Agentic AI Ops Engineer - Serverless & CI/CD

Expedite Commerce Logo

Expedite Commerce

πŸ“Remote - India

Summary

Join Rethem, a company revolutionizing sales with an AI-driven platform focused on buyer outcomes. We are seeking a hands-on Agentic AI Ops Engineer to build and maintain CI/CD infrastructure for Agentic AI solutions using Terraform on AWS. This critical role involves developing, deploying, and debugging intelligent agents and their associated tools, ensuring scalable, traceable, and cost-effective delivery. The ideal candidate will have experience with AWS serverless architecture, Terraform, CI/CD pipelines, and agent development in Python. This is a fully remote opportunity with benefits including health insurance, PTO, paid professional training, and a strong onboarding program. If you are passionate about AI-driven sales transformation and meet the requirements, apply now to help shape the future of Rethem.

Requirements

  • 2+ years of experience in DevOps, MLOps, or Cloud Infrastructure with exposure to AI/ML systems
  • Deep expertise in AWS serverless architecture, including hands-on experience with: AWS Lambda – function design, performance tuning, cold-start optimization. Amazon API Gateway – managing REST/HTTP APIs and integrating with Lambda securely. Step Functions – orchestrating agentic workflows and managing execution states. S3, DynamoDB, EventBridge, SQS – event-driven and storage patterns for scalable AI systems
  • Strong proficiency in Terraform to build and manage serverless AWS environments using reusable, modular templates
  • Experience deploying and managing CI/CD pipelines for serverless and agent-based applications using AWS CodePipeline, CodeBuild, CodeDeploy, or GitHub Actions
  • Hands-on experience with agent and tool development in Python, including debugging and performance tuning in production
  • Solid understanding of IAM roles and policies, VPC configuration, and least-privilege access control for securing AI systems
  • Deep understanding of monitoring, alerting, and distributed tracing systems (e.g., CloudWatch, Grafana, OpenTelemetry)
  • Ability to manage environment parity across dev, staging, and production using automated infrastructure pipelines
  • Excellent debugging, documentation, and cross-team communication skills

Responsibilities

  • Design, implement, and maintain CI/CD pipelines for Agentic AI applications using Terraform, AWS CodePipeline, CodeBuild, and related tools
  • Automate deployment of multi-agent systems and associated tooling, ensuring version control, rollback strategies, and consistent environment parity across dev/test/prod
  • Collaborate with ML/NLP engineers to develop and deploy modular, tool-integrated AI agents in production
  • Lead the effort to create debuggable agent architectures, with structured logging, standardized agent behaviors, and feedback integration loops
  • Build agent lifecycle management tools that support quick iteration, rollback, and debugging of faulty behaviors
  • Implement end-to-end observability for agents and tools, including runtime performance metrics, tool invocation traces, and latency/accuracy tracking
  • Design dashboards and alerting mechanisms to capture agent failures, degraded performance, and tool bottlenecks in real-time
  • Build lightweight tracing systems that help visualize agent workflows and simplify root cause analysis
  • Monitor and manage cost metrics associated with agentic operations including API call usage, toolchain overhead, and model inference costs
  • Set up proactive alerts for usage anomalies, implement cost dashboards, and propose strategies for reducing operational expenses without compromising performance
  • Work closely with product, backend, and AI teams to evolve the agentic infrastructure design and tool orchestration workflows
  • Drive the adoption of best practices for Agentic AI DevOps, including retraining automation, secure deployments, and compliance in cloud-hosted environments
  • Participate in design reviews, postmortems, and architectural roadmap planning to continuously improve reliability and scalability

Benefits

  • Health Insurance
  • PTO, and Leave time
  • Ongoing paid professional training and certifications
  • Fully Remote work Opportunity
  • Strong Onboarding & Training programs

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.