LLM Ops Engineer - Serverless & CI/CD

Expedite Commerce Logo

Expedite Commerce

πŸ“Remote - India

Summary

Join us to redefine SaaS infrastructure and champion a new era of AI-powered, product-led enterprise experiences. We are seeking a hands-on Agentic AI Ops Engineer who thrives at the intersection of cloud infrastructure, AI agent systems, and DevOps automation. You will build and maintain the CI/CD infrastructure for Agentic AI solutions using Terraform on AWS, while also developing, deploying, and debugging intelligent agents and their associated tools. This position is critical to ensuring scalable, traceable, and cost-effective delivery of agentic systems in production environments. The role involves designing, implementing, and maintaining CI/CD pipelines, automating deployment of multi-agent systems, collaborating with ML/NLP engineers, building agent lifecycle management tools, implementing end-to-end observability, and optimizing costs. You will also work closely with other teams to improve infrastructure design and workflows.

Requirements

  • 2+ years of experience in DevOps, MLOps, or Cloud Infrastructure with exposure to AI/ML systems
  • Deep expertise in AWS serverless architecture, including hands-on experience with:AWS Lambda – function design, performance tuning, cold-start optimization.Amazon API Gateway – managing REST/HTTP APIs and integrating with Lambda securely.Step Functions – orchestrating agentic workflows and managing execution states.S3, DynamoDB, EventBridge, SQS – event-driven and storage patterns for scalable AI systems
  • Strong proficiency in Terraform to build and manage serverless AWS environments using reusable, modular templates
  • Experience deploying and managing CI/CD pipelines for serverless and agent-based applications using AWS CodePipeline, CodeBuild, CodeDeploy, or GitHub Actions
  • Hands-on experience with agent and tool development in Python, including debugging and performance tuning in production
  • Solid understanding of IAM roles and policies, VPC configuration, and least-privilege access control for securing AI systems
  • Deep understanding of monitoring, alerting, and distributed tracing systems (e.g., CloudWatch, Grafana, OpenTelemetry)
  • Ability to manage environment parity across dev, staging, and production using automated infrastructure pipelines
  • Excellent debugging, documentation, and cross-team communication skills

Responsibilities

  • Design, implement, and maintain CI/CD pipelines for Agentic AI applications using Terraform, AWS CodePipeline, CodeBuild, and related tools
  • Automate deployment of multi-agent systems and associated tooling, ensuring version control, rollback strategies, and consistent environment parity across dev/test/prod
  • Collaborate with ML/NLP engineers to develop and deploy modular, tool-integrated AI agents in production
  • Lead the effort to create debuggable agent architectures, with structured logging, standardized agent behaviors, and feedback integration loops
  • Build agent lifecycle management tools that support quick iteration, rollback, and debugging of faulty behaviors
  • Implement end-to-end observability for agents and tools, including runtime performance metrics, tool invocation traces, and latency/accuracy tracking
  • Design dashboards and alerting mechanisms to capture agent failures, degraded performance, and tool bottlenecks in real-time
  • Build lightweight tracing systems that help visualize agent workflows and simplify root cause analysis
  • Monitor and manage cost metrics associated with agentic operations including API call usage, toolchain overhead, and model inference costs
  • Set up proactive alerts for usage anomalies, implement cost dashboards, and propose strategies for reducing operational expenses without compromising performance
  • Work closely with product, backend, and AI teams to evolve the agentic infrastructure design and tool orchestration workflows
  • Drive the adoption of best practices for Agentic AI DevOps, including retraining automation, secure deployments, and compliance in cloud-hosted environments
  • Participate in design reviews, postmortems, and architectural roadmap planning to continuously improve reliability and scalability

Benefits

  • Equity participation program
  • Health Insurance, PTO, and Leave time
  • Ongoing paid professional training and certifications
  • Fully Remote work Opportunity
  • Strong Onboarding & Training program

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.