Agentic AI Ops Engineer - Serverless & CI/CD

Expedite Commerce
Summary
Join Rethem, a company revolutionizing sales with an AI-driven platform focused on buyer outcomes. We are seeking a hands-on Agentic AI Ops Engineer to build and maintain CI/CD infrastructure for Agentic AI solutions using Terraform on AWS. This critical role involves developing, deploying, and debugging intelligent agents and their associated tools, ensuring scalable, traceable, and cost-effective delivery. The ideal candidate will have experience with AWS serverless architecture, Terraform, CI/CD pipelines, and agent development in Python. This is a fully remote opportunity with benefits including health insurance, PTO, paid professional training, and a strong onboarding program. If you are passionate about AI-driven sales transformation and meet the requirements, apply now to help shape the future of Rethem.
Requirements
- 2+ years of experience in DevOps, MLOps, or Cloud Infrastructure with exposure to AI/ML systems
- Deep expertise in AWS serverless architecture, including hands-on experience with: AWS Lambda β function design, performance tuning, cold-start optimization. Amazon API Gateway β managing REST/HTTP APIs and integrating with Lambda securely. Step Functions β orchestrating agentic workflows and managing execution states. S3, DynamoDB, EventBridge, SQS β event-driven and storage patterns for scalable AI systems
- Strong proficiency in Terraform to build and manage serverless AWS environments using reusable, modular templates
- Experience deploying and managing CI/CD pipelines for serverless and agent-based applications using AWS CodePipeline, CodeBuild, CodeDeploy, or GitHub Actions
- Hands-on experience with agent and tool development in Python, including debugging and performance tuning in production
- Solid understanding of IAM roles and policies, VPC configuration, and least-privilege access control for securing AI systems
- Deep understanding of monitoring, alerting, and distributed tracing systems (e.g., CloudWatch, Grafana, OpenTelemetry)
- Ability to manage environment parity across dev, staging, and production using automated infrastructure pipelines
- Excellent debugging, documentation, and cross-team communication skills
Responsibilities
- Design, implement, and maintain CI/CD pipelines for Agentic AI applications using Terraform, AWS CodePipeline, CodeBuild, and related tools
- Automate deployment of multi-agent systems and associated tooling, ensuring version control, rollback strategies, and consistent environment parity across dev/test/prod
- Collaborate with ML/NLP engineers to develop and deploy modular, tool-integrated AI agents in production
- Lead the effort to create debuggable agent architectures, with structured logging, standardized agent behaviors, and feedback integration loops
- Build agent lifecycle management tools that support quick iteration, rollback, and debugging of faulty behaviors
- Implement end-to-end observability for agents and tools, including runtime performance metrics, tool invocation traces, and latency/accuracy tracking
- Design dashboards and alerting mechanisms to capture agent failures, degraded performance, and tool bottlenecks in real-time
- Build lightweight tracing systems that help visualize agent workflows and simplify root cause analysis
- Monitor and manage cost metrics associated with agentic operations including API call usage, toolchain overhead, and model inference costs
- Set up proactive alerts for usage anomalies, implement cost dashboards, and propose strategies for reducing operational expenses without compromising performance
- Work closely with product, backend, and AI teams to evolve the agentic infrastructure design and tool orchestration workflows
- Drive the adoption of best practices for Agentic AI DevOps, including retraining automation, secure deployments, and compliance in cloud-hosted environments
- Participate in design reviews, postmortems, and architectural roadmap planning to continuously improve reliability and scalability
Benefits
- Health Insurance
- PTO, and Leave time
- Ongoing paid professional training and certifications
- Fully Remote work Opportunity
- Strong Onboarding & Training programs