MLOps Engineer

Cognigy Logo

Cognigy

📍Remote - Germany

Summary

Join Cognigy, a leading AI Agent platform provider, as an MLOps Engineer. Based in Düsseldorf or remotely in Germany, you will build and operate scalable infrastructure for Large Language Models (LLMs) on Kubernetes. You will automate deployments, optimize resource usage, and ensure robust monitoring and alerting. Collaborate with ML and product engineers, prioritize security, and improve documentation. Cognigy offers significant career development opportunities and a supportive work environment.

Requirements

  • Hands-on experience running production ML or LLM workloads in Kubernetes
  • Familiarity with distributed ML frameworks such as KubeRay, Ray Serve, or similar
  • Deep understanding of Kubernetes internals , especially GPU scheduling, autoscaling, and multi-tenant environments
  • Proficiency with CI/CD systems for ML models , and versioned deployment strategies
  • Strong experience with cloud platforms (AWS, GCP, or Azure), networking , and security best practices
  • Skilled in monitoring and observability for ML workloads (e.g., Prometheus, Grafana)
  • Passion for automation , performance tuning , and cost optimization for LLM workloads
  • Clear communicator and proactive team player who thrives in fast-paced, cross-functional environments

Responsibilities

  • Build & Operate LLM Infrastructure – Design and maintain scalable LLM-serving systems using Kubernetes and KubeRay
  • Automate & Optimize – Automate deployments, rollbacks, and scaling of LLMs while optimizing resource usage and performance
  • Enhance Observability – Ensure robust monitoring, logging, and alerting for LLM operations (Prometheus, Grafana, etc.)
  • Support AI Teams – Empower ML and product engineers with self-service pipelines and scalable infrastructure
  • Prioritize Security – Enforce secure deployments, compliance practices, and robust incident response strategies
  • Improve Documentation – Create and maintain technical documentation to streamline knowledge sharing and onboarding
  • Drive Innovation – Evaluate, adopt, and integrate the latest MLOps and LLM-serving technologies
  • Reduce SRE Toil – Eliminate repetitive tasks and improve operational efficiency across the platform

Preferred Qualifications

MLOps or DevOps certifications

Benefits

  • Attractive and performance-oriented salary
  • Company Pension Scheme
  • 25 days paid leave, plus 5 floating days, plus public holidays
  • Unique opportunity to help build and shape the company, with little hierarchy
  • Flexible working options
  • Colleague recognition, reward and celebration events
  • Global Employee Assistance Program
  • ClassPass membership, giving you access to a variety of fitness and wellness experiences
  • Ongoing learning and development opportunities, including Udemy
  • One paid ‘Giving Back Day' each year, so you can volunteer for a charity or community activity of your choice
  • Subscription to the Calm app for you plus five friends/family members, giving you access to guided meditation, sleep stories, music, masterclasses, and much more

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.