Operations Engineer

Believe Solutions Logo

Believe Solutions

πŸ“Remote - Worldwide

Summary

Join our team as a Kubernetes On-Premise Operations Engineer and manage our on-premise Kubernetes infrastructure, focusing on day-to-day operations, proactive monitoring, and troubleshooting to ensure high availability and system stability. Collaborate with Level 3 Engineers to maintain seamless production operations. This role supports applications serving multiple countries, including Mi Tigo, Tigo Sports, Apigee, and KannelGateway. You will be responsible for Kubernetes cluster management, incident management, networking and ingress management, storage and database support, observability and monitoring, automation and configuration management, production deployments, and OS and security management. The position requires strong troubleshooting skills and experience with various tools and technologies. This is a remote position open only to candidates in Bolivia.

Requirements

  • 5+ years in Operations, SRE, or DevOps roles
  • 3+ years managing on-premise Kubernetes clusters
  • Strong troubleshooting skills in: Kubernetes
  • Strong troubleshooting skills in: Networking
  • Strong troubleshooting skills in: Databases (MongoDB, MySQL, PostgreSQL)
  • Proficient in monitoring tools: Prometheus, Grafana, Loki
  • Familiar with operational processes, incident management, and runbooks
  • Experience with Helm, Ansible , and optionally Terraform
  • Prior experience with production on-call support and incident resolution
  • Competent in performing production deployments under change management practices
  • Experience managing Ubuntu systems

Responsibilities

  • Manage and maintain our on-premise Kubernetes infrastructure
  • Perform day-to-day operations, proactive monitoring, and troubleshooting
  • Ensure high availability and system stability
  • Collaborate with Level 3 Engineers to maintain seamless production operations
  • Kubernetes Cluster Management
  • Apply patches and updates
  • Monitor and troubleshoot performance issues
  • Incident Management & On-Call Support
  • Participate in on-call rotation
  • Respond to incidents, perform root cause analysis (RCA), and document resolutions
  • Networking & Ingress Management
  • Operate and troubleshoot Cilium, Nginx Ingress Controller, and Traefik
  • Storage & Databases
  • Support and maintain NFS, MongoDB, MySQL, PostgreSQL ensuring performance and data integrity
  • Observability & Monitoring
  • Manage Prometheus, Grafana, and Loki for proactive alerting and system logging
  • Automation & Configuration Management
  • Use Helm, Ansible, and CI/CD pipelines to apply and manage infrastructure configurations
  • Production Deployments
  • Execute, monitor, and manage production deployments with proper rollback strategies
  • OS & Security Management
  • Maintain Ubuntu-based systems, ensuring they are patched, secure, and performant

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.

Similar Remote Jobs