Staff Engineer - DevOps Site Reliability

closed
Nagarro Logo

Nagarro

πŸ“Remote - Colombia

Summary

Join our Digital Product Engineering company as an experienced L3 SRE engineer, focusing on business-critical SaaS applications. You will provide L3 support across the full stack (infrastructure, backend, and frontend), automating SRE tools for proactive monitoring. This role demands working under pressure, communicating effectively with various teams, and managing incidents and problems. Experience with multitenant applications, networking concepts, CI/CD pipelines, and AWS (especially EKS and serverless technologies) is crucial. Expertise in Kubernetes and Prometheus is essential. The ideal candidate will possess strong Python skills.

Requirements

Possess expert-level skills in EKS, Github Actions, Python, and Kubernetes

Responsibilities

  • Provide L3 support across the full stack (infrastructure, backend, and frontend), escalating only when necessary to the engineering business unit
  • Automate SRE tools to provide proactive support, aligning with our tech monitoring strategy
  • Work effectively under business pressure for business-critical applications
  • Communicate clearly and effectively with L1, L2, Engineering, Product managers, leadership, and end-users during troubleshooting
  • Manage incidents and problems effectively
  • Work with multitenant applications
  • Demonstrate a solid understanding of networking concepts (TCP/IP, DNS, Routing, etc.) including VPCs, subnets, firewalls, and load balancing, TLS and SSL
  • Utilize CI/CD pipelines (e.g., Jenkins, Github Actions) and version control systems
  • Employ Python, React/Next.js in your work
  • Leverage monitoring and logging tools (Grafana, Prometheus, Loki, or ELK) to analyze and track resource utilization, application performance, and identify potential issues
  • Utilize AWS, particularly EKS, serverless technologies, queueing systems, and various databases
  • Show solid knowledge of Kubernetes

Preferred Qualifications

  • Have previous experience building a user-facing GenAI/LLM software application
  • Be proficient in security best practices in cloud environments, including AWS Managed Services (RDS, Batch, Lambda, Fargate, Step Functions, SQS/SNS, etc.)
  • Have experience with FastAPI and NextJS
  • Be familiar with Websockets, Server-Side Events, and Pub/Sub technologies (RabbitMQ, Kafka, etc.)
  • Understand cloud security concepts (IAM, access control)
  • Have experience with Terraform
This job is filled or no longer available