Summary
Join our dynamic and global Digital Product Engineering company as an experienced L3 SRE engineer! You will be responsible for providing L3 support across the full stack of our business-critical SaaS application. This role requires expertise in areas such as Kubernetes, AWS, CI/CD pipelines, and Python. You will work closely with various teams to troubleshoot issues and ensure application performance. The ideal candidate possesses strong communication skills and experience with incident and problem management. We offer a non-hierarchical work culture and opportunities to work on exciting projects.
Requirements
- Possess expertise in EKS
- Possess expertise in Github Actions
- Possess strong Python skills
- Possess expert-level Kubernetes skills
- Possess expertise in Prometheus
Responsibilities
- Provide L3 support across the full stack including infra, backend and front-end, before escalation to engineering business unit
- Automate SRE tools to provide proactive L3 support, close to our tech monitoring strategy
- Work under business pressure for business critical applications
- Communicate accordingly with L1, L2, Engineering, Product managers, leadership and end-users during troubleshooting
- Communicate accordingly
- Manage incidents and problems
- Work with multitenant applications
- Utilize understanding of networking concepts(TCP/IP, DNS, Routing, etc) like VPCs, subnets, firewalls, and load balancing, TLS and SSL
- Work with CI/CD pipelines (e.g., Jenkins, Github Actions) & version control
- Use Python, react/next
- Use monitoring and logging to analyze & track resource utilization, application performance, and identify potential issues, Grafana, Prometheus, Loki or ELK
- Work with AWS, particularly EKS, serverless, queue & various databases
- Utilize solid knowledge of Kubernetes
Preferred Qualifications
- Have previous experience building a user-facing GenAI/LLM software application
- Understand security best practices in cloud environments
- Have experience with AWS Managed Services (RDS, Batch, Lambda, Fargate, Step Functions, SQS/SNS, etc.)
- Have experience with FastAPI and NextJS
- Have experience with Websockets, Server-Side Events, Pub/Sub (RabbitMQ, Kafka, etc.)
- Understand cloud security concepts (IAM, access control)
- Have Terraform experience
Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.