Senior Site Reliability Engineer

Airalo Logo

Airalo

๐Ÿ“Remote

Summary

Join Airalo, the worldโ€™s first eSIM store, as a Site Reliability Engineer! This remote-first, full-time position offers a chance to work on highly reliable systems within a growing engineering team. You will develop and maintain efficient systems, define service level objectives, conduct post-incident reviews, and drive automation. The ideal candidate possesses extensive experience in SRE, AWS services, Kubernetes, and various other technologies. Airalo provides excellent benefits including health insurance, a work-from-anywhere stipend, and annual wellness & learning credits.

Requirements

  • Bachelorโ€™s degree in Computer Engineering or a similar discipline
  • 5+ years of experience as a Site Reliability Engineer or in a similar role
  • 3+ years of experience with AWS services including strong knowledge of container orchestration
  • 2+ years of Kubernetes experience
  • Deep understanding of observability principles and tools (logging, monitoring, tracing)
  • Experience with incident management and postmortem analysis
  • Experience and interest in infrastructure as a code approach (Terraform)
  • Experience with chaos engineering and other techniques for testing system resilience
  • Experience with CI/CD tools such as GitHub Actions
  • Proficiency in at least one programming language (Python, Go, Java, etc.) for automation and tooling
  • Comfortable with messaging systems (SNS, SQS, etc)
  • Ability to work independently and collaboratively in a fast-paced environment
  • Team player and open to new ideas
  • Good communication skills and fluency in English

Responsibilities

  • Develop and maintain reliable, scalable, and efficient systems
  • Define and track Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to measure and improve system reliability
  • Conduct blameless post-incident reviews to identify root causes and implement preventive measures
  • Drive automation of operational tasks and incident response
  • Develop and maintain runbooks and playbooks for common operational tasks and incident response
  • Mitigate operational risks
  • Work with software engineers to design systems for reliability, scalability, and maintainability
  • Continuously evaluate and optimize system performance, capacity, and cost
  • Participate in on-call rotation and be available to troubleshoot and resolve critical issues

Preferred Qualifications

  • Prior experience with Scrum and other agile methods
  • Certification in relevant areas such as AWS Certified DevOps Engineer, Certified Kubernetes Administrator (CKA), or similar
  • Experience with AI-driven SRE tools for anomaly detection and improvements
  • Contributions to open-source SRE projects or communities
  • Prior work experience in telecommunications
  • Knowledge of eSIM and GSMA related technologies and services

Benefits

  • Health Insurance
  • Work-from-anywhere stipend
  • Annual wellness & learning credits
  • Annual all-expenses-paid company retreat in a gorgeous destination
  • Other benefits

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.

Similar Remote Jobs