Site Reliability Engineer - Staff Engineer
closedAviatrix
Job highlights
Summary
Join Aviatrix's growing US-based SRE team as a Staff Engineer and contribute to the reliability and performance of our critical systems and services. You will design, implement, and maintain highly available, fault-tolerant systems using automation and infrastructure-as-code. Responsibilities include ensuring system uptime, architecting scalable systems, developing automation tools, building monitoring tools, managing incidents, and collaborating with engineering teams. This remote position offers a competitive salary and benefits package, including comprehensive health insurance, 401k matching, and flexible vacation time. The ideal candidate possesses extensive experience in system administration, automation, and cloud technologies, with proficiency in Golang or Python and experience with Kubernetes and Terraform.
Requirements
- 8+ years of experience maintaining and deploying highly available, fault-tolerant systems at scale
- Proficiency in Golang or Python is required
- Deep understanding of Terraform core components with real-world experience using Terraform for infrastructure provisioning and management
- At least one cloud service provider experience (e.g., AWS, GCP, Azure, OCI)
- Good knowledge with Kubernetes
- Solid experience developing Automation tools and frameworks
- Experience with Logging Solutions (e.g., Loki, Syslog, Elasticsearch, Logstash, Kibana, Filebeat, Fluentbit, etc.)
- Experience with Monitoring and Metrics Solutions (e.g., Prometheus, Grafana, Victoria Metrics)
- Practical experience with Linux system administration
- Experience with Version control system (e.g., Git, GitHub) and code review
- Excellent communication skills are required
Responsibilities
- Ensure uptime for crucial services and systems based on business required SLOs
- Minimize service disruptions through proactive monitoring, capacity planning and fault-tolerant design
- Design and architect complex, scalable and reliable systems
- Develop and implement automation tools and frameworks to automate routine tasks to reduce human error and to streamline and improve operational processes to increase efficiency
- Define, build, deploy, maintain, and extend our observability and monitoring tools to enhance system reliability and availability
- Maintain an effective on-call rotation to ensure 24/7 coverage
- Respond to incident response procedures to swiftly address and mitigate service disruptions
- Help define and monitor Service level Indicators (SLIs) and Service Level Objectives to set clear expectations for system performance
- Work closely with product engineering to ensure service-level objectives and reliability targets are met
- Respond to escalations by troubleshooting complex system and application incidents, perform root cause analysis, implement necessary corrective actions
- Stay up to date with latest industry trends, emerging technologies
- Iterate on best practices to increase the quality & velocity of development and deliverables
Preferred Qualifications
- Terragrunt
- Cdk8s and operators
Benefits
- 100% of employee premiums and 88% of dependent(s) premiums for medical, dental and vision coverage
- 401(k) match
- Short and long-term disability
- Life/AD&D insurance
- $1,000/year education reimbursement
- Flexible vacation policy
- Comprehensive benefits package which, (subect to regional variations) could include pension, private medical for you and dependents, generous holiday allowance, life assurance, long-term disability, annual wellbeing stipend
- Remote work
Similar Remote Jobs
- 💰$177k-$213k📍United States
- 📍Japan
- 💰$60k-$120k📍Asia
- 📍Mexico
- 💰$143k-$173k📍United States
- 💰$167k-$201k📍United States
- 📍India
- N💰$68k-$98k📍Worldwide
- 📍India
- 💰$125k-$150k📍Canada