Site Reliability Engineer - Staff Engineer
Aviatrix
Summary
Join Aviatrix's SRE team as a Staff Engineer and play a key role in designing, implementing, and maintaining highly available systems. You will focus on automation, proactive monitoring, and Infrastructure-as-Code (IaC). Responsibilities include ensuring system reliability and availability, architecting scalable systems, developing automation tools, building monitoring tools, managing incidents, and collaborating with product engineering. The role requires 8+ years of experience in maintaining highly available systems, proficiency in Golang or Python, and expertise in IaC and Kubernetes. This is a remote position open to US and Canada residents. Aviatrix offers a competitive salary and benefits package, including comprehensive health insurance, 401(k) match, and flexible vacation policy.
Requirements
- 8+ years of experience maintaining and deploying highly available, fault-tolerant systems at scale
- Proficiency in Golang or Python is required
- Infrastructure-as-code (IaC): Deep understanding of Terraform core components (e.g., Terragrunt is a bonus) with real-world experience using Terraform for infrastructure provisioning and management
- At least one cloud service provider experience (e.g., AWS, GCP, Azure, OCI)
- Good knowledge with Kubernetes (e.g., cdk8s and operators are a bonus)
- Solid experience developing Automation tools and frameworks
- Experience with Logging Solutions (e.g., Loki, Syslog, Elasticsearch, Logstash, Kibana, Filebeat, Fluentbit, etc.)
- Experience with Monitoring and Metrics Solutions (e.g., Prometheus, Grafana, Victoria Metrics)
- Practical experience with Linux system administration
- Experience with Version control system (e.g., Git, GitHub) and code review
- Excellent communication skills are required
Responsibilities
- Ensure Reliability and Availability: You will ensure uptime for crucial services and systems based on business required SLOs. Minimize service disruptions through proactive monitoring, capacity planning and fault-tolerant design
- Architecture and System Design: you will design and architect complex, scalable and reliable systems
- Automation and Efficiency: you will develop and implement automation tools and frameworks to automate routine tasks to reduce human error and to streamline and improve operational processes to increase efficiency
- Build Observability and Monitoring tools: you will define, build, deploy, maintain, and extend our observability and monitoring tools to enhance system reliability and availability
- Incident Management and Response: you will maintain an effective on-call rotation to ensure 24/7 coverage. You will respond to incident response procedures to swiftly address and mitigate service disruptions
- Performance Monitoring and SLIs/SLOs: you will help define and monitor Service level Indicators (SLIs) and Service Level Objectives to set clear expectations for system performance
- Collaboration: you will work closely with product engineering to ensure service-level objectives and reliability targets are met
- Problem-Solving & Troubleshooting: you respond to escalations by troubleshooting complex system and application incidents, perform root cause analysis, implement necessary corrective actions
- Thought Leadership and Innovation: Stay up to date with latest industry trends, emerging technologies. Iterate on best practices to increase the quality & velocity of development and deliverables
- Manage application lifecycles, automate operational tasks, troubleshoot issues, integrate monitoring and alerting, optimize infrastructure, and ensure reliable operations using custom-built operators and cdk8s
- Implement Infrastructure-as-Code (IaC) to enable rapid provisioning, seamless configuration changes, and efficient scaling
- Build and enhance automation tools and frameworks in Golang and Python to streamline operations
Preferred Qualifications
- Terragrunt
- Cdk8s and operators
Benefits
- We cover 100% of employee premiums and 88% of dependent(s) premiums for medical, dental and vision coverage
- 401(k) match
- Short and long-term disability
- Life/AD&D insurance
- $1,000/year education reimbursement
- Flexible vacation policy
- We offer a comprehensive benefits package which, (subect to regional variations) could include pension, private medical for you and dependents, generous holiday allowance, life assurance, long-term disability, annual wellbeing stipend