Site Reliability Engineer - Staff Engineer at Aviatrix

Summary

Join Aviatrix's SRE team as a Staff Engineer and play a key role in designing, implementing, and maintaining highly available systems. You will focus on automation, proactive monitoring, and Infrastructure-as-Code (IaC). Responsibilities include ensuring system reliability and availability, architecting scalable systems, developing automation tools, building monitoring tools, managing incidents, and collaborating with product engineering. The role requires 8+ years of experience in maintaining highly available systems, proficiency in Golang or Python, and expertise in IaC and Kubernetes. This is a remote position open to US and Canada residents. Aviatrix offers a competitive salary and benefits package, including comprehensive health insurance, 401(k) match, and flexible vacation policy.

Requirements

8+ years of experience maintaining and deploying highly available, fault-tolerant systems at scale
Proficiency in Golang or Python is required
Infrastructure-as-code (IaC): Deep understanding of Terraform core components (e.g., Terragrunt is a bonus) with real-world experience using Terraform for infrastructure provisioning and management
At least one cloud service provider experience (e.g., AWS, GCP, Azure, OCI)
Good knowledge with Kubernetes (e.g., cdk8s and operators are a bonus)
Solid experience developing Automation tools and frameworks
Experience with Logging Solutions (e.g., Loki, Syslog, Elasticsearch, Logstash, Kibana, Filebeat, Fluentbit, etc.)
Experience with Monitoring and Metrics Solutions (e.g., Prometheus, Grafana, Victoria Metrics)
Practical experience with Linux system administration
Experience with Version control system (e.g., Git, GitHub) and code review
Excellent communication skills are required

Responsibilities

Ensure Reliability and Availability: You will ensure uptime for crucial services and systems based on business required SLOs. Minimize service disruptions through proactive monitoring, capacity planning and fault-tolerant design
Architecture and System Design: you will design and architect complex, scalable and reliable systems
Automation and Efficiency: you will develop and implement automation tools and frameworks to automate routine tasks to reduce human error and to streamline and improve operational processes to increase efficiency
Build Observability and Monitoring tools: you will define, build, deploy, maintain, and extend our observability and monitoring tools to enhance system reliability and availability
Incident Management and Response: you will maintain an effective on-call rotation to ensure 24/7 coverage. You will respond to incident response procedures to swiftly address and mitigate service disruptions
Performance Monitoring and SLIs/SLOs: you will help define and monitor Service level Indicators (SLIs) and Service Level Objectives to set clear expectations for system performance
Collaboration: you will work closely with product engineering to ensure service-level objectives and reliability targets are met
Problem-Solving & Troubleshooting: you respond to escalations by troubleshooting complex system and application incidents, perform root cause analysis, implement necessary corrective actions
Thought Leadership and Innovation: Stay up to date with latest industry trends, emerging technologies. Iterate on best practices to increase the quality & velocity of development and deliverables
Manage application lifecycles, automate operational tasks, troubleshoot issues, integrate monitoring and alerting, optimize infrastructure, and ensure reliable operations using custom-built operators and cdk8s
Implement Infrastructure-as-Code (IaC) to enable rapid provisioning, seamless configuration changes, and efficient scaling
Build and enhance automation tools and frameworks in Golang and Python to streamline operations

Preferred Qualifications

Terragrunt
Cdk8s and operators

Benefits

We cover 100% of employee premiums and 88% of dependent(s) premiums for medical, dental and vision coverage
401(k) match
Short and long-term disability
Life/AD&D insurance
$1,000/year education reimbursement
Flexible vacation policy
We offer a comprehensive benefits package which, (subect to regional variations) could include pension, private medical for you and dependents, generous holiday allowance, life assurance, long-term disability, annual wellbeing stipend

Site Reliability Engineer - Staff Engineer

Aviatrix

Summary

Requirements

Responsibilities

Preferred Qualifications

Benefits

Remote

DevOps

Senior

Similar Remote Jobs

theScore

Remote

DevOps

Mid-level

Tailor

Remote

Software Development

Mid-level

Remote

DevOps

Senior

Remote

DevOps

Mid-level

Remote

DevOps

Senior

GoDaddy

Remote

DevOps

Senior

OLX

Remote

DevOps

Mid-level

OLX

Remote

DevOps

Mid-level

OLX

Remote

DevOps

Mid-level

Remote

DevOps

Mid-level