Site Reliability Engineer - Staff Engineer

Aviatrix Logo

Aviatrix

💵 $177k-$190k
📍Remote - United States

Summary

Join Aviatrix's SRE team as a Staff Engineer and play a key role in designing, implementing, and maintaining highly available systems. You will focus on automation, proactive monitoring, and Infrastructure-as-Code (IaC). Responsibilities include ensuring system reliability and availability, architecting scalable systems, developing automation tools, building monitoring tools, managing incidents, and collaborating with product engineering. The role requires 8+ years of experience in maintaining highly available systems, proficiency in Golang or Python, and expertise in IaC and Kubernetes. This is a remote position open to US and Canada residents. Aviatrix offers a competitive salary and benefits package, including comprehensive health insurance, 401(k) match, and flexible vacation policy.

Requirements

  • 8+ years of experience maintaining and deploying highly available, fault-tolerant systems at scale
  • Proficiency in Golang or Python is required
  • Infrastructure-as-code (IaC): Deep understanding of Terraform core components (e.g., Terragrunt is a bonus) with real-world experience using Terraform for infrastructure provisioning and management
  • At least one cloud service provider experience (e.g., AWS, GCP, Azure, OCI)
  • Good knowledge with Kubernetes (e.g., cdk8s and operators are a bonus)
  • Solid experience developing Automation tools and frameworks
  • Experience with Logging Solutions (e.g., Loki, Syslog, Elasticsearch, Logstash, Kibana, Filebeat, Fluentbit, etc.)
  • Experience with Monitoring and Metrics Solutions (e.g., Prometheus, Grafana, Victoria Metrics)
  • Practical experience with Linux system administration
  • Experience with Version control system (e.g., Git, GitHub) and code review
  • Excellent communication skills are required

Responsibilities

  • Ensure Reliability and Availability: You will ensure uptime for crucial services and systems based on business required SLOs. Minimize service disruptions through proactive monitoring, capacity planning and fault-tolerant design
  • Architecture and System Design: you will design and architect complex, scalable and reliable systems
  • Automation and Efficiency: you will develop and implement automation tools and frameworks to automate routine tasks to reduce human error and to streamline and improve operational processes to increase efficiency
  • Build Observability and Monitoring tools: you will define, build, deploy, maintain, and extend our observability and monitoring tools to enhance system reliability and availability
  • Incident Management and Response: you will maintain an effective on-call rotation to ensure 24/7 coverage. You will respond to incident response procedures to swiftly address and mitigate service disruptions
  • Performance Monitoring and SLIs/SLOs: you will help define and monitor Service level Indicators (SLIs) and Service Level Objectives to set clear expectations for system performance
  • Collaboration: you will work closely with product engineering to ensure service-level objectives and reliability targets are met
  • Problem-Solving & Troubleshooting: you respond to escalations by troubleshooting complex system and application incidents, perform root cause analysis, implement necessary corrective actions
  • Thought Leadership and Innovation: Stay up to date with latest industry trends, emerging technologies. Iterate on best practices to increase the quality & velocity of development and deliverables
  • Manage application lifecycles, automate operational tasks, troubleshoot issues, integrate monitoring and alerting, optimize infrastructure, and ensure reliable operations using custom-built operators and cdk8s
  • Implement Infrastructure-as-Code (IaC) to enable rapid provisioning, seamless configuration changes, and efficient scaling
  • Build and enhance automation tools and frameworks in Golang and Python to streamline operations

Preferred Qualifications

  • Terragrunt
  • Cdk8s and operators

Benefits

  • We cover 100% of employee premiums and 88% of dependent(s) premiums for medical, dental and vision coverage
  • 401(k) match
  • Short and long-term disability
  • Life/AD&D insurance
  • $1,000/year education reimbursement
  • Flexible vacation policy
  • We offer a comprehensive benefits package which, (subect to regional variations) could include pension, private medical for you and dependents, generous holiday allowance, life assurance, long-term disability, annual wellbeing stipend

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.