Site Reliability Engineer

Evertz
Summary
Join our growing team as a highly motivated and passionate Site Reliability Engineer at evertz.io, where you will contribute to building services used by major players in the broadcast and media industry. You will work with talented teams to enhance our multi-tenant SaaS platform hosted on AWS, utilizing best-in-class observability tools. Your responsibilities will include debugging incidents, implementing platform improvements, automating processes, and building tools to ensure reliability. We offer flexible working hours and excellent benefits, along with the opportunity to experiment with new technologies. The role requires significant experience in managing production infrastructure, programming, and working with various AWS services.
Requirements
- At least 3 years of hands-on experience managing critical, high-availability production infrastructure, demonstrating success in maintaining reliability and maximizing application uptime
- Proficient in at least one programming language (such as Python, Java, or Rust), with experience designing and building production-quality automation, tools, or software libraries
- At least 3 years working with monitoring, log aggregation, and observability platforms such as Datadog, CloudWatch, Honeycomb, Splunk, or New Relic, using data-driven insights to proactively identify and resolve issues
- Excellent analytical skills with the ability to understand end-to-end use cases, map system flows, debug complex issues, and anticipate potential failure points
- Proven track record translating SLO’s and SLI’s into actionable improvements. Reliability, monitoring, and observability are not just words to you
- At least 3 years of experience with cloud technologies, in particular AWS Services and tools such as Cloud Formation, Lambda, DynamoDB, SQS, SNS, EC2, S3, AWS CLI, Boto3
- Solid foundation in Linux systems administration, networking, and security
- Familiarity with the use and configuration of CI & CD pipelines such as Jenkins & AWS CodePipeline
Responsibilities
- Work with our talented teams to help harden our multi-tenant SaaS platform
- Using best in class observability tooling, you will be working to debug incidents, while also identifying and implementing improvements to the platform to ensure its continued reliability
- Your drive to eliminate toil will see you automating processes and building the tools to do so
Preferred Qualifications
- Experience architecting and deploying serverless applications in cloud environments
- Experience with infrastructure-as-code tools like Terraform or CloudFormation, enabling reproducible and scalable environments
- Previous participation in production on-call rotations, with direct involvement in incident management and post-incident reviews
- Demonstrated expertise in performance optimization for core AWS services, including Lambda, DynamoDB, API Gateway, SQS, EventBridge, and EC2
- Experience supporting and improving systems with frequent, high-velocity deployment cycles
- Familiarity with security compliance frameworks (e.g., OWASP, ISO, CSA, PCI), and hands-on experience conducting threat assessments and implementing remediation plans
- Background in security practices, including penetration testing, threat modeling, and usage of both open-source and commercial security tools
- Experience developing and implementing advanced deployment strategies for web application infrastructures—such as canary, A/B testing, blue/green deployments, or red/line patterns
- Hands-on experience with chaos engineering—intentionally testing systems under extreme conditions to improve reliability and fault tolerance
- Track record of championing system reliability, continuous improvement, and operational excellence throughout an organization
Benefits
- Flexible working hours
- Great benefits