Site Reliability Engineer

Unitary
Summary
Join Unitary, a rapidly growing startup, as a Site Reliability Engineer to ensure the smooth and reliable operation of our systems at scale. You will play a key role in maintaining high availability and performance, working at the intersection of development and operations. Your responsibilities include designing and implementing comprehensive alerting systems, collaborating with development teams on observability, optimizing on-call processes, and building self-healing systems. You will also develop automation tools and ensure secure code deployment. This role involves joining a 24/7 support rotation and contributing to a positive customer experience. Unitary offers a collaborative environment and the opportunity to make a significant impact in a smaller company. We are looking for a versatile engineer who is comfortable balancing urgent issues with proactive system improvements.
Requirements
- Have worked with visualisation tools such as Grafana for creating and maintaining dashboards that provide meaningful insights into system performance
- Are proficient with metrics platforms such as Prometheus, InfluxDB, or OpenTelemetry for collecting and analysing system data
- Have experience with incident management tools such as Incident.io for coordinating response efforts and recording follow-up learnings and actions
- Can demonstrate strong problem-solving skills and the ability to work autonomously
- Are confident writing production code in languages such as Go or Python
- Thrive in a collaborative environment where group output and team achievements weigh heavier than individual input
Responsibilities
- Design and implement comprehensive alerting systems that detect issues early and provide actionable insights to streamline the resolution of these issues
- Collaborate with our development teams to ensure our observability stack provides clear visibility into system health and performance
- Optimise on-call processes, including creating and maintaining detailed runbooks that enable efficient incident response and knowledge sharing across teams
- Build self-healing systems using AI tools that automatically resolve common issues before they require human intervention
- Develop automation tools and diagnostic capabilities that help teams quickly identify and resolve issues when manual investigation is required
- Ensure secure and reliable code deployment processes through robust CI/CD pipelines and infrastructure automation
- Join our 24/7 support rotation which provides first-level platform support to ensure a great customer experience
Preferred Qualifications
- Experience working in a fully remote, international team
- Previous startup experience
- Built Slack bots or similar automation tools to streamline team workflows
- Experience with CI/CD platforms for building reliable deployment pipelines (e.g. GitLab CI, ArgoCD)
- Worked with Kubernetes and infrastructure as code tools such as Terraform for scalable system deployment
- Are familiar with MLOps practices and tools, and monitoring machine learning systems in production
Benefits
- Flexible hours and location
- Competitive salary and equity package
- Occupational pension
- Generous paid parental leave
- Generous paid sick leave
- Annual budget for your professional development and growth
- Annual budget for your individual health and wellness
- Three team offsites to London or other exciting destinations in Europe