Site Reliability Engineer at Unitary

Summary

Join Unitary, a rapidly growing startup, as a Site Reliability Engineer to ensure the smooth and reliable operation of our systems at scale. You will play a key role in maintaining high availability and performance, working at the intersection of development and operations. Your responsibilities include designing and implementing comprehensive alerting systems, collaborating with development teams on observability, optimizing on-call processes, and building self-healing systems. You will also develop automation tools and ensure secure code deployment. This role involves joining a 24/7 support rotation and contributing to a positive customer experience. Unitary offers a collaborative environment and the opportunity to make a significant impact in a smaller company. We are looking for a versatile engineer who is comfortable balancing urgent issues with proactive system improvements.

Requirements

Have worked with visualisation tools such as Grafana for creating and maintaining dashboards that provide meaningful insights into system performance
Are proficient with metrics platforms such as Prometheus, InfluxDB, or OpenTelemetry for collecting and analysing system data
Have experience with incident management tools such as Incident.io for coordinating response efforts and recording follow-up learnings and actions
Can demonstrate strong problem-solving skills and the ability to work autonomously
Are confident writing production code in languages such as Go or Python
Thrive in a collaborative environment where group output and team achievements weigh heavier than individual input

Responsibilities

Design and implement comprehensive alerting systems that detect issues early and provide actionable insights to streamline the resolution of these issues
Collaborate with our development teams to ensure our observability stack provides clear visibility into system health and performance
Optimise on-call processes, including creating and maintaining detailed runbooks that enable efficient incident response and knowledge sharing across teams
Build self-healing systems using AI tools that automatically resolve common issues before they require human intervention
Develop automation tools and diagnostic capabilities that help teams quickly identify and resolve issues when manual investigation is required
Ensure secure and reliable code deployment processes through robust CI/CD pipelines and infrastructure automation
Join our 24/7 support rotation which provides first-level platform support to ensure a great customer experience

Preferred Qualifications

Experience working in a fully remote, international team
Previous startup experience
Built Slack bots or similar automation tools to streamline team workflows
Experience with CI/CD platforms for building reliable deployment pipelines (e.g. GitLab CI, ArgoCD)
Worked with Kubernetes and infrastructure as code tools such as Terraform for scalable system deployment
Are familiar with MLOps practices and tools, and monitoring machine learning systems in production

Benefits

Flexible hours and location
Competitive salary and equity package
Occupational pension
Generous paid parental leave
Generous paid sick leave
Annual budget for your professional development and growth
Annual budget for your individual health and wellness
Three team offsites to London or other exciting destinations in Europe

Site Reliability Engineer

Unitary

Summary

Requirements

Responsibilities

Preferred Qualifications

Benefits

Remote

DevOps

Mid-level

Share this job:

Similar Remote Jobs

Remote

DevOps

Senior

Remote

Software Development

Senior

Remote

Software Development

Senior

Remote

Software Development

Senior

Tailor

Remote

Software Development

Mid-level

Remote

DevOps

Senior

Kraken Digital Asset Exchange

Remote

DevOps

Mid-level

Kraken Digital Asset Exchange

Remote

DevOps

Mid-level

GoDaddy

Remote

DevOps

Mid-level

Remote

DevOps

Senior