Summary

Join ClickHouse's Site Reliability Engineering team as one of the first members, building and leading processes to ensure the reliability, availability, scalability, and performance of our cloud infrastructure. You will collaborate with various teams, design and implement scalable systems, manage SLOs/SLAs, enhance incident response, and continuously improve services. Leverage your software engineering expertise to develop software platforms and tools. This role offers a significant impact on our high-performance, serverless ClickHouse Cloud. The position can be remote in any country with ClickHouse hiring presence. We offer competitive compensation and a range of benefits, including flexible work, healthcare contributions, stock options, generous time off, and a home office setup allowance.

Requirements

Bachelor’s or Master’s degree in Computer Science or a related field
At least 8 years of experience in Site Reliability Engineering or a related field
Previous experience using ClickHouse in production
Coding experience with Go and/or Python
Strong knowledge of cloud computing platforms such as AWS, Azure, or Google Cloud Platform
Hands-on experience with container orchestration tools such as Kubernetes or Docker Swarm
Strong experience with automation and configuration management tools such as Ansible, Terraform, or Puppet
You are a strong problem-solver and have solid production debugging skills
You are passionate about efficiency, availability, scalability, and data governance
You thrive in a fast-paced environment as part of a global team, and you see yourself as a partner with the business with the shared goal of moving the business forward
You have a high level of responsibility, ownership, and accountability
Excellent communication and interpersonal skills

Responsibilities

Collaborate with various engineering teams in ClickHouse to design and implement scalable, secure, and highly available systems for ClickHouse
Establish and manage service level objectives (SLOs) and service level agreements (SLAs) for ClickHouse Cloud
Ensure all the infrastructure components in ClickHouse Cloud (including Dataplane, Control Plane and ClickHouse Core) have monitoring and alerting in place to ensure timely detection and resolution of incidents
Enhance and refine incident response processes and post-mortem analysis for any outages in ClickHouse Cloud including working with the support team to communicate to the impacted customers
Continuously improve the reliability and performance of our ClickHouse services
Plan, enable, and drive Chaos initiatives across Engineering teams, based upon internal priorities
Manage on-call processes to respond to performance and reliability issues, and establish best practices for coordinating escalation to resolve issues and minimize downtime

Preferred Qualifications

Excellent understanding of distributed databases and SQL, particularly ClickHouse

Benefits

Flexible work environment - ClickHouse is a distributed company offering remote-first work to all employees
Healthcare - Employer contributions towards your healthcare
Equity in the company - Every new team member who joins our company receives stock options
Time off - Flexible time off in the US, generous entitlement in all countries
A $500 Home office setup if you’re a remote employee
Employee-driven international mobility - we enable you to relocate internationally if you wish (within certain countries and timelines and subject to role requirements, time zones and work permit considerations)
Cash compensation and a stock options grant

Senior Site Reliability Engineer

ClickHouse

Summary

Requirements

Responsibilities

Preferred Qualifications

Benefits

Remote

DevOps

Senior

Share this job:

Similar Remote Jobs

Remote

DevOps

Senior

Remote

Software Development

Senior

Remote

Software Development

Senior

Remote

Software Development

Senior

ServiceNow

Remote

DevOps

Senior

Loadsmart

Remote

DevOps

Senior

Exygy

Remote

DevOps

Senior

Remote

DevOps

Senior

Remote

DevOps

Senior

Remote

DevOps

Senior