Senior Site Reliability Engineer

ClickHouse Logo

ClickHouse

📍Remote - Germany

Summary

Join ClickHouse's Site Reliability Engineering team as one of the first members and play a key role in ensuring the reliability, availability, scalability, and performance of our cloud infrastructure. You will collaborate with various engineering teams, establish service level objectives (SLOs) and service level agreements (SLAs), and enhance incident response processes. Responsibilities include building and leading processes to ensure the reliability of our cloud infrastructure, collaborating with different teams to design and implement scalable systems, and continuously improving the reliability and performance of ClickHouse services. You will also manage on-call processes and drive Chaos initiatives. This role offers a unique opportunity to significantly impact our high-performance, serverless ClickHouse Cloud.

Requirements

  • Bachelor’s or Master’s degree in Computer Science or a related field
  • At least 8 years of experience in Site Reliability Engineering or a related field
  • Previous experience using ClickHouse in production
  • Coding experience with Go and/or Python
  • Strong knowledge of cloud computing platforms such as AWS, Azure, or Google Cloud Platform
  • Hands-on experience with container orchestration tools such as Kubernetes or Docker Swarm
  • Strong experience with automation and configuration management tools such as Ansible, Terraform, or Puppet
  • You are a strong problem-solver and have solid production debugging skills
  • You are passionate about efficiency, availability, scalability, and data governance
  • You thrive in a fast-paced environment as part of a global team, and you see yourself as a partner with the business with the shared goal of moving the business forward
  • You have a high level of responsibility, ownership, and accountability
  • Excellent communication and interpersonal skills

Responsibilities

  • Collaborate with various engineering teams in ClickHouse to design and implement scalable, secure, and highly available systems for ClickHouse
  • Establish and manage service level objectives (SLOs) and service level agreements (SLAs) for ClickHouse Cloud
  • Ensure all the infrastructure components in ClickHouse Cloud (including Dataplane, Control Plane and ClickHouse Core) have monitoring and alerting in place to ensure timely detection and resolution of incidents
  • Enhance and refine incident response processes and post-mortem analysis for any outages in ClickHouse Cloud including working with the support team to communicate to the impacted customers
  • Continuously improve the reliability and performance of our ClickHouse services
  • Plan, enable, and drive Chaos initiatives across Engineering teams, based upon internal priorities
  • Manage on-call processes to respond to performance and reliability issues, and establish best practices for coordinating escalation to resolve issues and minimize downtime

Preferred Qualifications

Excellent understanding of distributed databases and SQL, particularly ClickHouse is a major plus

Benefits

  • Flexible work environment - ClickHouse is a globally distributed company and remote-friendly. We currently operate in 20 countries
  • Healthcare - Employer contributions towards your healthcare
  • Equity in the company - Every new team member who joins our company receives stock options
  • Time off - Flexible time off in the US, generous entitlement in other countries
  • A $500 Home office setup if you’re a remote employee
  • Global Gatherings – We believe in the power of in-person connection and offer opportunities to engage with colleagues at company-wide offsites

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.