Summary

Join ClickHouse's Site Reliability Engineering team as one of the first members and play a key role in ensuring the reliability, availability, scalability, and performance of ClickHouse. You will build and lead processes, collaborate with various teams, and guide them in implementing ClickHouse effectively for customers. Responsibilities include managing engineering escalation, conducting investigations and post-mortem analyses, and continuously improving ClickHouse's cloud operations. This position offers a unique opportunity to significantly impact the performance of ClickHouse in ClickHouse Cloud and is remote-based in any country with ClickHouse hiring presence. The ideal candidate possesses a Bachelor's or Master's degree in Computer Science or a related field, along with at least 5 years of experience in Reliability Engineering, QA, or customer-facing engineering. Prior experience with ClickHouse or other SQL databases in production is essential.

Requirements

Bachelor’s or Master’s degree in Computer Science or a related field
At least 5 years of experience in Reliability Engineering, QA or customer facing engineering
Previous experience operating ClickHouse or other SQL databases in production
Scripting experience with Shell or Python,and ability to read and understand C++ code
Knowledge of cloud computing platforms such as AWS, Azure, or Google Cloud Platform
You are a strong problem-solver and have solid production debugging skills
You thrive in a fast-paced environment as part of a global team, and you see yourself as a partner with the business with the shared goal of moving the business forward
You have a high level of responsibility, ownership, and accountability
Excellent communication skills

Responsibilities

Continuously improve the reliability and performance of ClickHouse core
Improve and create metrics and alerts for ClickHouse to be able to identify and prevent problems in production before they affect customers
Dig deeper into the most common problems encountered by customers in Clickhouse Core to identify the root cause of problems and submit bug fixes, issue reports and suggest improvements
Enhance and refine incident response processes and post-mortem analysis for ClickHouse core related outages including working with support and Cloud teams to communicate to the impacted customers
Plan, enable, and drive Chaos initiatives across Engineering teams, based upon internal priorities
Manage on-call processes to respond to performance and reliability issues, and establish best practices for coordinating escalation to resolve issues and minimize customer impact

Preferred Qualifications

Excellent understanding of distributed database internals and SQL, particularly ClickHouse is a major plus

Benefits

Flexible work environment - ClickHouse is a globally distributed company and remote-friendly. We currently operate in 20 countries
Healthcare - Employer contributions towards your healthcare
Equity in the company - Every new team member who joins our company receives stock options
Time off - Flexible time off in the US, generous entitlement in other countries
A $500 Home office setup if you’re a remote employee
Global Gatherings – We believe in the power of in-person connection and offer opportunities to engage with colleagues at company-wide offsites

Site Reliability Engineer

ClickHouse

Summary

Requirements

Responsibilities

Preferred Qualifications

Benefits

Remote

DevOps

Mid-level

Share this job:

Similar Remote Jobs

Remote

DevOps

Senior

Remote

Software Development

Senior

Remote

Software Development

Senior

Remote

Software Development

Senior

Tailor

Remote

Software Development

Mid-level

Remote

DevOps

Senior

GoDaddy

Remote

DevOps

Mid-level

Remote

DevOps

Senior

Remote

DevOps

Mid-level

Remote

DevOps

Senior