Senior Cloud Performance Engineer

ClickHouse
Summary
Join ClickHouse's Cloud Engineering team as a distributed systems performance engineer. You will benchmark system performance, troubleshoot applications, recommend configuration optimizations, and collaborate with various teams to improve ClickHouse Cloud's performance. The role involves planning and executing chaos engineering initiatives, developing tools for chaos experiments, and studying software resilience. You'll need 6+ years of experience in building and operating scalable distributed systems, proficiency in programming languages like Go or C++, and expertise with cloud infrastructure and Kubernetes. ClickHouse offers a remote-first work environment, healthcare contributions, stock options, flexible time off, a home office setup allowance, and opportunities for international mobility.
Requirements
- 6+ years of relevant software development industry experience building and operating scalable, fault-tolerant, distributed systems
- Software development experience in Go, C/C++, Java, or similar
- Experience with concurrency, multithreading, and the deployment of distributed system architectures
- Experience developing cloud infrastructure services, preferably with Kubernetes
- Experience leading and shipping large scope technical projects in collaboration with multiple experienced engineers
- Expertise with a public cloud provider (AWS, GCP, Azure) and their infrastructure as a service offering (e.g. EC2)
- Excellent communication skills and the ability to work well within a team and across engineering teams
- Strong problem solver and solid production debugging skills
- Passionate about efficiency, availability, scalability and data governance
- Thrive in a fast paced environment, and see yourself as a partner with the business with the shared goal of moving the business forward
- High level of responsibility, ownership, and accountability
Responsibilities
- Benchmark system performance, database performance analysis, capacity sizing and optimization
- Troubleshoot and debug applications, server errors, logs, and triage accordingly
- Recommend configuration tuning/optimizations for performance bottlenecks
- Work closely and partner with ClickHouse's core development team, cloud team, and security team to improve the performance of ClickHouse Cloud
- Plan, enable, and drive Chaos initiatives across Engineering teams, based upon internal priorities
- Develop, deploy and manage tools to systematically run chaos experiments and measure impact
- Enjoy working on, and gaining a deep understanding of, large scale distributed systems
- Study the problems in the software resilience, operational, and delivery spaces
- Extend our entire backend to enable Chaos Engineering techniques in the system
- Observe running systems, and determine/prioritize innovative ways to disrupt them
Benefits
- Cash compensation and a stock options grant
- Flexible work environment - ClickHouse is a distributed company offering remote-first work to all employees
- Healthcare - Employer contributions towards your healthcare
- Equity in the company - Every new team member who joins our company receives stock options
- Time off - Flexible time off in the US, generous entitlement in all countries
- A $500 Home office setup if youβre a remote employee
- Employee-driven international mobility - we enable you to relocate internationally if you wish (within certain countries and timelines and subject to role requirements, time zones and work permit considerations)