Senior Software Engineer, Observability

CoreWeave Logo

CoreWeave

💵 $185k-$210k
📍Remote - United States

Summary

Join CoreWeave's Observability department and contribute to building high-performing systems for AI. As a senior observability engineer, you will focus on logging and tracing platforms, pipelines, and visualization, working with tools like Loki, ClickHouse, Kafka, and Grafana. You will build automated tooling, modernize logging platforms, design transparent migrations, and assist engineers in maximizing insights from observability data. You will also develop best practices, improve the performance and scalability of observability services, and work with massive telemetry data from GPU clusters. CoreWeave offers a dynamic environment, opportunities for growth, and a collaborative team culture. The role requires six or more years of experience in software or infrastructure engineering and expertise in logging and tracing platforms.

Requirements

  • You have six or more years of experience in a software or infrastructure engineering industry
  • You enjoy helping your colleagues achieve more with less effort
  • You are customer obsessed, ecstatic to provide infrastructure as a service, and default to adopting a product lens when evaluating platform scale problems
  • You have experience with logging and tracing platforms in production and at scale, and are versed in reliability engineering concepts such as the different types of testing, progressive deployments, error budgets, the role observability plays, and fault-tolerant design
  • You are opinionated about the best use cases for each of the three Observability pillars, and are excited about enabling engineers of all stripes - from frontend down to the kernel – with the best tools for their specific jobs
  • You have experience using Kubernetes, a conceptual understanding of its major components, and/or have operated Kubernetes clusters at scale for both event-driven and stateful orchestration
  • You’re familiar with various logging and metrics systems like ClickHouse, Elastic, Loki, Victoria Metrics, Prometheus, Thanos and Grafana. You have experience with designing and operating these systems at scale
  • You are familiar with PromQL, any other querying language and enjoy understanding the data model for observability systems
  • You are excited at the prospect of zero-downtime, transparent migrations of major production services
  • You’re comfortable with the idea of using Go as your primary programming language, but don’t shy away from templating yaml as circumstances call for it
  • You know your way around a Linux distro, shell scripting, and/or the Linux storage and networking stacks
  • You can transform problems in elastic solutions, decompose them into achievable tasks, and socialize both to your teammates
  • You’re excited about being part of a team of diverse perspectives and backgrounds that believe in tackling challenges, growing hand in hand, and winning together

Responsibilities

  • Build low-touch, automated tooling that enables CoreWeave engineers to clearly articulate which pieces of telemetry, visualizations, and dashboards should be presented to customers
  • Modernize logging platforms at cloud-scale
  • Design migrations that are transparent to platform consumers
  • Assist CoreWeavers in differentiating between optimal use cases for structured and unstructured logs
  • Build governance mechanisms that empower CoreWeavers to effectively manage the telemetry their services produce and adopt best practices
  • Assist engineers across CoreWeave’s in maximizing the insights they glean from all three pillars of observability
  • Develop and enforce best practices regarding the health of telemetry ETL pipelines
  • Improve the performance, security, reliability, and scalability of observability services while participating in the team’s on-call rotation
  • Work with telemetry data at enormous scale, ingesting data from industry-leading GPU clusters
  • Grow, change, invest in your teammates, be invested-in, share your ideas, listen to others, be curious, have fun, and, above all, be yourself

Preferred Qualifications

  • Knowledge of Loki and/or ClickHouse is a strong plus
  • Knowledge of Kafka and Grafana will set you apart

Benefits

  • Medical, dental, and vision insurance - 100% paid for by CoreWeave
  • Company-paid Life Insurance
  • Voluntary supplemental life insurance
  • Short and long-term disability insurance
  • Flexible Spending Account
  • Health Savings Account
  • Tuition Reimbursement
  • Mental Wellness Benefits through Spring Health
  • Family-Forming support provided by Carrot
  • Paid Parental Leave
  • Flexible, full-service childcare support with Kinside
  • 401(k) with a generous employer match
  • Flexible PTO
  • Catered lunch each day in our office and data center locations
  • A casual work environment
  • A work culture focused on innovative disruption

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.