Senior Site Reliability Engineer

Rackspace Technology Logo

Rackspace Technology

💵 $116k-$198k
📍Remote - United States, Canada

Summary

Join our team as a Senior Site Reliability Engineer (SRE)! This remote role requires expertise in managing large-scale, data-intensive systems and GCP. You will administer and maintain cloud infrastructure, implement CI/CD pipelines, and build scalable infrastructure for ML model training and real-time inference. Strong troubleshooting and debugging skills are essential. The position demands experience with cloud-native databases, machine learning platforms, and cloud observability tools. You will collaborate with ML and data teams to ensure system reliability and performance.

Requirements

  • Bachelor’s or Master’s degree in Computer Science, Engineering, or a related technical field
  • 5+ years of experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering, including hands-on operational support and participation in on-call rotations
  • Proven track record of managing large-scale applications, distributed systems, and networked services in production
  • Minimum 5+ years of hands-on experience in cloud environments
  • Deep understanding of Google Cloud Platform (GCP) — especially GKE, GCE, networking, and security
  • Strong troubleshooting and debugging skills across systems and networks
  • Cloud-native databases and storage — including Google Cloud Storage (GCS), Cloud SQL, Spanner, and Firestore
  • Machine Learning and AI platforms — such as Vertex AI, Generative AI tools, BigQuery, Looker, and DataProc
  • Cloud observability and monitoring — hands-on experience with OpenTelemetry, tracing, metrics, and distributed logging systems

Responsibilities

  • Administer and optimize cloud-native databases and storage platforms, including Google Cloud Storage (GCS), Cloud SQL, Spanner, and Firestore
  • Support and maintain machine learning and analytics platforms, including Vertex AI, Generative AI, BigQuery, Looker, and Dataproc, ensuring scalable and reliable infrastructure for data pipelines and model workflows
  • Implement and manage cloud observability using OpenTelemetry and native GCP tools to enable real-time monitoring, distributed tracing, and incident resolution
  • Support and maintain large-scale applications, computer systems, and networks in production environments
  • Administer and troubleshoot Linux-based systems, including core networking protocols such as TCP/IP, HTTP, MAIL protocols, DNS, and manage components like content delivery networks (CDNs) and load balancers
  • Manage and operate GCP services, including Kubernetes Engine (GKE), Compute Engine (GCE), Networking, Security, CI/CD pipelines, and other common Cloud technologies
  • Build and maintain cloud infrastructure using Infrastructure as Code (IaC) tools such as Terraform, Ansible, and Helm Charts
  • Develop and deploy services using Python, Golang, or Java, and implement CI/CD pipelines to ensure consistent, reliable delivery of applications and infrastructure components

Benefits

  • The anticipated starting pay range for Colorado is: $116,100 - $170,280
  • The anticipated starting pay range for the states of Hawaii and New York (not including NYC) is: $123,600 - $181,280
  • The anticipated starting pay range for California, New York City and Washington is: $135,300 - $198,440
  • Unless already included in the posted pay range and based on eligibility, the role may include variable compensation in the form of bonus, commissions, or other discretionary payments
  • Remote work

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.