Summary

Join our team as a Senior Site Reliability Engineer (SRE)! This remote role requires expertise in managing large-scale, data-intensive systems and GCP. You will administer and maintain cloud infrastructure, implement CI/CD pipelines, and build scalable infrastructure for ML model training and real-time inference. Strong troubleshooting and debugging skills are essential. The position demands experience with cloud-native databases, machine learning platforms, and cloud observability tools. You will collaborate with ML and data teams to ensure system reliability and performance.

Requirements

Bachelor’s or Master’s degree in Computer Science, Engineering, or a related technical field
5+ years of experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering, including hands-on operational support and participation in on-call rotations
Proven track record of managing large-scale applications, distributed systems, and networked services in production
Minimum 5+ years of hands-on experience in cloud environments
Deep understanding of Google Cloud Platform (GCP) — especially GKE, GCE, networking, and security
Strong troubleshooting and debugging skills across systems and networks
Cloud-native databases and storage — including Google Cloud Storage (GCS), Cloud SQL, Spanner, and Firestore
Machine Learning and AI platforms — such as Vertex AI, Generative AI tools, BigQuery, Looker, and DataProc
Cloud observability and monitoring — hands-on experience with OpenTelemetry, tracing, metrics, and distributed logging systems

Responsibilities

Administer and optimize cloud-native databases and storage platforms, including Google Cloud Storage (GCS), Cloud SQL, Spanner, and Firestore
Support and maintain machine learning and analytics platforms, including Vertex AI, Generative AI, BigQuery, Looker, and Dataproc, ensuring scalable and reliable infrastructure for data pipelines and model workflows
Implement and manage cloud observability using OpenTelemetry and native GCP tools to enable real-time monitoring, distributed tracing, and incident resolution
Support and maintain large-scale applications, computer systems, and networks in production environments
Administer and troubleshoot Linux-based systems, including core networking protocols such as TCP/IP, HTTP, MAIL protocols, DNS, and manage components like content delivery networks (CDNs) and load balancers
Manage and operate GCP services, including Kubernetes Engine (GKE), Compute Engine (GCE), Networking, Security, CI/CD pipelines, and other common Cloud technologies
Build and maintain cloud infrastructure using Infrastructure as Code (IaC) tools such as Terraform, Ansible, and Helm Charts
Develop and deploy services using Python, Golang, or Java, and implement CI/CD pipelines to ensure consistent, reliable delivery of applications and infrastructure components

Benefits

The anticipated starting pay range for Colorado is: $116,100 - $170,280
The anticipated starting pay range for the states of Hawaii and New York (not including NYC) is: $123,600 - $181,280
The anticipated starting pay range for California, New York City and Washington is: $135,300 - $198,440
Unless already included in the posted pay range and based on eligibility, the role may include variable compensation in the form of bonus, commissions, or other discretionary payments
Remote work

Senior Site Reliability Engineer

Rackspace Technology

Summary

Requirements

Responsibilities

Benefits

Remote

DevOps

Senior

Share this job:

Similar Remote Jobs

Remote

DevOps

Senior

Remote

Software Development

Senior

Remote

Software Development

Senior

Remote

Software Development

Senior

Remote

DevOps

Senior

Remote

DevOps

Senior

Remote

DevOps

Senior

Remote

DevOps

Senior

Remote

DevOps

Senior

Remote

DevOps

Senior