Summary

Join Upwork's Hybrid Workforce Solutions Team as a technical leader in modern SRE practices, focusing on zero-trust infrastructure, platform observability, and cloud-native scalability. Guide the architectural evolution of reliability systems, champion SLO-driven engineering, and partner with platform and security teams. Develop AI-assisted tools and workflows to reduce operational burden, define and maintain end-to-end observability strategies, and drive infrastructure automation efforts. Lead post-incident reviews and reliability audits, and mentor engineers on designing and operating reliable, scalable systems. This full-time position involves on-call rotation.

Requirements

10+ years in SRE, DevOps, or production engineering roles, including experience operating large-scale distributed systems in production
Deep expertise in Kubernetes operations, including multi-cluster orchestration, service mesh (Istio or equivalent), and workload policy management (e.g., OPA, Kyverno)
Proven experience building and maintaining GitOps pipelines using tools like ArgoCD or Flux
Strong fluency in observability tooling (e.g., Prometheus, OpenTelemetry, Grafana, or Datadog), with a focus on SLO-based alerting and incident detection
Familiarity with reliability-as-code practices and automation using scripting languages (Python, Go, or Bash) and AI-enhanced workflows (e.g., Cursor, incident bots, PR-generating agents)
Experience designing and enforcing zero trust service-to-service authentication, workload identity, and mTLS policies
Track record of leading incident review programs, standardizing postmortems, and driving systemic reliability improvements
Ability to work cross-functionally with platform, security, and developer enablement teams to embed resilience across the SDLC

Responsibilities

Serve as a technical leader in modern SRE practices with a focus on zero-trust infrastructure, platform observability, and cloud-native scalability
Guide the architectural evolution of reliability systems, including multi-cluster Kubernetes environments, GitOps workflows, and service mesh integration
Champion SLO-driven engineering across teams and establish frameworks for defining, tracking, and enforcing reliability standards
Partner with platform and security teams to enable service-to-service authentication, policy enforcement, and resilient control planes
Develop AI-assisted tools and workflows (e.g., for incident triage, RCA generation, auto-remediation) to reduce operational burden and accelerate resolution
Define and maintain end-to-end observability strategies including distributed tracing, metrics pipelines, and log enrichment
Drive infrastructure automation efforts using IaC best practices, with an emphasis on policy-as-code, workload identity, and platform governance
Lead post-incident reviews and reliability audits to surface systemic gaps and drive continuous improvement
Mentor engineers across infrastructure and application teams on designing and operating reliable, scalable systems

Principal Site Reliability Engineer

Upwork

Summary

Requirements

Responsibilities

Remote

DevOps

Principal

Share this job:

Similar Remote Jobs

Remote

DevOps

Principal

Remote

DevOps

Principal

Remote

DevOps

Principal

Remote

DevOps

Principal

Remote

DevOps

Senior

Remote

DevOps

Senior

Remote

Software Development

Principal

Disco

Remote

Software Development

Principal

Remote

DevOps

Principal