Principal Site Reliability Engineer

Upwork Logo

Upwork

πŸ“Remote - Worldwide

Summary

Join Upwork's Hybrid Workforce Solutions Team as a technical leader in modern SRE practices, focusing on zero-trust infrastructure, platform observability, and cloud-native scalability. Guide the architectural evolution of reliability systems, champion SLO-driven engineering, and partner with platform and security teams. Develop AI-assisted tools and workflows to reduce operational burden, define and maintain end-to-end observability strategies, and drive infrastructure automation efforts. Lead post-incident reviews and reliability audits, and mentor engineers on designing and operating reliable, scalable systems. This full-time position involves on-call rotation.

Requirements

  • 10+ years in SRE, DevOps, or production engineering roles, including experience operating large-scale distributed systems in production
  • Deep expertise in Kubernetes operations, including multi-cluster orchestration, service mesh (Istio or equivalent), and workload policy management (e.g., OPA, Kyverno)
  • Proven experience building and maintaining GitOps pipelines using tools like ArgoCD or Flux
  • Strong fluency in observability tooling (e.g., Prometheus, OpenTelemetry, Grafana, or Datadog), with a focus on SLO-based alerting and incident detection
  • Familiarity with reliability-as-code practices and automation using scripting languages (Python, Go, or Bash) and AI-enhanced workflows (e.g., Cursor, incident bots, PR-generating agents)
  • Experience designing and enforcing zero trust service-to-service authentication, workload identity, and mTLS policies
  • Track record of leading incident review programs, standardizing postmortems, and driving systemic reliability improvements
  • Ability to work cross-functionally with platform, security, and developer enablement teams to embed resilience across the SDLC

Responsibilities

  • Serve as a technical leader in modern SRE practices with a focus on zero-trust infrastructure, platform observability, and cloud-native scalability
  • Guide the architectural evolution of reliability systems, including multi-cluster Kubernetes environments, GitOps workflows, and service mesh integration
  • Champion SLO-driven engineering across teams and establish frameworks for defining, tracking, and enforcing reliability standards
  • Partner with platform and security teams to enable service-to-service authentication, policy enforcement, and resilient control planes
  • Develop AI-assisted tools and workflows (e.g., for incident triage, RCA generation, auto-remediation) to reduce operational burden and accelerate resolution
  • Define and maintain end-to-end observability strategies including distributed tracing, metrics pipelines, and log enrichment
  • Drive infrastructure automation efforts using IaC best practices, with an emphasis on policy-as-code, workload identity, and platform governance
  • Lead post-incident reviews and reliability audits to surface systemic gaps and drive continuous improvement
  • Mentor engineers across infrastructure and application teams on designing and operating reliable, scalable systems

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.