Senior Systems Engineer

Sauce Labs Logo

Sauce Labs

πŸ’΅ $135k-$165k
πŸ“Remote - United States

Summary

Join Sauce Labs as a Senior Systems Engineer and contribute to the architecture, operation, and scaling of our hybrid cloud infrastructure. Lead the design and management of Kubernetes clusters, develop infrastructure-as-code using Terraform, and manage hardware and systems in our global data centers. You will optimize performance, engineer our observability stack, and design disaster recovery strategies. This role requires expertise in Kubernetes, cloud technologies (GCP, AWS), automation, and Linux systems administration. The position offers a competitive salary and benefits package, including health insurance, parental leave, flexible time off, and a 401(k) plan.

Requirements

  • Proven ability to execute on high-level goals independently and to lead technical initiatives within cross-functional teams
  • 5+ years of experience as a Linux administrator/engineer at scale (hundreds of systems), with a deep understanding of designing and deploying highly available solutions
  • 3+ years of recent, hands-on professional experience architecting, operating, and scaling Kubernetes clusters in a large-scale production environment
  • Expertise in Configuration Management solutions, preferably Ansible , for managing infrastructure at scale
  • Strong skills in at least one programming language: Python (preferred) or Go
  • Solid experience in Linux performance tuning, profiling, and monitoring
  • Deep experience deploying and managing services in GCP and/or AWS using Terraform
  • Experience with virtualization technologies, specifically KVM-Qemu
  • A solid understanding of cloud, networking, and distributed computing concepts (TCP/IP, firewalls, VLANs, load balancing, etc.)
  • Experience with testing frameworks for infrastructure automation (e.g., InSpec, Ansible Molecule)
  • Familiarity with ZFS on Linux and managing storage appliances (iSCSI, NFS)
  • Deep experience with modern observability tooling ( Prometheus, Grafana )
  • Excellent communication skills (verbal and written) and the ability to collaborate effectively across all levels of the organization
  • Familiarity with software engineering best practices and agile methodologies

Responsibilities

  • Kubernetes and Cloud Native Architecture: Lead the design, deployment, and lifecycle management of highly available, scalable Kubernetes clusters across both our data centers and public cloud providers (GCP, AWS)
  • Infrastructure Automation: Write and maintain expert-level infrastructure-as-code using Terraform to deploy and manage services in our hybrid cloud environment. Develop robust automation and self-service tooling in Python or Go to empower engineering teams
  • System and Hardware Operations: Install, configure, debug, and manage a diverse range of hardware and systems in our global data centers, including Dell, SuperMicro, storage arrays (NAS/SAN), and custom mobile device appliances
  • Scalability and Performance Engineering: Creatively solve complex scaling challenges within our rapidly expanding environment. Optimize hardware, hypervisor (KVM-Qemu), and Kubernetes configurations to enhance performance and efficiency
  • Observability and Monitoring: Engineer and enhance our observability stack ( Prometheus, Grafana ) to provide deep insights into the health and performance of our Kubernetes clusters, applications, and underlying infrastructure
  • Disaster Recovery and Resiliency: Design, implement, and maintain robust disaster recovery strategies for critical production services, with a focus on multi-cluster and multi-region Kubernetes deployments
  • Bare Metal Provisioning: Automate the deployment and lifecycle management of operating systems on bare metal servers using tools like PXE and Foreman
  • Documentation and Runbooks: Create and maintain clear, comprehensive documentation, architectural diagrams, and NOC runbooks for the environments you manage
  • Troubleshooting: Act as a senior escalation point for complex troubleshooting of application, server, and network issues within our containerized and virtualized environments
  • On-Call: Participate in a 24x7 on-call rotation to ensure the stability and availability of the Sauce Labs platform

Benefits

  • Health coverage (medical, dental, and vision) along with disability and life insurance
  • Parental leave benefits
  • Flexible time off
  • Professional development
  • A 401(k) retirement plan with match

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.

Similar Remote Jobs