Director of Site Reliability Engineers

closed
Cyware Logo

Cyware

πŸ“Remote - United States

Summary

Join Cyware as the Director of Site Reliability Engineers (SREs) and lead a team responsible for maintaining the smooth operation of all user-facing services and production systems. You will guide and develop SREs, ensuring system monitoring, driving root cause analysis, and supporting on-call teams. This role involves leading automation efforts, defining and measuring SRE metrics, overseeing high availability and disaster recovery, and optimizing cloud infrastructure. Collaboration with engineering, security, and operations teams across time zones is crucial. The ideal candidate possesses extensive experience in SRE management and a strong technical background.

Requirements

  • US Citizenship
  • Bachelor's degree or higher, in Computer Science, Engineering, IT or related discipline
  • 7 to 10 Years of total experience as an SRE
  • 4 to 6 Years of experience managing a team of SREs
  • Experienced in knowledge sharing and mentoring of Team members
  • Self-awareness, handling conflict in the team, and providing and receiving feedback
  • Accountability: willing to proactively step in and do the right thing while providing candid and constructive feedback
  • Cloud: AWS/Azure/GCP
  • Linux: Solid understanding of Linux Systems, sed/awk/grep/egrep, VI/VIM/Emacs, netstat, lsof, strace, ps/top/atop/dstat, grub boot config & systems rescue, fstab/disk labels, ext3/ext4, IPtables, sysstat (sar/vmstat/iostat etc), run-levels & startup scripts, sudo/chroot
  • Scripting: Bash/Python
  • Development Languages and Frameworks: Python/Django, Vue, React, Go Lang
  • Fundamentals: Basic DNS & Networking, TCP/UDP, IP Routing, HA & Load Balancing Concepts
  • Application Protocols: SMTP, HTTP, HTTPS, FTP, IMAP, POP

Responsibilities

  • Guide and develop SREs, setting clear goals and fostering a high-performance culture
  • Ensure system monitoring, drive root cause analysis, and support on-call teams to meet SLAs
  • Lead efforts to automate deployments, infrastructure provisioning, and operational tasks to minimize human error
  • Define and measure SRE metrics (SLIs, SLOs, SLAs) and drive continuous improvement
  • Oversee high availability (HA), disaster recovery (DR), and compliance monitoring
  • Manage and optimize cloud infrastructure using tools like Terraform, Kubernetes, and Jenkins
  • Ensure smooth deployments, operational readiness, and security compliance
  • Work across time zones to coordinate with engineering, security, and operations teams

Preferred Qualifications

  • Database Systems Fundamentals (MySQL/Postgres)
  • Redis
  • Nginx/Apache
  • Supervisorctl
  • Nagios
  • Yum
  • RPM
  • GIT
  • Grafana
  • Prometheus
  • New Relic
  • ELK
  • Docker
  • Jenkins
  • RHCSA/RHCE/AWS (SysOps)

Benefits

  • Time off
  • Paid holidays
  • Retirement plans
  • Insurance coverage
  • Professional development opportunities
  • Competitive compensation packages
This job is filled or no longer available