Senior Site Reliability Engineer

Superhuman
Summary
Join Superhuman, a company building the productivity platform of the future, as a Senior Site Reliability Engineer (SRE) / DevOps Engineer. This dual role combines SRE responsibilities (60%) ensuring system availability and performance with DevOps practices (40%) focusing on automation and CI/CD. You will collaborate with software engineers, design scalable systems, monitor service health, and implement disaster recovery plans. The ideal candidate possesses 6+ years of experience in SRE or DevOps, strong cloud platform proficiency, and expertise in various tools and technologies. Superhuman offers a competitive salary ($160,000 - $185,000), comprehensive benefits including health insurance, 401k matching, generous PTO, and professional development opportunities. We are open to candidates in the US, Canada, or Latin America.
Requirements
- 6+ years of experience in SRE, DevOps, or systems engineering roles
- Proven experience managing high-availability, mission-critical systems
- Strong proficiency with cloud platforms (GCP, AWS, or Azure)
- Hands-on experience with containers and orchestration tools (Docker, Kubernetes)
- Expertise in monitoring, logging, and alerting tools (e.g., Metabase, Datadog, Prometheus, Grafana, etc)
- Proficiency in scripting/programming languages (Python, Go, Bash, etc.)
- Knowledge of database management systems (SQL/NoSQL)
- Strong knowledge of networking, security, and distributed systems
- Experience with Infrastructure as Code (Terraform, Ansible, Chef, or Puppet)
- Familiarity with version control systems (Git) and CI/CD pipelines (Jenkins, GitLab CI, etc.)
- Strong communication skills and ability to work collaboratively across teams
- Problem-solving mindset with a focus on root cause analysis
- Proactive, self-driven, and able to handle high-pressure environments
Responsibilities
- Collaborate with software engineers to design scalable, fault-tolerant systems and services. Help smoothly integrate AI-solutions into existing architectures, ensuring that AI models, frameworks, and tools work efficiently within a broader system without causing disruptions
- Proactively monitor service health, availability, and performance using monitoring tools like Metabase, Datadog, Prometheus, Grafana, etc
- Establish SLAs, SLOs, and SLIs for key services and ensure alignment with business goals
- Respond to and troubleshoot production issues, ensuring quick resolution and minimal downtime
- Conduct post-incident reviews to ensure continuous learning and improvement
- Perform capacity planning and scaling activities to ensure system resilience during traffic spikes or unexpected failures
- Automate repetitive tasks to enhance efficiency (e.g., provisioning, monitoring, and alerting)
- Implement self-healing mechanisms to reduce manual intervention
- Continuously analyze system performance, identify bottlenecks, and work with teams to optimize applications and infrastructure
- Design and implement disaster recovery plans and high availability strategies
- Test failover mechanisms and backups regularly
- Collaborate with our security team to ensure infrastructure adheres to best practices and compliance requirements
- Implement and manage security monitoring, patching, and auditing for critical services
- Build, maintain, and enhance CI/CD pipelines using tools like Jenkins, GitLab CI, CircleCI, or similar
- Ensure smooth and efficient deployment processes, enabling fast and reliable delivery of code changes to production
- Manage and automate infrastructure provisioning and configuration using tools like Terraform
- Work on containerization solutions using Docker and orchestration with Kubernetes
- Work closely with development teams to ensure best practices in deployment and release processes
- Champion DevOps culture by mentoring and guiding other engineers in the use of tools and best practices
Benefits
- Medical, dental, and vision insurance: 100% coverage for you and 75% coverage for all your dependents
- Voluntary insurance: short-term disability, long-term disability, and life insurance
- 401(k) plan (we match 75 cents per dollar, up to 4% of your salary)
- Free access to Northstar, a financial wellness platform that provides financial advisors + personal finance tools
- Enjoy our generous and flexible Paid Time Off (PTO) policy, with our amazing team members taking an average of 20 days per year
- 13 additional company holidays, plus your own Care Days, Flexible Holidays, and a company-wide Winter Break
- Generous parental, caregiver, healthcare, and compassionate leave policies
- $3000 per year towards your professional development
- Free access to Calm and Aaptive
- Allyship education program to help build your best self
- Custom MacBook Pro
- $1000 budget for workstation setup
- $260/month for your lunches, groceries, or whatever nutrition you need to stay fueled up!
- Flexible spending accounts for commuter costs, dependent care, and healthcare expenses