Staff Site Reliability Engineer

IntelliPro Logo

IntelliPro

πŸ’΅ $200k-$250k
πŸ“Remote - Worldwide

Summary

Join our team as a Staff Site Reliability Engineer and help scale and maintain the massive GPU infrastructure powering our cutting-edge AI systems. You will work closely with engineers and researchers, operate and manage thousands of GPUs across multiple cloud providers, design scalable solutions for growing compute demands, build resilient systems, develop automation tools, maintain monitoring systems, define and track SLOs/SLIs, and participate in on-call rotation. This role requires 7+ years of experience in reliability engineering in fast-paced environments, deep knowledge of GPU infrastructure, proficiency in scripting/programming languages, and experience with Kubernetes and IaC tools. We offer a competitive salary ($200K–$250K), equity, comprehensive health benefits, generous PTO, and support for professional development. The ideal candidate will have experience in AI/ML infrastructure or managing large-scale GPU clusters.

Requirements

  • Proven 7+ years of experience as a reliability engineer, infrastructure engineer, or production engineer in fast-paced, high-growth environments
  • Deep knowledge of GPU infrastructure, including scheduling, scaling, cloud networking, storage, and security
  • Proficiency in one or more scripting or programming languages
  • Strong experience with Kubernetes or similar container orchestration systems
  • Familiarity with Infrastructure-as-Code tools like Terraform or CloudFormation
  • Experience working with observability tools like Prometheus, Grafana, DataDog, ELK, or Splunk
  • Excellent troubleshooting, debugging, and systems thinking
  • Strong communication skills and a collaborative mindset

Responsibilities

  • Work closely with engineers and researchers to define and meet system performance, availability, and efficiency requirements
  • Operate and manage thousands of GPUs distributed across multiple cloud providers and clusters
  • Design scalable solutions to support rapid growth in compute demands for AI model training, data processing, and inference
  • Build resilient, fault-tolerant systems to ensure continuous uptime and seamless performance
  • Develop automation tools to eliminate toil and streamline infrastructure operations
  • Set up and maintain monitoring systems to proactively detect issues and drive performance improvements
  • Define and track SLOs and SLIs that uphold system reliability standards
  • Participate in an on-call rotation to ensure 24/7 system availability

Preferred Qualifications

Experience in AI/ML infrastructure, or managing large-scale GPU clusters

Benefits

  • Base Salary: $200K–$250K/year
  • Competitive equity package (stock options)
  • Comprehensive health benefits
  • Generous PTO and flexible work policies
  • Support for ongoing professional development

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.

Similar Remote Jobs