Summary
Join The Linux Foundation as a Senior Cloud Operations Engineer and play a crucial role in managing and optimizing our multi-cloud infrastructure and DevOps practices for the PyTorch project. You will design and manage multi-cloud environments, optimize cloud resource utilization, implement CI/CD pipelines, develop performance testing frameworks, and ensure infrastructure security and monitoring. This position requires extensive experience in cloud operations, GPU computing, and DevOps methodologies. The ideal candidate will have a strong background in infrastructure-as-code, automated testing, and agile practices. We offer a competitive salary, comprehensive benefits, and a flexible remote work environment.
Requirements
- Bachelor's degree in Computer Science, Engineering, or related field
- 7+ years of experience in cloud operations with extensive multi-cloud expertise (AWS, GCP, Azure)
- Demonstrated experience with GPU computing (AMD and NVIDIA) and specialized accelerators (TPUs, NPUs)
- Strong knowledge of CPU architectures and instance type optimization (AMD, Intel)
- Advanced experience with GitHub Actions, including custom runner configuration and management
- Expertise in implementing non-blocking and out-of-tree CI jobs
- Strong background in version control systems and branching strategies
- Experience with agile methodologies and scrum practices
- Proficiency in infrastructure-as-code tools, particularly Terraform
- Strong scripting abilities (Python, Bash, PowerShell, Typescript)
- Experience with containerization and orchestration (Docker, Kubernetes)
- Demonstrated experience in implementing automated testing frameworks
Responsibilities
- Design and manage multi-cloud environments across AWS, GCP, and Azure
- Optimize instance selection and utilization across various compute types including AMD and Intel CPU-based instances
- Configure and manage GPU-accelerated instances (AMD and NVIDIA) and specialized accelerators (TPUs, NPUs)
- Implement and maintain infrastructure-as-code using Terraform and other IaC tools
- Optimize cloud resource utilization and implement FinOps practices for cost management
- Design and implement high-availability solutions across multiple cloud providers
- Design, implement, and maintain CI/CD pipelines using GitHub Actions
- Configure and manage both github-hosted and self-hosted runners
- Implement and maintain non-blocking and out-of-tree CI jobs
- Design and implement matrix testing strategies across different hardware configurations
- Develop and maintain automated testing frameworks for various testing types (unit, integration, performance)
- Implement best practices for version control management and branching strategies
- Develop and implement performance testing frameworks for various hardware accelerators
- Optimize workload distribution across different types of compute instances
- Implement automated performance regression testing
- Design and maintain benchmarking systems for various hardware configurations
- Implement security best practices across multi-cloud environments
- Develop comprehensive monitoring solutions using cloud-native tools
- Participate in on-call rotations supporting operations and incident response
- Establish and maintain escalation procedures and resolution processes
- Manage access control and security policies across cloud platforms
Preferred Qualifications
- Experience optimizing workloads across different hardware accelerators
- Background in performance testing and optimization
- Contributions to open-source projects
- Experience mentoring other engineers
- Background in machine learning infrastructure
- Experience with Datadog is a plus
Benefits
- Competitive salary
- Comprehensive health, dental, and vision insurance
- Flexible PTO policy
- Remote work environment
- Professional development opportunities
- 401(k) matching
- Home office stipend
Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.