Manager of Infrastructure Operations

Voltage Park Logo

Voltage Park

📍Remote - Worldwide

Summary

Join Voltage Park as their Manager of Infrastructure Operations and lead their 24/7 team responsible for the stability, scalability, and performance of their infrastructure. This role is crucial in providing high-performance environments for AI/ML training, inference, and HPC workloads. The position requires strong technical and leadership skills with a focus on operational excellence. Full remote flexibility is offered within the continental US with PST work hours. Sponsorship is not provided for this role. The team ensures the stability, scalability, and performance of Voltage Park’s compute, storage, and platform systems. The team delivers proactive monitoring, automation toolsets, and continuous optimization to maintain high availability and operational excellence.

Requirements

  • Proficiency in Puppet, Terraform, and Ansible
  • Strong scripting skills in Bash, Python, or Go
  • Extensive experience in setting up, deploying, and managing Kubernetes clusters
  • Proven track record of architecting, building, and delivering complex systems from inception
  • Ability to strike a balance between pragmatic development and ideal architectures
  • Skilled at navigating trade-offs between design, risk, cost, and outcomes
  • Deep understanding of network protocols, network programming, Unix variants, monitoring, and security systems
  • Excellent written and verbal communication skills
  • Demonstrated ability to inspire and lead a team towards common goals, fostering a positive and collaborative work environment
  • Proven track record of effectively delegating tasks, providing constructive feedback, and developing team members' skills
  • Strong decision-making skills, capable of guiding the team through complex technical challenges and strategic initiatives
  • Ability to communicate a clear vision and align team efforts with broader company objectives
  • Experience in conflict resolution and team building, promoting diversity, equity, and inclusion within the team and the organization

Responsibilities

  • Establish and uphold the standard practices for our expanding InfraOps team
  • Lead and mentor a 24/7 infrastructure Operations team responsible for monitoring, maintaining, and supporting our infrastructure
  • Develop and maintain operational runbooks, escalation procedures, and documentation for critical systems
  • Collaborate with Infrastructure Engineering, Network operations, and Datacenter Operations and Customer Success teams to support infrastructure rollouts, upgrades, and scaling efforts
  • Oversee observability systems (monitoring, logging, alerting) and drive continuous improvements in automation and root-cause analysis
  • Drive adoption of “Infrastructure as Code” and automated workflows to reduce manual intervention
  • Implement and enforce best practices for system availability, performance tuning, capacity planning, and lifecycle management
  • Be available for on-call support during urgent system incidents
  • Ensure compliance with security, regulatory, and organizational standards across all environments

Benefits

  • Full remote flexibility
  • Candidates must be based in the continental US and available to work during PST hours

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.