Remote Infrastructure Operations Engineer
Aethir
πRemote - United States
Please let Aethir know you found this job on JobsCollider. Thanks! π
Job highlights
Summary
Aethir is seeking a skilled Infrastructure Operations Engineer to manage and optimize their GPU-based compute infrastructure across multiple locations and partners. The role involves deploying, configuring, and maintaining servers, storage, networking, and associated software stack, as well as implementing automation scripts and tools for streamlined deployment and management.
Requirements
- Experience in infrastructure operations, preferably in a DevOps or SRE role or Sales Engineering or Solution Architect role - focused on GPU compute
- Proficiency in managing GPU-based compute infrastructure, including NVIDIA GPUs and CUDA programming
- Strong expertise in Linux system administration and shell scripting (e.g., Bash, Python)
- Experience with configuration management tools (e.g., Ansible, Chef, Puppet) and version control systems (e.g., Git)
- Familiarity with containerization and orchestration technologies (e.g., Docker, Kubernetes)
- Solid understanding of networking concepts, protocols, and troubleshooting techniques
- Excellent analytical and problem-solving skills, with a proactive and results-oriented mindset
- Effective communication skills and the ability to collaborate effectively with cross-functional teams. Speaking Mandarin is a bonus as Aethir has engineering teams in China and Southeast Asia
- Experience with cloud computing platforms (e.g., AWS, Azure, GCP) and hybrid cloud architectures
- Knowledge of HPC frameworks and job scheduling systems (e.g., Slurm, PBS Pro)
- Familiarity with GPU-accelerated libraries and frameworks (e.g., TensorFlow, PyTorch, CUDA Toolkit)
- Understanding of cybersecurity principles and practices, including encryption, access controls, and threat detection/prevention
Responsibilities
- Infrastructure Management: Deploy, configure, and maintain GPU-based compute infrastructure
- Monitoring and Optimization: Implement robust monitoring and alerting systems to proactively identify performance bottlenecks, resource constraints, and potential failures
- Automation and Orchestration: Develop automation scripts and tools to streamline deployment, configuration, and management of infrastructure components
- Security and Compliance: Implement and enforce security best practices to safeguard sensitive data and ensure compliance with relevant regulations and industry standards
- Incident Response and Troubleshooting: Provide tier-3 support for infrastructure-related issues, investigating root causes and implementing timely resolutions
- Capacity Planning and Scaling: Collaborate with cross-functional teams to forecast resource requirements, plan capacity upgrades, and scale infrastructure to accommodate growing workloads and user demands
- Documentation and Knowledge Sharing: Maintain comprehensive documentation of infrastructure configurations, procedures, and troubleshooting guidelines. Share knowledge and best practices with team members
Benefits
- Competitive compensation structure (and flexible on fiat/token mix)
- Can be flexible on benefits, depending on location and setup
- Salary is also flexible depending on location and setup
- Flexible work hours and remote work options
Share this job:
Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.
Similar Remote Jobs
- πMexico
- πUnited States
- πUnited Kingdom
- πWorldwide
- πUnited States
- πAustralia
- πWorldwide
- π°$109k-$205kπWorldwide
- π°$140k-$160kπUnited States
Please let Aethir know you found this job on JobsCollider. Thanks! π