Remote Senior AI/HPC Storage Engineer

Logo of Recursion

Recursion

πŸ’΅ $160k-$182k
πŸ“Remote

Job highlights

Summary

Join our innovative team as a Senior AI/HPC Storage Engineer and design, implement, and manage advanced AI/HPC data systems that propel our groundbreaking drug discovery research. You will be instrumental in ensuring the performance, scalability, and reliability of our storage systems.

Requirements

  • A minimum of 7 years of experience in managing data storage infrastructure, preferably within global BioPharma organizations
  • In-depth knowledge of distributed/parallel file systems (IBM Storage Scale GPFS), multi-tier file (NAS), hybrid object storage (MinIO), and storage access and data transfer networking protocols
  • Experience with RDMA-capable high-speed networking
  • Extensive experience designing, deploying, testing, supporting, and troubleshooting complex Linux-based computing and data storage environments
  • Python programming and Bash scripting experience. In-depth hands-on experience in provisioning, configuring, and managing infrastructure through modern CI/CD techniques, GitOps, Infrastructure as Code (IaC) and cloud automation principles
  • Solid experience with software-defined infrastructure and cloud computing platforms, including Kubernetes, GCP, AWS, and others
  • Practical knowledge of resource management and job scheduling using Slurm and Kubernetes. Knowledge of container technologies like Apptainer and Docker

Responsibilities

  • Design, implement, test, maintain, and optimize our data storage infrastructure and services, utilizing an Infrastructure as Code approach across both on-premises and public cloud environments
  • Drive innovation across all storage tiers within our AI/HPC infrastructure, ensuring we deliver a scalable and effective data platform to support our mission
  • Automate and verify storage infrastructure provisioning and dynamic reconfiguration, enhancing support for our AI/HPC storage environments
  • Perform performance analysis, benchmarking, troubleshooting, and fine-tuning of our data storage systems and services, while efficiently managing user tickets
  • Research, deploy, and optimize accessibility, performance, security, and data lifecycle management policies
  • Regularly assess our storage platforms' health and operational performance against established metrics, with a focus on meeting and exceeding operational service level objectives

Benefits

Comprehensive benefits package for United States based candidates, including bonuses and equity compensation

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.
Please let Recursion know you found this job on JobsCollider. Thanks! πŸ™