Site Reliability Engineer
MetroStar
Job highlights
Summary
Join MetroStar as a Site Reliability Engineer and play a crucial role in designing, implementing, and maintaining the reliability and efficiency of our platforms. You will collaborate with cross-functional teams, lead initiatives, and contribute to the strategic direction of our clientβs infrastructure. Responsibilities include leading the design and management of highly available systems, collaborating on performance optimization, designing monitoring and alerting strategies, driving automation initiatives, and participating in on-call rotations. A minimum of 3 years of experience in a similar role and an active U.S. Government Secret security clearance are required. Strong experience with cloud technologies, infrastructure as code, programming languages, and containerization technologies is also necessary. MetroStar offers a generous benefits package, professional growth opportunities, and valuable time to recharge.
Requirements
- An active U.S. Government issued Secret security clearance (or higher)
- Minimum of 3 years of professional experience in a Site Reliability Engineering role or similar capacity
- Strong experience with cloud technologies (e.g., AWS, Azure, GCP) and infrastructure as code (e.g., Terraform, Ansible)
- Proficiency in programming and scripting languages (e.g., Python, Go, Bash) and RPA (e.g. Blue Prism, UIPath) to automate tasks and develop tools
- Deep understanding of containerization and orchestration technologies (e.g., Kubernetes, Docker)
- Expertise in implementing and managing monitoring and logging solutions (e.g., Zabbix, Nagios, Prometheus, ELK stack)
- Proven track record of designing, building, and maintaining highly available and scalable systems
- Expert proficiency in developing automated functional, regression and performance tests and developing automated testing standards for development teams
- Experience facilitating change and configuration management processes to drive reliability
- Strong problem-solving skills, with the ability to diagnose complex issues and implement effective solutions
- Excellent communication skills, with the ability to collaborate effectively across diverse teams
Responsibilities
- Lead the design, implementation, and management of highly available and scalable systems, applying industry best practices and reliability engineering principles
- Collaborate with cross-functional teams to identify performance bottlenecks, troubleshoot complex issues, and optimize system performance to meet defined service level objectives
- Design and implement monitoring, alerting, and incident response strategies to proactively identify and mitigate potential issues, ensuring uninterrupted service availability
- Drive automation initiatives to streamline deployment, configuration management, and infrastructure provisioning processes
- Develop and maintain comprehensive documentation for system configurations, processes, and procedures
- Participate in on-call rotations and respond to incidents, working diligently to resolve issues and prevent recurrence
Benefits
- A generous benefits package
- Professional growth
- Valuable time to recharge
Share this job:
Similar Remote Jobs
- π°$177k-$213kπUnited States
- πJapan
- π°$60k-$120kπAsia
- πMexico
- πUnited States
- π°$151k-$297kπUnited States
- πSpain
- πIndia
- πFrance
- πIndia