Site Reliability Engineer (SRE) at techruiter.

Summary

The job is for a Site Reliability Engineer to maintain the stability and efficiency of LLM and Machine Learning platforms by collaborating with cross-functional teams, designing and automating infrastructure, managing deployment pipelines, implementing monitoring systems, leading incident response efforts, performing capacity planning, ensuring security and compliance, continuously improving system reliability, and maintaining documentation.

Requirements

Bachelor's or Master's degree in Computer Science, Information Technology, or a related field
Proven experience as a Site Reliability Engineer or a related role with a focus on LLM and Machine Learning infrastructure
Strong proficiency in cloud platforms (e.g., AWS, Azure, GCP) and containerization technologies (e.g., Docker, Kubernetes)
Experience with configuration management tools (e.g., Ansible, Terraform) and CI/CD pipelines
Knowledge of monitoring and observability tools (e.g., Prometheus, Grafana, ELK Stack)
Scripting and automation skills (e.g., Python, Bash)
Excellent problem-solving and troubleshooting skills
Strong communication and collaboration skills

Responsibilities

Infrastructure Design and Automation: Collaborate with engineering and research teams to design, implement, and automate infrastructure for LLM and Machine Learning workloads
Deployment and Configuration: Manage deployment pipelines, configuration management, and orchestration tools to streamline the deployment of models and services
Monitoring and Alerting: Implement and maintain robust monitoring, alerting, and logging systems to proactively identify and resolve issues
Incident Response: Lead incident response efforts, investigate root causes of outages, and implement preventive measures to reduce the likelihood of recurrence
Capacity Planning: Perform capacity planning and scaling to accommodate growing workloads and ensure resource efficiency
Security and Compliance: Collaborate with security teams to implement security best practices, vulnerability assessments, and compliance requirements for LLM and Machine Learning systems
Continuous Improvement: Continuously evaluate and improve system reliability, performance, and efficiency through automation and optimization

Site Reliability Engineer (SRE)

techruiter.

Summary

Requirements

Responsibilities

Remote

DevOps

Mid-level

Similar Remote Jobs

Centric Software

Remote

DevOps

Mid-level

Remote

DevOps

Senior

Remote

Software Development

Senior

Remote

Software Development

Senior

Remote

Software Development

Senior

Remote

DevOps

Senior

Kraken Digital Asset Exchange

Remote

DevOps

Mid-level

Kraken Digital Asset Exchange

Remote

DevOps

Mid-level

GoDaddy

Remote

DevOps

Mid-level

Remote

DevOps

Senior