techruiter. is hiring a
Site Reliability Engineer (SRE) in United Kingdom

Logo of techruiter.
Site Reliability Engineer (SRE)
🏢 techruiter.
💵 ~$117k-$210k
📍United Kingdom
📅 Posted on Jun 11, 2024

Summary

The job is for a Site Reliability Engineer to maintain the stability and efficiency of LLM and Machine Learning platforms by collaborating with cross-functional teams, designing and automating infrastructure, managing deployment pipelines, implementing monitoring systems, leading incident response efforts, performing capacity planning, ensuring security and compliance, continuously improving system reliability, and maintaining documentation.

Requirements

  • Bachelor's or Master's degree in Computer Science, Information Technology, or a related field
  • Proven experience as a Site Reliability Engineer or a related role with a focus on LLM and Machine Learning infrastructure
  • Strong proficiency in cloud platforms (e.g., AWS, Azure, GCP) and containerization technologies (e.g., Docker, Kubernetes)
  • Experience with configuration management tools (e.g., Ansible, Terraform) and CI/CD pipelines
  • Knowledge of monitoring and observability tools (e.g., Prometheus, Grafana, ELK Stack)
  • Scripting and automation skills (e.g., Python, Bash)
  • Excellent problem-solving and troubleshooting skills
  • Strong communication and collaboration skills

Responsibilities

  • Infrastructure Design and Automation: Collaborate with engineering and research teams to design, implement, and automate infrastructure for LLM and Machine Learning workloads
  • Deployment and Configuration: Manage deployment pipelines, configuration management, and orchestration tools to streamline the deployment of models and services
  • Monitoring and Alerting: Implement and maintain robust monitoring, alerting, and logging systems to proactively identify and resolve issues
  • Incident Response: Lead incident response efforts, investigate root causes of outages, and implement preventive measures to reduce the likelihood of recurrence
  • Capacity Planning: Perform capacity planning and scaling to accommodate growing workloads and ensure resource efficiency
  • Security and Compliance: Collaborate with security teams to implement security best practices, vulnerability assessments, and compliance requirements for LLM and Machine Learning systems
  • Continuous Improvement: Continuously evaluate and improve system reliability, performance, and efficiency through automation and optimization
Help us out by mentioning to techruiter. that you discovered this job opportunity on JobsCollider. Your support is greatly appreciated. Thank you 🙏
Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.

Similar Jobs