Site Reliability Engineering Manager

Logo of Xebia Poland

Xebia Poland

πŸ“Remote - Worldwide

Job highlights

Summary

Join Xebia, a global leader in digital solutions, and become a key member of our Site Reliability Engineering (SRE) team. You will recruit, develop, and mentor the SRE team, setting goals and tracking achievements. Responsibilities include defining and implementing SRE best practices, delivering Terraform-based automation in Google Cloud, designing secure IAM roles, and collaborating with development and security teams. This role requires extensive experience in software development, distributed systems, and cloud computing, along with strong problem-solving and leadership skills. The ideal candidate will possess deep technical expertise in GCP and experience with IaC tooling and monitoring tools. Xebia offers a dynamic work environment focused on innovation and employee development.

Requirements

  • 8 years of experience with data structures or algorithms
  • 5 years of experience with software development in one or more programming languages
  • 3 years of experience managing people or teams, leading projects, and designing, analyzing, and troubleshooting distributed systems
  • Excellent problem-solving and analytical skills
  • Strong understanding of software development lifecycle (SDLC) and DevOps principles
  • Deep technical expertise in cloud computing platforms (GCP preferred)
  • Proficiency with Infrastructure-as-a-code (IaC) tooling, such as Terraform
  • Proven experience with monitoring tools (Prometheus, Datadog, New Relic)
  • Experience with automation frameworks (Ansible, Puppet, Chef)
  • Fluent in English (B2-C2)
  • Bachelor’s degree in Computer Science, a related field, or equivalent practical experience
  • Work from the European Union region and a work permit are required

Responsibilities

  • Recruiting, developing, and mentoring the SRE team, including setting goals and tracking their achievement
  • Supporting engineers' skill development through coaching and clear expectation setting
  • Defining and implementing SRE best practices, standards and processes, including Service Level Objectives (SLOs), to ensure service reliability and performance
  • Delivering Terraform-based automation in Google Cloud, including project creation, user management, and service enablement, while optimizing cloud costs
  • Designing secure IAM roles, permissions, and monitoring systems to enhance security, user experience, and proactive issue detection
  • Collaborating with development and security teams to ensure reliability, system security, and compliance, while proactively addressing potential issues
  • Prioritizing a customer-focused approach, delivering exceptional user experiences for infrastructure services with clear and effective communication
  • Analyzing system metrics to identify performance bottlenecks and opportunities for improvement and implement capacity planning strategies for resilience under high load
  • Continuously monitoring and optimizing system performance

Preferred Qualifications

Google Cloud, Azure or Kubernetes certifications

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.
Please let Xebia Poland know you found this job on JobsCollider. Thanks! πŸ™