Site Reliability Engineering Manager
Xebia Poland
Job highlights
Summary
Join Xebia, a global leader in digital solutions, and become a key member of our Site Reliability Engineering (SRE) team. You will recruit, develop, and mentor the SRE team, setting goals and tracking achievements. Responsibilities include defining and implementing SRE best practices, delivering Terraform-based automation in Google Cloud, designing secure IAM roles, and collaborating with development and security teams. This role requires extensive experience in software development, distributed systems, and cloud computing, along with strong problem-solving and leadership skills. The ideal candidate will possess deep technical expertise in GCP and experience with IaC tooling and monitoring tools. Xebia offers a dynamic work environment focused on innovation and employee development.
Requirements
- 8 years of experience with data structures or algorithms
- 5 years of experience with software development in one or more programming languages
- 3 years of experience managing people or teams, leading projects, and designing, analyzing, and troubleshooting distributed systems
- Excellent problem-solving and analytical skills
- Strong understanding of software development lifecycle (SDLC) and DevOps principles
- Deep technical expertise in cloud computing platforms (GCP preferred)
- Proficiency with Infrastructure-as-a-code (IaC) tooling, such as Terraform
- Proven experience with monitoring tools (Prometheus, Datadog, New Relic)
- Experience with automation frameworks (Ansible, Puppet, Chef)
- Fluent in English (B2-C2)
- Bachelorβs degree in Computer Science, a related field, or equivalent practical experience
- Work from the European Union region and a work permit are required
Responsibilities
- Recruiting, developing, and mentoring the SRE team, including setting goals and tracking their achievement
- Supporting engineers' skill development through coaching and clear expectation setting
- Defining and implementing SRE best practices, standards and processes, including Service Level Objectives (SLOs), to ensure service reliability and performance
- Delivering Terraform-based automation in Google Cloud, including project creation, user management, and service enablement, while optimizing cloud costs
- Designing secure IAM roles, permissions, and monitoring systems to enhance security, user experience, and proactive issue detection
- Collaborating with development and security teams to ensure reliability, system security, and compliance, while proactively addressing potential issues
- Prioritizing a customer-focused approach, delivering exceptional user experiences for infrastructure services with clear and effective communication
- Analyzing system metrics to identify performance bottlenecks and opportunities for improvement and implement capacity planning strategies for resilience under high load
- Continuously monitoring and optimizing system performance
Preferred Qualifications
Google Cloud, Azure or Kubernetes certifications
Share this job:
Similar Remote Jobs
- π°$94k-$163kπWorldwide
- π°$129k-$220kπUnited States
- πUnited States
- πUnited States
- πAsia
- π°$192k-$288kπUnited States
- πIndia
- πIndia
- π°$172k-$215kπUnited States