Senior Manager, Site Reliability Engineering

SEON
Summary
Join SEON's Site Reliability Engineering (SRE) team as a highly experienced and motivated SRE Manager to lead a team of Site Reliability Engineers. You will play a crucial role in maintaining the reliability and efficiency of our services, ensuring that our products and services are reliable while coordinating with cross-functional teams across various geographical regions. This role offers flexibility, based in Budapest with a hybrid schedule or remotely in the European Union with occasional travel. You will lead and grow a high-performing SRE team, own incident management, drive implementation of SLAs and SLOs, champion automation, collaborate with engineering teams, and oversee system monitoring. You will also manage on-call rotations, drive continuous improvement, ensure compliance, provide mentorship, and communicate effectively with stakeholders.
Requirements
- Bachelorโs degree in Computer Science, Engineering, or a related field (or equivalent practical experience)
- Proven success in leading high-performing SRE or DevOps teams in a large-scale, fast-paced environment
- Extensive experience running high-availability web services at a large scale, with comprehensive knowledge of cloud-native architectures and advanced networking concepts
- Strategic vision to balance immediate operational needs with long-term reliability and scalability objectives
- Outstanding communication and interpersonal skills, with the ability to build strong relationships with team members and stakeholders
- Strong technical background with hands-on experience in cloud computing, system architecture, automation, and monitoring
- Excellent problem-solving skills with a focus on root cause analysis and proactive improvements
- Exceptional organizational skills, with the ability to manage multiple priorities and projects simultaneously
- Experience with tools and technologies such as AWS, Kubernetes, Terraform, Prometheus, Grafana, Jenkins, and similar
Responsibilities
- Lead and grow a high-performing SRE team responsible for the reliability, performance, and scalability of production systems
- Own the incident management process, postmortems, and root cause analysis to improve system resilience
- Drive implementation of SLAs, SLOs, and error budgets across services to align operational goals with business objectives
- Champion the use of automation to reduce manual work and improve deployment and recovery times
- Collaborate with software engineering and Platform engineering teams to ensure systems are designed for reliability and operational efficiency
- Oversee system monitoring, alerting, and observability efforts using tools like Prometheus, Grafana, Datadog, or similar
- Manage on-call rotations, and ensure proper documentation, runbooks, and playbooks are maintained
- Identify and drive continuous improvement in system architecture, capacity planning, and deployment strategies
- Ensure compliance with security, privacy, and regulatory requirements within the infrastructure
- Provide mentorship, performance reviews, and career development opportunities for SRE team members
- You will communicate effectively with stakeholders at all levels, providing updates on team performance, project status, and incident resolutions
- You will advocate for the SRE team within the broader organization, representing their needs and concerns
Preferred Qualifications
- Cloud Architect Certification in one of the public clouds (AWS, GCP, Azure)
- Good Knowledge of security controls for SOC2 and ISO certifications
Benefits
This role offers flexibility. It can be based in Budapest with a hybrid schedule or anywhere in the European Union with a remote setup, including occasional travel to our other offices