Remote Senior Site Reliability Engineer
closedTyk
πRemote - Canada
Job highlights
Summary
Join Tyk, a company on a mission to connect every system in the world, and work as a Senior Site Reliability Engineer (SRE) to optimize, automate, and improve performance. The role involves leading the SRE team, collaborating with Principal SRE, and ensuring SLA compliance for cloud environment through proactive monitoring.
Requirements
- Proven experience in a senior SRE role or similar
- Strong knowledge of cloud technologies and SLA SLO SLI management
- Experience leading teams and implementing SCRUM processes
- Excellent communication and leadership skills
- Experience line managing, mentoring and coaching
- Ability to analyze and improve operational processes and performance metrics
- Experience in software design, automation, and root cause analysis
- On-call support experience and customer-focused mindset
- Collaborative attitude with commercial and technical teams
- Launching and operating production Kubernetes clusters
- Designing and operating infrastructure on AWS and other providers
- Operating MongoDB (or other document database) clusters
- Operating Redis (or other key-value storage) clusters
- Administering Linux servers
- Maintaining distributed software
- Operating Prometheus and Grafana
- Operating logging collection and analysis system
Responsibilities
- Collaborate with the Principal SRE to shape and implement the SRE strategic plan
- Lead the SRE team in translating strategy into actionable plans, coordinating these through the SCRUM process
- Address wellbeing and performance concerns, fostering a positive and productive team environment
- Work with the Principal SRE and Scrum Master to analyse wellbeing survey outcomes and develop improvement plans
- Champion operational communication, ensuring high-quality and timely updates on team progress
- Ensure SLA compliance for our cloud environment through proactive monitoring
- Develop and oversee the roadmap for proactive alerting and monitoring
- Define and track key performance metrics for cloud services, driving continuous improvement
- Design and implement solutions to maintain and enhance KPIs
- Lead performance tuning and fault finding by analysing metrics from operating systems and applications
- Optimise system and infrastructure performance, focusing on innovation and customer needs anticipation
- Engage with commercial teams to understand growth plans and develop corresponding SRE strategies
- Direct the analysis of cloud infrastructure, focusing on automation, scalability, and management
- Align with the Principal SRE on automation strategies for cloud-operations tasks
- Model excellence in software design and automation to enhance Tyk Cloud services, creating runbooks and knowledge sharing
- Conduct blame-free root cause analysis postmortems, reporting findings and recommendations
- Document operational processes and policies, ensuring replicability and adherence
- Provide on-call support, ensuring effective response and resolution in line with SLAs
- Plan and execute software upgrades to optimise cloud services
- Assist commercial teams with data requests and account management
- Champion and adhere to SCRUM methodologies within the SRE team
Benefits
- Everyone has unlimited paid holidays
- We have total flexibility in hours, as we believe creativity flows better when our people are given freedom to decide when they are most productive. Everyone is unique after all
- Employee share scheme
- Generous maternity and paternity leave
- Volunteering Days
- Company retreats
- Employee Wellbeing platform
This job is filled or no longer available
Similar Remote Jobs
- π°$60k-$120kπAsia
- π°$177k-$213kπUnited States
- πUnited Kingdom
- πUnited States
- πCanada
- πPoland
- π°$167k-$201kπUnited States
- Nπ°$68k-$98kπWorldwide
- π°$125k-$150kπCanada
- π°$154k-$258kπWorldwide