Site Reliability Engineer

RT2
Summary
Join RTยฒ as a Site Reliability Engineer and play a key role in maintaining and improving the reliability of our systems. You will enhance system stability, optimize performance, and automate deployment processes. This position requires experience with infrastructure tools like Terraform, Bicep, and Ansible, across on-premise and cloud (Azure) environments. You will work with various teams to identify and mitigate potential system failures, participate in capacity planning, and create automation to improve deployment speed and incident response. The ideal candidate is a proactive problem-solver passionate about infrastructure and continuous improvement. This is an exciting opportunity to make a meaningful impact at a rapidly growing company.
Requirements
- Experience working with server operating systems like Windows, Unix, Linux
- Experience working with monitoring via tools such as ELK stack, Grafana, Azure Application Insights
- Experience with Git or other distributed source control systems
- Bachelorโs degree (or equivalent) in computer science or related discipline
- Experience with tools TerraForm, Bicep, Ansible
- Experience with both On-Premise and Cloud Providers preferably Azure
- Experience with Hyper-V and VMWare
- Experience with CI/CD Pipelines like Azure Pipelines, GitHub Actions, and OctoDeploy
- Experience with scripting languages like PowerShell, Python and Bash
- Proactive approach to identifying problems, performance bottlenecks, and areas for improvement
- Experience with observability tools like Grafana, UptimeRobot, ELK, PagerDuty
- Experience working with Agile methodologies
Responsibilities
- Help maintain and enhance production monitoring and notifications
- Improve reliability and quality of production systems
- Measure and help optimize system performance
- Work with delivery and other teams to identify points of potential failure and then work to help enhance and improve systems to mitigate
- Participate in capacity planning
- Create automation to improve deployment speed, testing, and responding to operational issues
- Work to meet service level objectives
- Help build runbooks, tools, and other supporting tools to improve incident response
- Monitor production systems and help manage incident response
- Participate in post mortems, document outages, steps to recovery, future mitigation strategies
- Work on both on-premises (data center) and cloud-based infrastructure (Azure)
Benefits
- Remote, flexible working options
- Competitive compensation
- Generous STI and LTI provisions
- Health, Dental and Vision Insurance
- Paid Annual Leave
- Paid Sick Leave
- 401K, and more