Site Reliability Engineer

RT2 Logo

RT2

๐Ÿ“Remote - Worldwide

Summary

Join RTยฒ as a Site Reliability Engineer and play a key role in maintaining and improving the reliability of our systems. You will enhance system stability, optimize performance, and automate deployment processes. This position requires experience with infrastructure tools like Terraform, Bicep, and Ansible, across on-premise and cloud (Azure) environments. You will work with various teams to identify and mitigate potential system failures, participate in capacity planning, and create automation to improve deployment speed and incident response. The ideal candidate is a proactive problem-solver passionate about infrastructure and continuous improvement. This is an exciting opportunity to make a meaningful impact at a rapidly growing company.

Requirements

  • Experience working with server operating systems like Windows, Unix, Linux
  • Experience working with monitoring via tools such as ELK stack, Grafana, Azure Application Insights
  • Experience with Git or other distributed source control systems
  • Bachelorโ€™s degree (or equivalent) in computer science or related discipline
  • Experience with tools TerraForm, Bicep, Ansible
  • Experience with both On-Premise and Cloud Providers preferably Azure
  • Experience with Hyper-V and VMWare
  • Experience with CI/CD Pipelines like Azure Pipelines, GitHub Actions, and OctoDeploy
  • Experience with scripting languages like PowerShell, Python and Bash
  • Proactive approach to identifying problems, performance bottlenecks, and areas for improvement
  • Experience with observability tools like Grafana, UptimeRobot, ELK, PagerDuty
  • Experience working with Agile methodologies

Responsibilities

  • Help maintain and enhance production monitoring and notifications
  • Improve reliability and quality of production systems
  • Measure and help optimize system performance
  • Work with delivery and other teams to identify points of potential failure and then work to help enhance and improve systems to mitigate
  • Participate in capacity planning
  • Create automation to improve deployment speed, testing, and responding to operational issues
  • Work to meet service level objectives
  • Help build runbooks, tools, and other supporting tools to improve incident response
  • Monitor production systems and help manage incident response
  • Participate in post mortems, document outages, steps to recovery, future mitigation strategies
  • Work on both on-premises (data center) and cloud-based infrastructure (Azure)

Benefits

  • Remote, flexible working options
  • Competitive compensation
  • Generous STI and LTI provisions
  • Health, Dental and Vision Insurance
  • Paid Annual Leave
  • Paid Sick Leave
  • 401K, and more

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.