Site Reliability Engineer

Logo of RT2

RT2

πŸ’΅ $100k-$130k
πŸ“Remote - Worldwide

Job highlights

Summary

Join RTΒ², a company offering flexible, cutting-edge retail management solutions, as a Site Reliability Engineer. This key role focuses on maintaining and improving the reliability of our systems. You will enhance system stability, optimize performance, and automate deployment processes. Experience with infrastructure tools like Terraform, Bicep, and Ansible in both on-premise and cloud environments (Azure) is essential. This position offers a chance to make a significant impact in a dynamic, team-oriented environment. The role involves working with various technologies and methodologies to ensure system reliability and efficiency. Competitive compensation and benefits are offered.

Requirements

  • Experience working with server operating systems like Windows, Unix, Linux
  • Experience working with monitoring via tools such as ELK stack, Grafana, Azure Application Insights
  • Experience with Git or other distributed source control systems
  • Bachelor’s degree (or equivalent) in computer science or related discipline
  • Experience with tools TerraForm, Bicep, Ansible
  • Experience with both On-Premise and Cloud Providers preferably Azure
  • Experience with Hyper-V and VMWare
  • Experience with CI/CD Pipelines like Azure Pipelines, GitHub Actions, and OctoDeploy
  • Experience with scripting languages like PowerShell, Python and Bash
  • Proactive approach to identifying problems, performance bottlenecks, and areas for improvement
  • Experience with observability tools like Grafana, UptimeRobot, ELK, PagerDuty
  • Experience working with Agile methodologies

Responsibilities

  • Help maintain and enhance production monitoring and notifications
  • Improve reliability and quality of production systems
  • Measure and help optimize system performance
  • Work with delivery and other teams to identify points of potential failure and then work to help enhance and improve systems to mitigate
  • Participate in capacity planning
  • Create automation to improve deployment speed, testing, and responding to operational issues
  • Work to meet service level objectives
  • Help build runbooks, tools, and other supporting tools to improve incident response
  • Monitor production systems and help manage incident response
  • Participate in post mortems, document outages, steps to recovery, future mitigation strategies
  • Work on both on-premises (data center) and cloud-based infrastructure (Azure)

Benefits

  • Remote, flexible working options
  • Competitive compensation
  • Generous STI and LTI provisions
  • Health, Dental and Vision Insurance
  • Paid Annual Leave
  • Paid Sick Leave
  • 401K, and more

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.
Please let RT2 know you found this job on JobsCollider. Thanks! πŸ™