StarCompliance is hiring a
Site Reliability Engineer, Remote - United States

Logo of StarCompliance

Site Reliability Engineer closed

🏢 StarCompliance

💵 $100k-$150k
📍United States

Summary

The Site Reliability Engineer will maintain and improve the platform's reliability, availability, and performance using Azure as the core cloud platform. Key responsibilities include analyzing reliability challenges, working with cross-functional teams, identifying and addressing Toil, conducting Post-Mortems, and driving reliability and supportability aspects of Cloud services.

Requirements

  • 4+ years of experience in Reliability engineering background
  • 2+ recent years of experience with Azure systems
  • Advanced knowledge of New Relic ecosystem
  • Working Knowledge of Monitoring and APM tools such as Azure App Insights, Grafana, and Selenium
  • Knowledge of networking and troubleshooting latency, connectivity, and performance
  • Experience working with IaC with Terraform and CaC with Ansible
  • Familiar with one or more Databases - SQL server, Mongo DB, and PostgreSQL
  • Hands-on experience with SRE practices and writing, running Chaos engineering experiments
  • Proficient in Linux and Windows administration, troubleshooting, and support
  • Experience with Azure DevOps
  • Excellent Debugging skills across a variety of integrated platforms

Responsibilities

  • Analyze reliability challenges and develop automated solutions for incident resolution
  • Work with development teams to improve applications operational features for faster MTTD, MTTR, and auto-recovery
  • Lead the establishment of SLIs, SLOs, Error budgets, policies, and work with respective engineers to instrument, visualize, and offer a means for peer engineers and developers to gain greater insight into operational performance (Observability)
  • Identify, track, and address Toil
  • Conduct Post-Mortems
  • Identify and implement continuous improvement in various facets of production operations
  • Offer advanced technical support for cross-product issues and incidents
  • Leverage SRE tooling to develop, implement, and deliver on the SRE mission
  • Conduct Chaos Testing
  • Identify, define, and implement new tools and technologies to improve the quality and efficiency of distributed platforms
  • Drive reliability and supportability aspects of Cloud service, including change management, triage of customer escalations, remediation plans, playbooks, and automation
  • Maintain services once they are live by measuring and monitoring availability, latency, and overall system health
  • Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity
  • Engage in and improve the whole lifecycle of services from inception and design through deployment, operation, and refinement

Preferred Qualifications

  • Preferred experience with C#, .Net, and PowerShell or Python or Golang
  • Experience with containerization
  • Experience in High Availability and distributed systems

Benefits

StarCompliance Background Checks

This job is filled or no longer available

Similar Jobs