L3 Cloud DevOps Engineer/Site Reliability Engineer (SRE)
closed
NTD Software
💵 $80k-$120k
📍Remote - Worldwide
Summary
Join our team as an experienced L3 Cloud DevOps Engineer with a strong focus on Site Reliability Engineering (SRE) to create and enhance monitoring and alerting tools, utilizing Grafana, Prometheus, and Datadog.
Requirements
- Extensive hands-on experience with Python scripting
- Strong expertise in Site Reliability Engineering (SRE) practices
- Proficiency in Grafana, including dashboard creation and modification
- In-depth knowledge of Prometheus and Datadog tools for monitoring and alerting
- Experience with user and system monitoring, along with the ability to create and enhance dashboards and runbooks
- DevOps experience is a secondary but desirable skill set
- Relevant certifications or courses in Python, SRE, Grafana, and Prometheus are a plus
Responsibilities
- Proactively build and enhance Grafana dashboards to improve monitoring capabilities
- Collaborate with cross-functional teams to ensure effective monitoring and alerting
- Manage and respond to alerts, focusing on timely remediation and implementation of solutions for service issues
- Conduct user and system monitoring to identify and address potential problems
- Develop and maintain runbooks to support operational efficiency and incident response
- Utilize Python scripting to automate and improve processes within the DevOps and SRE framework
This job is filled or no longer available