L3 Cloud DevOps Engineer/Site Reliability Engineer (SRE)

closed
NTD Software Logo

NTD Software

💵 $80k-$120k
📍Remote - Worldwide

Summary

Join our team as an experienced L3 Cloud DevOps Engineer with a strong focus on Site Reliability Engineering (SRE) to create and enhance monitoring and alerting tools, utilizing Grafana, Prometheus, and Datadog.

Requirements

  • Extensive hands-on experience with Python scripting
  • Strong expertise in Site Reliability Engineering (SRE) practices
  • Proficiency in Grafana, including dashboard creation and modification
  • In-depth knowledge of Prometheus and Datadog tools for monitoring and alerting
  • Experience with user and system monitoring, along with the ability to create and enhance dashboards and runbooks
  • DevOps experience is a secondary but desirable skill set
  • Relevant certifications or courses in Python, SRE, Grafana, and Prometheus are a plus

Responsibilities

  • Proactively build and enhance Grafana dashboards to improve monitoring capabilities
  • Collaborate with cross-functional teams to ensure effective monitoring and alerting
  • Manage and respond to alerts, focusing on timely remediation and implementation of solutions for service issues
  • Conduct user and system monitoring to identify and address potential problems
  • Develop and maintain runbooks to support operational efficiency and incident response
  • Utilize Python scripting to automate and improve processes within the DevOps and SRE framework
This job is filled or no longer available