Site Reliability Engineer

GoodLeap Logo

GoodLeap

💵 $97k-$141k
📍Remote - United States

Summary

Join GoodLeap as a Site Reliability Engineer (SRE) and ensure the reliability, scalability, and performance of our applications and services. This hybrid role blends software and systems engineering, focusing on automation, monitoring, and incident response to maintain high service availability. Collaborate with development and operations teams to implement best practices, reduce manual work, and improve system stability. Leverage DataDog for enhanced system visibility and observability, and develop comprehensive documentation to support operational excellence. Contribute to scaling initiatives and optimize system performance using data-driven insights. This position is crucial for supporting our DevOps initiatives and enhancing the overall health of our production environments.

Requirements

  • Solid understanding of the Software Development Lifecycle (SDLC), including source control, defect tracking, automated build systems, and production control processes
  • Strong knowledge of CI/CD and DevOps principles, tools, and integrations
  • Hands-on experience with Amazon Web Services (AWS), including services such as DynamoDB, CloudFormation, CloudFront, S3, Route53, Lambda, and YAML configuration
  • Proficiency with containerization and serverless technologies
  • Experience with infrastructure as code tools, particularly Terraform and Kubernetes
  • Strong understanding of observability concepts, including tracing, structured logging, and metrics
  • Experience using application and infrastructure monitoring tools—specifically DataDog—to ensure system health and performance
  • Familiarity with designing and implementing self-healing, fault-tolerant, and autoscaling systems
  • Experience working with SQL and relational databases; familiarity with MongoDB Cloud Atlas is a plus
  • Proficiency with Git and source control workflows; understanding of change management best practices
  • Demonstrated problem-solving and analytical skills in fast-paced environments
  • Excellent verbal and written communication skills, with the ability to explain complex technical topics to both technical and non-technical stakeholders
  • Self-motivated with a strong sense of ownership, accountability, and follow-through

Responsibilities

  • Partner with engineering, DevOps, and product teams to understand system requirements, communicate reliability best practices, and embed a culture of shared ownership
  • Lead incident response efforts, facilitate root cause analysis, and drive continuous improvements post-incident
  • Identify opportunities to reduce manual work by building and maintaining internal tools and automation pipelines
  • Leverage DataDog to enhance system visibility, improve alerting strategies, and ensure observability across services
  • Develop and maintain documentation including runbooks, service readiness guides, and knowledge articles to support operational excellence
  • Collaborate with teams to support scaling initiatives and optimize system performance using data-informed insights

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.