Summary

Join our team as a Senior Site Reliability Engineer to develop and maintain advanced observability solutions, focusing on blackbox and whitebox monitoring, synthetic tests, and platform reliability across on-premise and GCP environments.

Requirements

Bachelor's degree in Computer Science, Engineering, or equivalent experience
+3 years of experience in DevOps and Site Reliability Engineering, with a focus on automation, infrastructure as code, and continuous integration/continuous deployment (CI/CD) practices
3+ years of experience in programming, with a strong focus on Golang development
+3 years of experience with APM and monitoring tools such as Dynatrace, Prometheus, ELK, Splunk, or similar
Proficiency in Google Cloud Platform (GCP) and experience with on-premise environments, particularly with application deployment and management on OpenShift
Experience with container orchestration technologies like Kubernetes (K8s) and OpenShift
Experience with CI/CD deployment pipelines, ensuring automated and reliable deployment processes
Demonstrable experience in designing and deploying scalable and resilient systems, with an understanding of cloud-native principles
Extensive experience in implementing both blackbox and whitebox monitoring solutions, with a focus on SLOs and anomaly detection

Responsibilities

Lead efforts in blackbox monitoring, including the development and enhancement of the Health Mesh product
Implement and manage synthetic tests that monitor critical platform services, providing early detection of incidents
Utilize Prometheus for blackbox monitoring and develop simple Go APIs to support these activities
Implement whitebox monitoring strategies with a focus on Service Level Objectives (SLOs) for core Google Cloud Platform (GCP) services and applications on OpenShift
Ensure that both platform operators and customers have clear visibility into the system's performance and health
Develop and refine anomaly detection mechanisms using the same metrics applied in whitebox monitoring
Leverage tools such as Prometheus and Dynatrace to identify and address potential issues before they escalate, contributing to overall platform stability
Create tools and processes that help operators distinguish between platform-level incidents and individual user errors
Enhance the observability of API gateways and other critical infrastructure components
Maintain and improve observability tools that support both on-premise and cloud environments, ensuring seamless operation across different infrastructure setups
Provide application support for on-premise applications and utilize technologies such as OpenShift for managing on-premise environments
Collaborate with various teams to ensure effective incident management and response
Focus on separating platform incidents from individual application teams, providing clear communication and resolution strategies

Preferred Qualifications

Knowledge of both Debian and Ubuntu environments
Experience with Jenkins, Terraform, Datadog, K6, or similar technologies
Understanding of web protocols and technologies such as HTTP, TLS, REST, Nginx, and API gateways

Gorilla Logic is hiring a Site Reliability Engineer

Gorilla Logic

Summary

Requirements

Responsibilities

Preferred Qualifications

Remote

DevOps

Senior

Share this job:

Similar Jobs

Senior Site Reliability Engineering Engineer

Binance

Remote

DevOps

Senior

Senior Site Reliability Engineering Engineer

Binance

Remote

DevOps

Senior

Engineering Team Lead Site Reliability Engineer

Givebutter

Remote

DevOps

Manager

Senior Site Reliability Engineer

Tyk

Remote

DevOps

Senior

Site Reliability Engineer

Blockchain.com

DevOps

Mid-level

Site Reliability Engineer

Aurora Labs

Remote

DevOps

Senior

Site Reliability Engineer

Sezzle

Remote

DevOps

Mid-level

Lead Site Reliability Engineer

Remotivate

Remote

DevOps

Manager

Sr. Site Reliability Engineer

Corelight

Remote

DevOps

Senior

Gorilla Logic is hiring a
Site Reliability Engineer