Site Reliability Engineer at Catchpoint

Summary

Join Catchpoint as a Site Reliability Engineer and support the systems running our global monitoring platform. You will collaborate with operations and development teams to build, automate, and monitor infrastructure at scale, ensuring a highly reliable system for our customers. This role demands an operational mindset and problem-solving skills on a global scale. You will analyze system telemetry, logs, and passive monitoring data to create automation for platform control, rollout, and maintenance. Success in this position requires expertise in infrastructure automation, cloud platforms, and incident resolution. A strong programming background and experience with various monitoring and deployment tools are essential.

Requirements

Strong Experience/knowledge of administering application servers, web servers, and databases
Familiarity with Infrastructure Automation, configuration management and CI/CD tools (preferably terraform)
Experience with multiple cloud platforms (AWS, GCP, Azure)
Good networking knowledge and experience with Internet Architecture (BGP, peering, DNS)
2+ years of incident resolution experience in a large-scale operations environment
Hands-on experience with cloud deployment, monitoring, and ops analysis tools such as Prometheus, Elasticsearch, Grafana, Kibana, Splunk, Terraform, Jenkins, etc
3+ years programming experience with python, bash, PowerShell, C, etc
Virtualization experience required
BS degree in Computer Science or related technical field involving coding or equivalent practical experience
Appreciation of the value of diversity of opinions

Responsibilities

Define and refine the whole service lifecycle - from inception and design, through deployment, operation and finally retirement
Assess services once they are live by measuring and monitoring availability, latency and overall system health. Establish performance baselines, define actions and automations based on data correlated from multiple sources
Design, build, and maintain logging and telemetry systems that are used to manage all services
Design, code, test, and deliver software to automate manual operational work
Troubleshoot priority incidents, facilitate blameless post-mortems and ensure permanent closure of incidents
Identify application patterns and analytics in support of better service level objectives
Deploy and maintain systems that run on multiple cloud providers (AWS, GCP, Azure, Alibaba, Tencent, Oracle, IBM) and physical systems around the world
Be part of an on-call rotation to support production systems

Site Reliability Engineer

Catchpoint

Summary

Requirements

Responsibilities

Remote

DevOps

Mid-level

Share this job:

Similar Remote Jobs

Remote

DevOps

Mid-level

Tailor

Remote

Software Development

Mid-level

Remote

DevOps

Senior

Remote

DevOps

Mid-level

Remote

DevOps

Mid-level

Remote

DevOps

Senior

Remote

DevOps

Senior

Remote

DevOps

Senior

Wizeline

Remote

DevOps

Mid-level

Wizeline

Remote

DevOps

Mid-level