Site Reliability Engineer at CMG

Summary

Join Capital Markets Gateway LLC. (CMG), a financial technology firm connecting investors and underwriters, as a Site Reliability Engineer (SRE). You will play a crucial role in ensuring the reliability, performance, and scalability of our infrastructure and applications. Key responsibilities include designing, implementing, and maintaining monitoring and observability solutions using various tools, defining and implementing SLOs and SLIs, developing dashboards and alerts, and optimizing system performance. You will also collaborate with cross-functional teams and contribute to automation and tooling efforts. This is a 2+ year contract position offering flexible working hours, a top-of-the-line MacBook, tech courses and conferences, and 15 vacation days. Based in Latin America, this role requires strong experience in SRE, proficiency in various technologies, and excellent communication skills.

Requirements

Must be based in Latin America
English level - C1 or C2
Proven experience as a Site Reliability Engineer or similar role
Proficiency in logging, metrics, and tracing frameworks (DataDog, Loki, Prometheus, OpenTelemetry)
Experience with cloud platforms (Azure preferred) and infrastructure-as-code tools (e.g., Terraform)
Strong programming and scripting skills (Python, Bash)
Proficiency in containerization technologies and orchestration tools (Docker, Kubernetes)
Understanding of Linux-based systems, networking, and security principles related to containerized applications
Strong problem-solving and troubleshooting skills, with a passion for identifying and resolving complex technical issues
Excellent communication and collaboration abilities
Ability to thrive in a fast-paced, constantly evolving environment

Responsibilities

Design, implement, and maintain monitoring and observability solutions using tools like Prometheus, Grafana Stack (Loki/Grafana/Tempo/Alert Manager), Datadog, and OpenTelemetry
Define and implement SLOs, SLIs, and error budgets to measure system reliability
Develop and optimize dashboards, alerts, and reports for system performance and business metrics
Design actionable alerting strategies to minimize noise and ensure meaningful notifications
Integrate alerting systems with Jira
Establish and refine runbooks for on-call teams to handle alerts efficiently
Analyze system performance metrics, identify bottlenecks, and implement optimizations to improve system efficiency and scalability
Help conduct load testing and capacity planning to ensure systems can handle peak traffic loads
Identify opportunities for automation and develop tools to streamline operational processes, such as deployment, configuration management, and monitoring
Implement monitoring and alerting systems within automations to detect and resolve issues proactively
Collaborate closely with cross-functional teams, including software engineers, operations, and infrastructure teams, to understand system requirements, provide technical guidance, and drive solutions
Communicate effectively to stakeholders about system changes, incidents, and improvements

Preferred Qualifications

Experience with PostgreSQL monitoring and optimization

Benefits

2 year+ contract
15 days of vacation
Tech courses and conferences
Top-of-the-line MacBook
Flexible working hours

Site Reliability Engineer

CMG

Summary

Requirements

Responsibilities

Preferred Qualifications

Benefits

Remote

DevOps

Mid-level

Similar Remote Jobs

Remote

DevOps

Senior

Remote

Software Development

Senior

Remote

Software Development

Senior

Remote

Software Development

Senior

Tailor

Remote

Software Development

Mid-level

Remote

DevOps

Senior

Kraken Digital Asset Exchange

Remote

DevOps

Mid-level

Kraken Digital Asset Exchange

Remote

DevOps

Mid-level

GoDaddy

Remote

DevOps

Mid-level

Remote

DevOps

Senior