Site Reliability Engineer at Masabi

Summary

Join Masabi's Site Reliability Engineering team and be at the forefront of ensuring our platform's reliability, performance, and security. As an SRE, you will drive automation, refine processes, ensure infrastructure security and compliance, maintain monitoring systems, optimize cloud costs, implement disaster recovery, manage incidents, collaborate with developers, and maintain documentation. This pivotal role shapes the future of our infrastructure, offering collaboration with a dynamic team in a rapidly evolving industry. The platform is JVM-based and cloud-native, hosted on AWS. We utilize standard tooling, including Gitlab, Terraform, CloudFormation, Puppet, Kibana, Grafana and Confluent Cloud. Masabi offers a supportive and inclusive work environment.

Requirements

Significant experience in SRE or related roles, with a proven track record in building and maintaining reliable systems
Expertise in AWS Cloud technologies
Hands-on experience with Terraform and Grafana, along with strong knowledge of security principles and networking components
Hands-on experience with EKS and ECS is essential
Experience in building pipelines and robust CI/CD infrastructure
A collaborative team player who approaches projects with an open mind and prioritizes security
Passionate about leveraging technology to drive advancements while ensuring reliability and security
Excellent communication skills, a collaborative mindset, and a willingness to learn and contribute to team success
Self-sufficient and capable of working independently, while also knowing when to seek support or input

Responsibilities

Drive automation to reduce operational overhead and human error. Build CI/CD pipelines, develop Infrastructure as Code (IaC) using tools like Terraform and CloudFormation, and design scalable systems to handle high traffic while optimizing resource utilization
Refine processes, tools, and workflows to enhance system reliability, scalability, and efficiency. Plan capacity to anticipate future needs and support high-performance systems
Ensure infrastructure meets organizational security standards and supports compliance frameworks like SOC 2 and PCI
Maintain real-time monitoring systems aligned with SLIs and SLOs, ensuring uptime and performance meet or exceed SLAs. Set up proactive alerting mechanisms to address issues before they escalate
Monitor and optimize cloud infrastructure costs through autoscaling, rightsizing, and architectural reviews to balance cost-effectiveness with reliability
Implement failover strategies, disaster recovery plans, and redundancy to ensure system resilience under all conditions
Respond to production incidents, minimize downtime, and restore availability. Perform root cause analysis, implement preventive measures, and contribute to post-incident reviews to share lessons learned
Partner with developers to design reliable, maintainable systems. Coach teams on best practices for reliability, scalability, and observability, fostering a culture of ownership
Maintain detailed documentation for infrastructure, incident response, and workflows. Develop playbooks and runbooks to ensure seamless knowledge transfer

Preferred Qualifications

Familiarity with PCI DSS v4 Compliance requirements is a plus
AWS Cloud certification

Site Reliability Engineer

Masabi

Summary

Requirements

Responsibilities

Preferred Qualifications

Remote

DevOps

Mid-level

Similar Remote Jobs

Remote

DevOps

Senior

Remote

Software Development

Senior

Remote

Software Development

Senior

Remote

Software Development

Senior

Tailor

Remote

Software Development

Mid-level

Remote

DevOps

Senior

Kraken Digital Asset Exchange

Remote

DevOps

Mid-level

Kraken Digital Asset Exchange

Remote

DevOps

Mid-level

GoDaddy

Remote

DevOps

Mid-level

Remote

DevOps

Senior