Senior Site Reliability Engineer

Moniepoint Logo

Moniepoint

πŸ“Remote - Nigeria

Summary

Join Moniepoint, a rapidly growing financial services platform in Africa, as a Site Reliability Engineer (SRE). You will ensure the smooth and efficient operation of our systems, engineering solutions to enhance visibility, automate tasks, and boost system resilience. This role involves on-call responsibilities for detecting and resolving issues, acting as Incident Commander during major incidents, and conducting root cause analyses. You will also develop automation, maintain monitoring dashboards, participate in feature development, define SLIs/SLOs, and resolve escalated customer complaints. The ideal candidate balances real-time responsibilities with strategic engineering work for sustainable service reliability. Moniepoint offers a supportive culture, learning opportunities, and competitive compensation.

Requirements

  • Minimum of 3 years of experience supporting enterprise applications in an SRE or similar role
  • Knowledge of distributed systems, microservices architecture and software design patterns
  • Experience with cloud platforms such as AWS, GCP, or Azure
  • Strong knowledge of Kubernetes and container orchestration tools
  • Experience using application performance monitoring tools, OpenTelemetry, and observability platforms such as New Relic, Datadog, ELK, or SigNoz
  • Excellent problem-solving and troubleshooting skills as an on-call engineer, with the ability to resolve complex infrastructure and application issues
  • Proficient in setting up and maintaining monitoring dashboards and alerts using Grafana and Prometheus
  • Working knowledge of a scripting/programming language (e.g., Python, Bash)
  • Proficiency in SQL databases (e.g., MySQL), writing complex sql queries against large datasets, and hands-on experience in database administration

Responsibilities

  • Participate in on-call rotations as the primary technical lead for detecting, triaging, and resolving service degradation, outages, or reliability issues across all environments
  • Act as the Incident Commander during major incidents: initiating war room or bridge calls, coordinating cross-functional teams, providing timely and clear status updates to all stakeholders and leading/documenting blameless Root Cause Analyses (RCAs) to identify the root causes of issues and drive long-term fixes
  • Develop automation to eliminate manual and repetitive operational tasks (toil) related to reliability and operations across both applications and infrastructure to improve efficiency and system resilience
  • Create and maintain monitoring dashboards and alerts to monitor application and infrastructure health
  • Participate in feature development discussions to ensure services are built with observability from the ground up
  • Define and track Service Level Indicators (SLIs) and Service Level Objectives (SLOs) in collaboration with Product and Engineering teams
  • Investigate and resolve customer complaints escalated beyond L1 and L2 support, especially those involving performance, reliability, or complex system behavior

Benefits

  • Attractive salary
  • Pension
  • Health insurance
  • Annual bonus

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.