Lead Site Reliability Engineer

Sprinto
Summary
Join Sprinto as a Lead Site Reliability Engineer and take ownership of the observability pipeline, CI/CD pipeline development, and full infrastructure management. Ensure high availability, scalability, and reliable product delivery by collaborating with application engineers to develop necessary tooling for efficient operations. Establish and maintain on-call protocols and incident response processes. This role requires expertise in IaC tools, APM tools, application capacity planning, and incident response. Strong problem-solving and communication skills are essential. Familiarity with Sprinto's tech stack (Node.js, React, Apollo GraphQL, PostgreSQL, and AWS) is a plus. Sprinto offers a remote-first policy, flexible hours, group medical insurance, accident cover, a company-sponsored device, and education reimbursement.
Requirements
- Proficiency with tools such as Terraform and Ansible
- Skilled in using Application Performance Monitoring tools, setting up on-call practices, identifying bottlenecks across the stack, and collaborating with teams to address these issues effectively
- Proven experience in application capacity planning, owning incident response workflows, and running processes such as Root Cause Analyses (RCAs) and maintaining runbooks
- Strong problem-solving abilities and excellent communication skills, both spoken and written
Responsibilities
- Take ownership of the observability pipeline to ensure high availability and optimal performance of applications
- Design, build, and maintain the Continuous Integration/Continuous Deployment (CI/CD) pipelines to facilitate smooth and reliable product deliveries
- Own the complete infrastructure stack of the product, contributing to scalability and enhancements of the overall offering
- Work closely with application engineers to develop and refine tooling necessary for efficient operations management
- Establish and maintain on-call protocols and incident response processes to ensure timely resolution of issues and maintain service reliability
Preferred Qualifications
Familiarity with our current tech stack is a plus as it will enable you to start contributing sooner. Our tech stack includes Node.js , React, Apollo GraphQL, PostgreSQL, and AWS
Benefits
- Remote First Policy
- 5 Days Working With FLEXI Hours
- Group Medical Insurance (Parents, Spouse, Children)
- Group Accident Cover
- Company Sponsored Device
- Education Reimbursement Policy