Summary
Join Nextiva as a Senior Site Reliability Engineer (SRE) in Bangalore and redefine customer experiences. You will support and scale Kafka and Elasticsearch infrastructure, core systems powering our SaaS platform. This role demands automation expertise, AI-driven observability, and quick adoption of new technologies. You will proactively build resilient systems, own systems end-to-end, and write clean automation within a fast-paced, innovative team. You will also mentor junior engineers and lead large-scale reliability projects. The position requires a Bachelor's degree, 6+ years of relevant experience, and strong Linux and cloud platform skills.
Requirements
- Bachelor's degree in Computer Science, Engineering, or equivalent practical experience
- Fluent English communication skills (spoken and written)
- 6+ years of experience in software development, automation, or infrastructure engineering
- Deep experience with Kafka and/or Elasticsearch in production environments
- Strong Linux systems expertise and 6+ years managing Linux-based environments
- Hands-on experience with cloud platforms - GCP and/or AWS required
- Proficient in scripting languages like Python, Bash, etc
- Automation-first mindset - deep experience with Ansible, Terraform, Jenkins
- Expert-level understanding of Git and GitHub workflows for CI/CD and infrastructure-as-code
- Proficient with container tools (Docker) and orchestrators (Kubernetes)
- Strong understanding of SRE principles - SLAs/SLOs, alerting, observability, and incident management
- Experience with SQL, caching systems (e.g., Redis), and troubleshooting distributed systems
- Quick learner with a strong curiosity for new tools, frameworks, and AI/ML use cases in operations
Responsibilities
- Triage, troubleshoot, and resolve complex production issues involving Kafka and Elasticsearch
- Design and build automated monitoring, alerting, and logging systems - leveraging AI/ML techniques where possible
- Write tools and infrastructure software to support self-healing, auto-scaling, and incident prevention
- Automate system administration tasks - from patching and upgrades to config and deployment workflows
- Use and manage GitHub extensively for infrastructure-as-code, release management, and collaboration
- Partner with development, QA, and performance teams to ensure middleware systems are production-ready
- Participate in the on-call rotation and continuously improve incident response and resolution playbooks
- Mentor junior engineers and contribute to a culture of automation, learning, and accountability
- Lead large-scale reliability and observability projects in collaboration with global teams
Preferred Qualifications
- Observability Tools: Datadog, Splunk, Kibana, Opsgenie
- Programming: Java/Spring, JavaScript/React
- Middleware: RabbitMQ, Tomcat
- Experience with AI/ML-based anomaly detection, AIOps platforms, and LLM integrations for infrastructure
- Azure cloud experience (nice to have)
Benefits
- Health π - Comprehensive medical coverage, including dental care
- Insurance πΌ - Life insurance, covering life and disability
- Work-Life Balance βοΈ - PTO and Paid Sick time as per CBA, paid parental leave
- Financial Security π° - Private pension plan available
- Wellness π€Έβ - Employee Assistance Program and comprehensive wellness initiatives
- Growth π± - Access to ongoing learning and development opportunities and career advancement
Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.