Senior Site Reliability Engineer

Granicus Logo

Granicus

๐Ÿ“Remote - India

Summary

Join Granicus as a Senior Site Reliability Engineer (SRE) and contribute to the reliability, scalability, and performance of our services. You will lead efforts in building and maintaining a robust infrastructure, automating processes, and guiding the team to implement best practices in site reliability. This role involves on-call production support, monitoring and maintaining systems, automating processes, incident management, system improvements, collaboration with software engineers, documentation, capacity planning, and security implementation.

Requirements

  • Good understanding of Linux/Unix systems, networking, and cloud services (AWS, Azure, or Google Cloud)
  • Experience with scripting languages such as Python, Bash, or Ruby
  • Bachelorโ€™s or Masterโ€™s degree in Computer Science, Information Technology, or a related field, or equivalent practical experience
  • 5+ years of experience in site reliability engineering, system administration, or a similar role, with a proven track record of managing large-scale, high-availability systems
  • Expertise in Linux/Unix systems, networking, and cloud services (AWS, Azure, or Google Cloud)
  • Proficiency in scripting languages (Python, Bash, Ruby) and programming languages (Go, Java, C++)
  • Advanced knowledge of monitoring and logging tools (Prometheus, Grafana, Splunk), configuration management (Ansible, Chef, Puppet), and CI/CD pipelines
  • Strong analytical and problem-solving skills with the ability to diagnose and resolve complex issues efficiently
  • Excellent verbal and written communication skills, with the ability to convey complex technical concepts to non-technical stakeholders
  • Demonstrated ability to lead and mentor a team, drive projects to completion, and manage cross-functional initiatives
  • Responsible for Granicus information security by appropriately preserving the Confidentiality, Integrity, and Availability (CIA) of Granicus information assets in accordance with the company's information security program
  • Responsible for ensuring the data privacy of our employees and customers, their data, as well as taking all required privacy training in a timely manner, in accordance with company policies
  • The position requires flexibility in working hours to cover for any overlap and attend team meetings as needed
  • 24/7 on-call, including weekends (typically one week every month)

Responsibilities

  • On-call Production Support: Provide production support on a shift according to the team on-call roster
  • Work on the customer and internal engineering/implementation team raised tickets while not on-call for production support. For example, a client may request to correct some data on the database server which cannot be done through the web interface
  • Work on SREs backlog items
  • Monitor and Maintain Systems: Continuously monitor the health and performance of our services, systems, and infrastructure. Respond to alerts and incidents promptly to ensure high availability
  • Automate Processes: Develop and maintain automation scripts and tools to streamline operations and reduce manual intervention
  • Incident Management: Assist in troubleshooting and resolving incidents, performing root cause analysis, and implementing long-term fixes to prevent recurrence
  • System Improvements: Participate in designing and implementing system improvements to enhance reliability, scalability, and performance
  • Collaboration: Work closely with software engineers to understand application requirements, provide feedback on design and architecture, and support deployment and release processes
  • Documentation: Create and maintain documentation for processes, procedures, and troubleshooting guides to ensure knowledge sharing within the team
  • Capacity Planning: Assist in capacity planning activities to anticipate future needs and ensure that our infrastructure can handle growth
  • Security: Implement and adhere to security best practices to protect our systems and data

Preferred Qualifications

  • 5+ years experience in a SRE, DevOps or Software Engineering role
  • Relevant certifications such as AWS Certified DevOps Engineer, Google Cloud Professional DevOps Engineer, or similar
  • In-depth understanding of containerization (Docker, Kubernetes) and infrastructure as code (Terraform, CloudFormation)
  • Experience with database management (SQL, NoSQL), load balancing, and distributed systems

Benefits

  • We are a remote-first company with a globally distributed workforce across the United States, Canada, United Kingdom, India, Armenia, Australia, and New Zealand
  • At Granicus, we are building a transparent, inclusive, and safe space for everyone who wants to be a part of our journey
  • Employee Resource Groups to encourage diverse voices
  • Coffee with Mark sessions โ€“ Our employees get to interact with our CEO on very important and sometimes difficult issues ranging from mental health to work-life balance and current affairs
  • Microsoft Teams communities focused on wellness, art, furbabies, family, parenting, and more
  • We bring in special guests from time to time to discuss issues that impact our employee population

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.