Summary

Join Granicus as a Senior Site Reliability Engineering Manager and lead a team ensuring the reliability, scalability, and performance of our services. You will build and maintain robust infrastructure, automate processes, and implement best practices in site reliability. Responsibilities include on-call production support, managing tickets, working on SRE backlog items, monitoring systems, automating processes, incident management, system improvements, collaboration with software engineers, documentation, capacity planning, and implementing security best practices. The ideal candidate possesses strong technical skills in Linux/Unix systems, networking, cloud services, scripting and programming languages, and experience with various tools and technologies. A Bachelor's or Master's degree in a related field or equivalent experience, along with 5+ years of experience in SRE and 5+ years as a people manager, is required. Additional preferred qualifications include experience with AI tools, anomaly detection tools, containerization, and database management.

Requirements

Good understanding of Linux/Unix systems, networking, and cloud services (AWS, Azure, or Google Cloud)
Experience with scripting languages such as Python, Bash, or Ruby
Bachelor’s or Master’s degree in Computer Science, Information Technology, or a related field, or equivalent practical experience
5+ years of experience in site reliability engineering, system administration, or a similar role, with a proven track record of managing large-scale, high-availability systems
5+ years of experience as a people manager
Experience with Elastic stack or a similar observability tool in production is mandatory
24/7 on-call, including weekends (typically one week every month)

Responsibilities

Manage a team of engineers to provide production support on a shift according to the team on-call roster
Work on the customer and internal engineering/implementation team raised tickets while not on-call for production support. For example, a client may request to correct some data on the database server which cannot be done through the web interface
Work on SREs backlog items
Continuously monitor the health and performance of our services, systems, and infrastructure. Respond to alerts and incidents promptly to ensure high availability
Develop and maintain automation scripts and tools to streamline operations and reduce manual intervention
Assist in troubleshooting and resolving incidents, performing root cause analysis, and implementing long-term fixes to prevent recurrence
Participate in the design and implementation of system improvements to enhance reliability, scalability, and performance
Work closely with software engineers to understand application requirements, provide feedback on design and architecture, and support deployment and release processes
Create and maintain documentation for processes, procedures, and troubleshooting guides to ensure knowledge sharing within the team
Assist in capacity planning activities to anticipate future needs and ensure that our infrastructure can handle growth
Implement and adhere to security best practices to protect our systems and data
Responsible for Granicus information security by appropriately preserving the Confidentiality, Integrity, and Availability (CIA) of Granicus information assets in accordance with the company's information security program
Responsible for ensuring the data privacy of our employees and customers, their data, as well as taking all required privacy training in a timely manner, in accordance with company policies

Preferred Qualifications

Expertise in Linux/Unix systems, networking, and cloud services (AWS, Azure, or Google Cloud)
Proficiency in scripting languages (Python, Bash, Ruby) and programming languages (Go, Java, C++)
Advanced knowledge of monitoring and logging tools (Prometheus, Grafana, Splunk), configuration management (Ansible, Chef, Puppet), and CI/CD pipelines
Strong analytical and problem-solving skills with the ability to diagnose and resolve complex issues efficiently
Excellent verbal and written communication skills, with the ability to convey complex technical concepts to non-technical stakeholders
Demonstrated ability to lead and mentor a team, drive projects to completion, and manage cross-functional initiatives
Exposure to AI tools for code/script/agent development
Experience with AI based anomaly detection tools built into industry standard observability tools such as Datadog, NewRelic or Elastic Stack
8+ years' experience in an SRE, DevOps or Software Engineering role and a minimum of 5 years as a people manager
Relevant certifications such as AWS Certified DevOps Engineer, Google Cloud Professional DevOps Engineer, or similar
In-depth understanding of containerisation (Docker, Kubernetes) and infrastructure as code (Terraform, CloudFormation)
Experience with database management (SQL, NoSQL), load balancing, and distributed systems
AWS Solution Architect Associate or Professional Certification is desirable

Site Reliability Engineering Manager

Granicus

Summary

Requirements

Responsibilities

Preferred Qualifications

Remote

DevOps

Manager

Share this job:

Similar Remote Jobs

DC SCORES

Remote

DevOps

Manager

ServiceNow

Remote

DevOps

Manager

ServiceNow

Remote

DevOps

Manager

Remote

DevOps

Manager

Remote

DevOps

Manager

Remote

DevOps

Manager

Articulate

Remote

DevOps

Manager

Remote

DevOps

Manager

Remote

DevOps

Manager