Senior Site Reliability Engineer

Granicus
Summary
Join Granicus as a Senior Site Reliability Engineer (SRE) and play a pivotal role in ensuring the reliability, scalability, and performance of our services. You will lead efforts in building and maintaining a robust infrastructure, automating processes, and guiding the team to implement best practices in site reliability. This role involves on-call production support, working on customer and internal tickets, managing the SRE backlog, monitoring systems, automating processes, and participating in incident management and system improvements. Collaboration with software engineers, documentation, capacity planning, and implementing security best practices are also key responsibilities. Granicus offers a competitive benefits package and is a remote-first company with a globally distributed workforce.
Requirements
- 5+ years in site reliability engineering, system administration, or a similar role, with a proven track record of managing large-scale, high-availability systems
- Expertise in Linux/Unix systems, and cloud platforms (AWS, Azure, or Google Cloud)
- Strong proficiency in scripting languages (Python, Bash, Ruby) and programming languages (Go, Java, C++)
- Familiarity with AI/ML operations, including model lifecycle management, vector databases, and inference performance tuning
- Experience with the ELK Stack (Elasticsearch, Logstash, Kibana) for centralized logging, monitoring, and observability
- Experience with configuration management tools (Ansible, Chef, Puppet)
- Responsible for Granicus information security by appropriately preserving the Confidentiality, Integrity, and Availability (CIA) of Granicus information assets in accordance with the company's information security program
- Responsible for ensuring the data privacy of our employees and customers, their data, as well as taking all required privacy training in a timely manner, in accordance with company policies
Responsibilities
- Provide production support on a shift according to the team on-call roster
- Work on the customer and internal engineering/implementation team raised tickets while not on-call for production support. For example, a client may request to correct some data on the database server which cannot be done through the web interface
- Work on SREs backlog items
- Continuously monitor the health and performance of our services, systems, and infrastructure. Respond to alerts and incidents promptly to ensure high availability
- Develop and maintain automation scripts and tools to streamline operations and reduce manual intervention
- Assist in troubleshooting and resolving incidents, performing root cause analysis, and implementing long-term fixes to prevent recurrence
- Participate in designing and implementing system improvements to enhance reliability, scalability, and performance
- Work closely with software engineers to understand application requirements, provide feedback on design and architecture, and support deployment and release processes
- Create and maintain documentation for processes, procedures, and troubleshooting guides to ensure knowledge sharing within the team
- Assist in capacity planning activities to anticipate future needs and ensure that our infrastructure can handle growth
- Implement and adhere to security best practices to protect our systems and data
Preferred Qualifications
- Experience supporting AI/ML infrastructure , including model deployment, inference optimization, and integration with services like AWS Bedrock is highly desirable
- Exposure to AI/ML toolchains , including AWS Bedrock, SageMaker, and LLMOps frameworks
- Relevant certifications such as AWS Certified DevOps Engineer , AWS Certified Machine Learning β Specialty , Google Cloud Professional DevOps Engineer , or similar are a plus
Benefits
- Flexible Time Off
- Medical (includes an option that is paid 100% by Granicus!), Dental & Vision Insurance
- 401(k) plan with matching contribution
- Paid Parental Leave
- Employer-paid Short and Long Term Disability Insurance, Group Term Life Insurance and AD&D Insurance
- Group legal coverage
Share this job:
Similar Remote Jobs
