Summary

Join Graylog's multinational cloud services team as a Site Reliability Engineer. You will provide architectural guidance and technical solutions for a 24x7 support cloud offering, focusing on high availability, resilience, security, scalability, and cost efficiency. Responsibilities include cloud infrastructure management using AWS, Terraform, and Kubernetes; implementing security measures and ensuring compliance; developing internal tools; resolving infrastructure issues; advocating for cloud strategies; and sharing knowledge. The role is full-time and permanent, based in North America, and reports to the Engineering Manager, Site Reliability. Graylog offers a remote-friendly work environment and various benefits.

Requirements

Proficiency in managing cloud infrastructures, especially AWS, along with associated tools like Terraform and Kubernetes, ensuring high availability, scalability, and resilience
Hands-on experience with IaC tools and techniques, including configuration management and cloud provisioning
Basic programming skills in at least one language, such as Python, for tool development and automation tasks
Knowledge of security protocols and compliance requirements specific to cloud environments, with experience in implementing security measures
Experience in diagnosing and resolving infrastructure-related issues, working closely with development and support teams
Familiarity with cloud monitoring tools and performance metrics to continuously evaluate and improve the infrastructure
Understanding of continuous integration and continuous deployment practices for efficient and reliable product releases
Ability to document technical processes clearly and effectively communicate architectural decisions and changes to various stakeholders

Responsibilities

Provide architectural guidance and technical solutions for adapting our product in a 24x7 support cloud offering, with a focus on delivering a product that is highly available, resilient, secure, scalable, cost-efficient, and consistently delivers valuable product outcomes to consumers
Writing pull requests (PRs) to make changes that improve and optimize our AWS+Terraform+Kubernetes setup, centring around ensuring its high availability, scalability, and resilience
Implementing security measures, auditing the cloud environment, and ensuring adherence to compliance standards
Expanding our internal tool base, focusing on Infrastructure as a Code and configuration management improvements
Collaborating with teams to identify and resolve infrastructure-related issues swiftly, minimizing any impact on product performance
Championing cloud strategies that align with and advance our business objectives, especially during pitch cycles and other planning meetings
Connecting with Cloud Engineers, Site Reliability Engineers, and application engineers, documenting key decisions where possible and making sure critical knowledge isn't siloed in a single spot in the organization

Benefits

Opportunity to work with a globally distributed and diverse team
Grow and develop professionally and personally in a fast-growing environment
Choice of the latest equipment to help you succeed
Monthly allowance to support your commute costs and support outfitting your work-from-home environment
Remote-friendly work environment

Site Reliability Engineer

Graylog

Job highlights

Summary

Requirements

Responsibilities

Benefits

Remote

DevOps

Mid-level

Share this job:

Similar Remote Jobs

Senior Infrastructure Engineer, Site Reliability Engineer

Flex

Remote

DevOps

Senior

Software Engineer, Site Reliability Engineer

Tailor

Remote

Software Development

Mid-level

Senior Site Reliability Engineering Engineer

Binance

Remote

DevOps

Senior