Site Reliability Engineer

Logo of Graylog

Graylog

πŸ“Remote - United States

Job highlights

Summary

Join Graylog's multinational cloud services team as a Site Reliability Engineer. You will provide architectural guidance and technical solutions for a 24x7 support cloud offering, focusing on high availability, resilience, security, scalability, and cost efficiency. Responsibilities include cloud infrastructure management using AWS, Terraform, and Kubernetes; implementing security measures and ensuring compliance; developing internal tools; resolving infrastructure issues; advocating for cloud strategies; and sharing knowledge. The role is full-time and permanent, based in North America, and reports to the Engineering Manager, Site Reliability. Graylog offers a remote-friendly work environment and various benefits.

Requirements

  • Proficiency in managing cloud infrastructures, especially AWS, along with associated tools like Terraform and Kubernetes, ensuring high availability, scalability, and resilience
  • Hands-on experience with IaC tools and techniques, including configuration management and cloud provisioning
  • Basic programming skills in at least one language, such as Python, for tool development and automation tasks
  • Knowledge of security protocols and compliance requirements specific to cloud environments, with experience in implementing security measures
  • Experience in diagnosing and resolving infrastructure-related issues, working closely with development and support teams
  • Familiarity with cloud monitoring tools and performance metrics to continuously evaluate and improve the infrastructure
  • Understanding of continuous integration and continuous deployment practices for efficient and reliable product releases
  • Ability to document technical processes clearly and effectively communicate architectural decisions and changes to various stakeholders

Responsibilities

  • Provide architectural guidance and technical solutions for adapting our product in a 24x7 support cloud offering, with a focus on delivering a product that is highly available, resilient, secure, scalable, cost-efficient, and consistently delivers valuable product outcomes to consumers
  • Writing pull requests (PRs) to make changes that improve and optimize our AWS+Terraform+Kubernetes setup, centring around ensuring its high availability, scalability, and resilience
  • Implementing security measures, auditing the cloud environment, and ensuring adherence to compliance standards
  • Expanding our internal tool base, focusing on Infrastructure as a Code and configuration management improvements
  • Collaborating with teams to identify and resolve infrastructure-related issues swiftly, minimizing any impact on product performance
  • Championing cloud strategies that align with and advance our business objectives, especially during pitch cycles and other planning meetings
  • Connecting with Cloud Engineers, Site Reliability Engineers, and application engineers, documenting key decisions where possible and making sure critical knowledge isn't siloed in a single spot in the organization

Benefits

  • Opportunity to work with a globally distributed and diverse team
  • Grow and develop professionally and personally in a fast-growing environment
  • Choice of the latest equipment to help you succeed
  • Monthly allowance to support your commute costs and support outfitting your work-from-home environment
  • Remote-friendly work environment

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.
Please let Graylog know you found this job on JobsCollider. Thanks! πŸ™