Site Reliability Engineer

Cloudbeds Logo

Cloudbeds

๐Ÿ“Remote

Summary

Join Cloudbeds as a Site Reliability Engineer (SRE) and ensure the reliability, availability, and performance of our systems and applications. Collaborate with cross-functional teams to design and implement scalable and resilient solutions using automation and best practices. You will have ample opportunities for architecture design and implementation within AWS cloud infrastructure. Help provide the highest quality full-stack management solution for hotels worldwide. The role requires 2+ years of experience as a DevOps or SRE Engineer with AWS expertise and strong skills in Linux system administration, Kubernetes, Docker, and more. The position is based in Europe and offers a remote-first work environment.

Requirements

  • 2+ years of experience as a DevOps or SRE Engineer, working with AWS
  • Exceptional skills in Linux system administration
  • 2+ years of strong Experience in Kubernetes, Docker, Helm charts
  • Experience implementing and scaling Elastic Kubernetes (EKS) platforms
  • Strong Experience with application containerization methodologies and delivery
  • Strong Experience with monitoring, logging, and alerting technologies (any of ELK, Datadog, Loki, AWS Cloudwatch)
  • Experience with infrastructure-as-code methodologies such as Terraform
  • Experience with designing, building, and supporting CI/CD pipelines (Github Actions, Bitbucket pipelines, and ArgoCD)
  • Experience with web application servers (NGiNX, Ingress controllers, traffic load balancing), databases (MySQL, PostgreSQL, Aurora), cache technologies (any of Redis, Memcached), and queue technologies (SQS)
  • Ability to write Bash/Python scripts
  • Good networking skills
  • Good written and verbal communication in English
  • Good team player qualities
  • Ability to work remotely and manage your own time in a global team
  • Bachelorโ€™s degree in Computer Science or related field, or equivalent experience

Responsibilities

  • Design and implement reliable, scalable, and efficient systems to meet the needs of the organization
  • Maintain and support highly loaded Kubernetes (EKS) clusters and infrastructure-related components
  • Develop and continuously improve Product monitoring and logging systems based on the Prometheus, DataDog, and Loki stacks
  • Respond to and resolve incidents, ensuring minimal impact on services
  • Collaborate with development teams to establish Service Level Objectives (SLOs) and ensure systems meet or exceed reliability targets
  • Optimize system performance and troubleshoot issues as they arise
  • Support development teams by sharing SRE best practices and expertise, assist in environment and application configuration from the resiliency perspective
  • Collaborate with security teams to implement and maintain security best practices
  • Support the release process via CI/CD pipelines
  • Automate the platform with infrastructure-as-code and configuration management
  • Maintain clear and comprehensive documentation for systems, processes, and procedures
  • Share knowledge with team members to enhance overall understanding
  • On-call rotation support for the production environment outages

Preferred Qualifications

  • Advanced experience with Database Administration (Aurora, MySQL, PostgreSQL)
  • Experience working in a Scrum team using Jira and as L3/L4 support
  • Experience working in a PCI-compliant environment
  • Experience working with Kong API Gateway

Benefits

  • Remote First, Remote Always
  • PTO in accordance with local labor requirements
  • 2 corporate apartment accommodations for team member use for free (San Diego & Sรฃo Paulo)
  • Full Paid Parental Leave
  • Home office stipend based on country of residency
  • Professional development courses in Cloudbeds University
  • Access provided to professional Therapy and Coaching
  • Access to professional development, including manager training, upskilling and knowledge transfer

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.