Remote Senior Site Reliability Engineer

closed
Logo of Lightspeed

Lightspeed

πŸ“Remote - United Kingdom

Job highlights

Summary

Join our NuOrder by Lightspeed team as a Staff Site Reliability Engineer and contribute to building software solutions that help merchants grow their business. You will be part of a team responsible for supporting cross-cutting concerns, such as cloud infrastructure, reliability, and incident management, and support our growing Dev teams with the infrastructure and tools needed to scale.

Requirements

  • Bachelor’s degree in Computer Science, Engineering, or equivalent real-world experience
  • 6+ years of experience in site reliability engineering, systems administration, and/or software engineering
  • Expertise in container orchestration platforms, specifically Kubernetes
  • Strong understanding of both relational (e.g., PostgreSQL, MySQL) and NoSQL databases (e.g., MongoDB, Cassandra, Redis)
  • Familiarity with network protocols and IP networking, along with experience in network troubleshooting
  • Proficiency in at least one programming language such as Bash, Python, Go, etc
  • Proven track record of managing large-scale infrastructure in cloud environments like Google Cloud, AWS, or Azure
  • Experience with monitoring tools (e.g., Prometheus, Grafana, Datadog) and logging solutions (e.g., ELK stack)
  • Strong understanding of security best practices
  • Excellent problem-solving skills and the ability to work under pressure to troubleshoot and resolve complex issues
  • Excellent communication skills for effective collaboration with cross-functional teams

Responsibilities

  • Design, build, and maintain robust infrastructure on GCP, leveraging cloud-native technologies such as GKE, Cloud SQL, BigQuery, etc
  • Develop and manage CI/CD pipelines for efficient deployment and release using various technologies (GitLab, GitHub, Helm, Terraform, etc.)
  • Work closely with development teams to provide tools and practices for monitoring software health in production, defining and measuring reliability metrics (SLI, SLO), and managing error budgets
  • Build platform solutions and apply software engineering principles to improve software reliability and accelerate delivery
  • Support the incident management process and conduct post-mortem analysis to prevent future outages
  • Mentor junior SREs and developers, offering guidance on best practices in cloud architecture, data management, and software development
  • Manage infrastructure changes through infrastructure as code (IaC) using Terraform
  • Participate in the on-call rotation
  • Stay current with industry trends and emerging technologies, advocating for the adoption of new technologies and practices to improve product quality and team efficiency

Benefits

  • Work in a talented global team with strong role growth opportunities
  • Flexible Working policy
  • Lightspeed share scheme (we are all owners)
  • Company pension program
  • Private medical insurance
  • Health and wellness benefit
  • Mental health online platform and counseling & coaching services
  • Paid leave and assistance for new parents
  • Language classes & LinkedIn Learning license
  • Volunteer day
This job is filled or no longer available

Similar Remote Jobs