Senior Site Reliability Engineer

closed
Focal Systems Logo

Focal Systems

πŸ’΅ $122k-$129k
πŸ“Remote - Canada

Summary

Join Focal Systems, a leading retail AI solutions company, as a Sr. DevOps/Site Reliability Engineer. You will play a crucial role in ensuring the smooth operation and continuous improvement of our infrastructure. Responsibilities include managing large GCP Kubernetes clusters, ensuring 99.9% uptime for distributed services, and collaborating with various teams on infrastructure automation. You will also design and manage a robust CI/CD pipeline and lead uptime improvement processes. This role requires extensive experience in SRE, containerization, cloud cost management, and various technologies. The position offers a competitive salary, stock options, paid time off, quarterly team retreats, and education grants.

Requirements

  • Solid experience in an infrastructure or Site Reliability Engineer (SRE) role
  • Hands-on experience with containerization (Docker) and orchestration platforms (Kubernetes) required
  • Experience in cloud cost management
  • Great understanding of SQL, networking, distributed systems, operating systems (debian) and software engineering practices
  • Experience with messaging systems
  • Terraform or other Infrastructure as Code automation solution
  • Operating Relational SQL databases and Redis at terabyte scale
  • Proven experience with setting up monitoring/alerting and reliability engineering
  • Scriptings skills in Python
  • Must be comfortable with 12-hour on call rotations

Responsibilities

  • Set up and manage blue/green and canary deployments to ensure smooth launches without downtime
  • Operate multiple large GCP Kubernetes clusters and fine tune for reliability vs cost
  • Manage the various distributed services of the company, ensuring to always provide graceful updates, comprehensive test coverage, tracking of logs, and 99.9% uptime
  • Work with Backend, Frontend and Deep Learning teams and write infrastructure automation code for their needs
  • Identify scalability bottlenecks through load testing and plan infrastructure architecture
  • Create tools to provide transparency/ease of access into the company's rich datasets stored across varying geographic locations and data formats
  • Design, build, and manage a robust Continuous Integration and Continuous Deployment (CI/CD) pipeline
  • Lead uptime improvement processes including: postmortem review, on-call setup

Preferred Qualifications

  • GitOps
  • Setting up automation for complex load testing scenarios
  • Tuning Deep Learning pipelines with Python, Pytorch and Multiprocessing
  • Backend programming with Python

Benefits

  • Competitive Salary & Attractive Stock
  • Paid Time Off
  • Quarterly Team Retreats
  • Education grants
This job is filled or no longer available

Similar Remote Jobs