Senior Site Reliability Engineer
Focal Systems
π΅ $122k-$129k
πRemote - Canada
Please let Focal Systems know you found this job on JobsCollider. Thanks! π
Job highlights
Summary
Join Focal Systems, a leading retail AI solutions company, as a Sr. DevOps/Site Reliability Engineer. You will play a crucial role in ensuring the smooth operation and continuous improvement of our infrastructure. Responsibilities include managing large GCP Kubernetes clusters, ensuring 99.9% uptime for distributed services, and collaborating with various teams on infrastructure automation. You will also design and manage a robust CI/CD pipeline and lead uptime improvement processes. This role requires extensive experience in SRE, containerization, cloud cost management, and various technologies. The position offers a competitive salary, stock options, paid time off, quarterly team retreats, and education grants.
Requirements
- Solid experience in an infrastructure or Site Reliability Engineer (SRE) role
- Hands-on experience with containerization (Docker) and orchestration platforms (Kubernetes) required
- Experience in cloud cost management
- Great understanding of SQL, networking, distributed systems, operating systems (debian) and software engineering practices
- Experience with messaging systems
- Terraform or other Infrastructure as Code automation solution
- Operating Relational SQL databases and Redis at terabyte scale
- Proven experience with setting up monitoring/alerting and reliability engineering
- Scriptings skills in Python
- Must be comfortable with 12-hour on call rotations
Responsibilities
- Set up and manage blue/green and canary deployments to ensure smooth launches without downtime
- Operate multiple large GCP Kubernetes clusters and fine tune for reliability vs cost
- Manage the various distributed services of the company, ensuring to always provide graceful updates, comprehensive test coverage, tracking of logs, and 99.9% uptime
- Work with Backend, Frontend and Deep Learning teams and write infrastructure automation code for their needs
- Identify scalability bottlenecks through load testing and plan infrastructure architecture
- Create tools to provide transparency/ease of access into the company's rich datasets stored across varying geographic locations and data formats
- Design, build, and manage a robust Continuous Integration and Continuous Deployment (CI/CD) pipeline
- Lead uptime improvement processes including: postmortem review, on-call setup
Preferred Qualifications
- GitOps
- Setting up automation for complex load testing scenarios
- Tuning Deep Learning pipelines with Python, Pytorch and Multiprocessing
- Backend programming with Python
Benefits
- Competitive Salary & Attractive Stock
- Paid Time Off
- Quarterly Team Retreats
- Education grants
Share this job:
Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.
Similar Remote Jobs
- π°$60k-$120kπAsia
- πUnited States
- π°$95k-$125kπWorldwide
- πPakistan
- πUnited Kingdom
- π°$143k-$245kπUnited States
- NπCzech Republic
- πCanada
- πUnited States
Please let Focal Systems know you found this job on JobsCollider. Thanks! π