Senior Site Reliability Engineer at Focal Systems

Summary

Join Focal Systems, a leading retail AI solutions company, as a Sr. DevOps/Site Reliability Engineer. You will play a crucial role in ensuring the smooth operation and continuous improvement of our infrastructure. Responsibilities include managing large GCP Kubernetes clusters, ensuring 99.9% uptime for distributed services, and collaborating with various teams on infrastructure automation. You will also design and manage a robust CI/CD pipeline and lead uptime improvement processes. This role requires extensive experience in SRE, containerization, cloud cost management, and various technologies. The position offers a competitive salary, stock options, paid time off, quarterly team retreats, and education grants.

Requirements

Solid experience in an infrastructure or Site Reliability Engineer (SRE) role
Hands-on experience with containerization (Docker) and orchestration platforms (Kubernetes) required
Experience in cloud cost management
Great understanding of SQL, networking, distributed systems, operating systems (debian) and software engineering practices
Experience with messaging systems
Terraform or other Infrastructure as Code automation solution
Operating Relational SQL databases and Redis at terabyte scale
Proven experience with setting up monitoring/alerting and reliability engineering
Scriptings skills in Python
Must be comfortable with 12-hour on call rotations

Responsibilities

Set up and manage blue/green and canary deployments to ensure smooth launches without downtime
Operate multiple large GCP Kubernetes clusters and fine tune for reliability vs cost
Manage the various distributed services of the company, ensuring to always provide graceful updates, comprehensive test coverage, tracking of logs, and 99.9% uptime
Work with Backend, Frontend and Deep Learning teams and write infrastructure automation code for their needs
Identify scalability bottlenecks through load testing and plan infrastructure architecture
Create tools to provide transparency/ease of access into the company's rich datasets stored across varying geographic locations and data formats
Design, build, and manage a robust Continuous Integration and Continuous Deployment (CI/CD) pipeline
Lead uptime improvement processes including: postmortem review, on-call setup

Preferred Qualifications

GitOps
Setting up automation for complex load testing scenarios
Tuning Deep Learning pipelines with Python, Pytorch and Multiprocessing
Backend programming with Python

Benefits

Competitive Salary & Attractive Stock
Paid Time Off
Quarterly Team Retreats
Education grants

Senior Site Reliability Engineer

Focal Systems

Summary

Requirements

Responsibilities

Preferred Qualifications

Benefits

Remote

DevOps

Senior

Similar Remote Jobs

Remote

DevOps

Senior

Remote

DevOps

Senior

Remote

Software Development

Senior

Remote

Software Development

Senior

Remote

Software Development

Senior

Remote

DevOps

Senior

Trase

Remote

DevOps

Senior

Remote

DevOps

Senior

Remote

DevOps

Senior

Remote

DevOps

Senior