Summary
Join Focal Systems, a leading retail AI solutions company, as a Sr. DevOps/Site Reliability Engineer. You will play a crucial role in ensuring the smooth operation and continuous improvement of our infrastructure. Responsibilities include managing large GCP Kubernetes clusters, ensuring 99.9% uptime for distributed services, and collaborating with various teams on infrastructure automation. You will also design and manage a robust CI/CD pipeline and lead uptime improvement processes. This role requires extensive experience in SRE, containerization, cloud cost management, and various technologies. The position offers a competitive salary, stock options, paid time off, quarterly team retreats, and education grants.
Requirements
- Solid experience in an infrastructure or Site Reliability Engineer (SRE) role
- Hands-on experience with containerization (Docker) and orchestration platforms (Kubernetes) required
- Experience in cloud cost management
- Great understanding of SQL, networking, distributed systems, operating systems (debian) and software engineering practices
- Experience with messaging systems
- Terraform or other Infrastructure as Code automation solution
- Operating Relational SQL databases and Redis at terabyte scale
- Proven experience with setting up monitoring/alerting and reliability engineering
- Scriptings skills in Python
- Must be comfortable with 12-hour on call rotations
Responsibilities
- Set up and manage blue/green and canary deployments to ensure smooth launches without downtime
- Operate multiple large GCP Kubernetes clusters and fine tune for reliability vs cost
- Manage the various distributed services of the company, ensuring to always provide graceful updates, comprehensive test coverage, tracking of logs, and 99.9% uptime
- Work with Backend, Frontend and Deep Learning teams and write infrastructure automation code for their needs
- Identify scalability bottlenecks through load testing and plan infrastructure architecture
- Create tools to provide transparency/ease of access into the company's rich datasets stored across varying geographic locations and data formats
- Design, build, and manage a robust Continuous Integration and Continuous Deployment (CI/CD) pipeline
- Lead uptime improvement processes including: postmortem review, on-call setup
Preferred Qualifications
- GitOps
- Setting up automation for complex load testing scenarios
- Tuning Deep Learning pipelines with Python, Pytorch and Multiprocessing
- Backend programming with Python
Benefits
- Competitive Salary & Attractive Stock
- Paid Time Off
- Quarterly Team Retreats
- Education grants