Senior Customer Reliability Engineer

Astronomer
Summary
Join Astronomer's Customer Reliability Engineering (CRE) team and play a crucial role in ensuring the success of our customers using our managed Airflow service. As an infrastructure specialist, you will focus on the reliability of our cloud infrastructure and Kubernetes clusters, responding to incidents and implementing permanent solutions. This customer-facing role offers exposure to diverse problems and technologies across various cloud providers. You will directly impact customer success, improve the customer experience, and contribute to the architecture of our products. The role involves collaboration with a globally distributed team and participation in on-call rotations. Astronomer values diverse experiences and unconventional backgrounds.
Requirements
- 5 years of experience, preferably with large, complex SaaS infrastructures operating at scale
- Commercial experience using or managing Kubernetes clusters
- Experience managing a Production distributed system with at least one major cloud provider (one or all: AWS, GCP, Azure)
- Strong Network Experience with one of the major Clouds
- Strong Linux experience
- Knowledge of how to operate and monitor issues for distributed systems
- Experience with Observability tools
- Previous experience in handling customers issues (internal and external)
- Strong Communication Skills
- DevOps or CI/CD experience
- Python scripting
- Good troubleshooting Skills
Responsibilities
- Provide solutions to customers to make them successful using our products
- Troubleshoot Customer environments and engage in active triaging with customers
- Provide feedback to the product development teams on customer needs and pain points
- Build out our monitoring and alerting systems
- Build and maintain automation to ensure daily operational tasks are handled as efficiently as possible
- Help direct the architecture of the products and contribute where possible
- Own the customer experience, working directly with customers to prioritize and solve issues, meet SLAs, and provide βwhite gloveβ guidance on the path to production
- Participate remotely within a fully distributed team
- Enhance and Enrich customer documentation
- Work on a modern, sophisticated, cloud-native product that customers use to connect to dozens of other systems
- Help maintain 24x7 coverage through a specified 6-hour pager period during your work day
- Participate in paid on-call rotation for weekend coverage
Preferred Qualifications
- Experience as a Site Reliability Engineer
- Worked with Kubernetes Custom Resources
- Depth of knowledge with Azure
- Airflow/Big Data Orchestration experience
- IaC experience
Benefits
Remote work, flexible hours