Summary

Crisis Text Line is a mental health support organization that provides free, 24/7 text-based crisis intervention services. The engineering, product, and design teams at Crisis Text Line are seeking a Site Reliability Engineer (SRE) to lead and maintain the cloud infrastructure on AWS Fargate, design and maintain CloudWatch alerting and monitoring configurations, mentor junior team members, collaborate with cross-functional teams, and automate repetitive tasks. The ideal candidate should have experience in site reliability engineering or related roles, hands-on experience with AWS services, proficiency with infrastructure as code tools, strong scripting and automation skills, experience with container orchestration platforms, a solid understanding of networking concepts, security best practices, and DevOps principles, and strong problem-solving skills.

Requirements

Bachelor's degree in Computer Science, Engineering, or related field (Master's degree preferred) or equivalent experience
Experience in site reliability engineering (SRE) or related roles, with a focus on cloud infrastructure management
Hands-on experience with AWS services, particularly AWS Fargate, CloudWatch, and related tools
Proficiency with infrastructure as code (IaC) tools such as Terraform or CloudFormation
Strong scripting and automation skills using languages such as Python, Bash, or PowerShell
Experience with container orchestration platforms such as Kubernetes or Amazon ECS
Solid understanding of networking concepts, security best practices, and DevOps principles
Strong problem-solving skills and the ability to work effectively in a fast-paced, collaborative environment

Responsibilities

Lead, and maintain highly available, scalable, and secure infrastructure on AWS Fargate
Design and maintain CloudWatch alerting and monitoring configurations to proactively identify and resolve potential issues
Mentor and guide junior team members, sharing best practices and promoting a culture of excellence
Collaborate with cross-functional teams to define and implement best practices for infrastructure as code (IaC), continuous integration/continuous deployment (CI/CD), and site reliability engineering (SRE) methodologies
Lead in incident response and resolution, including troubleshooting complex system issues and implementing preventive measures to minimize downtime
Automate repetitive tasks and processes to improve operational efficiency and reduce manual intervention
Conduct performance tuning and optimization of infrastructure components to ensure optimal resource utilization and cost efficiency
Stay up-to-date with emerging technologies and industry trends to drive innovation and continuous improvement

Benefits

20 paid holidays including: Federal holidays like Juneteenth and Labor Day, Election day, Holiday break from Dec 24 through January 1, 2 days for renewal, and 2 floating holidays
Flexible paid time off, including: 15 vacation days, 3 personal days, and 7 sick days
Medical, dental, and vision benefits for the staff member and family at no cost to the employee
403B retirement plan (the nonprofit equivalent of a 401K): 3% contribution by Crisis Text Line to support building financial wellness, regardless of personal contribution
12 weeks paid parental leave (after 6 months of employment)
Student loan repayment (after 2 years of continuous full time service)
Family support through a virtual childcare platform
Stipends/Allowances for mental health, internet service, professional development, and wellness

Crisis Text Line is hiring a
Senior Infrastructure Site Reliability Engineer in Worldwide

Summary

Requirements

Responsibilities

Benefits

Similar Jobs

Crisis Text Line is hiring aSenior Infrastructure Site Reliability Engineer in Worldwide

Summary

Requirements

Responsibilities

Benefits

Similar Jobs

Crisis Text Line is hiring a
Senior Infrastructure Site Reliability Engineer in Worldwide