Senior Site Reliability Engineer

Echo360
Summary
Join Echo360 as a Site Reliability Engineer and play a critical role in ensuring the reliability, scalability, cost, and security of our cloud infrastructure. You will design and implement automated monitoring and alerting systems, collaborate with development teams, and conduct failure testing. Leveraging your AWS expertise, you will optimize performance, automate infrastructure provisioning, and enforce security best practices. Beyond technical skills, you will engage in incident response, mentorship, and continuous improvement. This fully remote position offers a competitive salary and comprehensive benefits. If you thrive in a fast-paced environment and are passionate about cloud optimization, this is an exciting opportunity to make a significant impact.
Requirements
- 5+ years of experience as a Site Reliability Engineer or similar role
- Strong understanding of AWS cloud services, including DynamoDB, MySQL, S3, CloudSearch, OpenSearch, Kafka, Presto, EKS, ECS and EC2
- Experience with infrastructure automation tools like Ansible, Terraform, or CloudFormation
- Experience with monitoring and alerting tools like CloudWatch, DataDog, Prometheus, Grafana, Kibana, and PagerDuty
- Experience with GitHub actions, Cl/CD pipelines and deployment strategies
- Strong problem-solving and analytical skills
- Excellent communication and collaboration skills
- Ability to work independently and take ownership of complex tasks
- Passion for technology and a desire to learn and grow
Responsibilities
- Ensure service reliability and SLO/SLA adherence to production, preventing incidents by proactively conducting failure testing
- Implement automated monitoring and alerting systems for early detection of potential problems
- Collaborate with development teams to perform deployments and rollbacks with minimal disruption
- Optimize the performance and scalability of our AWS infrastructure, including RDS, DynamoDB, MySQL, S3, CloudSearch, OpenSearch, Kafka, Presto, SES, EKS, ECS, and EC2
- Automate infrastructure provisioning and deployment processes using Terraform, CI/CD pipelines, and configuration management tools
- Proactively identify and address potential security vulnerabilities to maintain compliance, IAM best practices, and secrets management
- Participate in incident response and post-mortem analysis activities to identify root causes and prevent future occurrences
- Help onboard and mentor junior team members, sharing your knowledge and expertise
- Stay up to date on the latest cloud technologies and best practices for SRE
- Participate in a well-structured on-call rotation with other Site Reliability Engineers
- Explore new technologies and innovative solutions to improve service quality and speed to market
- Participate in technical discussions and deep dives with the other engineering and product teams
Preferred Qualifications
- Experience with Jenkins, PostgreSQL, and MongoDB
- Experience with cloud cost optimization, security best practices and tools
- Experience working in a fast-paced, agile environment
- Experience Rancher, Cattleprod, and TeamCity a plus
Benefits
- Medical, dental, vision, life & disability insurance
- A 401(k) plan with company match
- An unlimited PTO policy
- Fully remote