Site Reliability Engineer

Twilio
Summary
Join Twilio as a Site Reliability Engineer on our Data Infrastructure Platform! This role involves designing, building, and optimizing our platform to support various data-driven initiatives. You will collaborate with cross-functional teams, architect scalable solutions, and implement data solutions and infrastructure. The ideal candidate is passionate about leveraging data, possesses strong technical skills, and has experience with modern data technologies. You will be responsible for designing and implementing data streaming solutions, ensuring data quality and security, and staying current with emerging technologies. Mentoring junior engineers and contributing to a culture of continuous learning are also key aspects of this position. This remote role offers competitive pay and benefits.
Requirements
- Bachelor's or Master's degree in Computer Science, Engineering, or a related field
- 8+ years of experience in Site Reliability Engineering, DevOps, or Software Engineering roles with a focus on infrastructure or backend systems
- Strong production experience , including operational management, scaling, partitioning strategies, and tuning for performance and reliability
- Hands-on experience with Kubernetes (preferably EKS) , including deploying and managing stateful services and operators in Kubernetes environments
- Deep understanding of AWS cloud services , particularly those relevant to data infrastructure (e.g., EC2, EBS, S3, IAM, MSK, CloudWatch, VPC, ALB/NLB)
- Proficiency in infrastructure-as-code tools , such as Terraform or CloudFormation, for managing and automating infrastructure
- Expertise in observability tools (e.g., Prometheus, Grafana, OpenTelemetry, Datadog) to monitor distributed systems and set up alerting for reliability and latency
- Proficient in at least one programming language (e.g., Go, Python, Java, or similar) for building automation, tooling, and contributing to platform services
- Experience designing and implementing incident response processes , SLOs/SLIs, runbooks, and participating in on-call rotations
- Strong understanding of distributed systems principles , including consensus, durability, throughput, and availability tradeoffs
- Proven track record of driving reliability improvements in high-scale, data-intensive systems and collaborating with platform and data engineering teams
- Excellent problem-solving and analytical skills
- Strong verbal & written communication skills, with the ability to work effectively in a cross-functional team environment
Responsibilities
- Design, build, and maintain infrastructure and scalable frameworks to support data ingestion, processing, and analysis
- Collaborate with stakeholders, analysts, and product teams to understand business requirements and translate them into technical solutions
- Architect and implement data streaming solutions using modern data technologies such as Kafka, AWS MSK, Terraform, Hive, Hudi, Presto, Airflow, and cloud-based services like AWS EKS, Lakeformation, Glue and Athena
- Design and implement frameworks and solutions for performance, reliability, and cost-efficiency
- Ensure data quality, integrity, and security throughout the data lifecycle
- Stay current with emerging technologies and best practices in big data technologies
- Mentor early in career engineers and contribute to a culture of continuous learning and improvement
Preferred Qualifications
- Data technologies like Apache Kafka, AWS MSK, Flink, Clickhouse etc
- Bias to action, ability to iterate and ship rapidly
- Passion to build data products, prior projects in this area
Benefits
- Competitive pay
- Generous time off
- Ample parental and wellness leave
- Healthcare
- A retirement savings program