Site Reliability Engineer

TextNow Logo

TextNow

πŸ“Remote - United States, Canada

Summary

Join TextNow, the nation's largest free phone service provider, and become a key member of our Site Reliability Engineering (SRE) team. We're on a mission to democratize phone service, and you'll play a vital role in ensuring the reliability and scalability of our infrastructure. Your responsibilities will encompass designing, building, and maintaining highly available systems, automating infrastructure using tools like Terraform and Ansible, and participating in incident response and on-call support. You'll also focus on performance monitoring and optimization, collaborating with cross-functional teams, and driving continuous improvement initiatives. We offer a strong work-life blend, flexible work arrangements, competitive pay and benefits, and a culture that values collaboration and innovation.

Requirements

  • Experienced in SRE/DevOps: You have 2+ years of experience in an operationally focused role, such as SRE, DevOps, or Infrastructure Engineering, with a deep understanding of reliability, scalability, and performance optimization
  • Proficient with Key Technologies: Hands-on experience with AWS, GitHub, Terraform, Ansible, or similar tools to build and manage cloud infrastructure efficiently
  • Incident Management Expert: You are comfortable handling production incidents, analyzing root causes, and implementing long-term fixes to prevent recurrence
  • Automation & Observability Focused: Passionate about reducing toil through scripting and automation while ensuring robust observability using logging, metrics, and monitoring tools
  • Collaborative & Impact-Driven: You enjoy working cross-functionally with engineers, product teams, and leadership to drive meaningful improvements to system reliability

Responsibilities

  • Ensure System Reliability: Design, build, and maintain scalable, resilient, and highly available systems to support TextNow’s infrastructure and services
  • Automation & Infrastructure as Code: Develop and maintain automation using Terraform, Ansible, and other tools to enable efficient deployment, scaling, and operations of cloud-based systems (AWS preferred)
  • Incident Response & On-Call Support: Participate in an on-call rotation, troubleshoot issues, and drive incident resolution to minimize downtime and improve system performance. Conduct post-mortems and implement corrective actions to enhance reliability
  • Performance Monitoring & Optimization: Implement and improve observability tools, logging, and monitoring solutions to identify and mitigate potential system issues proactively
  • Collaboration & Cross-Team Engagement: Work closely with software engineers, DevOps, and product teams to align technical efforts with business objectives and improve system reliability from development to production
  • Continuous Improvement: Identify areas for improvement in architecture, automation, and operational practices. Contribute to the design and implementation of new SRE best practices

Benefits

  • Strong work life blend
  • Flexible work arrangements (wfh, remote, or access to one of our office spaces)
  • Employee Stock Options
  • Unlimited vacation
  • Competitive pay and benefits
  • Parental leave
  • Benefits for both physical and mental well being (wellness credit and L&D credit)
  • We travel a few times a year for various team events, company wide off-sites, and more

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.