Senior Site Reliability Engineer

Shippo Logo

Shippo

📍Remote - Brazil

Summary

Join Shippo, the shipping layer of the internet, as a Site Reliability Engineer (SRE). You will leverage platform engineering principles to ensure Shippo's services are reliable, scalable, and performant. As a hybrid software development and operations engineer, you'll design, build, and maintain the infrastructure supporting our applications. Your work directly impacts our ability to meet SLAs, collaborating closely with other engineering teams. You will be responsible for designing, scaling, and securing infrastructure, building automation and monitoring systems, and ensuring scalability and maintainability. Shippo offers a remote-first program, allowing for flexibility in location.

Requirements

  • Experience developing, managing and troubleshooting highly available distributed systems, including operational experience with Kubernetes in a production environment
  • Extensive expertise with at least one public cloud provider (AWS, GCP, Azure)
  • Exceptional verbal, written, and interpersonal communication skills
  • Interest in and understanding of best-in-class security practices, and automation and testing methods
  • Familiarity with configuration and maintenance of common infrastructure components such as Redis, Elasticsearch, and Hadoop
  • Deep understanding of customer needs and passion for customer success
  • BS or MS degree in Computer Science or equivalent experience

Responsibilities

  • Design, scale, and secure infrastructure to stay ahead of business needs through fault-tolerant architecture design, performance testing, profiling, and tuning, and capacity planning
  • Design, build, deploy, and maintain automation, monitoring, and alerting systems, as well as design, implement, and test disaster recovery solutions
  • Ensure scalability and maintainability through microservices adoption, decoupling of concerns and data model, queuing of jobs and application layering
  • Enhance and maintain our CI/CD pipeline for smooth and safe production releases via automated testing and verification
  • Verify and ensure performance and correctness of systems in response time and throughput
  • Participate in peer reviews and testing and contribute to automated test suites and in design reviews for new features, products, and systems
  • Participate in an on-call rotation

Preferred Qualifications

  • Advanced knowledge of managing and optimizing Postgresql server configuration
  • 3+ years of experience in software development
  • Experience with: Managing service meshes (e.g. Istio)
  • Experience with: Defining and monitoring Service-Level Objectives (SLOs) and Service-Level Agreements (SLAs) to ensure that systems meet reliability and performance targets
  • Monitoring Tools like New Relic, Prometheus, Grafana and/or Datadog - OpenTelemetry knowledge for distributed tracing and metrics collection and experience on using it in production environments
  • Managing Python and Golang applications in production Microservices architectures
  • DevOps tooling such as Docker, Terraform, ArgoCD, ArgoWorkflows, CircleCI, Github Actions, New Relic, PagerDuty, etc
  • AWS/Cloud services such as EKS, EC2, S3, Lambda, Route 53, CloudFront, Cloudflare, IAM, etc

Benefits

Remote work, flexible hours

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.

Similar Remote Jobs