Senior Site Reliability Engineer

Discogs Logo

Discogs

๐Ÿ’ต $130k-$140k
๐Ÿ“Remote - United States

Summary

Join Discogs Platform team as a Senior Site Reliability Engineer and contribute to centralized infrastructure maintenance, monitoring, and automation. Lead incident response and postmortems, collaborating with engineering teams to improve technologies and processes. This remote position, open to candidates in OR, WA, CA, CO, TX, and IL, offers a competitive salary ($130,000-$140,000) and excellent benefits. You will maintain the organization's cloud presence, automate infrastructure, mentor engineering squads, and assist with capacity planning. Responsibilities include writing documentation, implementing monitoring systems, working in a containerized environment, and participating in on-call rotations. The ideal candidate possesses extensive experience in DevOps and cloud technologies.

Requirements

  • A Bachelor's Degree in Computer Science or similar area of focus, or equivalent relevant work experience
  • 5+ years experience in Ops, DevOps, Site Reliability, Platform or other systems roles
  • Infrastructure-as-code (Terraform)
  • CI/CD (GitHub Actions)
  • GitOps (ArgoCD)
  • Kubernetes (EKS, Kustomize, Karpenter, administration, application manifests)
  • AWS and cloud development (VPC, EKS, RDS, S3)
  • FinOps and cloud cost optimization
  • Observability (Datadog, Sentry)
  • Scripting (Shell, Python)
  • Track record of collaboration and mentorship
  • Excellent written communication and documentation skills
  • Continuous learning
  • Ownership and proactive approach to solving large problems

Responsibilities

  • Maintaining organization cloud presence in AWS
  • Automating and deploying infrastructure configurations using Infrastructure as Code (IAC)
  • Mentoring engineering squads on Platform best practices for Kubernetes, MySQL, Kafka, and other software development lifecycle areas
  • Assist engineering squads with capacity planning, infrastructure budgeting, and production readiness
  • Writing documentation and runbooks that contribute to the engineering organizationโ€™s knowledge base
  • Implementing monitoring and alerting systems with Discogs observability tools
  • Working in a containerized, orchestrated environment
  • Participating in on-call rotation, responding to incidents, and troubleshooting data and other operations issues
  • Contribute to efforts on the reliability and design patterns of our Kafka, Kafka Connect and database implementations

Preferred Qualifications

  • Kafka: Cluster administration (Strimzi), Kafka Connect (Debezium, JDBC)
  • Relational database administration and performance (MySQL, Percona Server, AWS RDS)
  • Elasticsearch (ECK administration, scaling, performance)
  • Python (SQLAlchemy, FastAPI)
  • GraphQL (schema design, Apollo federation)
  • REST API
  • Hashicorp Vault
  • Redis
  • Memcached

Benefits

  • Competitive compensation: salary, plus performance-related bonus program
  • 401(k) with employer match
  • 100% company-paid medical and dental insurance benefits for you and your dependents
  • 4 weeks paid vacation, increasing based on tenure
  • 18 weeks paid leave for birth moms
  • 8 weeks paid parental leave, including for adoption
  • Monthly wellness allowance
  • Annual professional and personal development allowance
  • Work from home office set-up and expense allowances
  • Flexible work location opportunities
  • Employer matching toward charitable contributions

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.