Summary

Join Discogs Platform team as a Senior Site Reliability Engineer and contribute to centralized infrastructure maintenance, monitoring, and automation. Lead incident response and postmortems, collaborating with engineering teams to improve technologies and processes. This remote position, open to candidates in OR, WA, CA, CO, TX, and IL, offers a competitive salary ($130,000-$140,000) and excellent benefits. You will maintain the organization's cloud presence, automate infrastructure, mentor engineering squads, and assist with capacity planning. Responsibilities include writing documentation, implementing monitoring systems, working in a containerized environment, and participating in on-call rotations. The ideal candidate possesses extensive experience in DevOps and cloud technologies.

Requirements

A Bachelor's Degree in Computer Science or similar area of focus, or equivalent relevant work experience
5+ years experience in Ops, DevOps, Site Reliability, Platform or other systems roles
Infrastructure-as-code (Terraform)
CI/CD (GitHub Actions)
GitOps (ArgoCD)
Kubernetes (EKS, Kustomize, Karpenter, administration, application manifests)
AWS and cloud development (VPC, EKS, RDS, S3)
FinOps and cloud cost optimization
Observability (Datadog, Sentry)
Scripting (Shell, Python)
Track record of collaboration and mentorship
Excellent written communication and documentation skills
Continuous learning
Ownership and proactive approach to solving large problems

Responsibilities

Maintaining organization cloud presence in AWS
Automating and deploying infrastructure configurations using Infrastructure as Code (IAC)
Mentoring engineering squads on Platform best practices for Kubernetes, MySQL, Kafka, and other software development lifecycle areas
Assist engineering squads with capacity planning, infrastructure budgeting, and production readiness
Writing documentation and runbooks that contribute to the engineering organization’s knowledge base
Implementing monitoring and alerting systems with Discogs observability tools
Working in a containerized, orchestrated environment
Participating in on-call rotation, responding to incidents, and troubleshooting data and other operations issues
Contribute to efforts on the reliability and design patterns of our Kafka, Kafka Connect and database implementations

Preferred Qualifications

Kafka: Cluster administration (Strimzi), Kafka Connect (Debezium, JDBC)
Relational database administration and performance (MySQL, Percona Server, AWS RDS)
Elasticsearch (ECK administration, scaling, performance)
Python (SQLAlchemy, FastAPI)
GraphQL (schema design, Apollo federation)
REST API
Hashicorp Vault
Redis
Memcached

Benefits

Competitive compensation: salary, plus performance-related bonus program
401(k) with employer match
100% company-paid medical and dental insurance benefits for you and your dependents
4 weeks paid vacation, increasing based on tenure
18 weeks paid leave for birth moms
8 weeks paid parental leave, including for adoption
Monthly wellness allowance
Annual professional and personal development allowance
Work from home office set-up and expense allowances
Flexible work location opportunities
Employer matching toward charitable contributions

Senior Site Reliability Engineer

Discogs

Summary

Requirements

Responsibilities

Preferred Qualifications

Benefits

Remote

DevOps

Senior

Share this job:

Similar Remote Jobs

Remote

DevOps

Senior

Remote

DevOps

Senior

Remote

DevOps

Senior

Remote

Software Development

Senior

Remote

Software Development

Senior

Remote

Software Development

Senior

Remote

DevOps

Senior

Gusto

Remote

DevOps

Senior

Loggi

Remote

DevOps

Senior

Remote

DevOps

Senior