Senior Site Reliability Engineer

Cordial Logo

Cordial

πŸ’΅ $135k-$170k
πŸ“Remote - United States

Summary

Join Cordial as a Site Reliability Engineer to monitor, develop, and scale the platform, ensuring a delightful client experience. You will collaborate with DevOps and Product teams to optimize performance, identify and resolve issues, and implement comprehensive monitoring. Responsibilities include administering and troubleshooting application and network components, designing and deploying Kubernetes manifests, contributing to infrastructure design, debugging code, providing production support, and participating in on-call rotations. The ideal candidate possesses extensive experience in Unix/Linux systems, AWS, Kubernetes, and various monitoring tools. Cordial offers a competitive salary, equity, bonus, robust benefits, and perks such as wellness stipends and education reimbursements.

Requirements

  • 5+ years UNIX/Linux Systems (Unix/Linux) & Network Administration (DNS, IPsec, VPN, Load Balancing, process tracing)
  • Experience with AWS (we use EC2, EKS)
  • Experience deploying and/or maintaining Kubernetes/EKS clusters
  • Hands on experience writing & maintaining custom Helm charts
  • Experience working with one or more service meshes (app-mesh, Istio, Linkerd)
  • Experience with monitoring, logging and alerting tools
  • Previous positions held as a SRE and/or DevOps role
  • Development experience in PHP
  • Extensive experience with Docker/containers & Kubernetes
  • Experience with Hashicorp products such as Consul and Vault
  • Comfortable working in a globally distributed team across time zones
  • Strong teamwork and communication skills
  • A genuine desire to learn new technologies and grow
  • Fluent in verbal and written English
  • Experience with large-scale distributed systems
  • Proficiency in infrastructure as code (IaC) tools (e.g., Terraform, CloudFormation)
  • Understanding of observability principles and tools (e.g., Prometheus, Grafana, ELK stack, distributed tracing)
  • Familiarity with CI/CD pipelines (e.g., Jenkins, GitLab CI, ArgoCD)
  • A strong grasp of networking fundamentals
  • Security best practices in a cloud environment

Responsibilities

  • Utilize your knowledge of Web, App, Network, Server, Storage and Security technologies to administer, monitor and troubleshoot application and network components in our cloud based environment. (We are AWS hosted and make extensive use of Kubernetes, Consul, and Vault clusters)
  • Help design, author, deploy, and monitor manifests for our multiple Kubernetes clusters, helm charts/repos, and service mesh configurations
  • Actively contribute to platform Infrastructure Design and Implementation discussions
  • Use your software engineering skills to trace/debug code and identify root causes of production data corruption and/or performance issues
  • Provide production support for the Product Development teams
  • Participate in an on-call rotation
  • Work with the team to develop and deploy monitoring and alerting architecture, and implement monitoring/logging solutions
  • Troubleshoot complex issues in a timely manner as necessary to maintain the performance and stability of our Production Application environment
  • Help build out SLOs and document and monitor SLAs

Benefits

  • $135,000.00-$170,000.00 annually
  • Equity and bonus
  • Robust benefit plan (medical/dental/vision/life)
  • 401k match
  • Flexible time off
  • Monthly wellness and cell phone stipends
  • Childcare and continued education yearly reimbursements

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.