Software Reliability Engineer

Milk Moovement Logo

Milk Moovement

πŸ“Remote - Canada

Summary

Join Milk Moovement as a Software Reliability Engineer (SRE) and contribute to the smooth operation of our dairy industry platform. You will proactively monitor and resolve platform issues, implement monitoring solutions, investigate performance anomalies, and refine our incident response process. This critical role ensures high system availability and performance through collaboration with various teams. The ideal candidate possesses at least 3 years of SRE or DevOps experience with a focus on reliability, along with expertise in log aggregation, cloud-deployed applications, and incident management platforms. Milk Moovement offers a remote work environment, flexible hours, and unique perks.

Requirements

  • Strong experience with log aggregation and monitoring solutions. (Datadog, Splunk, ELK)
  • Experience working with monitoring cloud deployed applications. (AWS, GCP, Azure)
  • Familiarity with configuring incident management platforms. (Squadcast, PagerDuty)
  • Experience using IaC for deployment and management. (Terraform, CloudFormation, CDK)
  • Proficiency in JavaScript or Python for automation and debugging
  • Extensive experience in troubleshooting & triaging performance issues and incidents
  • At least 3 years prior SRE or DevOps experience, with a focus on the reliability side

Responsibilities

  • Implement and maintain monitoring solutions using Datadog, focusing on proactive detection and resolution of platform issues
  • Develop alerting mechanisms that trigger based on symptoms rather than just outages, ensuring early detection of problems
  • Analyze system metrics, logs, and performance data to identify trends and potential reliability concerns
  • Lead incident response efforts, including triaging, troubleshooting, and post-mortem analysis for continuous improvement
  • Manage and optimize logging and monitoring infrastructure to ensure observability across all services
  • Work closely with development teams to ensure features are deployed with minimal impact on platform reliability
  • Participate in on-call rotations and incident management workflows, ensuring rapid issue response and resolution
  • Assist in cloud engineering tasks where necessary, particularly in reliability-focused automation and infrastructure improvements

Preferred Qualifications

  • Datadog certification or extensive experience configuring and tuning monitoring solutions
  • Related AWS certifications or ample experience administering AWS environments
  • Proficiency building internal tooling and APIs leveraging serverless infrastructure (Lambda)
  • Experience working with container-based services. (Docker, ECS, Kubernetes)
  • Working knowledge of both SQL and NoSQL databases, including troubleshooting and performance tuning. (MongoDB, PostgreSQL, DynamoDB)
  • Familiarity with CI/CD processes and automation frameworks

Benefits

  • Remote work environment - work from home or from one of our hubs in Halifax and St. John’s
  • Flexible hours - night owl or early riser? No problem
  • Tools - need the latest and great software to perform more efficiently? Ask and you shall receive
  • Quarterly guest speakers - from shark trainers and graffiti artists to astronomers and sandwich aficionados. The more unique, the better

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.