Remote Site Reliability Engineer II
Abnormal Security
๐Remote - Canada
Please let Abnormal Security know you found this job on JobsCollider. Thanks! ๐
Job highlights
Summary
Join Abnormal Security's team as a Site Reliability Engineer to ensure the prevention, detection, efficient remediation, and quick recovery from outages that impact the Abnormal Security Platform. This role involves building tools and processes for deployment operations, incident prevention, detection, remediation, and incident recovery.
Requirements
- Bachelorโs in Computer Science, Computer Engineering, or equivalent professional experience
- 4+ experience as a Site Reliability Engineer, responsible for the reliability of shared services
- Experience with a public cloud provider (AWS, Azure, GCP), observability stack (Prometheus, Grafana), and incident management tools (PagerDuty, Sentry, Slack integration)
Responsibilities
- Build tools and processes to standardize deployment of Abnormal Security product suite in a multi-datacenter setup
- Partner with R&D teams to develop pre and post deployment checklists, canary test environments and workflows, and safe rollback processes
- Identify gaps in existing processes and advocate for necessary changes to improve overall system stability and availability
- Lead the Production Readiness Review process to ensure the resilience of systems before customer deployment
- Oversee the Critical Change Management Review process for the safe application of changes to critical services
- Develop and enforce architecture guidelines to minimize downtime and ensure high system availability
- Establish consistent definition of metrics for โIs this product workingโ
- Define and monitor SLAs/SLOs for critical systems, actively tracking deviations and triggering alerts when necessary
- Define incident severity classification guidelines and implement incident response protocols to promptly address issues and reduce downtime
- Facilitate effective communication between Engineering and Customer Success teams during incidents
- Design and implement tools to expedite system recovery and minimize the impact of incidents
- Develop guidelines for Post Mortems after incidents to prevent recurrence
Share this job:
Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.
Similar Remote Jobs
- ๐ฐ$122k-$139k๐United States
- ๐ฐ$144k-$189k๐Worldwide
- ๐ฐ$148k-$175k๐United States
- ๐India
- ๐United States
- ๐Worldwide
- ๐United States
- ๐Worldwide
- ๐ฐ$126k-$178k๐United States
Please let Abnormal Security know you found this job on JobsCollider. Thanks! ๐