Site Reliability Engineer
Pavilion Payments
Job highlights
Summary
Join Pavilion Payments as their first Site Reliability Engineer (SRE) and play a crucial role in building a resilient infrastructure and ensuring high availability across their systems. You will work closely with various IT teams to implement best practices in system reliability, observability, and automated response. This position emphasizes reliability, platform management, and network security. Key responsibilities include establishing reliability metrics, developing monitoring systems, establishing incident response processes, and collaborating on platform management and service objectives. You will also focus on automation, IaC, and CI/CD pipelines, as well as network and security collaboration. The role requires proficiency in various technical skills and tools.
Requirements
- Proficiency with SUSE, AKS, Linux, Azure Cloud, Grafana, Rancher, Terraform, Azure DevOps pipelines
- Strong experience with Grafana for observability and OpsGenie for incident response, with a focus on maintaining uptime and proactive alerts
- Proficiency in scripting (e.g., Bash, Python) and experience with TailScale for secure networking solutions
- Experience in identifying and remediating performance and security issues, focusing on proactive, long-term solutions
Responsibilities
- Establish and track reliability metrics such as Latency, Traffic, Errors, and Capacity, focusing on uptime across applications and products, with plans to expand monitoring to kiosk and edge networks
- Develop and refine monitoring systems using Grafana to ensure comprehensive visibility, focusing on continuous improvements in reliability
- Establish robust processes for incident response and root cause analysis, leveraging OpsGenie to ensure timely and structured responses
- Work with TailScale, SUSE, and F5 to support secure, resilient network connectivity and load balancing
- Collaborate with IT leadership to define and maintain service level objectives (SLOs) and monitor performance against these standards
- Structure and optimize platform management with a focus on supporting uptime in our production environment
- Develop and maintain Terraform configurations for scalable, repeatable infrastructure deployment, focusing on minimizing manual tasks and ensuring resource consistency
- Work with DevOps to optimize CI/CD workflows using Azure DevOps, focusing on pipeline automation and deployment efficiency
- Automate repetitive tasks and enhance deployment processes within AKS and Azure environments, aiming to reduce potential deployment bottlenecks
- Partner with network engineers to optimize and maintain F5 load balancers and Palo Alto Networks/Panorama for secure, resilient network operations
- Collaborate with security teams to ensure network traffic and access patterns align with security best practices, integrating observability into network operations
Share this job:
Similar Remote Jobs
- π°$177k-$213kπUnited States
- πJapan
- π°$60k-$120kπAsia
- πMexico
- πUnited States
- π°$151k-$297kπUnited States
- πSpain
- πIndia
- πFrance
- πIndia