Site Reliability Engineer at Pavilion Payments

Summary

Join Pavilion Payments as their first Site Reliability Engineer (SRE) and play a crucial role in building a resilient infrastructure and ensuring high availability across their systems. You will work closely with various IT teams to implement best practices in system reliability, observability, and automated response. This position emphasizes reliability, platform management, and network security. Key responsibilities include establishing reliability metrics, developing monitoring systems, establishing incident response processes, and collaborating on platform management and service objectives. You will also focus on automation, IaC, and CI/CD pipelines, as well as network and security collaboration. The role requires proficiency in various technical skills and tools.

Requirements

Proficiency with SUSE, AKS, Linux, Azure Cloud, Grafana, Rancher, Terraform, Azure DevOps pipelines
Strong experience with Grafana for observability and OpsGenie for incident response, with a focus on maintaining uptime and proactive alerts
Proficiency in scripting (e.g., Bash, Python) and experience with TailScale for secure networking solutions
Experience in identifying and remediating performance and security issues, focusing on proactive, long-term solutions

Responsibilities

Establish and track reliability metrics such as Latency, Traffic, Errors, and Capacity, focusing on uptime across applications and products, with plans to expand monitoring to kiosk and edge networks
Develop and refine monitoring systems using Grafana to ensure comprehensive visibility, focusing on continuous improvements in reliability
Establish robust processes for incident response and root cause analysis, leveraging OpsGenie to ensure timely and structured responses
Work with TailScale, SUSE, and F5 to support secure, resilient network connectivity and load balancing
Collaborate with IT leadership to define and maintain service level objectives (SLOs) and monitor performance against these standards
Structure and optimize platform management with a focus on supporting uptime in our production environment
Develop and maintain Terraform configurations for scalable, repeatable infrastructure deployment, focusing on minimizing manual tasks and ensuring resource consistency
Work with DevOps to optimize CI/CD workflows using Azure DevOps, focusing on pipeline automation and deployment efficiency
Automate repetitive tasks and enhance deployment processes within AKS and Azure environments, aiming to reduce potential deployment bottlenecks
Partner with network engineers to optimize and maintain F5 load balancers and Palo Alto Networks/Panorama for secure, resilient network operations
Collaborate with security teams to ensure network traffic and access patterns align with security best practices, integrating observability into network operations

Site Reliability Engineer

Pavilion Payments

Summary

Requirements

Responsibilities

Remote

DevOps

Mid-level

Similar Remote Jobs

Remote

DevOps

Senior

Remote

Software Development

Senior

Remote

Software Development

Senior

Remote

Software Development

Senior

Tailor

Remote

Software Development

Mid-level

Remote

DevOps

Senior

Kraken Digital Asset Exchange

Remote

DevOps

Mid-level

Kraken Digital Asset Exchange

Remote

DevOps

Mid-level

GoDaddy

Remote

DevOps

Mid-level

Remote

DevOps

Senior