Summary

Join UltraViolet Cyber as a Principal Site Reliability Engineer (SRE) and play a key role in enhancing the scalability, reliability, and security of our cloud infrastructure. You will work with a team of experts to ensure the resilience and efficiency of our systems using automation and modern DevOps practices. This dynamic role requires hands-on expertise, leadership skills, and continuous learning to mature our infrastructure and reliability processes. You will be responsible for system reliability and performance, Kubernetes and EKS management, infrastructure as code, CI/CD pipelines, monitoring and incident response, security and compliance, capacity planning and scaling, collaboration and cross-functional leadership, incident management and root cause analysis, and cost optimization. We offer a competitive salary and a comprehensive benefits package.

Requirements

Extensive experience in AWS, with deep expertise in managing EKS clusters, networking, IAM, security groups, and other core AWS services
Strong proficiency in Kubernetes (EKS, Helm, Kubectl, Operators) with a proven track record of deploying, maintaining, and scaling containerized applications
Hands-on experience in DevOps tools & methodologies, including Terraform, Ansible or SaltStack, Helm, GitOps, ArgoCD, and CI/CD platforms such as GitHub Actions or Jenkins
Proficiency in scripting and automation using Python, Bash, or Golang to enhance system reliability and efficiency
Experience with observability and monitoring tools, including Prometheus, Grafana, Loki, or AWS CloudWatch
Deep understanding of networking principles, including DNS, VPC, Load Balancers, VPNs, and Service Mesh architectures
Strong background in security best practices, including IAM policies, encryption, secrets management, and vulnerability scanning (AWS KMS, HashiCorp Vault, etc.)
Experience working with highly available, distributed systems, including microservices architecture and cloud-native applications
Previous experience in an Agile or DevOps culture, promoting collaboration, automation, and iterative improvements
Excellent troubleshooting skills, with the ability to analyze complex system failures and drive solutions
Strong communication and leadership skills, with the ability to mentor junior engineers and collaborate effectively with cross-functional teams
Bachelor’s degree in Computer Science, Engineering, or a related field, or equivalent practical experience

Responsibilities

Ensure the availability, performance, scalability, and security of our cloud-based services using best practices in SRE and DevOps
Architect, deploy, and maintain Kubernetes clusters, primarily using Amazon Elastic Kubernetes Service (EKS)
Automate infrastructure provisioning, configuration, and management using Terraform, Pulumi, or similar tools
Build, maintain, and enhance continuous integration and continuous deployment (CI/CD) pipelines, optimizing deployment workflows for speed and reliability
Design and implement comprehensive monitoring, alerting, and logging solutions using tools such as Prometheus, Grafana, and CloudWatch to proactively identify and address system issues
Enforce security best practices, implement access controls, and ensure compliance with industry standards
Analyze system performance and scalability, implementing proactive strategies to accommodate growth and prevent downtime
Work closely with Engineering and Product teams to integrate reliability principles into the software development lifecycle
Lead post-mortem investigations for critical incidents, identifying actionable improvements to enhance system resilience
Assess and optimize cloud costs while maintaining performance and reliability, leveraging AWS savings plans, right-sizing resources, and improving infrastructure efficiency

Benefits

401(k), including an employer match of 100% of the first 3% contributed and 50% of the next 2% contributed
Medical, Dental, and Vision Insurance (available on the 1st day of the month following your first day of employment)
Group Term Life, Short-Term Disability, Long-Term Disability
Voluntary Life, Hospital Indemnity, Accident, and/or Critical Illness
Participation in the Discretionary Time Off (DTO) Program
11 Paid Holidays Annually
$170,000 - $200,000 a year

Principal Site Reliability Engineer

UltraViolet Cyber

Summary

Requirements

Responsibilities

Benefits

Remote

DevOps

Principal

Share this job:

Similar Remote Jobs

Aviatrix

Remote

DevOps

Principal

Remote

DevOps

Principal

Remote

DevOps

Principal

Remote

DevOps

Senior

Remote

DevOps

Senior

Remote

DevOps

Mid-level

Plus Power

Remote

Data

Principal

Remote

DevOps

Principal

City and County of San Francisco

Remote

Software Development

Principal