Principal Site Reliability Engineer

UltraViolet Cyber Logo

UltraViolet Cyber

πŸ’΅ $170k-$200k
πŸ“Remote - Worldwide

Summary

Join UltraViolet Cyber as a Principal Site Reliability Engineer (SRE) and play a key role in enhancing the scalability, reliability, and security of our cloud infrastructure. You will work with a team of experts to ensure the resilience and efficiency of our systems using automation and modern DevOps practices. This dynamic role requires hands-on expertise, leadership skills, and continuous learning to mature our infrastructure and reliability processes. You will be responsible for system reliability and performance, Kubernetes and EKS management, infrastructure as code, CI/CD pipelines, monitoring and incident response, security and compliance, capacity planning and scaling, collaboration and cross-functional leadership, incident management and root cause analysis, and cost optimization. We offer a competitive salary and a comprehensive benefits package.

Requirements

  • Extensive experience in AWS, with deep expertise in managing EKS clusters, networking, IAM, security groups, and other core AWS services
  • Strong proficiency in Kubernetes (EKS, Helm, Kubectl, Operators) with a proven track record of deploying, maintaining, and scaling containerized applications
  • Hands-on experience in DevOps tools & methodologies, including Terraform, Ansible or SaltStack, Helm, GitOps, ArgoCD, and CI/CD platforms such as GitHub Actions or Jenkins
  • Proficiency in scripting and automation using Python, Bash, or Golang to enhance system reliability and efficiency
  • Experience with observability and monitoring tools, including Prometheus, Grafana, Loki, or AWS CloudWatch
  • Deep understanding of networking principles, including DNS, VPC, Load Balancers, VPNs, and Service Mesh architectures
  • Strong background in security best practices, including IAM policies, encryption, secrets management, and vulnerability scanning (AWS KMS, HashiCorp Vault, etc.)
  • Experience working with highly available, distributed systems, including microservices architecture and cloud-native applications
  • Previous experience in an Agile or DevOps culture, promoting collaboration, automation, and iterative improvements
  • Excellent troubleshooting skills, with the ability to analyze complex system failures and drive solutions
  • Strong communication and leadership skills, with the ability to mentor junior engineers and collaborate effectively with cross-functional teams
  • Bachelor’s degree in Computer Science, Engineering, or a related field, or equivalent practical experience

Responsibilities

  • Ensure the availability, performance, scalability, and security of our cloud-based services using best practices in SRE and DevOps
  • Architect, deploy, and maintain Kubernetes clusters, primarily using Amazon Elastic Kubernetes Service (EKS)
  • Automate infrastructure provisioning, configuration, and management using Terraform, Pulumi, or similar tools
  • Build, maintain, and enhance continuous integration and continuous deployment (CI/CD) pipelines, optimizing deployment workflows for speed and reliability
  • Design and implement comprehensive monitoring, alerting, and logging solutions using tools such as Prometheus, Grafana, and CloudWatch to proactively identify and address system issues
  • Enforce security best practices, implement access controls, and ensure compliance with industry standards
  • Analyze system performance and scalability, implementing proactive strategies to accommodate growth and prevent downtime
  • Work closely with Engineering and Product teams to integrate reliability principles into the software development lifecycle
  • Lead post-mortem investigations for critical incidents, identifying actionable improvements to enhance system resilience
  • Assess and optimize cloud costs while maintaining performance and reliability, leveraging AWS savings plans, right-sizing resources, and improving infrastructure efficiency

Benefits

  • 401(k), including an employer match of 100% of the first 3% contributed and 50% of the next 2% contributed
  • Medical, Dental, and Vision Insurance (available on the 1st day of the month following your first day of employment)
  • Group Term Life, Short-Term Disability, Long-Term Disability
  • Voluntary Life, Hospital Indemnity, Accident, and/or Critical Illness
  • Participation in the Discretionary Time Off (DTO) Program
  • 11 Paid Holidays Annually
  • $170,000 - $200,000 a year

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.

Similar Remote Jobs