Summary

Join Blackpoint Cyber, a leading cybersecurity firm, as a Site Reliability Engineer (SRE). You will design, build, and scale robust infrastructure, CI/CD pipelines, and build systems. Collaborate with cross-functional teams to enhance system reliability, performance, and automation. Champion a culture of innovation and continuous improvement. This role requires expertise in cloud infrastructure, automation, and various technologies like Terraform, Kubernetes, and Kafka. The ideal candidate possesses strong problem-solving and communication skills and experience in agile environments. Blackpoint Cyber offers competitive benefits for eligible US employees.

Requirements

4+ years proven experience as a SRE Engineer or in a similar role with a strong focus on cloud infrastructure and automation
Excellent problem-solving skills with the ability to troubleshoot complex systems in production
Strong communication and collaboration skills, with experience working in agile environments
Expertise in Infrastructure as Code (IaC) using Terraform and Terragrunt
Deep knowledge of AWS cloud services and best practices for designing secure and scalable architectures
Hands-on experience with Confluent Cloud and Kafka for distributed data streaming
Strong experience with REDIS for caching and RDS data storage
Strong Experience with OpenSearch/Elasticsearch/ Chaos Search
Proficiency in monitoring and alerting using Prometheus, Grafana, Alert Manager
Extensive experience managing Kubernetes clusters, including package management with Helm, deployment with ArgoCD, and service mesh configurations using Istio
Familiarity with Kustomize for Kubernetes resource configuration
Development experience in NodeJS/Python/GoLang

Responsibilities

Design, build, and maintain highly scalable infrastructure using Terraform and Terragrunt to automate cloud resource provisioning
Manage and optimize AWS cloud environments for cost-efficiency, security, and high availability
Continuously improve infrastructure automation tools and methodologies to support scalability and maintainability
Manage and scale Kafka and Confluent Cloud platforms for real-time data streaming
Deploy and maintain Redis instances to support caching and real-time data processing workloads
Implement and maintain robust monitoring and alerting systems using Prometheus, Grafana, Alert Manager, and OpsGenie to ensure system reliability and visibility
Troubleshoot and resolve complex system issues, ensuring optimal performance and uptime
Manage Kubernetes clusters using tools like Helm, ArgoCD, Istio, and Kustomize to support modern infrastructure-as-code and continuous delivery practices
Enable feature flag management and safe, controlled rollouts using LaunchDarkly
Work closely with development teams to seamlessly integrate new features and services into the infrastructure
Foster a culture of continuous improvement by regularly evaluating and adopting emerging SRE tools, technologies, and best practices

Preferred Qualifications

Experience with multi-cloud environments (e.g., GCP, Azure)
Familiarity with security, compliance best practices in cloud and containerized environments
Knowledge of serverless architectures and CI/CD tools such as Jenkins and/or GitHub Actions

Benefits

Competitive Health, Vision, Dental, and Life Insurance plans
A robust 401k plan
Discretionary Time Off

Site Reliability Engineer

Blackpoint Cyber

Summary

Requirements

Responsibilities

Preferred Qualifications

Benefits

Remote

DevOps

Mid-level

Share this job:

Similar Remote Jobs

Remote

DevOps

Senior

Remote

Software Development

Senior

Remote

Software Development

Senior

Remote

Software Development

Senior

Tailor

Remote

Software Development

Mid-level

Remote

DevOps

Senior

GoDaddy

Remote

DevOps

Mid-level

Remote

DevOps

Senior

Remote

DevOps

Mid-level

Remote

DevOps

Senior