Summary
Join Blackpoint Cyber, a leading cybersecurity company experiencing rapid growth, as a Senior SRE Engineer. You will play a key role in designing, implementing, and maintaining our infrastructure and CI/CD pipelines. This position requires expertise in cloud infrastructure, automation, and various technologies like Terraform, AWS, Kafka, and Kubernetes. You will collaborate with cross-functional teams to ensure system reliability and efficiency. The ideal candidate possesses extensive experience in SRE and a strong understanding of cloud security and scalability. Blackpoint Cyber offers competitive benefits, including health insurance, a 401k plan, and discretionary time off.
Requirements
- 8+ years proven experience as a Senior SRE Engineer or in a similar role with a strong focus on cloud infrastructure and automation
- Expertise in Infrastructure as Code (IaC) using Terraform and Terragrunt
- Deep knowledge of AWS cloud services and best practices for designing secure and scalable architectures
- Hands-on experience with Confluent Cloud and Kafka for distributed data streaming
- Strong experience with REDIS for caching and RDS data storage
- Strong Experience with OpenSearch/ElasticSearch/ ChaosSearch
- Proficiency in monitoring and alerting using Prometheus, Grafana, Alert Manager, and OpsGenie
- Experience with LaunchDarkly for feature flag management
- Extensive experience managing Kubernetes clusters, including package management with Helm, deployment with ArgoCD, and service mesh configurations using Istio
- Familiarity with Kustomize for Kubernetes resource configuration
- Excellent problem-solving skills with the ability to troubleshoot complex systems in production
- Strong communication and collaboration skills, with experience working in agile environments
Responsibilities
- Design, build, and maintain highly scalable infrastructure using Terraform and Terragrunt to automate cloud resource provisioning
- Manage cloud environments, particularly in AWS, ensuring cost optimization, security, and high availability
- Work with Confluent Cloud and Kafka to manage and scale our data streaming platforms
- Deploy and manage REDIS instances for caching and real-time data processing
- Implement and maintain monitoring and alerting solutions using Prometheus, Grafana, Alert Manager, and OpsGenie to ensure system reliability
- Enable feature flag management and controlled rollouts using LaunchDarkly
- Manage Kubernetes clusters using Kubernetes, Helm, ArgoCD, Istio, and Kustomize for continuous delivery and infrastructure-as-code practices
- Collaborate with development teams to ensure seamless integration of new services and features into our infrastructure
- Troubleshoot and resolve complex system issues, ensuring high performance and uptime
- Continuously improve automation tools, processes, and methodologies to enhance system scalability and maintainability
- Stay up-to-date with emerging SRE trends and technologies, ensuring the organization leverages the latest advancements
Preferred Qualifications
- Experience with multi-cloud environments (e.g., GCP, Azure)
- Familiarity with security best practices in cloud and containerized environments
- Knowledge of serverless architectures and CI/CD tools such as Jenkins and Github Actions
- Some development experience in NodeJS/Python/GoLang
Benefits
- Competitive Health, Vision, Dental, and Life Insurance plans
- A robust 401k plan
- Discretionary Time Off
- Other minor perks