Senior Site Reliability Engineer

Logo of Coalfire

Coalfire

๐Ÿ’ต $78k-$135k
๐Ÿ“Remote - United States

Job highlights

Summary

Join Coalfire's Cloud Services team as a Senior Site Reliability Engineer! This remote (US-based) position offers a unique blend of cloud infrastructure administration, site reliability engineering, security operations, and vulnerability management. You will work with leading cloud software companies, ensuring seamless reliability and scalability of their SaaS products. As a subject matter expert, you'll leverage automation, orchestration, and configuration management to optimize client environments across AWS, Azure, and GCP. You will collaborate with client teams, maintain infrastructure, and provide 24/7 support. This role requires strong technical skills, experience with various tools and technologies, and a passion for problem-solving.

Requirements

  • BS or above in related Information Technology field or equivalent combination of education and experience
  • 5+ years experience in 24x7x365 production operations
  • 5+ years experience supporting cloud operations and automation in AWS, Azure or GCP (and aligned certifications)
  • 5+ years experience with Infrastructure-as-Code and orchestration/automation tools such as Terraform and Ansible
  • Expert-level experience with IaaS platform capabilities and services (cloud certifications expected)
  • Strong experience working within an automated CI/CD pipeline for release development, testing, remediation, and deployment
  • Strong experience working within container orchestration solutions such as Kubernetes, Docker, EKS and/or ECS
  • Strong experience within ticketing tool solutions such as Jira and ServiceNow
  • Experience using environmental analytics tools such as Splunk and Elastic Stack for querying, monitoring and alerting
  • Strong experience in multiple scripting languages (Bash, Python, PowerShell)
  • Excellent communication, organizational, and problem-solving skills in a dynamic environment
  • Effective documentation skills, to include technical diagrams and written descriptions
  • Ability to work independently and as part of a team with professional attitude and demeanor

Responsibilities

  • Become a member of a highly collaborative engineering team offering a unique blend of Cloud Infrastructure Administration, Site Reliability Engineering, Security Operations, and Vulnerability Management across multiple clients
  • Coordinate with client product teams, engineering team members, and other stakeholders to monitor and maintain a secure and resilient cloud-hosted infrastructure to established SLAs in both production and non-production environments
  • Be a subject matter expert on innovating and implementing using automated orchestration and configuration management techniques. Deeply understand the design, deployment, and management of secure and compliant enterprise servers, network infrastructure, boundary protection, and cloud architectures using Infrastructure-as-Code
  • Create, maintain, and peer review automated orchestration and configuration management codebases, as well as Infrastructure-as-Code codebases. Maintain IaC tooling and versioning within Client environments
  • Define processes, implement, and upgrade client environments with CI/CD infrastructure code, and provide and facilitate internal feedback to development teams for environment requirements and necessary alterations
  • Own clients across AWS, Azure and GCP, serving as a SME and optimizing their unique native services in client environments
  • Configure and tune cloud-based tools, manage cost, security, and compliance for the clientโ€™s environments
  • Identify repetitive tasks/areas of improvement and develop technical solutions to automate repeatable tasks as well as enhancements to CMS offerings
  • Respond to environment-specific alerts, and review dashboards via analytics tools such as Splunk and Elastic Stack
  • Work closely with client DevOps and product teams to provide 24x7x365 support to environments through Client ticketing systems
  • Serve as a leader for definition, testing, and validation of incident response and disaster recovery documentation and exercises
  • Participate in on-call rotations as needed to support Client critical events that may lay outside of business hours
  • Serve as a SME for testing and data reviews to evaluate the effectiveness of current security and operational measures, in addition to remediating deviations from current security and operational measures
  • Maintain and author detailed diagrams representative of the Clientโ€™s cloud architecture
  • Create, maintain, and peer review standard operating procedures, operational runbooks, technical documents, and troubleshooting guidelines

Preferred Qualifications

  • Previous experience in a consulting role supporting dynamic, and fast-paced environments
  • Previous experience supporting a 24x7x365 highly-available environment for a SaaS vendor
  • Experience contributing to security incident handling and investigation, and/or system scenario recreation
  • Cloud-based networking experience (Palo Alto, Cisco ASAv, etc.โ€ฆ)
  • Familiarity with frameworks such as FedRAMP, FISMA, SOC, ISO, HIPAA, HITRUST, PCI, etc
  • Familiarity with configuration baseline standards such as CIS Benchmarks & DISA STIG
  • Knowledge of encryption technologies (SSL, encryption, PKI)
  • Experience with establishing, administering, and monitoring CI/CD pipelines or application deployments in cloud native environments
  • Strong experience with diagramming (Visio, Lucid Chart, etc.)
  • Application development experience for cloud-based systems

Benefits

  • Flexible work model
  • Paid parental leave
  • Flexible time off
  • Certification and training reimbursement
  • Digital mental health and wellbeing support membership
  • Comprehensive insurance options
  • Annual incentive
  • Commission
  • Recognition programs

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.