Senior Manager, Platform Engineering

SMG - Service Management Group
Summary
Join SMG, a leading experience management provider, as a Senior Manager of Platform Engineering. Lead and mentor a multidisciplinary team, fostering collaboration and continuous improvement. Define and enhance CI/CD workflows for rapid and safe software delivery. Establish Infrastructure-as-Code standards and robust development environments. Deliver tooling and governance to accelerate development while maintaining compliance. Manage vendor relationships and implement code vulnerability management processes. This role requires a Bachelor's or Master's degree in a related field, 10+ years of progressive experience in software engineering (with 3+ years in people management), and expertise in cloud-native platforms, CI/CD tooling, and SRE principles. SMG offers a remote-first work environment, unlimited PTO, and a diverse, supportive team.
Requirements
- Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field
- 10+ years of progressive experience in software, platform, or reliability engineering, with at least 3 years in a people‑management role
- Hands‑on experience designing and operating cloud‑native platforms on AWS, Azure, or GCP
- Expertise with CI/CD tooling (e.g., GitHub Actions, Jenkins, ArgoCD), Infrastructure‑as‑Code frameworks (e.g., Terraform, CloudFormation), and container orchestration (e.g., Kubernetes)
- Deep understanding of SRE principles, reliability metrics (SLO/SLA, error budgets), observability stacks (e.g., Prometheus, Grafana, ELK), and incident management best practices
- Demonstrated success implementing security and vulnerability management programs across the SDLC
- Proven track record driving cost optimization, resource rightsizing, and FinOps initiatives
- Exceptional communication, stakeholder management, and leadership skills
- Ability to thrive in a remote‑first, fast‑paced environment and influence across functional boundaries
Responsibilities
- Lead, mentor, and develop a multidisciplinary Platform Engineering team, fostering a culture of collaboration, ownership, and continuous improvement
- Define and continuously improve CI/CD workflows and pipelines that enable rapid, safe, and repeatable delivery of software
- Establish Infrastructure‑as‑Code (IaC) standards, reusable module libraries, and governance checks to ensure consistency across environments
- Provide and support robust local development environments that mirror production, boosting developer productivity
- Deliver tooling and governance enablement—including guardrails, automated policy enforcement, and self‑service platforms—to accelerate development velocity while maintaining compliance
- Own vendor evaluations for platform tooling, frameworks, and managed services; negotiate contracts and manage vendor relationships
- Define, document, and enforce organization‑wide Engineering, Troubleshooting, and Runbook standards
- Implement and operate code vulnerability management processes and tooling, ensuring remediation SLAs are met
- Establish and maintain an authoritative component inventory / software bill of materials (SBOM)
- Design and maintain an Engineering Documentation Framework to ensure knowledge is current, discoverable, and actionable
- Manage the lifecycle of internal engineering frameworks—from selection through deprecation—ensuring version currency and support
- Partner with FinOps to define and operationalize a cost tagging framework; champion rightsizing, cost analysis, and patch management of shared platform services
- Design and lead Incident and Problem Management architecture and processes, including on‑call escalation and rotation management
- Own the Postmortem framework, ensuring blameless retrospectives, action item tracking, and learning dissemination
- Define and maintain the strategy for Observability and Monitoring (metrics, logs, traces, dashboards, alerts)
- Develop and routinely test Disaster Recovery plans, achieving or surpassing agreed RTO/RPO targets
- Drive Site and Service Reliability through SLO/SLA definition, error budget policies, and proactive reliability engineering
- Measure and optimize capacity and performance of platform services through data‑driven analysis and forecasting
- Define policies for Scheduled Changes and Change Management to minimize risk and downtime
- Perform cost analysis and reporting, surfacing insights to engineering and finance stakeholders
Benefits
- Remote first company (fully remote)
- Unlimited PTO
- Tech provided
Share this job:
Similar Remote Jobs
