Senior Manager, Platform Engineering

SMG - Service Management Group Logo

SMG - Service Management Group

📍Remote - Worldwide

Summary

Join SMG, a leading experience management provider, as a Senior Manager of Platform Engineering. Lead and mentor a multidisciplinary team, fostering collaboration and continuous improvement. Define and enhance CI/CD workflows for rapid and safe software delivery. Establish Infrastructure-as-Code standards and robust development environments. Deliver tooling and governance to accelerate development while maintaining compliance. Manage vendor relationships and implement code vulnerability management processes. This role requires a Bachelor's or Master's degree in a related field, 10+ years of progressive experience in software engineering (with 3+ years in people management), and expertise in cloud-native platforms, CI/CD tooling, and SRE principles. SMG offers a remote-first work environment, unlimited PTO, and a diverse, supportive team.

Requirements

  • Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field
  • 10+ years of progressive experience in software, platform, or reliability engineering, with at least 3 years in a people‑management role
  • Hands‑on experience designing and operating cloud‑native platforms on AWS, Azure, or GCP
  • Expertise with CI/CD tooling (e.g., GitHub Actions, Jenkins, ArgoCD), Infrastructure‑as‑Code frameworks (e.g., Terraform, CloudFormation), and container orchestration (e.g., Kubernetes)
  • Deep understanding of SRE principles, reliability metrics (SLO/SLA, error budgets), observability stacks (e.g., Prometheus, Grafana, ELK), and incident management best practices
  • Demonstrated success implementing security and vulnerability management programs across the SDLC
  • Proven track record driving cost optimization, resource rightsizing, and FinOps initiatives
  • Exceptional communication, stakeholder management, and leadership skills
  • Ability to thrive in a remote‑first, fast‑paced environment and influence across functional boundaries

Responsibilities

  • Lead, mentor, and develop a multidisciplinary Platform Engineering team, fostering a culture of collaboration, ownership, and continuous improvement
  • Define and continuously improve CI/CD workflows and pipelines that enable rapid, safe, and repeatable delivery of software
  • Establish Infrastructure‑as‑Code (IaC) standards, reusable module libraries, and governance checks to ensure consistency across environments
  • Provide and support robust local development environments that mirror production, boosting developer productivity
  • Deliver tooling and governance enablement—including guardrails, automated policy enforcement, and self‑service platforms—to accelerate development velocity while maintaining compliance
  • Own vendor evaluations for platform tooling, frameworks, and managed services; negotiate contracts and manage vendor relationships
  • Define, document, and enforce organization‑wide Engineering, Troubleshooting, and Runbook standards
  • Implement and operate code vulnerability management processes and tooling, ensuring remediation SLAs are met
  • Establish and maintain an authoritative component inventory / software bill of materials (SBOM)
  • Design and maintain an Engineering Documentation Framework to ensure knowledge is current, discoverable, and actionable
  • Manage the lifecycle of internal engineering frameworks—from selection through deprecation—ensuring version currency and support
  • Partner with FinOps to define and operationalize a cost tagging framework; champion rightsizing, cost analysis, and patch management of shared platform services
  • Design and lead Incident and Problem Management architecture and processes, including on‑call escalation and rotation management
  • Own the Postmortem framework, ensuring blameless retrospectives, action item tracking, and learning dissemination
  • Define and maintain the strategy for Observability and Monitoring (metrics, logs, traces, dashboards, alerts)
  • Develop and routinely test Disaster Recovery plans, achieving or surpassing agreed RTO/RPO targets
  • Drive Site and Service Reliability through SLO/SLA definition, error budget policies, and proactive reliability engineering
  • Measure and optimize capacity and performance of platform services through data‑driven analysis and forecasting
  • Define policies for Scheduled Changes and Change Management to minimize risk and downtime
  • Perform cost analysis and reporting, surfacing insights to engineering and finance stakeholders

Benefits

  • Remote first company (fully remote)
  • Unlimited PTO
  • Tech provided

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.