Manager, Infrastructure Engineering

Wiser
Summary
Join Wiser Solutions as a Manager, Infrastructure Engineering, leading a team in building and maintaining the core infrastructure of their omnichannel retail intelligence platform. You will work remotely from Canada, collaborating with engineering and product leadership to craft long-term roadmaps and quarterly deliverables. Responsibilities include leading developer experience and cloud infrastructure, building developer-centric platforms, fostering a developer-centric product mindset, and driving AI-first infrastructure innovation. You will also champion infrastructure automation, implement SRE best practices, lead platform consolidation, enhance observability and reliability, and collaborate across engineering teams. The role involves managing a globally distributed high-performing team and fostering a culture of DevOps and ownership. This position requires extensive experience in infrastructure, SRE, or platform engineering, with a strong background in cloud-native systems and DevOps.
Requirements
- 10+ years of professional experience in infrastructure, SRE, or platform engineering, with 2+ years in a people management or technical leadership role
- Bachelor’s degree in Computer Science, Engineering, or a related technical field (or equivalent practical experience)
- Proven agile leadership experience, including mentoring and coaching high-performing teams, with demonstrated ability to recruit and develop top engineering talent
- Strong ability to lead sprint planning, manage execution roadmaps, and drive cross-functional collaboration with excellent communication skills across technical and non-technical audiences
- Experience treating internal platforms as products, gathering developer feedback, and applying product management principles to infrastructure services
- Experience integrating disparate systems and platforms, ideally in post-acquisition environments or complex legacy migration scenarios
- Proven track record of building internal platforms, CI/CD systems, and developer tooling that improve engineering productivity
- Deep understanding of cloud-native systems, container orchestration (e.g., Kubernetes), and Infrastructure as Code (IaC)
- Hands-on experience migrating legacy infrastructure to modern cloud environments (e.g., AWS, Kubernetes)
- Strong operational expertise in high-availability systems, distributed architectures, and incident response
- Security-first mindset with working knowledge of SOC 2, PCI DSS, and other compliance standards
- Experience or strong interest in leveraging AI/ML for infrastructure optimization, automated troubleshooting, predictive operations, or self-healing systems
- Proficient in Linux system administration and comfortable working at the command line
- Deep familiarity with SRE practices including SLIs, SLOs, error budgets, and reliability reviews
- Experience with DevOps and automation tools such as Vault, Terraform, Atlantis, GitHub Actions, and Kubernetes
- Strong scripting skills, especially in Bash (Python or Go experience is a plus)
- Ability to lead sprint planning, manage execution roadmaps, and drive cross-functional collaboration
- Passion for developer experience, with attention to usability, performance, and documentation
- Excellent communication skills, both verbal and written, across technical and non-technical audiences
- Willingness to participate in after-hours support for critical incidents as needed
- Strong background in systems operations, reliability engineering, and cloud infrastructure best practices
- Track record of hiring and developing top engineering talent and building high-impact teams
Responsibilities
- Lead Developer Experience and Cloud Infrastructure: Manage a team focused on improving the end-to-end experience for developers—from local development to production—while ensuring a stable and scalable cloud-native platform
- Build Developer-Centric Platforms: Design and evolve internal platforms, CI/CD pipelines, and tooling that empower product engineers to ship features faster, safely, and with confidence
- Foster Developer-Centric Product Thinking: Gather developer feedback, measure platform adoption metrics, and iterate based on user needs to ensure your platforms truly serve internal customers
- Drive AI-First Infrastructure Innovation: Lead AI-first approaches to infrastructure automation, predictive scaling, intelligent alerting, and self-healing systems to reduce toil and improve reliability
- Champion Infrastructure Automation: Drive complete automation of infrastructure provisioning, configuration, deployment, and monitoring to enable repeatability, self-service, and reliability
- Implement SRE Best Practices: Introduce and uphold Site Reliability Engineering principles, including error budgets, incident response, and toil reduction to improve system resilience and uptime
- Lead Platform Consolidation: Spearhead platform consolidation efforts resulting from multiple acquisitions, managing technical debt and legacy system migrations while maintaining operational excellence
- Enhance Observability and Reliability: Increase system observability through comprehensive metrics, tracing, and alerting to enable rapid detection and resolution of production issues
- Collaborate Across Engineering: Partner closely with application developers, QA, security, and product teams to ensure infrastructure and developer platforms meet evolving business needs
- Improve SLAs and Operational Excellence: Continuously improve service level indicators (SLIs), objectives (SLOs), and agreements (SLAs) across systems to meet or exceed uptime goals
- Manage a Distributed High-Performing Team: Lead a globally distributed team of infrastructure, DevEx, and SRE engineers; recruit, mentor, and grow the team to meet technical and leadership goals
- Foster a Culture of DevOps and Ownership: Promote ownership and self-service among development teams by building tools and platforms that support full lifecycle responsibility
Preferred Qualifications
- Experience managing large-scale deployments of PostgreSQL, MongoDB, RabbitMQ, or Elasticsearch
- Background in operating hosted or hybrid data center environments
- Familiarity with Windows Server infrastructure, especially in hybrid cloud setups
- Exposure to time-series databases or event-sourced system architectures
- Advanced experience with AI/ML-powered observability, automation, or self-healing infrastructure
- Active participation in incident retrospectives, with a track record of driving systemic improvements
Benefits
Performance-based discretionary bonuses and variable pay plans are available for some positions