Core & ML Ops Team Lead

Zyte
Summary
Join Zyte, a globally distributed team building powerful, easy-to-use tools for web data extraction, as an experienced Team Lead to manage the Core & MLOps Squad. This hands-on technical leadership role demands expertise in MLOps, systems programming, and orchestration. You will lead a cross-functional team in designing and maintaining the scalable infrastructure powering Zyte. Responsibilities include designing the core platform, owning the model platform, building a 'Golden Path' for streamlined development, and ensuring MLOps excellence. Team management involves roadmap planning, delivery, mentoring, and fostering high engineering standards. The role requires collaboration with other teams and a commitment to platform thinking.
Requirements
- 5+ years experience building distributed systems; 3+ years in MLOps/ML platform engineering (or equivalent impact)
- Knowledge of Linux/OS internals (process model, cgroups/namespaces), networking (TCP/IP, HTTP/2), concurrency, and performance profiling
- Deep understanding of Kubernetes (bonus: Mesos)
- Proficiency developing high-performance services in Java, Rust, Go or C++ (bonus: familiarity with vert.x and Netty frameworks); strong Python skills
- Experience with GPU infrastructure (scheduling, containerization, optimization)
- Track record of designing and operating model platforms (registry, training, serving, monitoring) in production
- Demonstrated success leading technical teams and implementing organization-wide platform solutions
Responsibilities
- Design and evolve the core platform (Kubernetes, Mesos, GPU scheduling/autoscaling, distributed compute)
- Own the model platform : registry, experiment tracking, training orchestration, evaluation, serving, and monitoring
- Build the Golden Path : reference repos, a scaffold CLI, opinionated CI/CD pipelines, runtime contracts (health/metrics/tracing/SLOs), high-performance clients, circuit breakers and other production‑ready defaults
- Operate a secure, multi‑tenant model registry and training platform with standardized experiment/evaluation harnesses
- Provide turnkey serving patterns (online + batch), drift/quality monitoring, and rollback playbooks
- Integrate public/open‑source AI capabilities as managed platform services with cost and data‑governance guardrails
- Run the squad: roadmap/prioritization, delivery, mentoring, and high engineering standards
- Partner with product engineering (Zyte API, Scrapy Cloud), Prod Ops, and Security on adoption and rollout plans
- Mentor the team and foster a platform-thinking mindset
Preferred Qualifications
- Streaming & workflows: Kafka plus Argo/Temporal/Airflow or equivalents
- EBPF‑based observability, perf tooling, or io_uring experience
- Cost optimization for ML/AI; multi‑tenant quotas and fairness
- Hands‑on experience authoring Golden Paths (service chassis/templates, CI/CD blueprints, CLI scaffolds)
- SRE practices (SLIs/SLOs, incident management)
Benefits
Have the freedom and flexibility to work from where you do your best work, as we are a completely remote company