Data & Reporting SRE at ZILO

Summary

Join ZILO™, a technology company redefining possibilities in the global Transfer Agency sector, as a Site Reliability Engineer (SRE). You will be responsible for the reliability, performance, and operational excellence of real-time and batch data pipelines built on AWS, Apache Flink, Kafka, and Python. This role involves incident management, reliability and monitoring, architecture and automation, performance optimization, and collaboration and knowledge sharing. You will act as the first line of defense for data-related incidents, rapidly diagnose root causes, and implement resilient solutions. The position requires deep subject-matter expertise in data processing and reporting. ZILO™ offers a comprehensive benefits package including enhanced leave, private health care, life assurance, flexible working arrangements, an employee assistance program, a company pension, access to training and development, a buy and sell holiday scheme, and the opportunity for global mobility.

Requirements

Experience with data processing and reporting

Responsibilities

Serve as on-call escalation for data pipeline incidents, including real-time stream failures and batch job errors
Rapidly analyze logs, metrics, and trace data to pinpoint failure points across AWS, Flink, Kafka, and Python layers
Lead post-incident reviews: identify root causes, document findings, and drive corrective actions to closure
Design, implement, and maintain robust observability for data pipelines: dashboards, alerts, distributed tracing
Define SLOs/SLIs for data freshness, throughput, and error rates; continuously monitor and optimize
Automate capacity planning, scaling policies, and disaster-recovery drills for stream and batch environments
Collaborate with data engineering and product teams to architect scalable, fault-tolerant pipelines using AWS services (e.g., Step Functions , EMR , Lambda , Redshift ) integrated with Apache Flink and Kafka
Troubleshoot & Maintain Python -based applications
Harden CI/CD for data jobs: implement automated testing of data schemas, versioned Flink jobs, and migration scripts
Profile and tune streaming jobs: optimize checkpoint intervals, state backends, and parallelism settings in Flink
Analyze Kafka cluster health: tune broker configurations, partition strategies, and retention policies to meet SLAs
Leverage Python profiling and vectorized libraries to streamline batch analytics and report generation
Act as SME for data & reporting stack: mentor peers, lead brown-bag sessions on best practices
Contribute to runbooks, design docs, and on-call playbooks detailing common failure modes and recovery steps
Work cross-functionally with DevOps, Security, and Product teams to align reliability goals and incident response workflows

Benefits

Enhanced leave - 38 days inclusive of 8 UK Public Holidays
Private Health Care including family cover
Life Assurance – 5x salary
Flexible working-work from home and/or in our London Office
Employee Assistance Program
Company Pension (Salary Sacrifice options available)
Access to training and development
Buy and Sell holiday scheme
The opportunity for “work from anywhere/global mobility”

Data & Reporting SRE

ZILO

Summary

Requirements

Responsibilities

Benefits

Remote

DevOps

Mid-level

Share this job:

Similar Remote Jobs

Windranger Labs

Remote

DevOps

Senior

Remote

Customer Service

Principal

Remote

Business

Senior

Remote

DevOps

Senior

Remote

DevOps

Senior

Remote

Software Development

Senior

Remote

Software Development

Senior

Supermetrics

Remote

Software Development

Principal

Supermetrics

Remote

Software Development

Principal