Data & Reporting SRE

ZILO
Summary
Join ZILO™, a technology company redefining possibilities in the global Transfer Agency sector, as a Site Reliability Engineer (SRE). You will be responsible for the reliability, performance, and operational excellence of real-time and batch data pipelines built on AWS, Apache Flink, Kafka, and Python. This role involves incident management, reliability and monitoring, architecture and automation, performance optimization, and collaboration and knowledge sharing. You will act as the first line of defense for data-related incidents, rapidly diagnose root causes, and implement resilient solutions. The position requires deep subject-matter expertise in data processing and reporting. ZILO™ offers a comprehensive benefits package including enhanced leave, private health care, life assurance, flexible working arrangements, an employee assistance program, a company pension, access to training and development, a buy and sell holiday scheme, and the opportunity for global mobility.
Requirements
Experience with data processing and reporting
Responsibilities
- Serve as on-call escalation for data pipeline incidents, including real-time stream failures and batch job errors
- Rapidly analyze logs, metrics, and trace data to pinpoint failure points across AWS, Flink, Kafka, and Python layers
- Lead post-incident reviews: identify root causes, document findings, and drive corrective actions to closure
- Design, implement, and maintain robust observability for data pipelines: dashboards, alerts, distributed tracing
- Define SLOs/SLIs for data freshness, throughput, and error rates; continuously monitor and optimize
- Automate capacity planning, scaling policies, and disaster-recovery drills for stream and batch environments
- Collaborate with data engineering and product teams to architect scalable, fault-tolerant pipelines using AWS services (e.g., Step Functions , EMR , Lambda , Redshift ) integrated with Apache Flink and Kafka
- Troubleshoot & Maintain Python -based applications
- Harden CI/CD for data jobs: implement automated testing of data schemas, versioned Flink jobs, and migration scripts
- Profile and tune streaming jobs: optimize checkpoint intervals, state backends, and parallelism settings in Flink
- Analyze Kafka cluster health: tune broker configurations, partition strategies, and retention policies to meet SLAs
- Leverage Python profiling and vectorized libraries to streamline batch analytics and report generation
- Act as SME for data & reporting stack: mentor peers, lead brown-bag sessions on best practices
- Contribute to runbooks, design docs, and on-call playbooks detailing common failure modes and recovery steps
- Work cross-functionally with DevOps, Security, and Product teams to align reliability goals and incident response workflows
Benefits
- Enhanced leave - 38 days inclusive of 8 UK Public Holidays
- Private Health Care including family cover
- Life Assurance – 5x salary
- Flexible working-work from home and/or in our London Office
- Employee Assistance Program
- Company Pension (Salary Sacrifice options available)
- Access to training and development
- Buy and Sell holiday scheme
- The opportunity for “work from anywhere/global mobility”