Data Engineer

Wikimedia Foundation
Summary
Join the Wikimedia Foundation's Data Platform team as a Data Engineer and shape the future of how Wikimedia's vast data ecosystem serves internal teams and the global community. Contribute to unifying data systems across the Wikimedia Foundation to deliver scalable solutions supporting the open knowledge movement. Design and build data pipelines, monitor data quality, support data governance and lineage, and collaborate with peers to improve the shared data platform. Enhance operational excellence by identifying and implementing improvements in system reliability, maintainability, and performance. The role involves working with a geographically distributed team and directly impacts billions of users while advancing open knowledge accessibility. The Wikimedia Foundation is a remote-first organization.
Requirements
- 3+ years of data engineering experience, with exposure to on-premise systems (e.g., Spark, Hadoop, HDFS)
- Understanding of engineering best practices with a strong emphasis on writing maintainable and reliable code
- Hands-on experience in troubleshooting systems and pipelines for performance and scaling
- Working experience with data pipeline tools like Airflow, Kafka, Spark, and Hive
- Proficient in Python or Java/Scala, with working knowledge of development tools and its ecosystem
- Knowledge of SQL and experience with various database/query dialects (e.g., MariaDB, HiveQL, CassandraQL, Spark SQL, Presto)
- Working knowledge of CI/CD processes and software containerization
- Familiarity with stream processing frameworks like Spark Streaming or Flink
- Good communication and collaboration skills to interact effectively within and across teams
- Ability to produce clear, well-documented technical designs and articulate ideas to both technical and non-technical stakeholders
Responsibilities
- Designing and Building Data Pipelines: Develop scalable, robust infrastructure and processes using tools such as Airflow, Spark, and Kafka
- Monitoring and Alerting for Data Quality: Implement systems to detect and address potential data issues promptly
- Supporting Data Governance and Lineage: Assist in designing and implementing solutions to track and manage data across pipelines
- Collaborate with peers to improve and evolve the shared data platform, enabling use cases like product analytics, bot detection, and image classification
- Enhancing Operational Excellence: Identify and implement improvements in system reliability, maintainability, and performance
Preferred Qualifications
- Exposure to architectural/system design or technical ownership
- Experience in data governance, data lineage, and data quality initiatives
- Familiarity with additional technologies such as Kubernetes, Flink, Iceberg, Druid, Presto, Cassandra
- Working knowledge of AI development tooling and AI applications in software engineering
Benefits
- The anticipated annual pay range of this position for applicants based within the United States is US$101,102 to US$156,045 with multiple individualized factors, including cost of living in the location, being the determinants of the offered pay
- For applicants located outside of the US, the pay range will be adjusted to the country of hire