Summary

Join the Wikimedia Foundation's Data Platform team as a Data Engineer and shape the future of how Wikimedia's vast data ecosystem serves internal teams and the global community. Contribute to unifying data systems across the Wikimedia Foundation to deliver scalable solutions supporting the open knowledge movement. Design and build data pipelines, monitor data quality, support data governance and lineage, and collaborate with peers to improve the shared data platform. Enhance operational excellence by identifying and implementing improvements in system reliability, maintainability, and performance. The role involves working with a geographically distributed team and directly impacts billions of users while advancing open knowledge accessibility. The Wikimedia Foundation is a remote-first organization.

Requirements

3+ years of data engineering experience, with exposure to on-premise systems (e.g., Spark, Hadoop, HDFS)
Understanding of engineering best practices with a strong emphasis on writing maintainable and reliable code
Hands-on experience in troubleshooting systems and pipelines for performance and scaling
Working experience with data pipeline tools like Airflow, Kafka, Spark, and Hive
Proficient in Python or Java/Scala, with working knowledge of development tools and its ecosystem
Knowledge of SQL and experience with various database/query dialects (e.g., MariaDB, HiveQL, CassandraQL, Spark SQL, Presto)
Working knowledge of CI/CD processes and software containerization
Familiarity with stream processing frameworks like Spark Streaming or Flink
Good communication and collaboration skills to interact effectively within and across teams
Ability to produce clear, well-documented technical designs and articulate ideas to both technical and non-technical stakeholders

Responsibilities

Designing and Building Data Pipelines: Develop scalable, robust infrastructure and processes using tools such as Airflow, Spark, and Kafka
Monitoring and Alerting for Data Quality: Implement systems to detect and address potential data issues promptly
Supporting Data Governance and Lineage: Assist in designing and implementing solutions to track and manage data across pipelines
Collaborate with peers to improve and evolve the shared data platform, enabling use cases like product analytics, bot detection, and image classification
Enhancing Operational Excellence: Identify and implement improvements in system reliability, maintainability, and performance

Preferred Qualifications

Exposure to architectural/system design or technical ownership
Experience in data governance, data lineage, and data quality initiatives
Familiarity with additional technologies such as Kubernetes, Flink, Iceberg, Druid, Presto, Cassandra
Working knowledge of AI development tooling and AI applications in software engineering

Benefits

The anticipated annual pay range of this position for applicants based within the United States is US$101,102 to US$156,045 with multiple individualized factors, including cost of living in the location, being the determinants of the offered pay
For applicants located outside of the US, the pay range will be adjusted to the country of hire

Data Engineer

Wikimedia Foundation

Summary

Requirements

Responsibilities

Preferred Qualifications

Benefits

Remote

Data

Mid-level

Share this job:

Similar Remote Jobs

Remote

Data

Mid-level

Included Health

Remote

Software Development

Senior

Netskope

Remote

Data

Senior

Remote

Data

Senior

Remote

Data

Senior

TeleSoftas

Remote

Data

Mid-level

Remote

DevOps

Mid-level

Remote

Software Development

Senior

Remote

Software Development

Senior

Remote

Data

Mid-level