In A Nutshell
Contribute to the Data Platform Engineering team’s effort to unify data systems across the Wikimedia Foundation to deliver scalable solutions.
Responsibilities
- Designing and Building Data Pipelines: Develop scalable, robust infrastructure and processes using tools such as Airflow, Spark, and Kafka.
- Monitoring and Alerting for Data Quality: Implement systems to detect and address potential data issues promptly.
- Supporting Data Governance and Lineage: Assist in designing and implementing solutions to track and manage data across pipelines.
- Collaborate with peers to improve and evolve the shared data platform, enabling use cases like product analytics, bot detection, and image classification.
- Enhancing Operational Excellence: Identify and implement improvements in system reliability, maintainability, and performance.
Skillset
- 3+ years of data engineering experience, with exposure to on-premise systems (e.g., Spark, Hadoop, HDFS).
- Understanding of engineering best practices with a strong emphasis on writing maintainable and reliable code.
- Hands-on experience in troubleshooting systems and pipelines for performance and scaling.
- Desirable: Exposure to architectural/system design or technical ownership.
- Desirable: Experience in data governance, data lineage, and data quality initiatives.Working experience with data pipeline tools like Airflow, Kafka, Spark, and Hive.
- Proficient in Python or Java/Scala, with working knowledge of development tools and its ecosystem.
- Knowledge of SQL and experience with various database/query dialects (e.g., MariaDB, HiveQL, CassandraQL, Spark SQL, Presto).
- Working knowledge of CI/CD processes and software containerization.