This course provides a comprehensive introduction to designing and implementing data pipelines for scientific research. It focuses on understudied topics often overlooked in traditional ML courses, addressing common challenges scientists face when incorporating ML into their research workflows.
Dr. Ahmad Abu-Khazneh
Senior Machine Learning Engineer, Accelerate Programme
This course takes an idiosyncratic approach based on both industrial and academic experience. Rather than treating data pipelines as merely a means to an end, we explore them as valuable software artifacts that can have lasting impact beyond their original purpose.
While there's no standard definition, this course defines a data pipeline as:
A software artifact that consists of all the steps related to preparing data for a scientific study, published with its accompanying testing framework, documentation and can easily be installed, forked, extended and deployed.
-
Industrial Definition (ETL)
- Extract-Transform-Load pipelines
- Emphasis on all three operations as equally crucial components
- Common in commercial/industrial applications
-
Data Science Definition
- Focuses on data preprocessing and feature engineering
- Often tightly coupled with modeling components
- Common in data science courses/bootcamps
-
Academic Definition
- Often viewed as legacy scripts passed down through research groups
- May be treated as a "black box" that "just works"
- This course aims to expand this limited view
-
Programming Definition
- Example: sklearn.pipeline.Pipeline in Python
- Represents a technical implementation of pipeline concepts
- One of many possible programming frameworks for pipeline implementation
- Enhances visibility of assumptions and biases for peer review
- Improves reproducibility by making code shareable
- Facilitates extensibility and interoperability
- Increases assurance through proper testing
- Helps manage concept drift and data drift
- Meets increasing journal requirements for code quality
- Creates opportunities for commercialization
- Enables industrial partnerships
- Facilitates broader impact through code reuse
- Analysis of well-designed published pipelines from various domains
- Case studies including:
- James Webb Space Telescope (JWST) astronomy pipeline
- Bioinformatics pipelines from nf-core
- Hands-on work with students' own pipelines
- Identifying limitations and weaknesses
- Implementing improvements
- Applying best practices
The course explores several key themes:
- Data-centric vs model-centric machine learning
- Software engineering practices in different team sizes
- Academic vs industrial pipeline engineering
- Pipelines as means vs ends in themselves
This course is designed for:
- Scientists incorporating ML into their research
- Researchers looking to improve code reproducibility
- Teams wanting to create maintainable data processing workflows
- Anyone interested in best practices for scientific software engineering
- Basic programming experience
- Familiarity with data analysis concepts
- Interest in improving research code quality
This course is part of the Accelerate Programme Spring School 2023.