Skip to content

acceleratescience/data-school-Spring23

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Designing & Implementing Data Pipelines for Scientific Research

Course Overview

This course provides a comprehensive introduction to designing and implementing data pipelines for scientific research. It focuses on understudied topics often overlooked in traditional ML courses, addressing common challenges scientists face when incorporating ML into their research workflows.

Instructor

Dr. Ahmad Abu-Khazneh
Senior Machine Learning Engineer, Accelerate Programme

Course Philosophy

This course takes an idiosyncratic approach based on both industrial and academic experience. Rather than treating data pipelines as merely a means to an end, we explore them as valuable software artifacts that can have lasting impact beyond their original purpose.

What is a Data Pipeline?

While there's no standard definition, this course defines a data pipeline as:

A software artifact that consists of all the steps related to preparing data for a scientific study, published with its accompanying testing framework, documentation and can easily be installed, forked, extended and deployed.

Different Perspectives on Data Pipelines:

  1. Industrial Definition (ETL)

    • Extract-Transform-Load pipelines
    • Emphasis on all three operations as equally crucial components
    • Common in commercial/industrial applications
  2. Data Science Definition

    • Focuses on data preprocessing and feature engineering
    • Often tightly coupled with modeling components
    • Common in data science courses/bootcamps
  3. Academic Definition

    • Often viewed as legacy scripts passed down through research groups
    • May be treated as a "black box" that "just works"
    • This course aims to expand this limited view
  4. Programming Definition

    • Example: sklearn.pipeline.Pipeline in Python
    • Represents a technical implementation of pipeline concepts
    • One of many possible programming frameworks for pipeline implementation

Why Should Scientists Care About Data Pipelines?

Research Benefits

  • Enhances visibility of assumptions and biases for peer review
  • Improves reproducibility by making code shareable
  • Facilitates extensibility and interoperability
  • Increases assurance through proper testing
  • Helps manage concept drift and data drift

Professional Benefits

  • Meets increasing journal requirements for code quality
  • Creates opportunities for commercialization
  • Enables industrial partnerships
  • Facilitates broader impact through code reuse

Course Structure

Labs and Practical Work

  • Analysis of well-designed published pipelines from various domains
  • Case studies including:
    • James Webb Space Telescope (JWST) astronomy pipeline
    • Bioinformatics pipelines from nf-core
  • Hands-on work with students' own pipelines
    • Identifying limitations and weaknesses
    • Implementing improvements
    • Applying best practices

Course Themes

The course explores several key themes:

  • Data-centric vs model-centric machine learning
  • Software engineering practices in different team sizes
  • Academic vs industrial pipeline engineering
  • Pipelines as means vs ends in themselves

Target Audience

This course is designed for:

  • Scientists incorporating ML into their research
  • Researchers looking to improve code reproducibility
  • Teams wanting to create maintainable data processing workflows
  • Anyone interested in best practices for scientific software engineering

Prerequisites

  • Basic programming experience
  • Familiarity with data analysis concepts
  • Interest in improving research code quality

This course is part of the Accelerate Programme Spring School 2023.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •