Scalable Medical Image Preprocessing using Apache Beam & Google Cloud

Overview

This project demonstrates a scalable preprocessing pipeline for Whole Slide Images (WSIs) used in medical diagnosis. WSIs are gigapixel images (up to 11 GB per file) and pose a significant challenge for conventional image processing tools. We use Apache Beam and Google Cloud Dataflow to handle the distributed processing and prepare the data in a format ready for ML model training on Vertex AI.

What We're Solving

Large WSI images that exceed in-memory processing limits
Multiple images per patient with a single label (Multiple Instance Learning)
Background and noise in images need to be removed to extract meaningful tiles
Efficient data serialization into TFRecord format for downstream ML tasks
Scalable preprocessing pipeline capable of handling thousands of such images in parallel

Dataset

We used the Mayo Clinic Dataset from Kaggle. It consists of:

CSV Metadata:
Contains image_id, center_id, patient_id, image_num, and label.

Note: Multiple rows (images) per patient. Label is assigned at patient level.
TIFF Images:
Gigapixel whole slide images of tissue slices.

Background regions are present and need to be filtered out.

Architecture

Pipeline Steps

Read & Parse CSV to group image rows by patient
Load TIFF images corresponding to each row
Extract Tiles from relevant regions
Merge Tiles + Patient Info
Serialize into tf.train.Example
Write to TFRecord format in Google Cloud Storage

Executed on Apache Beam, running via Google Cloud Dataflow

Tech Stack

Component	Description
Apache Beam	Unified programming model for large-scale data processing
Google Cloud Dataflow	Serverless runner for Beam pipelines
Google Cloud Storage	Input TIFFs and output TFRecord storage
TensorFlow	Used for `TFRecord` serialization
OpenSlide	To handle reading of `.tif` Whole Slide Images
Kaggle Datasets	Source of Mayo Clinic pathology image data

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
Image_to_TFRecords.ipynb		Image_to_TFRecords.ipynb
README.md		README.md
medical_image_processing_beam.ipynb		medical_image_processing_beam.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scalable Medical Image Preprocessing using Apache Beam & Google Cloud

Overview

What We're Solving

Dataset

Architecture

Pipeline Steps

Tech Stack

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Scalable Medical Image Preprocessing using Apache Beam & Google Cloud

Overview

What We're Solving

Dataset

Architecture

Pipeline Steps

Tech Stack

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages