This project demonstrates a scalable preprocessing pipeline for Whole Slide Images (WSIs) used in medical diagnosis. WSIs are gigapixel images (up to 11 GB per file) and pose a significant challenge for conventional image processing tools. We use Apache Beam and Google Cloud Dataflow to handle the distributed processing and prepare the data in a format ready for ML model training on Vertex AI.
- Large WSI images that exceed in-memory processing limits
- Multiple images per patient with a single label (Multiple Instance Learning)
- Background and noise in images need to be removed to extract meaningful tiles
- Efficient data serialization into
TFRecordformat for downstream ML tasks - Scalable preprocessing pipeline capable of handling thousands of such images in parallel
We used the Mayo Clinic Dataset from Kaggle. It consists of:
-
CSV Metadata:
Containsimage_id,center_id,patient_id,image_num, andlabel.Note: Multiple rows (images) per patient. Label is assigned at patient level.
-
TIFF Images:
Gigapixel whole slide images of tissue slices.Background regions are present and need to be filtered out.
- Read & Parse CSV to group image rows by patient
- Load TIFF images corresponding to each row
- Extract Tiles from relevant regions
- Merge Tiles + Patient Info
- Serialize into
tf.train.Example - Write to TFRecord format in Google Cloud Storage
Executed on Apache Beam, running via Google Cloud Dataflow
| Component | Description |
|---|---|
| Apache Beam | Unified programming model for large-scale data processing |
| Google Cloud Dataflow | Serverless runner for Beam pipelines |
| Google Cloud Storage | Input TIFFs and output TFRecord storage |
| TensorFlow | Used for TFRecord serialization |
| OpenSlide | To handle reading of .tif Whole Slide Images |
| Kaggle Datasets | Source of Mayo Clinic pathology image data |

