Refactor pipeline to use grain crop classes by SylviaWhittle · Pull Request #1022 · AFM-SPM/TopoStats

SylviaWhittle · 2024-11-22T17:19:21Z

This PR is huge (sorry)

Main things

This PR is designed to improve how we handle grains in the processing stages of TopoStats, starting at the grain finding stage, up to the disordered tracing stage. In future, this might be extended through the disordered tracing stage and beyond, however I've restricted the scope of this PR for the sake of everyone's sanity. The reason for stopping at disordered tracing is that once disordered tracing returns, all the data is wrapped up in neatly structured dictionaries, by grain and molecule, similar to what I've implemented, so I deemed this similar enough to not bother changing it yet.

The way this PR tries to standardize how we handle grains, is using DataClasses:

ImageGrainCrops
- has two attributes, above and below, each holding a DirectionGrainCrops object for that direction's grain crops
GrainCropsDirection
- two attributes: crops and full_mask_tensor
- crops stores dictionaries of GrainCrop objects ([int, GrainCrop])
- full_mask_tensor stores a full sized mask for the image, size is NxNxC where C is the number of classes. This is NOT automatically updated when the crops property is edited, this is because we don't want to update things during a loop. This can be discussed if this is an incorrect decision!
GrainCrop
- Stores various properties about the grain, such as mask, image, bbox padding etc.

This has the benefit of standardizing how we handle grains going forward, as we had previously been rather discordant in the types of data structures that we use in various parts of the codebase.

It also adds a helpful (I hope!!) layer of abstraction to processing functions, for example the run_grainstats function in processing no longer needs to take image, grain_masks, pixel_to_nm_scaling, it now takes just image_grain_crops which contains all the data for each crop.

This of course does come at the cost of increased memory usage as there are duplication of parts of images in the data structures as well as repeatedly listing the pixel_to_nm_scaling factor etc, however I personally find that the benefits here far outweigh the negatives. When working on the harbo-rings project, I found myself naturally extracting all the grains and storing them in a dictionary rather than keeping track of full image masks, I know Max also does this based on how he's handled the tracing code.

`disordered_tracing.py`

Removed prep_arrays. Prep arrays no longer needed, since it made a dictionary of grain crops, but we already have these now with the refactor.

TopoStats Pull Requests

Please provide a descriptive summary of the changes your Pull Request introduces.

The Software Development section of
the Contributing Guidelines may be useful if you are unfamiliar with linting, pre-commit, docstrings and testing.

NB - This header should be replaced with the description but please complete the below checklist or a short
description of why a particular item is not relevant.

Before submitting a Pull Request please check the following.

Existing tests pass.
Documentation has been updated and builds. Remember to update as required...
- docs/configuration.md
- docs/usage.md
- docs/data_dictionary.md
- docs/advanced.md and new pages it should link to.
Pre-commit checks pass.
New functions/methods have typehints and docstrings.
New functions/methods have tests which check the intended behaviour is correct.

Optional

`topostats/default_config.yaml`

If adding options to topostats/default_config.yaml please ensure.

There is a comment adjacent to the option explaining what it is and the valid values.
A check is made in topostats/validation.py to ensure entries are valid.
Add the option to the relevant sub-parser in topostats/entry_point.py.

…ures.

…ti class & subgrains

…mask required bool. Tested in debugger.

…f GrainCrops

… each row

…2] >= 2, shape[1]==shape[2]

…rainCrops. Locally debugged working

…e image plotting

SylviaWhittle · 2024-12-03T17:19:26Z

Proposed solution to the data frame issue

|----------------------------------------------------------------------------------------------------------------------|
|   image   |    direction   |     class         | grain | molecule | ... <grainstats> ... | ... <dnatracingstats> ... |
| mini.spm  |    above       |   dna_only        | 0     | 0        | ... <stats> ...      |          NONE             |
| mini.spm  |    above       |   dna_only        | 0     | 1        | ... <stats> ...      |          NONE             |
| mini.spm  |    above       |   dna_only        | 1     | 0        | ... <stats> ...      |          NONE             |
| mini.spm  |    above       |   dna_only        | 1     | 1        | ... <stats> ...      |          NONE             |
| mini.spm  |    above       |   protein_only    | 0     | 0        | ... <stats> ...      |          NONE             |
| mini.spm  |    above       |   protein_only    | 0     | 1        | ... <stats> ...      |          NONE             |
| mini.spm  |    above       |   protein_only    | 1     | 0        | ... <stats> ...      |          NONE             |
| mini.spm  |    above       |   protein_only    | 1     | 1        | ... <stats> ...      |          NONE             |
| mini.spm  |    above       |   combined_mask   | 0     | 0        | ... <stats> ...      |     ... <stats> ...       |
|----------------------------------------------------------------------------------------------------------------------|

ns-rse · 2024-12-03T17:26:42Z

Its that or split into separate files.

I'm ambivalent as to the preferred solution as I don't use the output but consideration for end users should be given. Whilst data management, manipulation, summarisation and plotting are, in my view, core skills for researchers these days experience levels vary widely and I don't know what would be easiest.

ns-rse · 2024-12-10T12:02:21Z

Are we aiming to include this refactoring in v2.3.0 release?

topostats/processing.py

topostats/default_config.yaml

topostats/tracing/disordered_tracing.py

Co-authored-by: Neil Shephard <n.shephard@sheffield.ac.uk>

… validation during construction using instance property 'padding'

- Spotted a few `print()` statements from debugging. - Explicitly test the number of grains below that are returned. - switching to a dictionary in the parameterisaion of `test_merge_classes()` instead of multiple individual options with comments/labels. The dictionaries are expressive about what the values are since the keys are the configuration options themselves. This in turn means we can just use `**vet_grains_conf` to unpack the dictionary of options when calling the `vet_grains()` function.

ns-rse

EPIC amount of excellent work @SylviaWhittle I think making our objects Classes is a really good decision long term and this gives us a good basis on which to build on.

I think you've seen #1092 which made some minor suggestions. I've looked through most things and think I have a general feel for things (many of the changes are names and/or where a functions options have changed removing those options).

Various in-line suggestions have been made.

tests/resources/process_scan_expected_above_height_profiles.pickle

tests/resources/process_scan_expected_below_height_profiles.pickle

tests/_regtest_outputs/test_grainstats_minicircle.test_grainstats_regression.out

tests/_regtest_outputs/test_processing.test_process_scan_above.out

tests/conftest.py

topostats/unet_masking.py

ns-rse · 2025-02-20T13:36:30Z

topostats/processing.py

                    grain_out_path_direction = grain_out_path / f"{direction}"
+                    # Possibly delete this creation of the directory since we already do this earlier?
                    if plotting_config["image_set"] == "all":
-                        grain_out_path_direction.mkdir(parents=True, exist_ok=True)


Not checked in detail but are we sure the grain_out_path_direction will exist and this line is no longer needed?

I believe so? Here there is code in processing.py > get_out_paths that does it:

Which is ran as the first step in processing.py > process_scan:

Might be understanding this wrong though ^^

topostats/processing.py

topostats/grainstats.py

ns-rse · 2025-02-20T13:49:00Z

tests/resources/tracing/nodestats/catenanes_nodestats_grainstats.csv

 ,image,grain_number,num_crossings,avg_crossing_confidence,min_crossing_confidence
-grain_0,test_image,0,4,0.4013589828832889,0.2129989376767838
-grain_1,test_image,1,4,0.3441057054647598,0.17063184531586506
+grain_0,test_image,0,4,0.21426594097881008,0.001258249874731443


Are we happy with these changes? They seem quite large.

I was suspicious of these changes too, however I looked at the traces and the traces are fine, though they slightly different (due to randomness Max and I believe). the differences in the traces are not significant, but they do result in some quite different results as you can see. The crossings in question for that image are very difficult and a slight change in trace does produce a large change in results, however upon manual review, we believe these changes do reflect the slightly altered (but not problematic trace)

so TLDR: yes, caused by a small change in the trace due to randomness, turns out the tracing for that image is actually slightly better now (but that's just luck)

Co-authored-by: Neil Shephard <n.shephard@sheffield.ac.uk>

ns-rse

Looks great @SylviaWhittle

I think we will very soon have to deal with how these newly introduced classes are saved when we write .topostats files and loaded by AFMReader when we load them. That will no doubt be quite involved (we've both had a look through the nanosurf code and seen how they handle it!). I checked running against the latest AFMReader:main branch and lots of tests fail.

In light of that I think we will perhaps have to be careful in advising people to be wary of using any of the newly introduced entry points as they probably won't work, but we are also now in a catch-22 of TopoStats:main installing using the latest AFMReader on GitHub which as per above causes a lot of errors when loading .topostats objects. Hopefully not too many people are using the handful of entry points that have been added to the swiss-army knife. 🤞

docs/advanced/grain_finding.md

ns-rse · 2025-03-03T16:58:46Z

docs/advanced/grain_finding.md


-The U-Net model will take the bounding box of each grain, makes it square, and passees it to a trained U-Net model
-which makes a prediction for a better mask, which then replaces the original mask.
+Each `GrainCrop`'s image crop is passed to a trained U-Net model which makes a prediction for a better mask, which then


I've queried this before and think the answer was that the pre-trained U-net models can be provisioned on request.

Is that still the case?

If so do we have documented how to request the trained models, and once they have been received where to place them so they are loaded and used?

The topostats/default_config.yaml currently has unet_config["model_path"] defined as Null and the documentation states...

The path to the U-Net model to override traditional segmentation. Supply a path to a tensorflow U-net model to use, else U-Net segmentation will be skipped.

...but how are users (outside the Pyne Lab) to know where to get this from?

We should at least say something along the lines of...

Please contact topostats@sheffield.ac.uk for pre-trained models.

...at some point in the documentation.

NB This is something we should have had in place when the features were first merged into main, sorry for not picking up on this earlier. I'd be happy to make this a separate issue and deal with soon after this pull request is merged.

I've added a little note explaining that our U-nets are available with our papers, but until the lab decide on how they want to (and make a list of papers) I'll write up an issue for this and leave it for now if that's okay?

Thanks @SylviaWhittle

Saw the issue #1103 and put some thoughts in there. We'll just have to update these once the papers have been published and there is a reliable location to point people to for downloading.

ns-rse · 2025-03-03T17:02:59Z

tests/test_processing.py

-    assert grainstats_df.shape[0] == 13
-    assert len(grainstats_df.columns) == 22
+    # Expect 6 grains in the above direction for cropped minicircle
+    assert grainstats_df.shape[0] == 6


Presumably those small bits of dirt/noise get removed at some point for being below the minimal size threshold.

topostats/grains.py

The refactoring to use classes for objects rather than dictionaries breaks the `topostats grainstats` entry point that was introduced with #1094. Previously the dictionary item `image=topostats_object["image"]` was passed into `processing.run_grainstats()` when called from `processing.process_grainstats()`. The refactoring requires that `image_grain_crop=image_grain_crop`, an object of the new type `ImageGrainCrops`, is now passed to `grainstats`. This isn't currently possible though because `AFMReader` loads `.topostats` objects and returns a dictionary and whilst the refactoring does save the new `image_grain_crop` (/`ImageGrainCrops`) the loading does _not_ currently re-create these structures. For now I have disabled the test of the entry point. Once this refactoring has been merged we will have to... - Make `TopoStats` a dependency of `AFMReader` (somewhat wary of this as it may cause headaches further down the line but for now we'll go with it!). - Modify `AFMReader.topostats.load_topostats()` to modify the `data["grain_tensors"]["above"]` and `data["grain_tensors"]["above"]` so that they are of class `ImageGrainCrops` (and the associated nested classes). - Once that is done we can then pass the loaded `data["grain_tensors"]` to `processing.run_grainstats()` (this may require reconstructing to be the same as `image_grain_crop`, not sure at the moment!)

…ntry-point-test tests: disable test_run_modules::test_grainstats

SylviaWhittle and others added 21 commits November 20, 2024 14:35

WIP: Scope out refactor for grains.py

58757ad

WIP: Begin grains > grainstats pipeline overhaul. Outline data struct…

f49b1e6

…ures.

WIP: Scope out changes to GrainStats.calculate_stats to allow for mul…

4334027

…ti class & subgrains

WIP: Initial proposal for grainstats using grain dictionary refactor

d106b0f

Add function: graincrops_merge_classes

1acee67

Add function: graincrops_update_background_class

a7b9075

Update: extract_grains_from_full_image now works in theory, untested

913e3b5

Fix: extract_grains_from_full_image_mask: allocating region to empty …

843a606

…mask required bool. Tested in debugger.

WIP: Switch vetting, merging and update background to work on dicts o…

535bea8

…f GrainCrops

WIP: Update vet_grains to take / return dicts of GrainCrops

36c07e2

WIP: Update grainstats handling of dataframe to use list of dicts for…

d1e4c1d

… each row

Fix: validate_full_mask_tensor_shape: Require len(shape) == 3, shape[…

ace46ea

…2] >= 2, shape[1]==shape[2]

Edit: find_grains now stores grains in self.image_grain_crops: ImageG…

e2824f9

…rainCrops. Locally debugged working

WIP: Handle ImageGrainCrops between run_grains and run_grainstats

f21eccd

WIP: Graintstats handles ImageGrainCrops

63d0003

Fix: grainstats: process scan no longer needing grain plots returned

02ecf43

WIP: Begin grains > disordered_tracing pipeline overhaul

651e89c

Merge branch 'main' into SylviaWhittle/grain_restructure

a0370cc

[pre-commit.ci] Fixing issues with pre-commit

a03524e

WIP: grains > disorderd_tracing pipeline | fix typing and remove whol…

8d215fd

…e image plotting

[pre-commit.ci] Fixing issues with pre-commit

3a18880

Add: class index to disordered tracing config

7781c86

[WIP] Fix: Attempt to fix grain_number double index issue

80b711c

Max-Gamill reviewed Dec 10, 2024

View reviewed changes

topostats/processing.py Outdated Show resolved Hide resolved

Max-Gamill reviewed Dec 10, 2024

View reviewed changes

topostats/default_config.yaml Outdated Show resolved Hide resolved

Max-Gamill reviewed Dec 10, 2024

View reviewed changes

topostats/tracing/disordered_tracing.py Outdated Show resolved Hide resolved

remove raising error on empty direction

4c6f0f3

Co-authored-by: Neil Shephard <n.shephard@sheffield.ac.uk>

SylviaWhittle and others added 4 commits February 13, 2025 16:16

Fix test: test_validate_full_mask_tensor_shape

9936270

Major Edit: Use class for GrainCrop instead of dataclass to allow for…

524f3a8

… validation during construction using instance property 'padding'

Merge branch 'main' into SylviaWhittle/grain_restructure

3b52279

ns-rse requested changes Feb 20, 2025

View reviewed changes

SylviaWhittle and others added 9 commits February 20, 2025 14:38

Fix incorrect bbox anchor for grainstats grain positions

e998d7f

Fix: flipped x and y coords in grainstats

9b2707b

Merge pull request #1092 from AFM-SPM/ns-rse/grain_refactor_suggestions

0d09436

Add u-net reference to docs, thanks @ns-rse

c8cb2b8

Co-authored-by: Neil Shephard <n.shephard@sheffield.ac.uk>

rename .pickle files to .pkl

5717e43

more type hints! thanks @ns-rse

f230af0

Co-authored-by: Neil Shephard <n.shephard@sheffield.ac.uk>

Fix: wrong bbox in dummy grain in conftest & filenames & regtest

3e71782

tidy: use region.bbox instead of extra varaible region_bbox

f403b37

tidy: markdown linting in grains.md

d1d0265

SylviaWhittle requested a review from ns-rse February 25, 2025 15:42

Merge branch 'main' into SylviaWhittle/grain_restructure

a2baedf

ns-rse approved these changes Mar 3, 2025

View reviewed changes

ns-rse changed the title ~~Refactor pipeline to use grain crop dictionaries~~ Refactor pipeline to use grain crop classes Mar 4, 2025

This was referenced Mar 4, 2025

load_topostats() should use new ImageGrainCrops classes AFM-SPM/AFMReader#121

Closed

Update grainstats entry point to work with ImageGrainCrops #1101

Closed

ns-rse and others added 2 commits March 5, 2025 12:20

Merge pull request #1100 from AFM-SPM/ns-rse/disable-run_grainstats-e…

d7eda49

…ntry-point-test tests: disable test_run_modules::test_grainstats

Add a little more on U-Nets

a9e7783

ns-rse approved these changes Mar 11, 2025

View reviewed changes

SylviaWhittle added this pull request to the merge queue Mar 11, 2025

Merged via the queue into main with commit 8243ddd Mar 11, 2025
11 checks passed

SylviaWhittle deleted the SylviaWhittle/grain_restructure branch March 11, 2025 11:09

ns-rse mentioned this pull request May 7, 2025

TopoStats classes internally and for writing/reading HDF5 #1151

Closed

14 tasks

Conversation

SylviaWhittle commented Nov 22, 2024 • edited by ns-rse Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

This PR is huge (sorry)

Main things

disordered_tracing.py

TopoStats Pull Requests

Optional

topostats/default_config.yaml

Uh oh!

SylviaWhittle commented Dec 3, 2024

Uh oh!

ns-rse commented Dec 3, 2024

Uh oh!

ns-rse commented Dec 10, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ns-rse left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ns-rse left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

SylviaWhittle commented Nov 22, 2024 •

edited by ns-rse

Loading

`disordered_tracing.py`

`topostats/default_config.yaml`