[ENH] Return expanded metadata per matching dataset for `POST /datasets` endpoint only #519

alyssadai · 2025-12-22T21:35:43Z

Closes n-API can return dataset level information #509

Changes proposed in this pull request:

Add new env var NB_DATASETS_METADATA_PATH (to be hardcoded in the deployment docker-compose.yml) to specify datasets metadata JSON, and load file on app startup
Update dataset-level SPARQL query templates to no longer return dataset name/portal
For a given matching dataset in a /datasets response, retrieve static dataset metadata from the datasets metadata JSON file + append to computed dataset attributes
Update cohort SPARQL query template to use latest JSONLD data model
Split response handling logic + models for /query and /subjects endpoints to preserve existing legacy behaviour of /query (see also Officially deprecate legacy GET /query endpoint for POST /subjects #520)
- Update /subjects response to exclude dataset metadata

Checklist

This section is for the PR reviewer

PR has an interpretable title with a prefix ([ENH], [FIX], [REF], [TST], [CI], [MNT], [INF], [MODEL], [DOC]) (see our Contributing Guidelines for more info)
PR has a label for the release changelog or skip-release (to be applied by maintainers only)
PR links to GitHub issue with mention Closes #XXXX
Tests pass
Checks pass
If the PR changes the SPARQL query template, the default Neurobagel query file has also been regenerated

For new features:

Tests have been added

For bug fixes:

There is at least one test that would fail under the original bug conditions.

Summary by Sourcery

Expand dataset metadata returned by the POST /datasets endpoint by combining per-dataset static metadata from a JSON file with dynamic query results, and simplify SPARQL dataset selections to only include dataset and subject identifiers.

New Features:

Load per-node dataset metadata from a configured JSON file at startup and expose it for use in API responses.

Enhancements:

Extend the dataset query response model to include richer static metadata fields (e.g., authors, access information, links, and keywords).
Refactor the /datasets POST handler to return typed DatasetQueryResponse objects built from merged static and dynamic metadata.
Introduce a dedicated CohortQueryResponse model to preserve the legacy /query response shape for backwards compatibility.
Add configuration and validation for the datasets metadata file path via a new NB_DATASETS_METADATA_PATH setting.
Simplify SPARQL query templates and models to stop selecting dataset_name and dataset_portal_uri directly from the graph.

Tests:

Update SPARQL query tests to reflect the reduced set of selected variables and grouping columns.
Extend settings tests to cover the new datasets metadata path configuration.

Summary by Sourcery

Expand dataset metadata returned by the POST /datasets endpoint using a node-level JSON metadata file while keeping subject-level and legacy /query responses backward compatible.

New Features:

Load per-node dataset metadata from a configurable JSON file at startup and make it available for enriching dataset responses.

Enhancements:

Return typed DatasetQueryResponse objects from POST /datasets that merge static JSON metadata with dynamic query-derived attributes.
Introduce a dedicated CohortQueryResponse model and a separate /subjects handler to preserve legacy /query behaviour while simplifying /subjects responses.
Simplify SPARQL dataset queries and utilities to only select dataset and subject identifiers and rely on external metadata for dataset details.
Validate the datasets metadata file path via a new NB_DATASETS_METADATA_PATH setting and load dataset metadata during app startup.

Tests:

Add and update tests to cover enriched /datasets responses, the new /subjects behaviour, SPARQL template changes, datasets metadata loading, and the new configuration setting.

sourcery-ai · 2025-12-22T21:35:49Z

Reviewer's Guide

This PR enhances the POST /datasets endpoint to return enriched per-dataset metadata by merging dynamic SPARQL query results with static metadata loaded from a JSON file at startup, introduces a dedicated /subjects handler and models to keep cohort (/query) behavior backward compatible, and adds configuration and validation for the datasets metadata file along with corresponding SPARQL/template and test updates.

Sequence diagram for enhanced POST /datasets response building

sequenceDiagram
    actor Client
    participant Router_datasets as Router_datasets
    participant CRUD as crud
    participant Util as util
    participant GraphAPI as Graph_API
    participant Env as env_settings

    Client->>Router_datasets: POST /datasets (QueryModel)
    Router_datasets->>CRUD: post_datasets(query)

    CRUD->>Util: create_sparql_queries_for_datasets(query)
    Util-->>CRUD: phenotypic_query, imaging_query

    CRUD->>GraphAPI: POST phenotypic_query
    GraphAPI-->>CRUD: phenotypic_results
    CRUD->>GraphAPI: POST imaging_query
    GraphAPI-->>CRUD: imaging_results

    CRUD->>CRUD: combine results
    CRUD->>CRUD: compute matching_dataset_sizes
    CRUD->>CRUD: compute matching_dataset_imaging_modals_and_pipelines

    loop per dataset_uuid in combined_query_results.groupby(dataset)
        CRUD->>Env: DATASETS_METADATA.get(prefixed_dataset_uuid, {})
        Env-->>CRUD: dataset_static_metadata
        CRUD->>CRUD: dataset_dynamic_metadata dict
        CRUD->>CRUD: DatasetQueryResponse(**dataset_static_metadata, **dataset_dynamic_metadata)
        CRUD->>CRUD: append to response list
    end

    CRUD-->>Router_datasets: list[DatasetQueryResponse]
    Router_datasets-->>Client: 200 OK JSON

Class diagram for updated query and dataset response models

classDiagram

    class SessionResponse {
        +str session
        +str session_file_path
        +str dataset_uuid
        +str dataset_name
        +str sub_id
        +int age
        +str sex
        +str diagnosis
        +str image_modal
        +dict completed_pipelines
    }

    class CohortQueryResponse {
        +str dataset_uuid
        +str dataset_name
        +str dataset_portal_uri
        +int dataset_total_subjects
        +bool records_protected
        +int num_matching_subjects
        +list image_modals
        +dict available_pipelines
        +list~SessionResponse~ subject_data
        +str subject_data
    }

    class DatasetQueryResponse {
        +str dataset_uuid
        +str dataset_name
        +list~str~ authors
        +str homepage
        +list~str~ references_and_links
        +list~str~ keywords
        +str repository_url
        +str access_instructions
        +str access_type
        +str access_email
        +str access_link
        +int dataset_total_subjects
        +bool records_protected
        +int num_matching_subjects
        +list~str~ image_modals
        +dict available_pipelines
    }

    class SubjectsQueryResponse {
        +str dataset_uuid
        +list~SessionResponse~ subject_data
        +str subject_data
    }

    class Settings {
        +str graph_address
        +str graph_db
        +int graph_port
        +Path datasets_metadata_path
        +bool return_agg
        +int min_cell_size
        +bool auth_enabled
    }

    class EnvGlobals {
        +dict CONTEXT
        +dict ALL_VOCABS
        +dict DATASETS_METADATA
    }

    CohortQueryResponse --> SessionResponse : uses
    SubjectsQueryResponse --> SessionResponse : uses
    DatasetQueryResponse ..> EnvGlobals : metadata_source
    CohortQueryResponse ..> Settings : guarded_by_min_cell_size
    SubjectsQueryResponse ..> Settings : guarded_by_min_cell_size

File-Level Changes

Change	Details	Files
Enrich POST /datasets responses by merging dynamic query data with static per-dataset metadata loaded at startup.	Introduce NB_DATASETS_METADATA_PATH setting and load dataset metadata JSON during app lifespan into a global DATASETS_METADATA mapping. Refactor post_datasets to group SPARQL results only by dataset UUID, compute dynamic metrics (subject counts, modalities, pipelines), look up static metadata by prefixed dataset ID, and construct typed DatasetQueryResponse objects from the merged data. Add a new DatasetQueryResponse model with explicit static metadata fields (name, authors, links, access info, keywords) plus dynamic fields (subject counts, modalities, pipelines). Add tests to verify that /datasets responses include metadata from the JSON file and that combine_sparql_query_results no longer depends on dataset name/portal columns.	`app/main.py` `app/api/env_settings.py` `app/api/utility.py` `app/api/crud.py` `app/api/models.py` `tests/test_datasets.py` `tests/test_utility.py` `tests/test_settings.py` `tests/test_app_events.py` `tests/test_query.py`
Split subject-level querying into a dedicated /subjects flow while preserving legacy /query response behavior via a new cohort model.	Add post_subjects CRUD function that executes the SPARQL query, normalizes the result DataFrame, groups by dataset UUID, enforces min_cell_size, and returns a list of SubjectsQueryResponse objects with only dataset_uuid and subject_data. Introduce separate CohortQueryResponse and SubjectsQueryResponse models so that /query retains the old combined dataset+subject metadata shape while /subjects responses exclude dataset metadata. Update /subjects router to call crud.post_subjects instead of crud.query_records, and adjust /subjects tests and fixtures to use the new function and response structure. Adjust /query router to use CohortQueryResponse as the response model and update crud.query_records signature and callers accordingly.	`app/api/crud.py` `app/api/models.py` `app/api/routers/query.py` `app/api/routers/subjects.py` `tests/test_subjects.py` `tests/conftest.py`
Simplify dataset-level SPARQL models and queries to select only dataset and subject identifiers, aligning tests with the new templates.	Update SPARQL_SELECTED_VARS and Dataset SPARQL model to drop dataset_name and dataset_portal_uri from selections and triples, relying instead on static metadata JSON for these fields. Adjust dataset SPARQL query builders and templates (phenotypic and imaging) to select ?dataset and ?subject only and to update GROUP BY clauses accordingly. Modify query creation for the cohort (/query) path to use nb:hasAccessLink instead of nb:hasPortalURI while still supporting legacy dataset_portal_uri in the response for compatibility. Update SPARQL-related tests to expect the reduced variable set and new query strings, and clean up test data frames that previously carried dataset_name and dataset_portal_uri columns.	`app/api/sparql_models.py` `app/api/utility.py` `tests/test_sparql.py` `tests/test_utility.py`
Add startup-time validation and test scaffolding for dataset metadata configuration to keep the app robust under missing or misconfigured settings.	Extend validate_environment_variables to fail fast when the datasets metadata file path does not exist, with an informative RuntimeError message. Introduce load_metadata_for_node_datasets helper to read the JSON file and wire it into the FastAPI lifespan to populate env_settings.DATASETS_METADATA, clearing it on shutdown. Add a reusable pytest fixture that creates a temporary datasets metadata JSON file and injects its path into settings for tests that touch app startup. Update app startup, security, and integration tests to use the new fixture and to cover error handling when the file is missing.	`app/main.py` `app/api/env_settings.py` `tests/conftest.py` `tests/test_app_events.py` `tests/test_security.py` `tests/test_query.py`

Assessment against linked issues

Issue	Objective	Addressed
#509	Load the datasets metadata JSON file at API startup via a configurable environment variable, store it in memory, and fail startup if the file is not found.	✅
#509	Update dataset- and subject-level API models, endpoints, and SPARQL queries so that POST /datasets returns expanded dataset-level metadata (without dataset_portal_uri) using the metadata JSON, and /subjects responses include only dataset_uuid plus subject data; separate a dedicated response model for the legacy GET /query endpoint.	✅
#509	Update SPARQL query templates, utilities, and tests so that dataset selection for POST /datasets no longer returns dataset_name or dataset_portal_uri and test data align with the new response structures and behavior.	✅

Possibly linked issues

n-API can return dataset level information #509: The PR fulfills the issue by loading dataset metadata JSON, enriching /datasets responses, and updating related queries/models.

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

…OST /subjects and GET /query

… /query

…rors in tests

alyssadai · 2025-12-23T20:18:34Z

@sourcery-ai review

sourcery-ai

We've reviewed this pull request using the Sourcery rules engine

sourcery-ai

We've reviewed this pull request using the Sourcery rules engine

codecov · 2025-12-23T20:44:15Z

Codecov Report

❌ Patch coverage is 97.34513% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 97.23%. Comparing base (a37055e) to head (b742f57).
⚠️ Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
app/api/crud.py	90.47%	2 Missing ⚠️
tests/test_app_events.py	87.50%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #519      +/-   ##
==========================================
+ Coverage   96.94%   97.23%   +0.29%     
==========================================
  Files          33       33              
  Lines        1177     1267      +90     
==========================================
+ Hits         1141     1232      +91     
+ Misses         36       35       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

rmanaem

Thanks for the PR @alyssadai! I've left couple comments regarding the TODOs in the code. I'd suggest moving them from the code to the issues as we tend to track them more effectively there other than that this is good to go 🧑‍🍳

app/main.py

app/api/utility.py

tests/conftest.py

alyssadai added 6 commits December 20, 2025 02:00

remove dataset name and portal from SPARQL query

260d572

rename new env var

6484a9d

add util for loading JSON file

db0d1dd

load datasets metadata on startup and error out if file not found

b62a508

remove leftover comment

bdb6459

append static dataset metadata to dataset query response

763c818

alyssadai added the pr-minor-breaking Feature or enhancement that breaks compatibility, will increment minor version (0.+1.0) label Dec 22, 2025

alyssadai added 7 commits December 23, 2025 02:27

update cohort query SPARQL template and split response handling for P…

a4b7fa2

…OST /subjects and GET /query

update existing tests to decouple response handling for /subjects and…

60af3cd

… /query

test startup error when datasets metadata file not found

4088ac1

test no dataset metadata is returned in /subjects responses

69b20a2

test contents of POST /datasets response

f085df3

create fixture for temp datasets metadata file to suppress startup er…

75a7c47

…rors in tests

rename temp datasets metadata JSON fixture & use in integration tests

7f58e98

alyssadai marked this pull request as ready for review December 23, 2025 20:18

sourcery-ai bot reviewed Dec 23, 2025

View reviewed changes

rmanaem self-requested a review January 5, 2026 19:33

rmanaem approved these changes Jan 5, 2026

View reviewed changes

app/main.py Outdated Show resolved Hide resolved

app/api/utility.py Outdated Show resolved Hide resolved

tests/conftest.py Outdated Show resolved Hide resolved

remove TODOs

b742f57

alyssadai mentioned this pull request Jan 6, 2026

f-API can request and forward dataset level information from n-API neurobagel/federation-api#211

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ENH] Return expanded metadata per matching dataset for `POST /datasets` endpoint only #519

[ENH] Return expanded metadata per matching dataset for `POST /datasets` endpoint only #519

Uh oh!

alyssadai commented Dec 22, 2025 •

edited

Loading

Uh oh!

sourcery-ai bot commented Dec 22, 2025 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

alyssadai commented Dec 23, 2025

Uh oh!

sourcery-ai bot left a comment

Uh oh!

sourcery-ai bot left a comment

Uh oh!

codecov bot commented Dec 23, 2025 •

edited

Loading

Uh oh!

rmanaem left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[ENH] Return expanded metadata per matching dataset for POST /datasets endpoint only #519

Are you sure you want to change the base?

[ENH] Return expanded metadata per matching dataset for POST /datasets endpoint only #519

Uh oh!

Conversation

alyssadai commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Summary by Sourcery

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for enhanced POST /datasets response building

Class diagram for updated query and dataset response models

File-Level Changes

Assessment against linked issues

Possibly linked issues

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

alyssadai commented Dec 23, 2025

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

rmanaem left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[ENH] Return expanded metadata per matching dataset for `POST /datasets` endpoint only #519

[ENH] Return expanded metadata per matching dataset for `POST /datasets` endpoint only #519

alyssadai commented Dec 22, 2025 •

edited

Loading

sourcery-ai bot commented Dec 22, 2025 •

edited

Loading

codecov bot commented Dec 23, 2025 •

edited

Loading