Skip to content

Conversation

@alyssadai
Copy link
Contributor

@alyssadai alyssadai commented Dec 22, 2025

Changes proposed in this pull request:

  • Add new env var NB_DATASETS_METADATA_PATH (to be hardcoded in the deployment docker-compose.yml) to specify datasets metadata JSON, and load file on app startup
  • Update dataset-level SPARQL query templates to no longer return dataset name/portal
  • For a given matching dataset in a /datasets response, retrieve static dataset metadata from the datasets metadata JSON file + append to computed dataset attributes
  • Update cohort SPARQL query template to use latest JSONLD data model
  • Split response handling logic + models for /query and /subjects endpoints to preserve existing legacy behaviour of /query (see also Officially deprecate legacy GET /query endpoint for POST /subjects #520)
    • Update /subjects response to exclude dataset metadata

Checklist

This section is for the PR reviewer

  • PR has an interpretable title with a prefix ([ENH], [FIX], [REF], [TST], [CI], [MNT], [INF], [MODEL], [DOC]) (see our Contributing Guidelines for more info)
  • PR has a label for the release changelog or skip-release (to be applied by maintainers only)
  • PR links to GitHub issue with mention Closes #XXXX
  • Tests pass
  • Checks pass
  • If the PR changes the SPARQL query template, the default Neurobagel query file has also been regenerated

For new features:

  • Tests have been added

For bug fixes:

  • There is at least one test that would fail under the original bug conditions.

Summary by Sourcery

Expand dataset metadata returned by the POST /datasets endpoint by combining per-dataset static metadata from a JSON file with dynamic query results, and simplify SPARQL dataset selections to only include dataset and subject identifiers.

New Features:

  • Load per-node dataset metadata from a configured JSON file at startup and expose it for use in API responses.

Enhancements:

  • Extend the dataset query response model to include richer static metadata fields (e.g., authors, access information, links, and keywords).
  • Refactor the /datasets POST handler to return typed DatasetQueryResponse objects built from merged static and dynamic metadata.
  • Introduce a dedicated CohortQueryResponse model to preserve the legacy /query response shape for backwards compatibility.
  • Add configuration and validation for the datasets metadata file path via a new NB_DATASETS_METADATA_PATH setting.
  • Simplify SPARQL query templates and models to stop selecting dataset_name and dataset_portal_uri directly from the graph.

Tests:

  • Update SPARQL query tests to reflect the reduced set of selected variables and grouping columns.
  • Extend settings tests to cover the new datasets metadata path configuration.

Summary by Sourcery

Expand dataset metadata returned by the POST /datasets endpoint using a node-level JSON metadata file while keeping subject-level and legacy /query responses backward compatible.

New Features:

  • Load per-node dataset metadata from a configurable JSON file at startup and make it available for enriching dataset responses.

Enhancements:

  • Return typed DatasetQueryResponse objects from POST /datasets that merge static JSON metadata with dynamic query-derived attributes.
  • Introduce a dedicated CohortQueryResponse model and a separate /subjects handler to preserve legacy /query behaviour while simplifying /subjects responses.
  • Simplify SPARQL dataset queries and utilities to only select dataset and subject identifiers and rely on external metadata for dataset details.
  • Validate the datasets metadata file path via a new NB_DATASETS_METADATA_PATH setting and load dataset metadata during app startup.

Tests:

  • Add and update tests to cover enriched /datasets responses, the new /subjects behaviour, SPARQL template changes, datasets metadata loading, and the new configuration setting.

@alyssadai alyssadai added the pr-minor-breaking Feature or enhancement that breaks compatibility, will increment minor version (0.+1.0) label Dec 22, 2025
@sourcery-ai
Copy link

sourcery-ai bot commented Dec 22, 2025

Reviewer's Guide

This PR enhances the POST /datasets endpoint to return enriched per-dataset metadata by merging dynamic SPARQL query results with static metadata loaded from a JSON file at startup, introduces a dedicated /subjects handler and models to keep cohort (/query) behavior backward compatible, and adds configuration and validation for the datasets metadata file along with corresponding SPARQL/template and test updates.

Sequence diagram for enhanced POST /datasets response building

sequenceDiagram
    actor Client
    participant Router_datasets as Router_datasets
    participant CRUD as crud
    participant Util as util
    participant GraphAPI as Graph_API
    participant Env as env_settings

    Client->>Router_datasets: POST /datasets (QueryModel)
    Router_datasets->>CRUD: post_datasets(query)

    CRUD->>Util: create_sparql_queries_for_datasets(query)
    Util-->>CRUD: phenotypic_query, imaging_query

    CRUD->>GraphAPI: POST phenotypic_query
    GraphAPI-->>CRUD: phenotypic_results
    CRUD->>GraphAPI: POST imaging_query
    GraphAPI-->>CRUD: imaging_results

    CRUD->>CRUD: combine results
    CRUD->>CRUD: compute matching_dataset_sizes
    CRUD->>CRUD: compute matching_dataset_imaging_modals_and_pipelines

    loop per dataset_uuid in combined_query_results.groupby(dataset)
        CRUD->>Env: DATASETS_METADATA.get(prefixed_dataset_uuid, {})
        Env-->>CRUD: dataset_static_metadata
        CRUD->>CRUD: dataset_dynamic_metadata dict
        CRUD->>CRUD: DatasetQueryResponse(**dataset_static_metadata, **dataset_dynamic_metadata)
        CRUD->>CRUD: append to response list
    end

    CRUD-->>Router_datasets: list[DatasetQueryResponse]
    Router_datasets-->>Client: 200 OK JSON
Loading

Class diagram for updated query and dataset response models

classDiagram

    class SessionResponse {
        +str session
        +str session_file_path
        +str dataset_uuid
        +str dataset_name
        +str sub_id
        +int age
        +str sex
        +str diagnosis
        +str image_modal
        +dict completed_pipelines
    }

    class CohortQueryResponse {
        +str dataset_uuid
        +str dataset_name
        +str dataset_portal_uri
        +int dataset_total_subjects
        +bool records_protected
        +int num_matching_subjects
        +list image_modals
        +dict available_pipelines
        +list~SessionResponse~ subject_data
        +str subject_data
    }

    class DatasetQueryResponse {
        +str dataset_uuid
        +str dataset_name
        +list~str~ authors
        +str homepage
        +list~str~ references_and_links
        +list~str~ keywords
        +str repository_url
        +str access_instructions
        +str access_type
        +str access_email
        +str access_link
        +int dataset_total_subjects
        +bool records_protected
        +int num_matching_subjects
        +list~str~ image_modals
        +dict available_pipelines
    }

    class SubjectsQueryResponse {
        +str dataset_uuid
        +list~SessionResponse~ subject_data
        +str subject_data
    }

    class Settings {
        +str graph_address
        +str graph_db
        +int graph_port
        +Path datasets_metadata_path
        +bool return_agg
        +int min_cell_size
        +bool auth_enabled
    }

    class EnvGlobals {
        +dict CONTEXT
        +dict ALL_VOCABS
        +dict DATASETS_METADATA
    }

    CohortQueryResponse --> SessionResponse : uses
    SubjectsQueryResponse --> SessionResponse : uses
    DatasetQueryResponse ..> EnvGlobals : metadata_source
    CohortQueryResponse ..> Settings : guarded_by_min_cell_size
    SubjectsQueryResponse ..> Settings : guarded_by_min_cell_size
Loading

File-Level Changes

Change Details Files
Enrich POST /datasets responses by merging dynamic query data with static per-dataset metadata loaded at startup.
  • Introduce NB_DATASETS_METADATA_PATH setting and load dataset metadata JSON during app lifespan into a global DATASETS_METADATA mapping.
  • Refactor post_datasets to group SPARQL results only by dataset UUID, compute dynamic metrics (subject counts, modalities, pipelines), look up static metadata by prefixed dataset ID, and construct typed DatasetQueryResponse objects from the merged data.
  • Add a new DatasetQueryResponse model with explicit static metadata fields (name, authors, links, access info, keywords) plus dynamic fields (subject counts, modalities, pipelines).
  • Add tests to verify that /datasets responses include metadata from the JSON file and that combine_sparql_query_results no longer depends on dataset name/portal columns.
app/main.py
app/api/env_settings.py
app/api/utility.py
app/api/crud.py
app/api/models.py
tests/test_datasets.py
tests/test_utility.py
tests/test_settings.py
tests/test_app_events.py
tests/test_query.py
Split subject-level querying into a dedicated /subjects flow while preserving legacy /query response behavior via a new cohort model.
  • Add post_subjects CRUD function that executes the SPARQL query, normalizes the result DataFrame, groups by dataset UUID, enforces min_cell_size, and returns a list of SubjectsQueryResponse objects with only dataset_uuid and subject_data.
  • Introduce separate CohortQueryResponse and SubjectsQueryResponse models so that /query retains the old combined dataset+subject metadata shape while /subjects responses exclude dataset metadata.
  • Update /subjects router to call crud.post_subjects instead of crud.query_records, and adjust /subjects tests and fixtures to use the new function and response structure.
  • Adjust /query router to use CohortQueryResponse as the response model and update crud.query_records signature and callers accordingly.
app/api/crud.py
app/api/models.py
app/api/routers/query.py
app/api/routers/subjects.py
tests/test_subjects.py
tests/conftest.py
Simplify dataset-level SPARQL models and queries to select only dataset and subject identifiers, aligning tests with the new templates.
  • Update SPARQL_SELECTED_VARS and Dataset SPARQL model to drop dataset_name and dataset_portal_uri from selections and triples, relying instead on static metadata JSON for these fields.
  • Adjust dataset SPARQL query builders and templates (phenotypic and imaging) to select ?dataset and ?subject only and to update GROUP BY clauses accordingly.
  • Modify query creation for the cohort (/query) path to use nb:hasAccessLink instead of nb:hasPortalURI while still supporting legacy dataset_portal_uri in the response for compatibility.
  • Update SPARQL-related tests to expect the reduced variable set and new query strings, and clean up test data frames that previously carried dataset_name and dataset_portal_uri columns.
app/api/sparql_models.py
app/api/utility.py
tests/test_sparql.py
tests/test_utility.py
Add startup-time validation and test scaffolding for dataset metadata configuration to keep the app robust under missing or misconfigured settings.
  • Extend validate_environment_variables to fail fast when the datasets metadata file path does not exist, with an informative RuntimeError message.
  • Introduce load_metadata_for_node_datasets helper to read the JSON file and wire it into the FastAPI lifespan to populate env_settings.DATASETS_METADATA, clearing it on shutdown.
  • Add a reusable pytest fixture that creates a temporary datasets metadata JSON file and injects its path into settings for tests that touch app startup.
  • Update app startup, security, and integration tests to use the new fixture and to cover error handling when the file is missing.
app/main.py
app/api/env_settings.py
tests/conftest.py
tests/test_app_events.py
tests/test_security.py
tests/test_query.py

Assessment against linked issues

Issue Objective Addressed Explanation
#509 Load the datasets metadata JSON file at API startup via a configurable environment variable, store it in memory, and fail startup if the file is not found.
#509 Update dataset- and subject-level API models, endpoints, and SPARQL queries so that POST /datasets returns expanded dataset-level metadata (without dataset_portal_uri) using the metadata JSON, and /subjects responses include only dataset_uuid plus subject data; separate a dedicated response model for the legacy GET /query endpoint.
#509 Update SPARQL query templates, utilities, and tests so that dataset selection for POST /datasets no longer returns dataset_name or dataset_portal_uri and test data align with the new response structures and behavior.

Possibly linked issues


Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@alyssadai alyssadai marked this pull request as ready for review December 23, 2025 20:18
@alyssadai
Copy link
Contributor Author

@sourcery-ai review

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've reviewed this pull request using the Sourcery rules engine

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've reviewed this pull request using the Sourcery rules engine

@codecov
Copy link

codecov bot commented Dec 23, 2025

Codecov Report

❌ Patch coverage is 97.34513% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 97.23%. Comparing base (a37055e) to head (b742f57).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
app/api/crud.py 90.47% 2 Missing ⚠️
tests/test_app_events.py 87.50% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #519      +/-   ##
==========================================
+ Coverage   96.94%   97.23%   +0.29%     
==========================================
  Files          33       33              
  Lines        1177     1267      +90     
==========================================
+ Hits         1141     1232      +91     
+ Misses         36       35       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@rmanaem rmanaem self-requested a review January 5, 2026 19:33
Copy link
Collaborator

@rmanaem rmanaem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @alyssadai! I've left couple comments regarding the TODOs in the code. I'd suggest moving them from the code to the issues as we tend to track them more effectively there other than that this is good to go 🧑‍🍳

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-minor-breaking Feature or enhancement that breaks compatibility, will increment minor version (0.+1.0)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

n-API can return dataset level information

3 participants