diff --git a/modules/contextualization/cdf_file_annotation/CONTRIBUTING.md b/modules/contextualization/cdf_file_annotation/CONTRIBUTING.md new file mode 100644 index 00000000..582be646 --- /dev/null +++ b/modules/contextualization/cdf_file_annotation/CONTRIBUTING.md @@ -0,0 +1,196 @@ +# Contributing to CDF File Annotation Module + +Thank you for your interest in contributing to the CDF File Annotation Module! This document outlines the process for contributing to this project. + +## Contribution Workflow + +All contributions to this project must follow this workflow: + +### 1. Create a GitHub Issue + +Before making any changes, please create a GitHub issue to discuss: + +- **Bug Reports**: Describe the bug, steps to reproduce, expected vs. actual behavior, and your environment +- **Feature Requests**: Describe the feature, its use case, and how it would benefit the project +- **Documentation Improvements**: Describe what documentation is missing or needs clarification +- **Code Improvements**: Describe the refactoring or optimization you'd like to make + +**Why create an issue first?** + +- Ensures alignment on the problem and proposed solution +- Prevents duplicate work +- Allows for discussion before investing time in implementation +- Provides context for the eventual pull request + +### 2. Create a Pull Request + +Once the issue has been discussed and you're ready to contribute: + +1. **Fork the repository** to your GitHub account +2. **Create a feature branch** from `main`: + + ```bash + git checkout -b feature/issue-123-short-description + ``` + + or + + ```bash + git checkout -b fix/issue-456-short-description + ``` + +3. **Make your changes** following the code standards below + +4. **Commit your changes** with clear, descriptive commit messages: + + ```bash + git commit -m "Fix: Resolve cache invalidation issue (#123) + + - Updated cache validation logic to handle edge cases + - Added unit tests for cache service + - Updated documentation" + ``` + +5. **Push to your fork**: + + ```bash + git push origin feature/issue-123-short-description + ``` + +6. **Create a Pull Request** on GitHub: + - Reference the related issue in the PR description (e.g., "Closes #123" or "Fixes #456") + - Provide a clear description of what changed and why + - Include any relevant testing details or screenshots + - Add `@dude-with-a-mug` as a reviewer (or the current maintainer) + +### 3. Code Review and Approval + +- **All PRs require approval** from the project maintainer (@dude-with-a-mug or designated reviewer) before merging +- The maintainer will review your code for: + + - Code quality and adherence to project standards + - Test coverage + - Documentation updates + - Breaking changes or backward compatibility + - Performance implications + +- Address any feedback or requested changes +- Once approved, the maintainer will merge your PR + +**Note**: PRs will not be merged without maintainer approval, even if all automated checks pass. + +## Code Standards + +### Python Code Style + +- Use type hints for all function parameters and return values +- Maximum line length: 120 characters (as configured in the project) +- Use meaningful variable and function names + +### Documentation + +- **All functions must include Google-style docstrings** with: + - Brief description + - `Args`: Parameter descriptions + - `Returns`: Return value description + - `Raises`: Exception descriptions (if applicable) +- Update README.md or relevant documentation if your changes affect user-facing behavior +- Add inline comments for complex logic or non-obvious decisions + +### Example Docstring Format + +```python +def process_annotations( + self, + file_node: Node, + regular_item: dict | None, + pattern_item: dict | None +) -> tuple[str, str]: + """ + Processes diagram detection results and applies annotations to a file. + + Handles both regular entity matching and pattern mode results, applying + confidence thresholds and deduplication logic. + + Args: + file_node: The file node instance to annotate. + regular_item: Dictionary containing regular diagram detect results. + pattern_item: Dictionary containing pattern mode results. + + Returns: + A tuple containing: + - Summary message of regular annotations applied + - Summary message of pattern annotations created + + Raises: + CogniteAPIError: If the API calls to apply annotations fail. + ValueError: If the file node is missing required properties. + """ + # Implementation... +``` + +### Testing + +- Add tests for new functionality where applicable +- Ensure existing tests pass before submitting your PR +- Test locally using the VSCode debugger setup (see [DEPLOYMENT.md](DEPLOYMENT.md)) + +### Configuration Changes + +- If you modify the configuration structure (`ep_file_annotation.config.yaml`), ensure: + - Pydantic models are updated accordingly + - Documentation in `detailed_guides/CONFIG.md` is updated + - Backward compatibility is maintained or a migration path is provided + +## What We're Looking For + +Contributions that align with the project's philosophy: + +- **Configuration-driven**: Prefer adding configuration options over hardcoded behavior +- **Interface-based**: Extend functionality through interfaces rather than modifying core logic +- **Well-documented**: Code should be self-explanatory with clear documentation +- **Production-ready**: Code should handle edge cases, errors, and scale considerations +- **Backward compatible**: Avoid breaking changes unless absolutely necessary + +## Types of Contributions We Welcome + +- **Bug fixes**: Resolve issues, fix edge cases, improve error handling +- **Performance improvements**: Optimize queries, caching, or processing logic +- **Documentation**: Improve guides, add examples, clarify confusing sections +- **New configuration options**: Add flexibility through new config parameters +- **New service implementations**: Create alternative implementations of existing interfaces +- **Test coverage**: Add unit tests, integration tests, or test utilities +- **Examples**: Add example configurations or use cases + +## Types of Changes Requiring Extra Discussion + +These types of changes require significant discussion in the GitHub issue before proceeding: + +- Breaking changes to the configuration format +- Changes to the core architecture or interfaces +- New external dependencies +- Changes affecting the data model structure +- Performance changes that trade off memory/CPU/network differently + +## Questions or Need Help? + +- Create a GitHub issue with your question +- Tag it with the "question" label +- The maintainer will respond as soon as possible + +## Code of Conduct + +- Be respectful and constructive in all interactions +- Provide thoughtful, actionable feedback during code reviews +- Assume good intentions from all contributors +- Focus on the code and ideas, not the person + +## License + +By contributing to this project, you agree that your contributions will be licensed under the same license as the project (see LICENSE file). + +--- + +Thank you for contributing to making this project better! πŸš€ + +Return to [Main README](README.md) diff --git a/modules/contextualization/cdf_file_annotation/DEPLOYMENT.md b/modules/contextualization/cdf_file_annotation/DEPLOYMENT.md new file mode 100644 index 00000000..c859ad13 --- /dev/null +++ b/modules/contextualization/cdf_file_annotation/DEPLOYMENT.md @@ -0,0 +1,229 @@ +# Deployment Guide + +This guide provides step-by-step instructions for deploying the CDF File Annotation Module to your Cognite Data Fusion (CDF) project. + +## Prerequisites + +Before deploying this module, ensure you have the following: + +- **Python 3.11+** installed on your system +- **An active Cognite Data Fusion (CDF) project** +- **CDF Toolkit** installed (see step 1 below) +- **Required Python packages** are listed in: + - `cdf_file_annotation/functions/fn_file_annotation_launch/requirements.txt` + - `cdf_file_annotation/functions/fn_file_annotation_finalize/requirements.txt` + +### Data Preparation Requirements + +Alias and tag generation is abstracted out of the annotation function. You'll need to create a transformation that populates the `aliases` and `tags` properties of your file and target entity views: + +#### Aliases Property + +- Used to match files with entities +- Should contain a list of alternative names or identifiers that can be found in the file's image +- Examples: `["FT-101A", "Flow Transmitter 101A", "FT101A"]` + +#### Tags Property + +The `tags` property serves multiple purposes and consists of the following: + +- **`DetectInDiagrams`**: Identifies files and assets to include as entities filtered by primary scope and secondary scope (if provided) +- **`ScopeWideDetect`**: Identifies files and assets to include as entities filtered by a primary scope only +- **`ToAnnotate`**: Identifies files that need to be annotated +- **`AnnotationInProcess`**: Identifies files that are in the process of being annotated +- **`Annotated`**: Identifies files that have been annotated +- **`AnnotationFailed`**: Identifies files that have failed the annotation process (either by erroring out or by receiving 0 possible matches) + +> **Note**: Don't worry if these concepts don't immediately make sense. Aliases and tags are explained in greater detail in the `detailed_guides/` documentation. The template also includes a jupyter notebook that prepares the files and assets for annotation if using the toolkit's quickstart module. + +## Deployment Steps + +_**NOTE:** I'm constantly improving this template, thus some parts of the video walkthroughs are from an older version. The video tutorials below are still **relevant**. Any breaking changes will receive a new video tutorial._ + +_(If videos fail to load, try loading the page in incognito or re-sign into GitHub)_ + +### Step 1: Create a CDF Project through Toolkit + +Follow the [CDF Toolkit guide](https://docs.cognite.com/cdf/deploy/cdf_toolkit/) to set up your project. + +Optionally, initialize the quickstart package using toolkit CLI: + +```bash +poetry init +poetry add cognite-toolkit +poetry run cdf modules init +``` + + + + + +### Step 2: Integrate the Module + +1. Move the `local_setup/` folder to the root and unpack `.vscode/` and `.env.tmpl` +2. Update the `default.config.yaml` file with project-specific configurations +3. Add the module name to the list of selected modules in your `config.{env}.yaml` file +4. Create a `.env` file with credentials pointing to your CDF project + + + + + +### Step 3: Build and Deploy the Module + +1. (Optional) Build and deploy the quickstart template modules +2. Build and deploy this module: + +```bash +poetry run cdf build --env dev +poetry run cdf deploy --dry-run +poetry run cdf deploy +``` + +#### Example Configuration File + +Below is an example `config..yaml` configuration: + +```yaml +# config..yaml used in examples below +environment: + name: dev + project: + validation-type: dev + selected: + - modules/ + +variables: + modules: + # stuff from quickstart package... + organization: tx + + # ... + + cdf_ingestion: + workflow: ingestion + groupSourceId: + ingestionClientId: ${IDP_CLIENT_ID} # Changed from ${INGESTION_CLIENT_ID} + ingestionClientSecret: ${IDP_CLIENT_SECRET} # Changed from ${INGESTION_CLIENT_SECRET} + pandidContextualizationFunction: contextualization_p_and_id_annotater + contextualization_connection_writer: contextualization_connection_writer + schemaSpace: sp_enterprise_process_industry + schemaSpace2: cdf_cdm + schemaSpace3: cdf_idm + instanceSpaces: + - springfield_instances + - cdf_cdm_units + runWorkflowUserIds: + - + + contextualization: + cdf_file_annotation: + # used in /data_sets, /data_models, /functions, /extraction_pipelines, and /workflows + annotationDatasetExternalId: ds_file_annotation + + # used in /data_models and /extraction_pipelines + annotationStateExternalId: FileAnnotationState + annotationStateInstanceSpace: sp_dat_cdf_annotation_states + annotationStateSchemaSpace: sp_hdm #NOTE: stands for space helper data model + annotationStateVersion: v1.0.1 + fileSchemaSpace: sp_enterprise_process_industry + fileExternalId: txFile + fileVersion: v1 + + # used in /raw and /extraction_pipelines + rawDb: db_file_annotation + rawTableDocTag: annotation_documents_tags + rawTableDocDoc: annotation_documents_docs + rawTableCache: annotation_entities_cache + + # used in /extraction_pipelines + extractionPipelineExternalId: ep_file_annotation + targetEntitySchemaSpace: sp_enterprise_process_industry + targetEntityExternalId: txEquipment + targetEntityVersion: v1 + + # used in /functions and /workflows + launchFunctionExternalId: fn_file_annotation_launch #NOTE: if this is changed, then the folder holding the launch function must be named the same as the new external ID + launchFunctionVersion: v1.0.0 + finalizeFunctionExternalId: fn_file_annotation_finalize #NOTE: if this is changed, then the folder holding the finalize function must be named the same as the new external ID + finalizeFunctionVersion: v1.0.0 + functionClientId: ${IDP_CLIENT_ID} + functionClientSecret: ${IDP_CLIENT_SECRET} + + # used in /workflows + workflowSchedule: "*/10 * * * *" + workflowExternalId: wf_file_annotation + workflowVersion: v1 + + # used in /auth + groupSourceId: # source ID from Azure AD for the corresponding groups + + + # ... +``` + + + + + +### Step 4: Run the Workflow + +After deployment, the annotation process is managed by a workflow that orchestrates the `Launch` and `Finalize` functions. The workflow is automatically triggered based on the schedule defined in the configuration. You can monitor the progress and logs of the functions in the CDF UI. + +**Optional preparatory steps:** + +1. Run the ingestion workflow from the quickstart package to create instances of `File`, `Asset`, etc. +2. Check out the instantiated files that have been annotated using the annotation function from the quickstart package +3. Run the `local_setup.ipynb` notebook to set up the files for annotation + +**Run the File Annotation Workflow** in the CDF UI and monitor its progress. + + + + + + + + + +## Local Development and Debugging + +This template is configured for easy local execution and debugging directly within Visual Studio Code. + +### Setup Instructions + +1. **Create Environment File**: Before running locally, you must create a `.env` file in the root directory. This file will hold the necessary credentials and configuration for connecting to your CDF project. Populate it with the required environment variables for `IDP_CLIENT_ID`, `CDF_CLUSTER`, etc. In the `local_runs/` folder you'll find a `.env` template. + +2. **Use the VS Code Debugger**: The repository includes a pre-configured `local_runs/.vscode/launch.json` file. Move the `.vscode/` folder to the top level of your repo. + + - Navigate to the "Run and Debug" view in the VS Code sidebar + - You will see dropdown options for launching the different functions (e.g., `Launch Function`, `Finalize Function`) + - Select the function you wish to run and click the green "Start Debugging" arrow + - This will start the function on your local machine, with the debugger attached, allowing you to set breakpoints and inspect variables + - Feel free to change/adjust the arguments passed into the function call to point to a test extraction pipeline and/or change the log level + + + +## Troubleshooting + +### Common Issues + +- **Authentication Errors**: Ensure your `.env` file contains valid credentials and that your service principal has the necessary permissions +- **Module Not Found**: Verify that the module is listed in your `config.{env}.yaml` file under `selected` +- **Function Deployment Fails**: Check that the function folder names match the external IDs defined in your configuration +- **Workflow Not Triggering**: Verify the workflow schedule is valid cron syntax and that the workflow has been deployed successfully + +For additional help, please refer to the [detailed guides](detailed_guides/) or [open an issue](../../issues) on GitHub. + +## Next Steps + +After successful deployment: + +1. Review the [Configuration Guide](detailed_guides/CONFIG.md) to understand all available options +2. Check the [Configuration Patterns Guide](detailed_guides/CONFIG_PATTERNS.md) for common use cases +3. Explore the [Development Guide](detailed_guides/DEVELOPING.md) if you need to extend functionality +4. Monitor your workflows and extraction pipelines in the CDF UI + +--- + +Return to [Main README](README.md) diff --git a/modules/contextualization/cdf_file_annotation/README.md b/modules/contextualization/cdf_file_annotation/README.md index ee3f4638..086f0b5b 100644 --- a/modules/contextualization/cdf_file_annotation/README.md +++ b/modules/contextualization/cdf_file_annotation/README.md @@ -1,4 +1,4 @@ -# Cognite Data Model-Based Annotation Function +# Cognite Data Model-Based Annotation Module ## Overview @@ -7,189 +7,30 @@ The Annotation template is a framework designed to automate the process of annot ## Key Features - **Configuration-Driven Workflow:** The entire process is controlled by a single config.yaml file, allowing adaptation to different data models and operational parameters without code changes. +- **Dual Annotation Modes**: Simultaneously runs standard entity matching and pattern-based detection mode: + - **Standard Mode**: Links files to known entities in your data model with confidence-based approval thresholds. + - **Pattern Mode**: Automatically generates regex-like patterns from entity aliases and detects all matching text in files, creating a comprehensive searchable catalog of potential entities for review and approval. +- **Automatic Pattern Promotion:** Post-processes pattern-mode annotations to automatically resolve cross-scope entity references using intelligent text matching and multi-tier caching, dramatically reducing manual review burden. +- **Intelligent Pattern Generation:** Automatically analyzes entity aliases to generate pattern samples, with support for manual pattern overrides at global, site, or unit levels. - **Large Document Support (\>50 Pages):** Automatically handles files with more than 50 pages by breaking them into manageable chunks, processing them iteratively, and tracking the overall progress. - **Parallel Execution Ready:** Designed for concurrent execution with a robust optimistic locking mechanism to prevent race conditions when multiple finalize function instances run in parallel. -- **Detailed Reporting:** Local logs and processed annotation details stored in CDF RAW tables, fucntion logs, and extraction pipeline runs for auditing and analysis. -- **Local Running and Debugging:** Both the launch and finalize handler can be ran locally and have default setups in the 'Run & Debug' tab in vscode. Requires a .env file to be placed in the directory. - - +- **Comprehensive Reporting:** Annotations stored in three dedicated RAW tables (doc-to-doc links, doc-to-tag links, and pattern detections) plus extraction pipeline logs for full traceability. +- **Local Running and Debugging:** All function handlers can be run locally and have default setups in the 'Run & Debug' tab in VSCode. Requires a .env file to be placed in the directory. ## Getting Started -Deploying this annotation module into a new Cognite Data Fusion (CDF) project is a streamlined process. Since all necessary resources (Data Sets, Extraction Pipelines, Functions, etc.) are bundled into a single module, you only need to configure one file to get started. - -### Prerequisites - -- Python 3.11+ -- An active Cognite Data Fusion (CDF) project. -- The required Python packages are listed in the `cdf_file_annotation/functions/fn_file_annotation_launch/requirements.txt` and `cdf_file_annotation/functions/fn_file_annotation_finalize/requirements.txt` files. -- Alias and tag generation is abstracted out of the annotation function. Thus, you'll need to create a transformation that populates the `aliases` and `tags` property of your file and target entity view. - - The `aliases` property is used to match files with entities and should contain a list of alternative names or identifiers that can be found in the files image. - - The `tags` property serves multiple purposes and consists of the following... - - (`DetectInDiagrams`) Identifies files and assets to include as entities filtered by primary scope and secondary scope (if provided). - - (`ScopeWideDetect`) Identifies files and asset to include as entities filtered by a primary scope. - - (`ToAnnotate`) Identifies files that need to be annotated. - - (`AnnotationInProcess`) Identifies files that are in the process of being annotated. - - (`Annotated`) Identifies files that have been annotated. - - (`AnnotationFailed`) Identifies files that have failed the annotation process. Either by erroring out or by receiving 0 possible matches. - - Don't worry if these concepts don't immediately make sense. Aliases and tags are explained in greater detail in the detailed_guides/ documentation. The template also includes a jupyter notebook that prepare the files and assets for annotation if using the toolkit's quickstart module. - -### Deployment Steps - -_**NOTE:** I'm constantly improving this template, thus some parts of the video walkthroughs are from an older version. The video tutorials below are still **relevant**. Any breaking changes will receive a new video tutorial._ - -_(if videos fail to load, try loading page in incognito or re-sign into github) ~ Hope y'all enjoy :)_ - -1. **Create a CDF Project through Toolkit** - - Follow the guide [here](https://docs.cognite.com/cdf/deploy/cdf_toolkit/) - - (optional) Initialize the quickstart package using toolkit CLI - ```bash - poetry init - poetry add cognite-toolkit - poetry run cdf modules init - ``` - - - - - -2. **Integrate the Module** - - Move the `local_setup/` folder to the root and unpack .vscode/ and .env.tmpl - - Update the default.config.yaml file with project-specific configurations - - Add the module name to the list of selected modules in your config.{env}.yaml file - - Make sure to create a .env file with credentials pointing to your CDF project - - - - - -3. **Build and Deploy the Module** - - - (optional) Build and deploy the quickstart template modules - - Build and deploy this module - - ```bash - poetry run cdf build --env dev - poetry run cdf deploy --dry-run - poetry run cdf deploy - ``` - - ```yaml - # config..yaml used in examples below - environment: - name: dev - project: - validation-type: dev - selected: - - modules/ - - variables: - modules: - # stuff from quickstart package... - organization: tx - - # ... - - cdf_ingestion: - workflow: ingestion - groupSourceId: - ingestionClientId: ${IDP_CLIENT_ID} # Changed from ${INGESTION_CLIENT_ID} - ingestionClientSecret: ${IDP_CLIENT_SECRET} # Changed from ${INGESTION_CLIENT_SECRET} - pandidContextualizationFunction: contextualization_p_and_id_annotater - contextualization_connection_writer: contextualization_connection_writer - schemaSpace: sp_enterprise_process_industry - schemaSpace2: cdf_cdm - schemaSpace3: cdf_idm - instanceSpaces: - - springfield_instances - - cdf_cdm_units - runWorkflowUserIds: - - - - contextualization: - cdf_file_annotation: - # used in /data_sets, /data_models, /functions, /extraction_pipelines, and /workflows - annotationDatasetExternalId: ds_file_annotation - - # used in /data_models and /extraction_pipelines - annotationStateExternalId: FileAnnotationState - annotationStateInstanceSpace: sp_dat_cdf_annotation_states - annotationStateSchemaSpace: sp_hdm #NOTE: stands for space helper data model - annotationStateVersion: v1.0.1 - fileSchemaSpace: sp_enterprise_process_industry - fileExternalId: txFile - fileVersion: v1 - - # used in /raw and /extraction_pipelines - rawDb: db_file_annotation - rawTableDocTag: annotation_documents_tags - rawTableDocDoc: annotation_documents_docs - rawTableCache: annotation_entities_cache - - # used in /extraction_pipelines - extractionPipelineExternalId: ep_file_annotation - targetEntitySchemaSpace: sp_enterprise_process_industry - targetEntityExternalId: txEquipment - targetEntityVersion: v1 - - # used in /functions and /workflows - launchFunctionExternalId: fn_file_annotation_launch #NOTE: if this is changed, then the folder holding the launch function must be named the same as the new external ID - launchFunctionVersion: v1.0.0 - finalizeFunctionExternalId: fn_file_annotation_finalize #NOTE: if this is changed, then the folder holding the finalize function must be named the same as the new external ID - finalizeFunctionVersion: v1.0.0 - functionClientId: ${IDP_CLIENT_ID} - functionClientSecret: ${IDP_CLIENT_SECRET} - - # used in /workflows - workflowSchedule: "*/10 * * * *" - workflowExternalId: wf_file_annotation - workflowVersion: v1 - - # used in /auth - groupSourceId: # source ID from Azure AD for the corresponding groups - - # ... - ``` - - - - - -4. **Run the Workflow** - - After deployment, the annotation process is managed by a workflow that orchestrates the `Launch` and `Finalize` functions. The workflow is automatically triggered based on the schedule defined in the configuration. You can monitor the progress and logs of the functions in the CDF UI. - - - (optional) Run the ingestion workflow from the quickstart package to create instances of File, Asset, etc - - (optional) Checkout the instantiated files that have been annotated using the annotation function from the quickstart package - - (optional) Run the local_setup.ipynb to setup the files for annotation - - Run the File Annotation Workflow - - - - - - - - - -### Local Development and Debugging - -This template is configured for easy local execution and debugging directly within Visual Studio Code. - -1. **Create Environment File**: Before running locally, you must create a `.env` file in the root directory. This file will hold the necessary credentials and configuration for connecting to your CDF project. Populate it with the required environment variables for `IDP_CLIENT_ID`, `CDF_CLUSTER`, etc. In the `local_runs/` folder you'll find a .env template. +Ready to deploy? Check out the **[Deployment Guide](DEPLOYMENT.md)** for step-by-step instructions on: -2. **Use the VS Code Debugger**: The repository includes a pre-configured `local_runs/.vscode/launch.json` file. Please move the .vscode/ folder to the top level of your repo. +- Prerequisites and data preparation requirements +- CDF Toolkit setup +- Module integration and configuration +- Local development and debugging - - Navigate to the "Run and Debug" view in the VS Code sidebar. - - You will see dropdown options for launching the different functions (e.g., `Launch Function`, `Finalize Function`). - - Select the function you wish to run and click the green "Start Debugging" arrow. This will start the function on your local machine, with the debugger attached, allowing you to set breakpoints and inspect variables. - - Feel free to change/adjust the arguments passed into the function call to point to a test_extraction_pipeline and/or change the log level. - - +For a quick overview, deploying this annotation module into a new Cognite Data Fusion (CDF) project is a streamlined process. Since all necessary resources (Data Sets, Extraction Pipelines, Functions, etc.) are bundled into a single module, you only need to configure one file to get started. ## How It Works -The template operates in three main phases, orchestrated by CDF Workflows. Since the prepare phase is relatively small, it is bundled in with the launch phase. However, conceptually it should be treated as a separate process. +The template operates in four main phases, orchestrated by CDF Workflows. Since the prepare phase is relatively small, it is bundled in with the launch phase. However, conceptually it should be treated as a separate process. ### Prepare Phase @@ -201,49 +42,253 @@ The template operates in three main phases, orchestrated by CDF Workflows. Since ### Launch Phase -![LaunchService](https://github.com/user-attachments/assets/3e5ba403-50bb-4f6a-a723-be8947c65ebc) - - **Goal**: Launch the annotation jobs for files that are ready. - **Process**: 1. It queries for `AnnotationState` instances with a "New" or "Retry" status. - 2. It groups these files by a primary scope to provide context. - 3. For each group, it fetches the relevant file and target entity information, using a cache to avoid redundant lookups. - 4. It calls the Cognite Diagram Detect API to start the annotation job. - 5. It updates the `AnnotationState` instance with the `diagramDetectJobId` and sets the status to "Processing". + 2. It groups these files by a primary scope (e.g., site, unit) to provide operational context. + 3. For each group, it fetches the relevant file and target entity information using an intelligent caching system: + - Checks if a valid cache exists in RAW (based on scope and time limit). + - If cache is stale or missing, queries the data model for entities within scope. + - Automatically generates pattern samples from entity aliases (e.g., "FT-101A" β†’ "[FT]-000[A]"). + - Retrieves manual pattern overrides from RAW catalog (GLOBAL, site-level, or unit-level). + - Merges and deduplicates auto-generated and manual patterns. + - Stores the combined entity list and pattern samples in RAW cache for reuse. + 4. It calls the Cognite Diagram Detect API to initiate two async jobs: + - A `standard annotation` job to find and link known entities with confidence scoring. + - A `pattern mode` job (if enabled) to detect all text matching the pattern samples, creating a searchable reference catalog. + 5. It updates the `AnnotationState` instance with both the `diagramDetectJobId` and `patternModeJobId` (if applicable) and sets the overall `annotationStatus` to "Processing". +
+Click to view Mermaid flowchart for Launch Phase + + ```mermaid + flowchart TD + Start([Start Launch Phase]) --> QueryFiles[Query AnnotationState
for New or Retry status] + QueryFiles --> CheckFiles{Any files
to process?} + CheckFiles -->|No| End([End]) + CheckFiles -->|Yes| GroupFiles[Group files by
primary scope
e.g., site, unit] + + GroupFiles --> NextScope{Next scope
group?} + NextScope -->|Yes| CheckCache{Valid cache
exists in RAW?} + + CheckCache -->|No - Stale/Missing| QueryEntities[Query data model for
entities within scope] + QueryEntities --> GenPatterns[Auto-generate pattern samples
from entity aliases
e.g., FT-101A β†’ #91;FT#93;-000#91;A#93;] + GenPatterns --> GetManual[Retrieve manual pattern
overrides from RAW catalog
GLOBAL, site, or unit level] + GetManual --> MergePatterns[Merge and deduplicate
auto-generated and
manual patterns] + MergePatterns --> StoreCache[Store entity list and
pattern samples in
RAW cache] + StoreCache --> UseCache[Use entities and patterns] + + CheckCache -->|Yes - Valid| LoadCache[Load entities and
patterns from RAW cache] + LoadCache --> UseCache + + UseCache --> ProcessBatch[Process files in batches
up to max batch size] + ProcessBatch --> SubmitJobs[Submit Diagram Detect jobs:
1 Standard annotation
2 Pattern mode if enabled] + SubmitJobs --> UpdateState[Update AnnotationState:
- Set status to Processing
- Store both job IDs] + UpdateState --> NextScope + NextScope -->|No more groups| QueryFiles + + style Start fill:#d4f1d4 + style End fill:#f1d4d4 + style CheckFiles fill:#fff4e6 + style CheckCache fill:#fff4e6 + style NextScope fill:#fff4e6 + style UseCache fill:#e6f3ff + style UpdateState fill:#e6f3ff + ``` + +
### Finalize Phase -![FinalizeService](https://github.com/user-attachments/assets/152d9eaf-afdb-46fe-9125-11430ff10bc9) - - **Goal**: Retrieve, process, and store the results of completed annotation jobs. - **Process**: - 1. It queries for `AnnotationState` instances with a "Processing" status. - 2. It checks the status of the corresponding diagram detection job. - 3. Once a job is complete, it retrieves the annotation results. - 4. It applies the new annotations, optionally cleaning up old ones first. - 5. It updates the `AnnotationState` status to "Annotated" or "Failed" and tags the file accordingly. - 6. It writes a summary of the approved annotations to a CDF RAW table for reporting. + 1. It queries for `AnnotationState` instances with a "Processing" or "Finalizing" status (using optimistic locking to claim jobs). + 2. It waits until both the standard and pattern mode jobs for a given file are complete. + 3. It retrieves and processes the results from both jobs: + - Creates a stable hash for each detection to enable deduplication between standard and pattern results. + - Filters standard annotations by confidence thresholds (auto-approve vs. suggest). + - Skips pattern detections that duplicate standard annotations. + 4. It optionally cleans old annotations first (on first run for multi-page files), then: + - **Standard annotations**: Creates edges in the data model linking files to specific entities, writes results to RAW tables (`doc_tag` for assets, `doc_doc` for file-to-file links). + - **Pattern annotations**: Creates edges linking files to a configurable "sink node" for review, writes results to a dedicated `doc_pattern` RAW table for the searchable catalog. + 5. Updates the file node tag from "AnnotationInProcess" to "Annotated". + 6. Updates the `AnnotationState` status to "Annotated", "Failed", or back to "New" (if more pages remain), tracking page progress for large files. +
+Click to view Mermaid flowchart for Finalize Phase + + ```mermaid + flowchart TD + Start([Start Finalize Phase]) --> QueryState[Query for ONE AnnotationState
with Processing status
Use optimistic locking to claim it] + QueryState --> CheckState{Found annotation
state instance?} + CheckState -->|No| End([End]) + CheckState -->|Yes| GetJobId[Extract job ID and
pattern mode job ID] + + GetJobId --> FindFiles[Find ALL files with
the same job ID] + FindFiles --> CheckJobs{Both standard
and pattern jobs
complete?} + CheckJobs -->|No| ResetStatus[Update AnnotationStates
back to Processing
Wait 30 seconds] + ResetStatus --> QueryState + + CheckJobs -->|Yes| RetrieveResults[Retrieve results from
both completed jobs] + RetrieveResults --> MergeResults[Merge regular and pattern
results by file ID
Creates unified result per file] + MergeResults --> LoopFiles[For each file in merged results] + + LoopFiles --> ProcessResults[Process file results:
- Create stable hash for deduplication
- Filter standard by confidence threshold
- Skip pattern duplicates] + + ProcessResults --> CheckClean{First run for
multi-page file?} + CheckClean -->|Yes| CleanOld[Clean old annotations] + CheckClean -->|No| CreateEdges + CleanOld --> CreateEdges[Create edges in data model] + + CreateEdges --> StandardEdges[Standard annotations:
Link file to entities
Write to doc_tag and doc_doc RAW tables] + StandardEdges --> PatternEdges[Pattern annotations:
Link file to sink node
Write to doc_pattern RAW table] + + PatternEdges --> UpdateTag[Update file tag:
AnnotationInProcess β†’ Annotated] + UpdateTag --> PrepareUpdate[Prepare AnnotationState update:
- Annotated if complete
- Failed if error
- New if more pages remain
Track page progress] + + PrepareUpdate --> MoreFiles{More files in
merged results?} + MoreFiles -->|Yes| LoopFiles + MoreFiles -->|No| BatchUpdate[Batch update ALL
AnnotationState instances
for this job] + + BatchUpdate --> QueryState + + style Start fill:#d4f1d4 + style End fill:#f1d4d4 + style CheckState fill:#fff4e6 + style CheckJobs fill:#fff4e6 + style CheckClean fill:#fff4e6 + style MoreFiles fill:#fff4e6 + style MergeResults fill:#e6f3ff + style ProcessResults fill:#e6f3ff + style CreateEdges fill:#e6f3ff + style BatchUpdate fill:#e6f3ff + ``` + +
+ +### Promote Phase + +- **Goal**: Automatically resolve pattern-mode annotations by finding matching entities and updating edges from sink node to actual entities. +- **Process**: + 1. Queries for pattern-mode annotation edges (edges pointing to the sink node with status "Suggested"). + 2. Groups candidates by unique text to process each text only once per batch. + 3. For each unique text: + - Generates text variations to handle different naming conventions (case, special characters, leading zeros). + - Searches for matching entities using a multi-tier caching strategy: + - **TIER 1**: In-memory cache (fastest, this run only). + - **TIER 2**: Persistent RAW cache (shared across runs and with manual promotions). + - **TIER 3**: Entity search via data model (queries smaller, stable entity dataset). + - Updates all edges with the same text based on search results. + 4. Updates edges and RAW tables based on results: + - **Approved**: Single unambiguous match found β†’ edge points to actual entity, added "PromotedAuto" tag. + - **Rejected**: No match found β†’ edge stays on sink node, added "PromoteAttempted" tag. + - **Suggested**: Multiple ambiguous matches β†’ kept for manual review, added "AmbiguousMatch" tag. + 5. Runs continuously (designed for repeated execution) until all resolvable pattern annotations are promoted. +
+Click to view Mermaid flowchart for Promote Phase + + ```mermaid + flowchart TD + Start([Start Promote Phase]) --> QueryEdges[Query for pattern-mode edges
pointing to sink node
with Suggested status] + QueryEdges --> CheckEdges{Any edges
to promote?} + CheckEdges -->|No| End([End]) + CheckEdges -->|Yes| GroupText[Group edges by
unique text + type
Process each text once] + + GroupText --> NextText{Next unique
text?} + NextText -->|Yes| GenVariations[Generate text variations
Case, special chars, zeros
e.g., V-0912 β†’ 8 variations] + + GenVariations --> CheckMemCache{In-memory
cache hit?} + CheckMemCache -->|Yes| UseMemCache[Use cached entity
TIER 1: Fastest] + CheckMemCache -->|No| CheckRAWCache{Persistent RAW
cache hit?} + + CheckRAWCache -->|Yes| UseRAWCache[Use cached entity
TIER 2: Fast
Populate in-memory cache] + CheckRAWCache -->|No| SearchEntities[Query entities via
data model
TIER 3: Server-side IN filter
on aliases property] + + SearchEntities --> CacheResult{Match found
and unambiguous?} + CacheResult -->|Yes| CachePositive[Cache positive result
in-memory + RAW] + CacheResult -->|No match| CacheNegative[Cache negative result
in-memory only] + CacheResult -->|Ambiguous| NoCache[Don't cache
ambiguous results] + + UseMemCache --> ProcessResult + UseRAWCache --> ProcessResult + CachePositive --> ProcessResult[Determine result type:
Single match, No match,
or Ambiguous] + CacheNegative --> ProcessResult + NoCache --> ProcessResult + + ProcessResult --> UpdateEdges{Result type?} + UpdateEdges -->|Single Match| ApproveEdges[Update ALL edges with this text:
- Point to matched entity
- Status: Approved
- Tag: PromotedAuto
- Update RAW pattern table] + UpdateEdges -->|No Match| RejectEdges[Update ALL edges with this text:
- Keep on sink node
- Status: Rejected
- Tag: PromoteAttempted
- Update RAW pattern table] + UpdateEdges -->|Ambiguous| FlagEdges[Update ALL edges with this text:
- Keep on sink node
- Status: Suggested
- Tags: PromoteAttempted,
AmbiguousMatch
- Update RAW pattern table] + + ApproveEdges --> BatchUpdate[Batch update edges
and RAW rows in CDF] + RejectEdges --> BatchUpdate + FlagEdges --> BatchUpdate + + BatchUpdate --> NextText + NextText -->|No more texts| QueryEdges + + style Start fill:#d4f1d4 + style End fill:#f1d4d4 + style CheckEdges fill:#fff4e6 + style CheckMemCache fill:#fff4e6 + style CheckRAWCache fill:#fff4e6 + style CacheResult fill:#fff4e6 + style UpdateEdges fill:#fff4e6 + style NextText fill:#fff4e6 + style UseMemCache fill:#e6ffe6 + style UseRAWCache fill:#e6f3ff + style SearchEntities fill:#ffe6e6 + style ProcessResult fill:#e6f3ff + style BatchUpdate fill:#e6f3ff + ``` + +
## Configuration -The templates behavior is entirely controlled by the `ep_file_annotation.config.yaml` file. This YAML file is parsed by Pydantic models in the code, ensuring a strongly typed and validated configuration. +The template's behavior is entirely controlled by the `ep_file_annotation.config.yaml` file. This YAML file is parsed by Pydantic models in the code, ensuring a strongly typed and validated configuration. Key configuration sections include: -- `dataModelViews`: Defines the data model views for files, annotation states, and target entities. -- `prepareFunction`: Configures the queries to find files to annotate. -- `launchFunction`: Sets parameters for the annotation job, such as batch size and entity matching properties. -- `finalizeFunction`: Defines how to process and apply the final annotations. +- `dataModelViews`: Defines the data model views for files, annotation states, core annotations, and target entities. +- `prepareFunction`: Configures the queries to find files to annotate and optionally reset. +- `launchFunction`: Sets parameters for the annotation job: + - `batchSize`: Maximum files per diagram detect call (1-50). + - `patternMode`: Boolean flag to enable pattern-based detection alongside standard matching. + - `primaryScopeProperty` / `secondaryScopeProperty`: Properties used for batching and cache scoping (e.g., "site", "unit"). + - `cacheService`: Configuration for entity cache storage and time limits. + - `annotationService`: Diagram detect parameters including `pageRange` for multi-page file processing. +- `finalizeFunction`: Defines how to process and apply the final annotations: + - `autoApprovalThreshold` / `autoSuggestThreshold`: Confidence thresholds for standard annotations. + - `cleanOldAnnotations`: Whether to remove existing annotations before applying new ones. + - `maxRetryAttempts`: Retry limit for failed files. + - `sinkNode`: Target node for pattern mode annotations pending review. +- `promoteFunction`: Configures automatic resolution of pattern-mode annotations: + - `getCandidatesQuery`: Query to find pattern-mode edges to promote (batch size controlled via limit). + - `entitySearchService`: Controls entity search and text normalization (case, special chars, leading zeros). + - `cacheService`: Configuration for the persistent textβ†’entity cache shared across runs and with manual promotions. + - `rawDb` / `rawTableDocPattern`: Location of RAW tables for storing promotion results. + +This file allows for deep customization. For example, you can use a list of query configurations to combine them with `OR` logic, or you can set `primaryScopeProperty` to `None` to process files that are not tied to a specific scope. Manual pattern samples can be added to the RAW catalog at `GLOBAL`, site, or unit levels to override or supplement auto-generated patterns. + +## Documentation + +This README provides a high-level overview of the template's purpose and architecture. For more detailed information: + +### Deployment & Setup -This file allows for deep customization. For example, you can use a list of query configurations to combine them with `OR` logic, or you can set `primaryScopeProperty` to `None` to process files that are not tied to a specific scope. +- **[Deployment Guide](DEPLOYMENT.md)**: Step-by-step instructions for deploying to CDF, including prerequisites, configuration, and local debugging setup. -## Detailed Guides +### Configuration & Usage -This README provides a high-level overview of the template's purpose and architecture. To gain a deeper understanding of how to configure and extend the template, I highly recommend exploring the detailed guides located in the `cdf_file_annotation/detailed_guides/` directory: +- **[CONFIG.md](detailed_guides/CONFIG.md)**: Comprehensive guide to the `ep_file_annotation.config.yaml` file and all configuration options. +- **[CONFIG_PATTERNS.md](detailed_guides/CONFIG_PATTERNS.md)**: Recipes for common operational tasks, including processing specific subsets, reprocessing files, and performance tuning. -- **`CONFIG.md`**: A document outlining the `ep_file_annotation.config.yaml` file to control the behavior of the Annotation Function. -- **`CONFIG_PATTERNS.md`**: A guide with recipes for common operational tasks, such as processing specific subsets of data, reprocessing files for debugging, and tuning performance by adjusting the configuration. -- **`DEVELOPING.md`**: A guide for developers who wish to extend the template's functionality. It details the interface-based architecture and provides a step-by-step walkthrough on how to create and integrate your own custom service implementations for specialized logic. +### Development & Extension + +- **[DEVELOPING.md](detailed_guides/DEVELOPING.md)**: Guide for developers extending the template's functionality, including the interface-based architecture and how to create custom service implementations. + +### Contributing + +- **[CONTRIBUTING.md](CONTRIBUTING.md)**: Guidelines for contributing to this project, including the issue/PR workflow, code standards, and review process. ## Design Philosophy @@ -274,7 +319,25 @@ Instead of using a simpler store like a RAW table to track the status of each fi When processing tens of thousands of files, naively fetching context for each file is inefficient. This module implements a significant optimization based on experiences with large-scale projects. - **Rationale:** For many projects, the entities relevant to a given file are often co-located within the same site or operational unit. By grouping files based on these properties before processing, we can create a highly effective cache. -- **Implementation:** The `launchFunction` configuration allows specifying a `primary_scope_property` and an optional `secondary_scope_property`. The `LaunchService` uses these properties to organize all files into ordered batches. The cache for entities is then loaded once for each context, drastically reducing the number of queries to CDF and improving overall throughput. +- **Implementation:** The `launchFunction` configuration allows specifying a `primary_scope_property` and an optional `secondary_scope_property`. The `LaunchService` uses these properties to organize all files into ordered batches. For each unique scope combination: + + 1. Check if a valid cache exists in RAW (scoped by primary/secondary values and time limit). + 2. If stale or missing, query the data model for all relevant entities within that scope. + 3. Transform entities into the format required by diagram detect. + 4. Automatically generate pattern samples by analyzing entity alias properties. + 5. Retrieve and merge manual pattern overrides from the RAW catalog. + 6. Store the complete entity list and pattern samples in RAW for reuse. + + This cache is loaded once per scope and reused for all files in that batch, drastically reducing the number of queries to CDF and improving overall throughput. The pattern generation process extracts common naming conventions from aliases, creating regex-like patterns that can match variations (e.g., detecting "FT-102A" even if only "FT-101A" was in the training data). + +### Efficient Entity Search for Pattern Promotion + +The promote function's entity search strategy is deliberately optimized for scale: + +- **Dataset Size Analysis:** When pattern-mode annotations need resolution, there are two potential query strategies: query annotation edges (to find proven matches) or query entities directly. Without property indexes on either `startNodeText` (edges) or `aliases` (entities), the smaller dataset wins. +- **Growth Patterns:** Annotation edges grow quadratically with O(Files Γ— Entities), potentially reaching hundreds of thousands or millions. Entity counts grow linearly and remain relatively stable at thousands. +- **Design Choice:** The promote function queries entities directly via server-side IN filters on the aliases property, avoiding the much larger annotation edge dataset. This provides 50-500x better performance at scale. +- **Self-Improving Cache:** The persistent RAW cache accumulates successful textβ†’entity mappings over time and is shared between automated promotions and manual promotions from the Streamlit dashboard, creating a self-improving system. ### Interface-Based Extensibility @@ -285,6 +348,6 @@ The template is designed around a core set of abstract interfaces (e.g., `IDataM ## About Me -Hey everyone\! I'm Jack Zhao, the creator of this template. I want to give a huge shoutout to Thomas Molbach and Noah Karsky for providing invaluable input from a solution architect's point of view. I also want to thank Khaled Shaheen and Gayatri Babel for their help in building this. +Hey everyone\! I'm Jack Zhao, the creator of this template. I want to give a huge shoutout to Thomas Molbach, Noah Karsky, and Darren Downtain for providing invaluable input from a solution architect's point of view. I also want to thank Lucas Guimaraes, Khaled Shaheen and Gayatri Babel for their help in building this. This code is my attempt to create a standard template that 'breaks' the cycle where projects build simple tools, outgrow them, and are then forced to build a new and often hard-to-reuse solution. My current belief is that it's impossible for a template to have long-term success if it's not built on the fundamental premise of being extended. Customer needs will evolve, and new product features will create new opportunities for optimization. diff --git a/modules/contextualization/cdf_file_annotation/data_models/hdm.container.yaml b/modules/contextualization/cdf_file_annotation/data_models/hdm.container.yaml index 3aa8a3aa..707754e3 100644 --- a/modules/contextualization/cdf_file_annotation/data_models/hdm.container.yaml +++ b/modules/contextualization/cdf_file_annotation/data_models/hdm.container.yaml @@ -11,7 +11,11 @@ type: list: false type: int64 + name: Annotated page count annotationMessage: + autoIncrement: false + immutable: false + nullable: true autoIncrement: false immutable: false nullable: true @@ -19,6 +23,7 @@ collation: ucs_basic list: false type: text + name: Annotation message annotationStatus: autoIncrement: false immutable: false @@ -27,6 +32,16 @@ collation: ucs_basic list: false type: text + name: Annotation status + patternModeMessage: + autoIncrement: false + immutable: false + nullable: true + type: + type: text + collation: ucs_basic + list: false + name: Pattern mode message attemptCount: autoIncrement: false immutable: false @@ -34,6 +49,7 @@ type: list: false type: int64 + name: Attempt count diagramDetectJobId: autoIncrement: false immutable: false @@ -41,6 +57,15 @@ type: list: false type: int64 + name: Diagram detect job Id + patternModeJobId: + autoIncrement: false + immutable: false + nullable: true + type: + list: false + type: int64 + name: Pattern mode job Id linkedFile: autoIncrement: false immutable: false @@ -48,6 +73,7 @@ type: list: false type: direct + name: Linked file pageCount: autoIncrement: false immutable: false @@ -55,5 +81,61 @@ type: list: false type: int64 + name: Page count + launchFunctionId: # NOTE: Id of the function that was called. Will be useful as an index for query calls. B-tree + type: + list: false + type: int64 + immutable: false + nullable: true + autoIncrement: false + name: Launch function Id + launchFunctionCallId: # NOTE: specific Id that points to the function log. Will be useful as an index for query calls. B-tree + type: + list: false + type: int64 + immutable: false + nullable: true + autoIncrement: false + name: Launch function call Id + finalizeFunctionId: # NOTE: Id of the function that was called. Will be useful as an index for query calls. B-tree + type: + list: false + type: int64 + immutable: false + nullable: true + autoIncrement: false + name: Finalize function Id + finalizeFunctionCallId: # NOTE: specific Id that points to the function log. Will be useful as an index for query calls. B-tree + type: + list: false + type: int64 + immutable: false + nullable: true + autoIncrement: false + name: Finalize function call Id space: {{ annotationStateSchemaSpace }} usedFor: node + indexes: + annotationStatus: + indexType: btree + properties: + - annotationStatus + cursorable: true + diagramDetectJobId: + indexType: btree + properties: + - diagramDetectJobId + cursorable: true + launchFunction: + indexType: btree + properties: + - launchFunctionId + - launchFunctionCallId + cursorable: true + finalizeFunction: + indexType: btree + properties: + - finalizeFunctionId + - finalizeFunctionCallId + cursorable: true \ No newline at end of file diff --git a/modules/contextualization/cdf_file_annotation/data_models/hdm.node.yaml b/modules/contextualization/cdf_file_annotation/data_models/hdm.node.yaml new file mode 100644 index 00000000..1c46051c --- /dev/null +++ b/modules/contextualization/cdf_file_annotation/data_models/hdm.node.yaml @@ -0,0 +1,24 @@ +- space: SolutionTagsInstanceSpace # NOTE: space that comes from enabling labels in Canvas UI + externalId: file_annotations_solution_tag + sources: + - source: + space: cdf_apps_shared + externalId: CogniteSolutionTag + version: 'v1' + type: view + properties: + name: 'File Annotations' + description: 'Label is used by canvases generated by the file annotation streamlit module. Can be used for any canvas related to file annotatons.' + color: Green # NOTE: can't seem to get this working + +- space: {{patternModeInstanceSpace}} + externalId: {{patternDetectSink}} + sources: + - source: + space: cdf_cdm + externalId: CogniteFile # Using CogniteFile as a base type for simplicity + version: 'v1' + type: view + properties: + name: 'Pattern Detection Sink Node' + description: 'A single, static node used as the end target for all pattern detection edges. The actual detection details are stored on the edge itself.' \ No newline at end of file diff --git a/modules/contextualization/cdf_file_annotation/data_models/hdm.space.yaml b/modules/contextualization/cdf_file_annotation/data_models/hdm.space.yaml index 4c110ed6..fe392ce7 100644 --- a/modules/contextualization/cdf_file_annotation/data_models/hdm.space.yaml +++ b/modules/contextualization/cdf_file_annotation/data_models/hdm.space.yaml @@ -1,6 +1,7 @@ - description: Helper data model space name: {{ annotationStateSchemaSpace }} space: {{ annotationStateSchemaSpace }} -- description: Instance space for contextualization pipeline annotation states - name: {{ annotationStateInstanceSpace }} - space: {{ annotationStateInstanceSpace }} + +- description: Pattern mode results instance space + name: {{ patternModeInstanceSpace }} + space: {{ patternModeInstanceSpace }} \ No newline at end of file diff --git a/modules/contextualization/cdf_file_annotation/data_models/hdm.view.yaml b/modules/contextualization/cdf_file_annotation/data_models/hdm.view.yaml index 18d059be..e501ef2d 100644 --- a/modules/contextualization/cdf_file_annotation/data_models/hdm.view.yaml +++ b/modules/contextualization/cdf_file_annotation/data_models/hdm.view.yaml @@ -25,7 +25,7 @@ space: {{ annotationStateSchemaSpace }} type: container containerPropertyIdentifier: annotationMessage - description: Annotation message + description: Contains annotations applied or error message name: Annotation message annotationStatus: container: @@ -33,7 +33,7 @@ space: {{ annotationStateSchemaSpace }} type: container containerPropertyIdentifier: annotationStatus - description: Annotation status + description: Holds the status of the files diagram detect job name: Annotation status attemptCount: container: @@ -49,8 +49,24 @@ space: {{ annotationStateSchemaSpace }} type: container containerPropertyIdentifier: diagramDetectJobId - description: Diagram detect job ID - name: Diagram detect job ID + description: Diagram detect job Id + name: Diagram detect job Id + patternModeJobId: + container: + externalId: {{ annotationStateExternalId }} + space: {{ annotationStateSchemaSpace }} + type: container + containerPropertyIdentifier: patternModeJobId + description: Diagram detect job Id with pattern mode + name: Pattern mode job Id + patternModeMessage: + container: + externalId: {{ annotationStateExternalId }} + space: {{ annotationStateSchemaSpace }} + type: container + containerPropertyIdentifier: patternModeMessage + description: Contains entities found from pattern mode or error message + name: Pattern mode message linkedFile: container: externalId: {{ annotationStateExternalId }} @@ -96,5 +112,37 @@ containerPropertyIdentifier: sourceUpdatedTime description: Last updated time name: Last updated time + launchFunctionId: + container: + externalId: {{ annotationStateExternalId }} + space: {{ annotationStateSchemaSpace }} + type: container + containerPropertyIdentifier: launchFunctionId + description: Id of the launch function that was called + name: Launch function Id + launchFunctionCallId: + container: + externalId: {{ annotationStateExternalId }} + space: {{ annotationStateSchemaSpace }} + type: container + containerPropertyIdentifier: launchFunctionCallId + description: Specific Id that points to the function log that created the diagram detect job for the file + name: Launch function call Id + finalizeFunctionId: + container: + externalId: {{ annotationStateExternalId }} + space: {{ annotationStateSchemaSpace }} + type: container + containerPropertyIdentifier: finalizeFunctionId + description: Id of the finalize function that was called + name: Finalize function Id + finalizeFunctionCallId: + container: + externalId: {{ annotationStateExternalId }} + space: {{ annotationStateSchemaSpace }} + type: container + containerPropertyIdentifier: finalizeFunctionCallId + description: Specific Id that points to the function log that applied annotations to the file + name: Finalize function call Id space: {{ annotationStateSchemaSpace }} version: {{ annotationStateVersion }} diff --git a/modules/contextualization/cdf_file_annotation/default.config.yaml b/modules/contextualization/cdf_file_annotation/default.config.yaml index ec62175d..10812f4a 100644 --- a/modules/contextualization/cdf_file_annotation/default.config.yaml +++ b/modules/contextualization/cdf_file_annotation/default.config.yaml @@ -3,10 +3,12 @@ annotationDatasetExternalId: ds_file_annotation # used in /data_models and /extraction_pipelines annotationStateExternalId: FileAnnotationState -annotationStateInstanceSpace: sp_dat_cdf_annotation_states annotationStateSchemaSpace: sp_hdm #NOTE: stands for space helper data model annotationStateVersion: v1.0.0 +patternModeInstanceSpace: sp_dat_pattern_mode_results +patternDetectSink: pattern_detection_sink_node fileSchemaSpace: +fileInstanceSpace: fileExternalId: fileVersion: @@ -14,26 +16,49 @@ fileVersion: rawDb: db_file_annotation rawTableDocTag: annotation_documents_tags rawTableDocDoc: annotation_documents_docs +rawTableDocPattern: annotation_documents_patterns rawTableCache: annotation_entities_cache +rawManualPatternsCatalog: manual_patterns_catalog +rawTablePromoteCache: annotation_tags_cache # used in /extraction_pipelines extractionPipelineExternalId: ep_file_annotation targetEntitySchemaSpace: +targetEntityInstanceSpace: targetEntityExternalId: targetEntityVersion: -# used in /functions and /workflows +# used in /functions +functionClientId: ${IDP_CLIENT_ID} +functionClientSecret: ${IDP_CLIENT_SECRET} + +# used in prepare function +prepareFunctionExternalId: fn_file_annotation_prepare #NOTE: if this is changed, then the folder holding the prepare function must be named the same as the new external ID +prepareFunctionVersion: v1.0.0 +prepareWorkflowVersion: v1_prepare +prepareWorkflowTrigger: wf_prepare_trigger + +# used in launch function launchFunctionExternalId: fn_file_annotation_launch #NOTE: if this is changed, then the folder holding the launch function must be named the same as the new external ID launchFunctionVersion: v1.0.0 +launchWorkflowVersion: v1_launch +launchWorkflowTrigger: wf_launch_trigger + +# used in finalize function finalizeFunctionExternalId: fn_file_annotation_finalize #NOTE: if this is changed, then the folder holding the finalize function must be named the same as the new external ID finalizeFunctionVersion: v1.0.0 -functionClientId: ${IDP_CLIENT_ID} -functionClientSecret: ${IDP_CLIENT_SECRET} +finalizeWorkflowVersion: v1_finalize +finalizeWorkflowTrigger: wf_finalize_trigger + +# used in promote function +promoteFunctionExternalId: fn_file_annotation_promote #NOTE: if this is changed, then the folder holding the promote function must be named the same as the new external ID +promoteFunctionVersion: v1.0.0 +promoteWorkflowVersion: v1_promote +promoteWorkflowTrigger: wf_promote_trigger # used in /workflows workflowSchedule: "3-59/10 * * * *" # NOTE: runs every 10 minutes with a 3 minute offset workflowExternalId: wf_file_annotation -workflowVersion: v1 # used in /auth groupSourceId: # source ID from Azure AD for the corresponding groups \ No newline at end of file diff --git a/modules/contextualization/cdf_file_annotation/detailed_guides/CONFIG.md b/modules/contextualization/cdf_file_annotation/detailed_guides/CONFIG.md index 76f9a866..b26ce858 100644 --- a/modules/contextualization/cdf_file_annotation/detailed_guides/CONFIG.md +++ b/modules/contextualization/cdf_file_annotation/detailed_guides/CONFIG.md @@ -63,25 +63,29 @@ Settings for the main annotation job launching process. Parsed by the `LaunchFun - `targetEntitiesSearchProperty` (str): Property on `targetEntitiesView` for matching (e.g., `aliases`). - `primaryScopeProperty` (str, optional): File property for primary grouping/context (e.g., `site`). If set to `None` or omitted, the function processes files without a primary scope grouping. _(Pydantic field: `primary_scope_property`)_ - `secondaryScopeProperty` (str, optional): File property for secondary grouping/context (e.g., `unit`). Defaults to `None`. _(Pydantic field: `secondary_scope_property`)_ + - `patternMode` (bool): Enables pattern-based detection mode alongside standard entity matching. When `True`, automatically generates regex-like patterns from entity aliases and detects all matching text in files. Defaults to `False`. _(Pydantic field: `pattern_mode`)_ + - `fileResourceProperty` (str, optional): Property on `fileView` to use for file-to-file link resource matching. Defaults to `None`. _(Pydantic field: `file_resource_property`)_ + - `targetEntitiesResourceProperty` (str, optional): Property on `targetEntitiesView` to use for resource matching. Defaults to `None`. _(Pydantic field: `target_entities_resource_property`)_ - **`dataModelService`** (`DataModelServiceConfig`): **Note:** For the query configurations below, you can provide a single query object or a list of query objects. If a list is provided, the queries are combined with a logical **OR**. - `getFilesToProcessQuery` (`QueryConfig | list[QueryConfig]`): Selects `AnnotationState` nodes ready for launching (e.g., status "New", "Retry"). - - `getTargetEntitiesQuery` (`QueryConfig | list[QueryConfig]`): Queries entities from `targetEntitiesView` for the cache. - - `getFileEntitiesQuery` (`QueryConfig | list[QueryConfig]`): Queries file entities from `fileView` for the cache. + - `getTargetEntitiesQuery` (`QueryConfig | list[QueryConfig]`): Queries entities from `targetEntitiesView` for the cache (e.g., assets tagged "DetectInDiagrams"). + - `getFileEntitiesQuery` (`QueryConfig | list[QueryConfig]`): Queries file entities from `fileView` for the cache, enabling file-to-file linking (e.g., files tagged "DetectInDiagrams"). - **`cacheService`** (`CacheServiceConfig`): - `cacheTimeLimit` (int): Cache validity in hours (e.g., `24`). - `rawDb` (str): RAW database for the entity cache (e.g., `db_file_annotation`). - `rawTableCache` (str): RAW table for the entity cache (e.g., `annotation_entities_cache`). + - `rawManualPatternsCatalog` (str): RAW table for storing manual pattern overrides at GLOBAL, site, or unit levels (e.g., `manual_patterns_catalog`). _(Pydantic field: `raw_manual_patterns_catalog`)_ - **`annotationService`** (`AnnotationServiceConfig`): - - `pageRange` (int): Parameter for creating start and end page for `FileReference`. - - `partialMatch` (bool): Parameter for `client.diagrams.detect()`. - - `minTokens` (int): Parameter for `client.diagrams.detect()`. - - `diagramDetectConfig` (`DiagramDetectConfigModel`, optional): Detailed API configuration. + - `pageRange` (int): Number of pages to process per batch for large documents. For files with more than `pageRange` pages, the file is processed iteratively in chunks (e.g., `50`). + - `partialMatch` (bool): Parameter for `client.diagrams.detect()`. Enables partial text matching. + - `minTokens` (int, optional): Parameter for `client.diagrams.detect()`. Minimum number of tokens required for a match. + - `diagramDetectConfig` (`DiagramDetectConfigModel`, optional): Detailed API configuration for diagram detection. - Contains fields like `connectionFlags` (`ConnectionFlagsConfig`), `customizeFuzziness` (`CustomizeFuzzinessConfig`), `readEmbeddedText`, etc. - The Pydantic model's `as_config()` method converts this into an SDK `DiagramDetectConfig` object. @@ -93,23 +97,24 @@ Settings for processing completed annotation jobs. Parsed by the `FinalizeFuncti - **Direct Parameters:** - - `cleanOldAnnotations` (bool): If `True`, deletes existing annotations before applying new ones. - - `maxRetryAttempts` (int): Max retries for a file if processing fails. + - `cleanOldAnnotations` (bool): If `True`, deletes existing annotations before applying new ones (only on the first run for multi-page files). _(Pydantic field: `clean_old_annotations`)_ + - `maxRetryAttempts` (int): Maximum number of retry attempts for a file before marking it as "Failed". _(Pydantic field: `max_retry_attempts`)_ - **`retrieveService`** (`RetrieveServiceConfig`): - - `getJobIdQuery` (`QueryConfig`): Selects `AnnotationState` nodes whose jobs are ready for result retrieval (e.g., status "Processing", `diagramDetectJobId` exists). + - `getJobIdQuery` (`QueryConfig`): Selects `AnnotationState` nodes whose jobs are ready for result retrieval. Uses optimistic locking to claim jobs (e.g., status "Processing", `diagramDetectJobId` exists). _(Pydantic field: `get_job_id_query`)_ - **`applyService`** (`ApplyServiceConfig`): - - `autoApprovalThreshold` (float): Confidence score for "Approved" status. - - `autoSuggestThreshold` (float): Confidence score for "Suggested" status. - -- **`reportService`** (`ReportServiceConfig`): - - `rawDb` (str): RAW DB for reports. - - `rawTableDocTag` (str): RAW table for document-tag links. - - `rawTableDocDoc` (str): RAW table for document-document links. - - `rawBatchSize` (int): Rows to batch before writing to RAW. + - `autoApprovalThreshold` (float): Confidence score threshold for automatically approving standard annotations (e.g., `1.0` for exact matches only). _(Pydantic field: `auto_approval_threshold`)_ + - `autoSuggestThreshold` (float): Confidence score threshold for suggesting standard annotations for review (e.g., `1.0`). _(Pydantic field: `auto_suggest_threshold`)_ + - `sinkNode` (`SinkNodeConfig`): Configuration for the target node where pattern mode annotations are linked for review. _(Pydantic field: `sink_node`)_ + - `space` (str): The space where the sink node resides. + - `externalId` (str): The external ID of the sink node. _(Pydantic field: `external_id`)_ + - `rawDb` (str): RAW database for storing annotation reports. _(Pydantic field: `raw_db`)_ + - `rawTableDocTag` (str): RAW table name for document-to-asset annotation links (e.g., `doc_tag`). _(Pydantic field: `raw_table_doc_tag`)_ + - `rawTableDocDoc` (str): RAW table name for document-to-document annotation links (e.g., `doc_doc`). _(Pydantic field: `raw_table_doc_doc`)_ + - `rawTableDocPattern` (str): RAW table name for pattern mode detections, creating a searchable catalog of potential entity matches (e.g., `doc_pattern`). _(Pydantic field: `raw_table_doc_pattern`)_ --- diff --git a/modules/contextualization/cdf_file_annotation/detailed_guides/CONFIG_PATTERNS.md b/modules/contextualization/cdf_file_annotation/detailed_guides/CONFIG_PATTERNS.md index 8015c183..b64748f6 100644 --- a/modules/contextualization/cdf_file_annotation/detailed_guides/CONFIG_PATTERNS.md +++ b/modules/contextualization/cdf_file_annotation/detailed_guides/CONFIG_PATTERNS.md @@ -99,7 +99,41 @@ launchFunction: # ... (rest of launchFunction config) ``` -### Recipe 4: Fine-Tuning the Diagram Detection API +### Recipe 4: Enabling and Configuring Pattern Mode + +**Goal:** Enable pattern-based detection alongside standard entity matching to create a comprehensive searchable catalog of potential entity occurrences in files. + +**Scenario:** You want to detect all text in files that matches patterns generated from entity aliases (e.g., "FT-101A" generates pattern "[FT]-000[A]"), in addition to standard exact entity matching. + +**Configuration:** +Enable `patternMode` in the `launchFunction` section and configure the sink node in `finalizeFunction.applyService`. + +```yaml +# In ep_file_annotation.config.yaml + +launchFunction: + patternMode: True # Enable pattern detection mode + # ... (other configs) + cacheService: + rawManualPatternsCatalog: "manual_patterns_catalog" # Table for manual pattern overrides + +finalizeFunction: + # ... (other configs) + applyService: + sinkNode: + space: "sp_pattern_review" # Space where pattern detections are linked + externalId: "pattern_detection_sink" # Sink node for review + rawTableDocPattern: "doc_pattern" # RAW table for pattern detections +``` + +**Pattern Mode Features:** + +- **Auto-generation**: Automatically creates regex-like patterns from entity aliases +- **Manual overrides**: Add custom patterns to RAW table at GLOBAL, site, or unit levels +- **Deduplication**: Automatically skips pattern detections that duplicate standard annotations +- **Separate catalog**: Pattern detections stored separately for review in `doc_pattern` RAW table + +### Recipe 5: Fine-Tuning the Diagram Detection API **Goal:** Adjust the behavior of the diagram detection model, for example, by making it more or less strict about fuzzy text matching. @@ -119,7 +153,7 @@ launchFunction: # ... (other DiagramDetectConfig properties) ``` -### Recipe 5: Combining Queries with OR Logic +### Recipe 6: Combining Queries with OR Logic **Goal:** To select files for processing that meet one of several distinct criteria. This is useful when you want to combine different sets of filters with a logical OR. @@ -165,7 +199,7 @@ prepareFunction: targetProperty: tags ``` -### Recipe 6: Annotating Files Without a Scope +### Recipe 7: Annotating Files Without a Scope **Goal:** To annotate files that do not have a `primaryScopeProperty` (e.g., `city`). This is useful for processing files that are not assigned to a specific city or for a global-level annotation process. @@ -190,7 +224,7 @@ launchFunction: This section covers high-level architectural decisions about how the template finds and partitions data. The choice between these patterns is fundamental and depends on your organization's requirements for governance, security, and operational structure. -### Recipe 7: Global Scoping (Searching Across All Spaces) +### Recipe 8: Global Scoping (Searching Across All Spaces) **Goal:** To run a single, unified annotation process that finds and annotates all new files based on their properties, regardless of which physical `instanceSpace` they reside in. @@ -220,7 +254,7 @@ dataModelViews: - When a single team uses a single, consistent set of rules to annotate all files across the organization. - For simpler systems where strict data partitioning between different domains is not a requirement. -### Recipe 8: Isolated Scoping (Targeting a Specific Space) +### Recipe 9: Isolated Scoping (Targeting a Specific Space) **Goal:** To run a dedicated annotation process that operates only within a single, physically separate data partition. diff --git a/modules/contextualization/cdf_file_annotation/detailed_guides/DEVELOPING.md b/modules/contextualization/cdf_file_annotation/detailed_guides/DEVELOPING.md index f389f58b..615cae2e 100644 --- a/modules/contextualization/cdf_file_annotation/detailed_guides/DEVELOPING.md +++ b/modules/contextualization/cdf_file_annotation/detailed_guides/DEVELOPING.md @@ -26,11 +26,17 @@ While any service can be replaced, these are the most common candidates for cust - **`AbstractLaunchService`**: The orchestrator for the launch function. You would implement this if your project requires a fundamentally different file batching, grouping, or processing workflow that can't be achieved with the `primary_scope_property` and `secondary_scope_property` configuration. +- **`AbstractFinalizeService`**: The orchestrator for the finalize function. Implement this if your project needs custom job claiming logic, result merging strategies, or unique annotation state update patterns. + - **`IDataModelService`**: The gateway to Cognite Data Fusion. Implement this if your project needs highly optimized or complex queries to fetch files and entities that go beyond the declarative `QueryConfig` filter system. -- **`IApplyService`**: The service responsible for writing annotations back to the data model. Implement this if your project has custom rules for how to set annotation properties (like status) or needs to create additional relationships in the data model. +- **`IRetrieveService`**: Handles retrieving diagram detection job results and claiming jobs with optimistic locking. Implement this if you need custom job claiming strategies or want to integrate with external job tracking systems. + +- **`IApplyService`**: The service responsible for writing annotations back to the data model and RAW tables. Implement this if your project has custom rules for confidence thresholds, deduplication logic, or needs to create additional relationships in the data model or external systems. + +- **`ICacheService`**: Manages the in-memory entity cache and pattern generation. You might implement this if your project has a different caching strategy (e.g., different cache key logic, custom pattern generation algorithms, or fetching context from an external system). -- **`ICacheService`**: Manages the in-memory entity cache. You might implement this if your project has a different caching strategy (e.g., different cache key logic, or fetching context from an external system). +- **`IAnnotationService`**: Handles interaction with the Cognite Diagram Detect API. Implement this if you need custom retry logic, want to use a different annotation API, or need to pre/post-process annotation requests. ## How to Create a Custom Implementation @@ -102,7 +108,7 @@ class HighPriorityLaunchService(GeneralLaunchService): ### Step 2: Use Your Custom Implementation ```python -# In fn_dm_context_annotation_launch/handler.py +# In fn_file_annotation_launch/handler.py # ... (other imports) from services.LaunchService import AbstractLaunchService @@ -110,7 +116,7 @@ from services.LaunchService import AbstractLaunchService from services.my_custom_launch_service import HighPriorityLaunchService # 2. Instantiate your new custom class instead of GeneralLaunchService -def _create_launch_service(config, client, logger, tracker) -> AbstractLaunchService: +def _create_launch_service(config, client, logger, tracker, function_call_info) -> AbstractLaunchService: cache_instance: ICacheService = create_general_cache_service(config, client, logger) data_model_instance: IDataModelService = create_general_data_model_service( config, client, logger @@ -126,6 +132,7 @@ def _create_launch_service(config, client, logger, tracker) -> AbstractLaunchSer data_model_service=data_model_instance, cache_service=cache_instance, annotation_service=annotation_instance, + function_call_info=function_call_info, ) return launch_instance diff --git a/modules/contextualization/cdf_file_annotation/extraction_pipelines/ep_file_annotation.ExtractionPipeline.yaml b/modules/contextualization/cdf_file_annotation/extraction_pipelines/ep_file_annotation.ExtractionPipeline.yaml index dcdbc901..103ad3bc 100644 --- a/modules/contextualization/cdf_file_annotation/extraction_pipelines/ep_file_annotation.ExtractionPipeline.yaml +++ b/modules/contextualization/cdf_file_annotation/extraction_pipelines/ep_file_annotation.ExtractionPipeline.yaml @@ -7,27 +7,16 @@ rawTables: tableName: {{ rawTableDocTag }} - dbName: {{ rawDb }} tableName: {{ rawTableDocDoc }} + - dbName: {{ rawDb }} + tableName: {{ rawTableDocPattern }} - dbName: {{ rawDb }} tableName: {{ rawTableCache }} + - dbName: {{ rawDb }} + tableName: {{ rawManualPatternsCatalog }} + - dbName: {{ rawDb }} + tableName: {{ rawTablePromoteCache }} source: "Files" documentation: > - # Guide to Configuring the Annotation Function via YAML - - This document outlines how to use the `ep_file_annotation.config.yaml` file to control the behavior of the Annotation Function. The Python code, particularly `ConfigService.py`, uses Pydantic models to parse this YAML, making the function adaptable to different data models and operational parameters. - - ## Overall Structure - - The YAML configuration is organized into logical blocks that correspond to different phases and components of the toolkit: - - - `dataModelViews`: Defines common Data Model views used across functions. - - `prepareFunction`: Settings for the initial file preparation phase. - - `launchFunction`: Settings for launching annotation tasks. - - `finalizeFunction`: Settings for processing and finalizing annotation results. - - The entire structure is parsed into a main `Config` Pydantic model. - - --- - ## 1. `dataModelViews` This section specifies the Data Model views the function will interact with. Each view is defined using a structure mapping to the `ViewPropertyConfig` Pydantic model. @@ -52,10 +41,8 @@ documentation: > Configures the initial setup phase, primarily for selecting files to be annotated. Parsed by the `PrepareFunction` Pydantic model. - **Note:** For the query configurations below, you can provide a single query object or a list of query objects. If a list is provided, the queries are combined with a logical **OR**. - - **`getFilesForAnnotationResetQuery`** (`QueryConfig | list[QueryConfig]`, optional): - **Purpose:** Selects specific files to have their annotation status reset (e.g., remove "Annotated"/"AnnotationInProcess" tags) to make them eligible for re-annotation. @@ -78,25 +65,29 @@ documentation: > - `targetEntitiesSearchProperty` (str): Property on `targetEntitiesView` for matching (e.g., `aliases`). - `primaryScopeProperty` (str, optional): File property for primary grouping/context (e.g., `site`). If set to `None` or omitted, the function processes files without a primary scope grouping. _(Pydantic field: `primary_scope_property`)_ - `secondaryScopeProperty` (str, optional): File property for secondary grouping/context (e.g., `unit`). Defaults to `None`. _(Pydantic field: `secondary_scope_property`)_ + - `patternMode` (bool): Enables pattern-based detection mode alongside standard entity matching. When `True`, automatically generates regex-like patterns from entity aliases and detects all matching text in files. Defaults to `False`. _(Pydantic field: `pattern_mode`)_ + - `fileResourceProperty` (str, optional): Property on `fileView` to use for file-to-file link resource matching. Defaults to `None`. _(Pydantic field: `file_resource_property`)_ + - `targetEntitiesResourceProperty` (str, optional): Property on `targetEntitiesView` to use for resource matching. Defaults to `None`. _(Pydantic field: `target_entities_resource_property`)_ - **`dataModelService`** (`DataModelServiceConfig`): **Note:** For the query configurations below, you can provide a single query object or a list of query objects. If a list is provided, the queries are combined with a logical **OR**. - `getFilesToProcessQuery` (`QueryConfig | list[QueryConfig]`): Selects `AnnotationState` nodes ready for launching (e.g., status "New", "Retry"). - - `getTargetEntitiesQuery` (`QueryConfig | list[QueryConfig]`): Queries entities from `targetEntitiesView` for the cache. - - `getFileEntitiesQuery` (`QueryConfig | list[QueryConfig]`): Queries file entities from `fileView` for the cache. + - `getTargetEntitiesQuery` (`QueryConfig | list[QueryConfig]`): Queries entities from `targetEntitiesView` for the cache (e.g., assets tagged "DetectInDiagrams"). + - `getFileEntitiesQuery` (`QueryConfig | list[QueryConfig]`): Queries file entities from `fileView` for the cache, enabling file-to-file linking (e.g., files tagged "DetectInDiagrams"). - **`cacheService`** (`CacheServiceConfig`): - `cacheTimeLimit` (int): Cache validity in hours (e.g., `24`). - `rawDb` (str): RAW database for the entity cache (e.g., `db_file_annotation`). - `rawTableCache` (str): RAW table for the entity cache (e.g., `annotation_entities_cache`). + - `rawManualPatternsCatalog` (str): RAW table for storing manual pattern overrides at GLOBAL, site, or unit levels (e.g., `manual_patterns_catalog`). _(Pydantic field: `raw_manual_patterns_catalog`)_ - **`annotationService`** (`AnnotationServiceConfig`): - - `pageRange` (int): Parameter for creating start and end page for `FileReference`. - - `partialMatch` (bool): Parameter for `client.diagrams.detect()`. - - `minTokens` (int): Parameter for `client.diagrams.detect()`. - - `diagramDetectConfig` (`DiagramDetectConfigModel`, optional): Detailed API configuration. + - `pageRange` (int): Number of pages to process per batch for large documents. For files with more than `pageRange` pages, the file is processed iteratively in chunks (e.g., `50`). + - `partialMatch` (bool): Parameter for `client.diagrams.detect()`. Enables partial text matching. + - `minTokens` (int, optional): Parameter for `client.diagrams.detect()`. Minimum number of tokens required for a match. + - `diagramDetectConfig` (`DiagramDetectConfigModel`, optional): Detailed API configuration for diagram detection. - Contains fields like `connectionFlags` (`ConnectionFlagsConfig`), `customizeFuzziness` (`CustomizeFuzzinessConfig`), `readEmbeddedText`, etc. - The Pydantic model's `as_config()` method converts this into an SDK `DiagramDetectConfig` object. @@ -108,23 +99,24 @@ documentation: > - **Direct Parameters:** - - `cleanOldAnnotations` (bool): If `True`, deletes existing annotations before applying new ones. - - `maxRetryAttempts` (int): Max retries for a file if processing fails. + - `cleanOldAnnotations` (bool): If `True`, deletes existing annotations before applying new ones (only on the first run for multi-page files). _(Pydantic field: `clean_old_annotations`)_ + - `maxRetryAttempts` (int): Maximum number of retry attempts for a file before marking it as "Failed". _(Pydantic field: `max_retry_attempts`)_ - **`retrieveService`** (`RetrieveServiceConfig`): - - `getJobIdQuery` (`QueryConfig`): Selects `AnnotationState` nodes whose jobs are ready for result retrieval (e.g., status "Processing", `diagramDetectJobId` exists). + - `getJobIdQuery` (`QueryConfig`): Selects `AnnotationState` nodes whose jobs are ready for result retrieval. Uses optimistic locking to claim jobs (e.g., status "Processing", `diagramDetectJobId` exists). _(Pydantic field: `get_job_id_query`)_ - **`applyService`** (`ApplyServiceConfig`): - - `autoApprovalThreshold` (float): Confidence score for "Approved" status. - - `autoSuggestThreshold` (float): Confidence score for "Suggested" status. - - - **`reportService`** (`ReportServiceConfig`): - - `rawDb` (str): RAW DB for reports. - - `rawTableDocTag` (str): RAW table for document-tag links. - - `rawTableDocDoc` (str): RAW table for document-document links. - - `rawBatchSize` (int): Rows to batch before writing to RAW. + - `autoApprovalThreshold` (float): Confidence score threshold for automatically approving standard annotations (e.g., `1.0` for exact matches only). _(Pydantic field: `auto_approval_threshold`)_ + - `autoSuggestThreshold` (float): Confidence score threshold for suggesting standard annotations for review (e.g., `1.0`). _(Pydantic field: `auto_suggest_threshold`)_ + - `sinkNode` (`SinkNodeConfig`): Configuration for the target node where pattern mode annotations are linked for review. _(Pydantic field: `sink_node`)_ + - `space` (str): The space where the sink node resides. + - `externalId` (str): The external ID of the sink node. _(Pydantic field: `external_id`)_ + - `rawDb` (str): RAW database for storing annotation reports. _(Pydantic field: `raw_db`)_ + - `rawTableDocTag` (str): RAW table name for document-to-asset annotation links (e.g., `doc_tag`). _(Pydantic field: `raw_table_doc_tag`)_ + - `rawTableDocDoc` (str): RAW table name for document-to-document annotation links (e.g., `doc_doc`). _(Pydantic field: `raw_table_doc_doc`)_ + - `rawTableDocPattern` (str): RAW table name for pattern mode detections, creating a searchable catalog of potential entity matches (e.g., `doc_pattern`). _(Pydantic field: `raw_table_doc_pattern`)_ --- @@ -148,6 +140,5 @@ documentation: > - **`limit`** (Optional[int], default `-1`): Specifies the upper limit of instances that can be retrieved from the query. - The Python code uses `QueryConfig.build_filter()` (which internally uses `FilterConfig.as_filter()`) to convert these YAML definitions into Cognite SDK `Filter` objects for querying CDF. diff --git a/modules/contextualization/cdf_file_annotation/extraction_pipelines/ep_file_annotation.config.yaml b/modules/contextualization/cdf_file_annotation/extraction_pipelines/ep_file_annotation.config.yaml index cf709dd6..4a14d99e 100644 --- a/modules/contextualization/cdf_file_annotation/extraction_pipelines/ep_file_annotation.config.yaml +++ b/modules/contextualization/cdf_file_annotation/extraction_pipelines/ep_file_annotation.config.yaml @@ -7,16 +7,18 @@ config: version: v1 annotationStateView: schemaSpace: {{ annotationStateSchemaSpace }} - instanceSpace: {{annotationStateInstanceSpace}} + instanceSpace: {{fileInstanceSpace}} externalId: {{ annotationStateExternalId }} version: {{ annotationStateVersion }} fileView: schemaSpace: {{ fileSchemaSpace }} + instanceSpace: {{fileInstanceSpace}} externalId: {{ fileExternalId }} version: {{ fileVersion }} annotationType: diagrams.FileLink targetEntitiesView: schemaSpace: {{ targetEntitySchemaSpace }} + instanceSpace: {{targetEntityInstanceSpace}} externalId: {{ targetEntityExternalId }} version: {{ targetEntityVersion }} annotationType: diagrams.AssetLink @@ -52,6 +54,9 @@ config: targetEntitiesSearchProperty: aliases primaryScopeProperty: None secondaryScopeProperty: + patternMode: True + fileResourceProperty: + targetEntitiesResourceProperty: dataModelService: getFilesToProcessQuery: targetView: @@ -63,6 +68,9 @@ config: negate: False operator: In targetProperty: annotationStatus + - negate: False + operator: Exists + targetProperty: linkedFile limit: 1000 getTargetEntitiesQuery: targetView: @@ -88,21 +96,15 @@ config: cacheTimeLimit: 24 # hours rawDb: {{ rawDb }} rawTableCache: {{ rawTableCache }} + rawManualPatternsCatalog: {{ rawManualPatternsCatalog }} annotationService: pageRange: 50 partialMatch: True - minTokens: 2 diagramDetectConfig: connectionFlags: noTextInbetween: True naturalReadingOrder: True - customizeFuzziness: - fuzzyScore: 0.93 - maxBoxes: - minChars: 10 - minFuzzyScore: 0.915 readEmbeddedText: True - removeLeadingZeros: True finalizeFunction: cleanOldAnnotations: True maxRetryAttempts: 3 @@ -117,14 +119,49 @@ config: negate: False operator: Equals targetProperty: annotationStatus - - negate: False # # NOTE: Do not change unless there's a good reason + - negate: False # NOTE: Do not change unless there's a good reason operator: Exists targetProperty: diagramDetectJobId applyService: autoApprovalThreshold: 1.0 autoSuggestThreshold: 1.0 - reportService: + sinkNode: + space: {{ patternModeInstanceSpace }} + externalId: {{patternDetectSink}} rawDb: {{ rawDb }} rawTableDocTag: {{ rawTableDocTag }} rawTableDocDoc: {{ rawTableDocDoc }} - rawBatchSize: 10000 + rawTableDocPattern: {{ rawTableDocPattern }} + promoteFunction: + getCandidatesQuery: + targetView: + schemaSpace: cdf_cdm + externalId: CogniteDiagramAnnotation + version: v1 + filters: + - values: "Suggested" # Only process suggested annotations + negate: False + operator: Equals + targetProperty: status + - values: ["PromoteAttempted"] # Skip already attempted edges + negate: True + operator: In + targetProperty: tags + limit: 500 # Number of edges to process per batch + rawDb: {{ rawDb }} + rawTableDocPattern: {{ rawTableDocPattern }} + rawTableDocTag: {{ rawTableDocTag }} + rawTableDocDoc: {{ rawTableDocDoc }} + deleteRejectedEdges: True + deleteSuggestedEdges: True + entitySearchService: + enableExistingAnnotationsSearch: True # Primary: Query annotation edges (fast, checks existing annotation edges) + enableGlobalEntitySearch: True # Fallback: Global entity search - (slow, unstable as instance count grows) + maxEntitySearchLimit: 1000 # Max entities to fetch in global search + textNormalization: + removeSpecialCharacters: True # Remove non-alphanumeric characters (e.g., "V-0912" β†’ "V0912") + convertToLowercase: False # Convert to lowercase (e.g., "V0912" β†’ "v0912") + stripLeadingZeros: True # Remove leading zeros (e.g., "v0912" β†’ "v912") + cacheService: + cacheTableName: {{ rawTablePromoteCache }} + diff --git a/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_finalize/dependencies.py b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_finalize/dependencies.py index 4bffc16b..33adf057 100644 --- a/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_finalize/dependencies.py +++ b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_finalize/dependencies.py @@ -12,7 +12,6 @@ from services.RetrieveService import GeneralRetrieveService from services.ApplyService import GeneralApplyService from services.LoggerService import CogniteFunctionLogger -from services.ReportService import GeneralReportService from services.PipelineService import GeneralPipelineService @@ -101,12 +100,6 @@ def create_general_retrieve_service( return GeneralRetrieveService(client, config, logger) -def create_general_report_service( - client: CogniteClient, config: Config, logger: CogniteFunctionLogger -) -> GeneralReportService: - return GeneralReportService(client, config, logger) - - def create_general_apply_service( client: CogniteClient, config: Config, logger: CogniteFunctionLogger ) -> GeneralApplyService: diff --git a/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_finalize/handler.py b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_finalize/handler.py index feee0dc3..a44b4ab9 100644 --- a/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_finalize/handler.py +++ b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_finalize/handler.py @@ -1,5 +1,7 @@ import sys import threading +import time +import random from datetime import datetime, timezone, timedelta from cognite.client import CogniteClient @@ -7,14 +9,12 @@ create_config_service, create_logger_service, create_write_logger_service, - create_general_report_service, create_general_retrieve_service, create_general_apply_service, create_general_pipeline_service, ) from services.FinalizeService import AbstractFinalizeService, GeneralFinalizeService from services.ApplyService import IApplyService -from services.ReportService import IReportService from services.RetrieveService import IRetrieveService from services.PipelineService import IPipelineService from utils.DataStructures import PerformanceTracker @@ -44,11 +44,14 @@ def handle(data: dict, function_call_info: dict, client: CogniteClient) -> dict: client, pipeline_ext_id=data["ExtractionPipelineExtId"] ) - finalize_instance, report_instance = _create_finalize_service( - config_instance, client, logger_instance, tracker_instance + finalize_instance = _create_finalize_service( + config_instance, client, logger_instance, tracker_instance, function_call_info ) run_status: str = "success" + # NOTE: a random delay to stagger API requests. Used to prevent API load shedding that can return empty results under high concurrency. + delay = random.uniform(0.1, 1.0) + time.sleep(delay) try: while datetime.now(timezone.utc) - start_time < timedelta(minutes=7): if finalize_instance.run() == "Done": @@ -61,17 +64,13 @@ def handle(data: dict, function_call_info: dict, client: CogniteClient) -> dict: logger_instance.error(message=msg, section="BOTH") return {"status": run_status, "message": msg} finally: - logger_instance.info(report_instance.update_report()) logger_instance.info(tracker_instance.generate_overall_report(), "BOTH") - # only want to report on the count of successful and failed files in ep_logs if there were files that were processed or an error occured - # else run log will be too messy - if tracker_instance.files_failed != 0 or tracker_instance.files_success != 0 or run_status == "failure": - function_id = function_call_info.get("function_id") - call_id = function_call_info.get("call_id") - pipeline_instance.update_extraction_pipeline( - msg=tracker_instance.generate_ep_run("Finalize", function_id, call_id) - ) - pipeline_instance.upload_extraction_pipeline(status=run_status) + function_id = function_call_info.get("function_id") + call_id = function_call_info.get("call_id") + pipeline_instance.update_extraction_pipeline( + msg=tracker_instance.generate_ep_run("Finalize", function_id, call_id) + ) + pipeline_instance.upload_extraction_pipeline(status=run_status) def run_locally(config_file: dict[str, str], log_path: str | None = None): @@ -94,8 +93,12 @@ def run_locally(config_file: dict[str, str], log_path: str | None = None): tracker_instance = PerformanceTracker() - finalize_instance, report_instance = _create_finalize_service( - config_instance, client, logger_instance, tracker_instance + finalize_instance = _create_finalize_service( + config_instance, + client, + logger_instance, + tracker_instance, + function_call_info={"function_id": None, "call_id": None}, ) try: @@ -109,8 +112,6 @@ def run_locally(config_file: dict[str, str], log_path: str | None = None): section="BOTH", ) finally: - result = report_instance.update_report() - logger_instance.info(result) logger_instance.info(tracker_instance.generate_overall_report(), "BOTH") logger_instance.close() @@ -138,11 +139,10 @@ def run_locally_parallel( thread_4.join() -def _create_finalize_service(config, client, logger, tracker) -> tuple[AbstractFinalizeService, IReportService]: +def _create_finalize_service(config, client, logger, tracker, function_call_info) -> AbstractFinalizeService: """ Instantiate Finalize with interfaces. """ - report_instance: IReportService = create_general_report_service(client, config, logger) retrieve_instance: IRetrieveService = create_general_retrieve_service(client, config, logger) apply_instance: IApplyService = create_general_apply_service(client, config, logger) finalize_instance = GeneralFinalizeService( @@ -152,9 +152,9 @@ def _create_finalize_service(config, client, logger, tracker) -> tuple[AbstractF tracker=tracker, retrieve_service=retrieve_instance, apply_service=apply_instance, - report_service=report_instance, + function_call_info=function_call_info, ) - return finalize_instance, report_instance + return finalize_instance if __name__ == "__main__": diff --git a/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_finalize/services/ApplyService.py b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_finalize/services/ApplyService.py index 316ac61e..1a3dc183 100644 --- a/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_finalize/services/ApplyService.py +++ b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_finalize/services/ApplyService.py @@ -14,16 +14,11 @@ Node, NodeId, NodeApply, - NodeApplyResultList, EdgeId, InstancesApplyResult, ) - -from cognite.client.data_classes.filters import ( - In, - Or, -) - +from cognite.client.data_classes.filters import And, Equals, Not +from cognite.client import data_modeling as dm from services.ConfigService import Config, ViewPropertyConfig from utils.DataStructures import DiagramAnnotationStatus @@ -36,89 +31,319 @@ class IApplyService(abc.ABC): """ @abc.abstractmethod - def apply_annotations(self, result_item: dict, file_id: NodeId) -> tuple[list, list]: - pass - - @abc.abstractmethod - def update_nodes(self, list_node_apply: list[NodeApply]) -> NodeApplyResultList: + def process_and_apply_annotations_for_file( + self, + file_node: Node, + regular_item: dict | None, + pattern_item: dict | None, + clean_old: bool, + ) -> tuple[str, str]: pass @abc.abstractmethod - def delete_annotations_for_file(self, file_node: NodeId) -> tuple[list[str], list[str]]: + def update_instances( + self, + list_node_apply: list[NodeApply] | NodeApply | None = None, + list_edge_apply: list[EdgeApply] | EdgeApply | None = None, + ) -> InstancesApplyResult: pass class GeneralApplyService(IApplyService): """ - Interface for applying/deleting annotations to a node + Implementation of the ApplyService interface. """ EXTERNAL_ID_LIMIT = 256 - FUNCTION_ID = "fn_dm_context_annotation_finalize" + FUNCTION_ID = "fn_file_annotation_finalize" def __init__(self, client: CogniteClient, config: Config, logger: CogniteFunctionLogger): self.client: CogniteClient = client self.config: Config = config self.logger: CogniteFunctionLogger = logger - - self.core_annotation_view_id: ViewId = self.config.data_model_views.core_annotation_view.as_view_id() - self.file_view_id: ViewId = self.config.data_model_views.file_view.as_view_id() + self.core_annotation_view_id: ViewId = config.data_model_views.core_annotation_view.as_view_id() + self.file_view_id: ViewId = config.data_model_views.file_view.as_view_id() self.file_annotation_type = config.data_model_views.file_view.annotation_type + self.approve_threshold = config.finalize_function.apply_service.auto_approval_threshold + self.suggest_threshold = config.finalize_function.apply_service.auto_suggest_threshold + self.sink_node_ref = DirectRelationReference( + space=config.finalize_function.apply_service.sink_node.space, + external_id=config.finalize_function.apply_service.sink_node.external_id, + ) - self.approve_threshold = self.config.finalize_function.apply_service.auto_approval_threshold - self.suggest_threshold = self.config.finalize_function.apply_service.auto_suggest_threshold - - # NOTE: could implement annotation edges to be updated in batches for performance gains but leaning towards no. Since it will over complicate error handling. - def apply_annotations(self, result_item: dict, file_id: NodeId) -> tuple[list[RowWrite], list[RowWrite]]: + def process_and_apply_annotations_for_file( + self, + file_node: Node, + regular_item: dict | None, + pattern_item: dict | None, + clean_old: bool, + ) -> tuple[str, str]: """ - Push the annotations to the file and set the "AnnotationInProcess" tag to "Annotated" + Performs the complete annotation workflow for a single file. + + Processes diagram detection results (regular and pattern mode), removes old annotations if needed, + creates annotation edges in the data model, writes annotation data to RAW tables, + and updates the file node's tag status. + + Args: + file_node: The file node instance to annotate. + regular_item: Dictionary containing regular diagram detect results. + pattern_item: Dictionary containing pattern mode diagram detect results. + clean_old: Whether to delete existing annotations before applying new ones. + + Returns: + A tuple containing: + - Summary message of regular annotations applied + - Summary message of pattern annotations created """ + file_id = file_node.as_id() + source_id = cast(str, file_node.properties.get(self.file_view_id, {}).get("sourceId")) - file_node: Node | None = self.client.data_modeling.instances.retrieve_nodes( - nodes=file_id, sources=self.file_view_id - ) - if not file_node: - raise ValueError("No file node found.") + if clean_old: + deleted_counts = self._delete_annotations_for_file(file_id) + self.logger.info( + f"\t- Deleted {deleted_counts['doc']} doc, {deleted_counts['tag']} tag, and {deleted_counts['pattern']} pattern annotations." + ) - node_apply: NodeApply = file_node.as_write() + # Step 1: Process regular annotations and collect their stable hashes + regular_edges, doc_rows, tag_rows = [], [], [] + processed_hashes = set() + if regular_item and regular_item.get("annotations"): + for annotation in regular_item["annotations"]: + stable_hash = self._create_stable_hash(annotation) + processed_hashes.add(stable_hash) + edges = self._detect_annotation_to_edge_applies(file_id, source_id, doc_rows, tag_rows, annotation) + regular_edges.extend(edges.values()) + + # Step 2: Process pattern annotations, skipping any that were already processed + pattern_edges, pattern_rows = [], [] + if pattern_item and pattern_item.get("annotations"): + pattern_edges, pattern_rows = self._process_pattern_results(pattern_item, file_node, processed_hashes) + + # Step 3: Update the file node tag + node_apply = file_node.as_write() node_apply.existing_version = None + tags = cast(list[str], node_apply.sources[0].properties["tags"]) + if "AnnotationInProcess" in tags: + tags[tags.index("AnnotationInProcess")] = "Annotated" + elif "Annotated" not in tags: + self.logger.warning( + f"File {file_id.external_id} was processed, but 'AnnotationInProcess' tag was not found." + ) - tags_property: list[str] = cast(list[str], node_apply.sources[0].properties["tags"]) - - # NOTE: There are cases where the 'annotated' tag is set but a job was queued up again for the file. - # This is because the rate at which the jobs are processed by finalize is slower than the rate at which launch fills up the queue. - # So if the wait time that was set in the extractor config file goes passed the time it takes for the finalize function to get to the job. Annotate will appear in the tags list. - if "AnnotationInProcess" in tags_property: - index = tags_property.index("AnnotationInProcess") - tags_property[index] = "Annotated" - elif "Annotated" not in tags_property: - raise ValueError("Annotated and AnnotationInProcess not found in tag property of file node") - source_id: str | None = cast(str, file_node.properties[self.file_view_id].get("sourceId")) - doc_doc, doc_tag = [], [] - edge_applies: list[EdgeApply] = [] - for detect_annotation in result_item["annotations"]: - edge_apply_dict: dict[tuple, EdgeApply] = self._detect_annotation_to_edge_applies( - file_id, - source_id, - doc_doc, - doc_tag, - detect_annotation, + # Step 4: Apply all data model and RAW changes + self.update_instances(list_node_apply=node_apply, list_edge_apply=regular_edges + pattern_edges) + db_name = self.config.finalize_function.apply_service.raw_db + if doc_rows: + self.client.raw.rows.insert( + db_name=db_name, + table_name=self.config.finalize_function.apply_service.raw_table_doc_doc, + row=doc_rows, + ensure_parent=True, + ) + if tag_rows: + self.client.raw.rows.insert( + db_name=db_name, + table_name=self.config.finalize_function.apply_service.raw_table_doc_tag, + row=tag_rows, + ensure_parent=True, + ) + if pattern_rows: + self.client.raw.rows.insert( + db_name=db_name, + table_name=self.config.finalize_function.apply_service.raw_table_doc_pattern, + row=pattern_rows, + ensure_parent=True, ) - edge_applies.extend(edge_apply_dict.values()) - self.client.data_modeling.instances.apply( - nodes=node_apply, - edges=edge_applies, - replace=False, + return ( + f"Applied {len(doc_rows)} doc and {len(tag_rows)} tag annotations.", + f"Created {len(pattern_rows)} new pattern detections.", ) - return doc_doc, doc_tag - def update_nodes(self, list_node_apply: list[NodeApply]) -> NodeApplyResultList: - update_results: InstancesApplyResult = self.client.data_modeling.instances.apply( - nodes=list_node_apply, - replace=False, # ensures we don't delete other properties in the view + def update_instances(self, list_node_apply=None, list_edge_apply=None) -> InstancesApplyResult: + """ + Applies node and/or edge updates to the data model. + + Args: + list_node_apply: Optional NodeApply or list of NodeApply objects to update. + list_edge_apply: Optional EdgeApply or list of EdgeApply objects to update. + + Returns: + InstancesApplyResult containing the results of the apply operation. + """ + return self.client.data_modeling.instances.apply(nodes=list_node_apply, edges=list_edge_apply, replace=False) + + def _delete_annotations_for_file(self, file_id: NodeId) -> dict[str, int]: + """ + Removes all existing annotations for a file from both data model and RAW tables. + + Deletes annotation edges (doc-to-doc, doc-to-tag, and pattern annotations) and their + corresponding RAW table entries to prepare for fresh annotations. + + Args: + file_id: NodeId of the file whose annotations should be deleted. + + Returns: + Dictionary with counts of deleted annotations: {"doc": int, "tag": int, "pattern": int}. + """ + + counts = {"doc": 0, "tag": 0, "pattern": 0} + std_edges = self._list_annotations_for_file( + file_id, file_id.space + ) # NOTE: Annotations produced from regular diagram detect are stored in the same instance space as the file node + if std_edges: + edge_ids, doc_keys, tag_keys = [], [], [] + for edge in std_edges: + edge_ids.append(edge.as_id()) + if edge.type.external_id == self.file_annotation_type: + doc_keys.append(edge.external_id) + else: + tag_keys.append(edge.external_id) + if edge_ids: + self.client.data_modeling.instances.delete(edges=edge_ids) + if doc_keys: + self.client.raw.rows.delete( + db_name=self.config.finalize_function.apply_service.raw_db, + table_name=self.config.finalize_function.apply_service.raw_table_doc_doc, + key=doc_keys, + ) + if tag_keys: + self.client.raw.rows.delete( + db_name=self.config.finalize_function.apply_service.raw_db, + table_name=self.config.finalize_function.apply_service.raw_table_doc_tag, + key=tag_keys, + ) + counts["doc"], counts["tag"] = len(doc_keys), len(tag_keys) + + pattern_edges = self._list_annotations_for_file( + file_id, self.sink_node_ref.space + ) # NOTE: Annotations produced from pattern mode are stored in the same instance space as the sink node + if pattern_edges: + edge_ids = [edge.as_id() for edge in pattern_edges] + row_keys = [edge.external_id for edge in pattern_edges] + if edge_ids: + self.client.data_modeling.instances.delete(edges=edge_ids) + if row_keys: + self.client.raw.rows.delete( + db_name=self.config.finalize_function.apply_service.raw_db, + table_name=self.config.finalize_function.apply_service.raw_table_doc_pattern, + key=row_keys, + ) + counts["pattern"] = len(row_keys) + return counts + + def _list_annotations_for_file(self, node_id: NodeId, edge_instance_space: str): + """ + Retrieves all annotation edges for a specific file from a given instance space. + + Args: + node_id: NodeId of the file to query annotations for. + edge_instance_space: Instance space where the annotation edges are stored. + + Returns: + EdgeList of all annotation edges connected to the file node. + """ + start_node_filter = Equals( + ["edge", "startNode"], + {"space": node_id.space, "externalId": node_id.external_id}, + ) + + return self.client.data_modeling.instances.list( + instance_type="edge", + sources=[self.core_annotation_view_id], + space=edge_instance_space, + filter=start_node_filter, + limit=-1, ) - return update_results.nodes + + def _process_pattern_results( + self, result_item: dict, file_node: Node, existing_hashes: set + ) -> tuple[list[EdgeApply], list[RowWrite]]: + """ + Processes pattern mode detection results into annotation edges and RAW rows. + + Creates pattern-based annotations that link to a sink node rather than specific entities, + allowing review and approval of pattern-detected annotations before linking to actual entities. + Skips patterns already covered by regular detection results. + + Args: + result_item: Dictionary containing pattern mode detection results. + file_node: The file node being annotated. + existing_hashes: Set of annotation hashes from regular detection to avoid duplicates. + + Returns: + A tuple containing: + - List of EdgeApply objects for pattern annotations + - List of RowWrite objects for RAW table entries + """ + file_id = file_node.as_id() + source_id = cast(str, file_node.properties.get(self.file_view_id, {}).get("sourceId")) + doc_patterns, edge_applies = [], [] + for detect_annotation in result_item.get("annotations", []): + stable_hash = self._create_stable_hash(detect_annotation) + if stable_hash in existing_hashes: + continue # Skip creating a pattern edge if a regular one already exists for this detection + + entities = detect_annotation.get("entities", []) + if not entities: + continue + entity = entities[0] + + external_id = self._create_pattern_annotation_id(file_id, detect_annotation) + now = datetime.now(timezone.utc).replace(microsecond=0) + annotation_type = entity.get( + "annotation_type", + self.config.data_model_views.target_entities_view.annotation_type, + ) + annotation_properties = { + "name": file_id.external_id, + "confidence": detect_annotation.get("confidence", 0.0), + "status": DiagramAnnotationStatus.SUGGESTED.value, + "tags": [], + "startNodePageNumber": detect_annotation.get("region", {}).get("page"), + "startNodeXMin": min(v.get("x", 0) for v in detect_annotation.get("region", {}).get("vertices", [])), + "startNodeYMin": min(v.get("y", 0) for v in detect_annotation.get("region", {}).get("vertices", [])), + "startNodeXMax": max(v.get("x", 0) for v in detect_annotation.get("region", {}).get("vertices", [])), + "startNodeYMax": max(v.get("y", 0) for v in detect_annotation.get("region", {}).get("vertices", [])), + "startNodeText": detect_annotation.get("text"), + "sourceCreatedUser": self.FUNCTION_ID, + "sourceUpdatedUser": self.FUNCTION_ID, + "sourceCreatedTime": now.isoformat(), + "sourceUpdatedTime": now.isoformat(), + } + edge_apply = EdgeApply( + space=self.sink_node_ref.space, + external_id=external_id, + type=DirectRelationReference( + space=self.core_annotation_view_id.space, + external_id=annotation_type, + ), + start_node=DirectRelationReference(space=file_id.space, external_id=file_id.external_id), + end_node=self.sink_node_ref, + sources=[ + NodeOrEdgeData( + source=self.core_annotation_view_id, + properties=annotation_properties, + ) + ], + ) + edge_applies.append(edge_apply) + row_columns = { + "externalId": external_id, + "startSourceId": source_id, + "startNode": file_id.external_id, + "startNodeSpace": file_id.space, + "endNode": self.sink_node_ref.external_id, + "endNodeSpace": self.sink_node_ref.space, + "endNodeResourceType": entity.get("resource_type", "Unknown"), + "viewId": self.core_annotation_view_id.external_id, + "viewSpace": self.core_annotation_view_id.space, + "viewVersion": self.core_annotation_view_id.version, + **annotation_properties, + } + doc_patterns.append(RowWrite(key=external_id, columns=row_columns)) + return edge_applies, doc_patterns def _detect_annotation_to_edge_applies( self, @@ -128,69 +353,60 @@ def _detect_annotation_to_edge_applies( doc_tag: list[RowWrite], detect_annotation: dict[str, Any], ) -> dict[tuple, EdgeApply]: + """ + Converts a single detection annotation into edge applies and RAW row writes. + + Creates annotation edges linking the file to detected entities, applying confidence thresholds + to determine approval/suggestion status. Also creates corresponding RAW table entries. - # NOTE: Using a set to ensure uniqueness and solve the duplicate external edge ID problem - diagram_annotations: dict[tuple, EdgeApply] = {} - annotation_schema_space: str = self.config.data_model_views.core_annotation_view.schema_space + Args: + file_instance_id: NodeId of the file being annotated. + source_id: Source ID of the file for RAW table logging. + doc_doc: List to append doc-to-doc annotation RAW rows to. + doc_tag: List to append doc-to-tag annotation RAW rows to. + detect_annotation: Dictionary containing a single detection result. - for entity in detect_annotation["entities"]: - if detect_annotation["confidence"] >= self.approve_threshold: - annotation_status = DiagramAnnotationStatus.APPROVED.value - elif detect_annotation["confidence"] >= self.suggest_threshold: - annotation_status = DiagramAnnotationStatus.SUGGESTED.value + Returns: + Dictionary mapping edge keys to EdgeApply objects (deduplicated by start/end/type). + """ + diagram_annotations = {} + for entity in detect_annotation.get("entities", []): + if detect_annotation.get("confidence", 0.0) >= self.approve_threshold: + status = DiagramAnnotationStatus.APPROVED.value + elif detect_annotation.get("confidence", 0.0) >= self.suggest_threshold: + status = DiagramAnnotationStatus.SUGGESTED.value else: continue - external_id = self._create_annotation_id( - file_instance_id, - entity, - detect_annotation["text"], - detect_annotation, - ) - - doc_log = { - "external_id": external_id, - "start_source_id": source_id, - "start_node": file_instance_id.external_id, - "end_node": entity["external_id"], - "end_node_space": entity["space"], - "view_id": self.core_annotation_view_id.external_id, - "view_space": self.core_annotation_view_id.space, - "view_version": self.core_annotation_view_id.version, - } + external_id = self._create_annotation_id(file_instance_id, entity, detect_annotation) now = datetime.now(timezone.utc).replace(microsecond=0) - annotation_properties = { "name": file_instance_id.external_id, - "confidence": detect_annotation["confidence"], - "status": annotation_status, - "startNodePageNumber": detect_annotation["region"]["page"], - "startNodeXMin": min(v["x"] for v in detect_annotation["region"]["vertices"]), - "startNodeYMin": min(v["y"] for v in detect_annotation["region"]["vertices"]), - "startNodeXMax": max(v["x"] for v in detect_annotation["region"]["vertices"]), - "startNodeYMax": max(v["y"] for v in detect_annotation["region"]["vertices"]), - "startNodeText": detect_annotation["text"], + "confidence": detect_annotation.get("confidence"), + "status": status, + "startNodePageNumber": detect_annotation.get("region", {}).get("page"), + "startNodeXMin": min(v.get("x", 0) for v in detect_annotation.get("region", {}).get("vertices", [])), + "startNodeYMin": min(v.get("y", 0) for v in detect_annotation.get("region", {}).get("vertices", [])), + "startNodeXMax": max(v.get("x", 0) for v in detect_annotation.get("region", {}).get("vertices", [])), + "startNodeYMax": max(v.get("y", 0) for v in detect_annotation.get("region", {}).get("vertices", [])), + "startNodeText": detect_annotation.get("text"), "sourceCreatedUser": self.FUNCTION_ID, "sourceUpdatedUser": self.FUNCTION_ID, + "sourceCreatedTime": now.isoformat(), + "sourceUpdatedTime": now.isoformat(), } - - doc_log.update(annotation_properties) - annotation_properties["sourceCreatedTime"] = now.isoformat() - annotation_properties["sourceUpdatedTime"] = now.isoformat() - - edge_apply_instance = EdgeApply( + edge = EdgeApply( space=file_instance_id.space, external_id=external_id, - existing_version=None, type=DirectRelationReference( - space=annotation_schema_space, - external_id=entity["annotation_type_external_id"], + space=self.core_annotation_view_id.space, + external_id=entity.get("annotation_type"), ), start_node=DirectRelationReference( space=file_instance_id.space, external_id=file_instance_id.external_id, ), - end_node=DirectRelationReference(space=entity["space"], external_id=entity["external_id"]), + end_node=DirectRelationReference(space=entity.get("space"), external_id=entity.get("external_id")), sources=[ NodeOrEdgeData( source=self.core_annotation_view_id, @@ -198,106 +414,116 @@ def _detect_annotation_to_edge_applies( ) ], ) + key = self._get_edge_apply_unique_key(edge) + if key not in diagram_annotations: + diagram_annotations[key] = edge - edge_apply_key = self._get_edge_apply_unique_key(edge_apply_instance) - if edge_apply_key not in diagram_annotations: - diagram_annotations[edge_apply_key] = edge_apply_instance - - if entity["annotation_type_external_id"] == self.file_annotation_type: - doc_doc.append(RowWrite(key=doc_log["external_id"], columns=doc_log)) + doc_log = { + "externalId": external_id, + "startSourceId": source_id, + "startNode": file_instance_id.external_id, + "startNodeSpace": file_instance_id.space, + "endNode": entity.get("external_id"), + "endNodeSpace": entity.get("space"), + "endNodeResourceType": entity.get("resource_type"), + "viewId": self.core_annotation_view_id.external_id, + "viewSpace": self.core_annotation_view_id.space, + "viewVersion": self.core_annotation_view_id.version, + **annotation_properties, + } + if entity.get("annotation_type") == self.file_annotation_type: + doc_doc.append(RowWrite(key=external_id, columns=doc_log)) else: - doc_tag.append(RowWrite(key=doc_log["external_id"], columns=doc_log)) - + doc_tag.append(RowWrite(key=external_id, columns=doc_log)) return diagram_annotations - def _create_annotation_id( - self, - file_id: NodeId, - entity: dict[str, Any], - text: str, - raw_annotation: dict[str, Any], - ) -> str: - hash_ = sha256(json.dumps(raw_annotation, sort_keys=True).encode()).hexdigest()[:10] - naive = f"{file_id.space}:{file_id.external_id}:{entity['space']}:{entity['external_id']}:{text}:{hash_}" - if len(naive) < self.EXTERNAL_ID_LIMIT: - return naive + def _create_stable_hash(self, raw_annotation: dict[str, Any]) -> str: + """ + Generates a stable hash for an annotation to enable deduplication. - prefix = f"{file_id.external_id}:{entity['external_id']}:{text}" - shorten = f"{prefix}:{hash_}" - if len(shorten) < self.EXTERNAL_ID_LIMIT: - return shorten + Creates a deterministic hash based on annotation text, page, and bounding box vertices, + ensuring that identical detections from regular and pattern mode are recognized as duplicates. - return prefix[: self.EXTERNAL_ID_LIMIT - 10] + hash_ + Args: + raw_annotation: Dictionary containing annotation detection data. - def delete_annotations_for_file( - self, - file_node: NodeId, - ) -> tuple[list[str], list[str]]: + Returns: + 10-character hash string representing the annotation. """ - Delete all annotation edges for a file node. + text = raw_annotation.get("text", "") + region = raw_annotation.get("region", {}) + vertices = region.get("vertices", []) + sorted_vertices = sorted(vertices, key=lambda v: (v.get("x", 0), v.get("y", 0))) + stable_representation = { + "text": text, + "page": region.get("page"), + "vertices": sorted_vertices, + } + return sha256(json.dumps(stable_representation, sort_keys=True).encode()).hexdigest()[:10] + + def _create_annotation_id(self, file_id: NodeId, entity: dict[str, Any], raw_annotation: dict[str, Any]) -> str: + """ + Creates a unique external ID for a regular annotation edge. + + Combines file ID, entity ID, detected text, and hash to create a human-readable + yet unique identifier, truncating if necessary to stay within CDF's 256 character limit. Args: - client (CogniteClient): The Cognite client instance. - annotation_view_id (ViewId): The ViewId of the annotation view. - node (NodeId): The NodeId of the file node. - """ - annotations = self._list_annotations_for_file(file_node) - - if not annotations: - return [], [] - - doc_annotations_delete: list[str] = [] - tag_annotations_delete: list[str] = [] - edge_ids = [] - for edge in annotations: - edge_ids.append(EdgeId(space=file_node.space, external_id=edge.external_id)) - if edge.type.external_id == self.file_annotation_type: - doc_annotations_delete.append(edge.external_id) - else: - tag_annotations_delete.append(edge.external_id) - self.client.data_modeling.instances.delete(edges=edge_ids) + file_id: NodeId of the file being annotated. + entity: Dictionary containing the detected entity information. + raw_annotation: Dictionary containing annotation detection data. - return doc_annotations_delete, tag_annotations_delete + Returns: + Unique external ID string for the annotation edge. + """ + hash_ = self._create_stable_hash(raw_annotation) + text = raw_annotation.get("text", "") + naive = f"{file_id.external_id}:{entity.get('external_id')}:{text}:{hash_}" + if len(naive) < self.EXTERNAL_ID_LIMIT: + return naive + prefix = f"{file_id.external_id}:{entity.get('external_id')}:{text}" + if len(prefix) > self.EXTERNAL_ID_LIMIT - 11: + prefix = prefix[: self.EXTERNAL_ID_LIMIT - 11] + return f"{prefix}:{hash_}" - def _list_annotations_for_file( - self, - node: NodeId, - ): + def _create_pattern_annotation_id(self, file_id: NodeId, raw_annotation: dict[str, Any]) -> str: """ - List all annotation edges for a file node. + Creates a unique external ID for a pattern annotation edge. + + Similar to regular annotations but prefixed with "pattern:" to distinguish pattern-detected + annotations that link to sink nodes rather than specific entities. Args: - client (CogniteClient): The Cognite client instance. - annotation_view_id (ViewId): The ViewId of the annotation view. - node (NodeId): The NodeId of the file node. + file_id: NodeId of the file being annotated. + raw_annotation: Dictionary containing annotation detection data. Returns: - list: A list of edges (annotations) linked to the file node. + Unique external ID string for the pattern annotation edge. """ - annotations = self.client.data_modeling.instances.list( - instance_type="edge", - sources=[self.core_annotation_view_id], - space=node.space, - filter=Or(In(["edge", "startNode"], [node])), - limit=-1, - ) - - return annotations + hash_ = self._create_stable_hash(raw_annotation) + text = raw_annotation.get("text", "") + prefix = f"pattern:{file_id.external_id}:{text}" + if len(prefix) > self.EXTERNAL_ID_LIMIT - 11: + prefix = prefix[: self.EXTERNAL_ID_LIMIT - 11] + return f"{prefix}:{hash_}" def _get_edge_apply_unique_key(self, edge_apply_instance: EdgeApply) -> tuple: """ - Create a hashable value for EdgeApply objects to use as a key for any hashable collection + Generates a unique key for an edge based on its start node, end node, and type. + + Used for deduplication to prevent creating multiple edges with identical connections. + + Args: + edge_apply_instance: EdgeApply object to generate key for. + + Returns: + Tuple of (start_node_tuple, end_node_tuple, type_tuple) for deduplication. """ - start_node_key = ( - edge_apply_instance.start_node.space, - edge_apply_instance.start_node.external_id, - ) - end_node_key = ( - edge_apply_instance.end_node.space, - edge_apply_instance.end_node.external_id, - ) - type_key = ( - edge_apply_instance.type.space, - edge_apply_instance.type.external_id, + start_node = edge_apply_instance.start_node + end_node = edge_apply_instance.end_node + type_ = edge_apply_instance.type + return ( + (start_node.space, start_node.external_id) if start_node else None, + (end_node.space, end_node.external_id) if end_node else None, + (type_.space, type_.external_id) if type_ else None, ) - return (start_node_key, end_node_key, type_key) diff --git a/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_finalize/services/ConfigService.py b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_finalize/services/ConfigService.py index 8c126a18..f1d2584d 100644 --- a/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_finalize/services/ConfigService.py +++ b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_finalize/services/ConfigService.py @@ -8,6 +8,7 @@ CustomizeFuzziness, DirectionWeights, ) +from cognite.client.data_classes.data_modeling import NodeId from cognite.client.data_classes.filters import Filter from cognite.client import CogniteClient from cognite.client import data_modeling as dm @@ -168,6 +169,7 @@ class CacheServiceConfig(BaseModel, alias_generator=to_camel): cache_time_limit: int raw_db: str raw_table_cache: str + raw_manual_patterns_catalog: str class AnnotationServiceConfig(BaseModel, alias_generator=to_camel): @@ -188,6 +190,9 @@ class LaunchFunction(BaseModel, alias_generator=to_camel): secondary_scope_property: Optional[str] = None file_search_property: str = "aliases" target_entities_search_property: str = "aliases" + pattern_mode: bool + file_resource_property: Optional[str] = None + target_entities_resource_property: Optional[str] = None data_model_service: DataModelServiceConfig cache_service: CacheServiceConfig annotation_service: AnnotationServiceConfig @@ -201,13 +206,11 @@ class RetrieveServiceConfig(BaseModel, alias_generator=to_camel): class ApplyServiceConfig(BaseModel, alias_generator=to_camel): auto_approval_threshold: float = Field(gt=0.0, le=1.0) auto_suggest_threshold: float = Field(gt=0.0, le=1.0) - - -class ReportServiceConfig(BaseModel, alias_generator=to_camel): + sink_node: NodeId raw_db: str raw_table_doc_tag: str raw_table_doc_doc: str - raw_batch_size: int + raw_table_doc_pattern: str class FinalizeFunction(BaseModel, alias_generator=to_camel): @@ -215,7 +218,76 @@ class FinalizeFunction(BaseModel, alias_generator=to_camel): max_retry_attempts: int retrieve_service: RetrieveServiceConfig apply_service: ApplyServiceConfig - report_service: ReportServiceConfig + + +# Promote Related Configs +class TextNormalizationConfig(BaseModel, alias_generator=to_camel): + """ + Configuration for text normalization and variation generation. + + Controls how text is normalized for matching and what variations are generated + to improve match rates across different naming conventions. + + These flags affect both the normalize() function (for cache keys and direct matching) + and generate_text_variations() function (for query-based matching). + """ + + remove_special_characters: bool = True + convert_to_lowercase: bool = True + strip_leading_zeros: bool = True + + +class EntitySearchServiceConfig(BaseModel, alias_generator=to_camel): + """ + Configuration for the EntitySearchService in the promote function. + + Controls entity search and text normalization behavior: + - Queries entities directly (server-side IN filter on entity/file aliases) + - Text normalization for generating search variations + + Uses efficient server-side filtering on the smaller entity dataset rather than + the larger annotation edge dataset for better performance at scale. + """ + + enable_existing_annotations_search: bool = True + enable_global_entity_search: bool = True + max_entity_search_limit: int = Field(default=1000, gt=0, le=10000) + text_normalization: TextNormalizationConfig + + +class PromoteCacheServiceConfig(BaseModel, alias_generator=to_camel): + """ + Configuration for the CacheService in the promote function. + + Controls caching behavior for textβ†’entity mappings. + """ + + cache_table_name: str + + +class PromoteFunctionConfig(BaseModel, alias_generator=to_camel): + """ + Configuration for the promote function. + + The promote function resolves pattern-mode annotations by finding matching entities + and updating annotation edges from pointing to a sink node to pointing to actual entities. + + Configuration is organized by service interface: + - entitySearchService: Controls entity search strategies + - cacheService: Controls caching behavior + + Batch size is controlled via getCandidatesQuery.limit field. + """ + + get_candidates_query: QueryConfig | list[QueryConfig] + raw_db: str + raw_table_doc_pattern: str + raw_table_doc_tag: str + raw_table_doc_doc: str + delete_rejected_edges: bool + delete_suggested_edges: bool + entity_search_service: EntitySearchServiceConfig + cache_service: PromoteCacheServiceConfig class DataModelViews(BaseModel, alias_generator=to_camel): @@ -230,6 +302,7 @@ class Config(BaseModel, alias_generator=to_camel): prepare_function: PrepareFunction launch_function: LaunchFunction finalize_function: FinalizeFunction + promote_function: PromoteFunctionConfig @classmethod def parse_direct_relation(cls, value: Any) -> Any: diff --git a/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_finalize/services/FinalizeService.py b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_finalize/services/FinalizeService.py index 9b42001c..9388c336 100644 --- a/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_finalize/services/FinalizeService.py +++ b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_finalize/services/FinalizeService.py @@ -7,8 +7,8 @@ from cognite.client.data_classes.data_modeling import ( Node, NodeId, + NodeList, NodeApply, - NodeApplyList, NodeOrEdgeData, ) @@ -16,7 +16,6 @@ from services.LoggerService import CogniteFunctionLogger from services.RetrieveService import IRetrieveService from services.ApplyService import IApplyService -from services.ReportService import IReportService from utils.DataStructures import ( BatchOfNodes, PerformanceTracker, @@ -39,7 +38,6 @@ def __init__( tracker: PerformanceTracker, retrieve_service: IRetrieveService, apply_service: IApplyService, - report_service: IReportService, ): self.client: CogniteClient = client self.config: Config = config @@ -47,7 +45,6 @@ def __init__( self.tracker: PerformanceTracker = tracker self.retrieve_service: IRetrieveService = retrieve_service self.apply_service: IApplyService = apply_service - self.report_service: IReportService = report_service @abc.abstractmethod def run(self) -> str | None: @@ -56,9 +53,7 @@ def run(self) -> str | None: class GeneralFinalizeService(AbstractFinalizeService): """ - Orchestrates the file annotation finalize process. - This service retrieves the results of the diagram detect jobs from the launch function and then applies annotations to the file. - Additionally, it captures the file and asset annotations into separate RAW tables. + Implementation of the FinalizeService. """ def __init__( @@ -69,7 +64,7 @@ def __init__( tracker: PerformanceTracker, retrieve_service: IRetrieveService, apply_service: IApplyService, - report_service: IReportService, + function_call_info: dict, ): super().__init__( client, @@ -78,7 +73,6 @@ def __init__( tracker, retrieve_service, apply_service, - report_service, ) self.annotation_state_view: ViewPropertyConfig = config.data_model_views.annotation_state_view @@ -86,46 +80,36 @@ def __init__( self.page_range: int = config.launch_function.annotation_service.page_range self.max_retries: int = config.finalize_function.max_retry_attempts self.clean_old_annotations: bool = config.finalize_function.clean_old_annotations + self.function_id: int | None = function_call_info.get("function_id") + self.call_id: int | None = function_call_info.get("call_id") def run(self) -> Literal["Done"] | None: """ - Retrieves the result of a diagram detect job and then pushes the annotation to mpcFile. - Specifically, - 1. Get a unique jobId and all instances of mpcAnnotationState that share that jobId - 2. If an error occurs - - Retrieve another job - 3. If no error occurs - - Continue - 4. Check the status of the job - 5. If the job is complete - - Iterate through all items in the diagram detect job results push the annotation to mpcFile - 6. If a file does have annotations - - Push the annotations to the file - - Update status of FileAnnotationState to "Annotated" - - Add annotations to the annotations report - 7. If a file doesn't have any annotations or an error occurs - - Update status of mpcAnnotationState to "Retry" or "Fail" - 8. If the job isn't complete - - Update status of FileAnnotationState to "Processing" - - End the run + Main execution loop for finalizing diagram detection jobs. + + Retrieves completed jobs, fetches their results, processes annotations for each file, + and updates annotation state instances. Handles multi-page files by tracking progress + and requeueing files with remaining pages. + + Args: + None + + Returns: + "Done" if no jobs available, None if processing should continue. + + Raises: + CogniteAPIError: Various API errors are handled gracefully (version conflicts, + timeouts, etc.). """ - self.logger.info( - message="Starting Finalize Function", - section="START", - ) + self.logger.info("Starting Finalize Function", section="START") try: - job_id, file_to_state_map = self.retrieve_service.get_job_id() + job_id, pattern_mode_job_id, file_to_state_map = self.retrieve_service.get_job_id() if not job_id or not file_to_state_map: - self.logger.info(message="No diagram detect jobs found", section="END") + self.logger.info("No diagram detect jobs found", section="END") return "Done" - else: - self.logger.info( - message=f"Retrieved job id ({job_id}) and claimed {len(file_to_state_map.values())} files" - ) + self.logger.info(f"Retrieved job id ({job_id}) and claimed {len(file_to_state_map.values())} files") except CogniteAPIError as e: - # NOTE: Reliant on the CogniteAPI message to stay the same across new releases. If unexpected changes were to occur please refer to this section of the code and check if error message is now different. if e.code == 400 and e.message == "A version conflict caused the ingest to fail.": - # NOTE: Expected behavior. Means jobs has been claimed already. self.logger.info( message=f"Retrieved job id that has already been claimed. Grabbing another job.", section="END", @@ -135,14 +119,17 @@ def run(self) -> Literal["Done"] | None: e.code == 408 and e.message == "Graph query timed out. Reduce load or contention, or optimise your query." ): - # NOTE: 408 indicates a timeout error. Keep retrying the query if a timeout occurs. self.logger.error(message=f"Ran into the following error:\n{str(e)}", section="END") return else: raise e + job_results: dict | None = None + pattern_mode_job_results: dict | None = None try: - job_results: dict | None = self.retrieve_service.get_diagram_detect_job_result(job_id) + job_results = self.retrieve_service.get_diagram_detect_job_result(job_id) + if pattern_mode_job_id: + pattern_mode_job_results = self.retrieve_service.get_diagram_detect_job_result(pattern_mode_job_id) except Exception as e: self.logger.info( message=f"Unfinalizing {len(file_to_state_map.keys())} files - job id ({job_id}) is a bad gateway", @@ -154,9 +141,17 @@ def run(self) -> Literal["Done"] | None: failed=True, ) - if job_results is None: + # A job is considered complete if: + # 1. The main job is finished, AND + # 2. EITHER pattern mode was not enabled (no pattern job ID) + # OR pattern mode was enabled AND its job is also finished. + jobs_complete: bool = job_results is not None and ( + not pattern_mode_job_id or pattern_mode_job_results is not None + ) + + if not jobs_complete: self.logger.info( - message=f"Unfinalizing {len(file_to_state_map.keys())} files - job id ({job_id}) is not complete yet", + message=f"Unfinalizing {len(file_to_state_map.keys())} files - job id ({job_id}) and/or pattern id ({pattern_mode_job_id}) not complete", section="END", ) self._update_batch_state( @@ -168,160 +163,127 @@ def run(self) -> Literal["Done"] | None: return self.logger.info( - message=f"Applying annotations to {len(job_results['items'])} files", + f"Both jobs ({job_id}, {pattern_mode_job_id}) complete. Applying all annotations.", section="END", ) - count_retry = 0 - count_failed = 0 - annotation_state_node_applies: list[NodeApply] = [] - failed_file_ids: list[NodeId] = [] - for diagram_detect_item in job_results["items"]: - file_id: NodeId = NodeId.load(diagram_detect_item["fileInstanceId"]) - annotation_state_node: Node = file_to_state_map[file_id] + merged_results = { + (item["fileInstanceId"]["space"], item["fileInstanceId"]["externalId"]): {"regular": item} + for item in job_results["items"] + } + if pattern_mode_job_results: + for item in pattern_mode_job_results["items"]: + key = ( + item["fileInstanceId"]["space"], + item["fileInstanceId"]["externalId"], + ) + if key in merged_results: + merged_results[key]["pattern"] = item + else: + merged_results[key] = {"pattern": item} + + count_retry, count_failed, count_success = 0, 0, 0 + annotation_state_node_applies = [] + + for (space, external_id), results in merged_results.items(): + file_id = NodeId(space, external_id) + file_node = self.client.data_modeling.instances.retrieve_nodes( + nodes=file_id, sources=self.file_view.as_view_id() + ) + if not file_node: + continue - current_attempt_count: int = cast( + annotation_state_node = file_to_state_map[file_id] + current_attempt = cast( int, annotation_state_node.properties[self.annotation_state_view.as_view_id()]["attemptCount"], ) - next_attempt_count = current_attempt_count + 1 - job_node_to_update: NodeApply | None = None - if diagram_detect_item.get("annotations") and len(diagram_detect_item["annotations"]) > 0: - try: - self.logger.info(f"Applying annotations to file NodeId - {str(file_id)}") - if self.clean_old_annotations: - self.logger.info("Deleting old annotations") - doc_annotations_delete, tag_annotations_delete = self.apply_service.delete_annotations_for_file( - file_node=file_id - ) - self.logger.info( - f"\t- deleted {len(doc_annotations_delete)} document annotations\n- deleted {len(tag_annotations_delete)} tag annoations" - ) - self.report_service.delete_annotations(doc_annotations_delete, tag_annotations_delete) - - doc_annotations, tag_annotations = self.apply_service.apply_annotations( - diagram_detect_item, file_id + next_attempt = current_attempt + 1 + + try: + self.logger.info(f"Processing file {file_id}:") + annotation_msg, pattern_msg = self.apply_service.process_and_apply_annotations_for_file( + file_node, + results.get("regular"), + results.get("pattern"), + self.clean_old_annotations + and annotation_state_node.properties[self.annotation_state_view.as_view_id()].get( + "annotatedPageCount" ) - doc_msg = f"added/updated {len(doc_annotations)} document annotations" - tag_msg = f"added/updated {len(tag_annotations)} tag annotations" - - page_count: int = diagram_detect_item["pageCount"] - annotated_page_count: int = self._check_all_pages_annotated(annotation_state_node, page_count) - if annotated_page_count == page_count: - job_node_to_update = self._process_annotation_state( - node=annotation_state_node, - status=AnnotationStatus.ANNOTATED, - attempt_count=next_attempt_count, - annotated_page_count=annotated_page_count, - page_count=page_count, - annotation_message=f"{doc_msg} and {tag_msg}", - ) - else: - job_node_to_update = self._process_annotation_state( - node=annotation_state_node, - status=AnnotationStatus.NEW, - attempt_count=current_attempt_count, # NOTE: using current_attempt_count since don't want to increment this if not fully annotated - annotated_page_count=annotated_page_count, - page_count=page_count, - annotation_message=f"{doc_msg} and {tag_msg}", - ) - - self.report_service.add_annotations(doc_rows=doc_annotations, tag_rows=tag_annotations) - self.logger.info(f"\t- {doc_msg}\n- {tag_msg}") - - except Exception as e: - msg = str(e) - if next_attempt_count >= self.max_retries: - job_node_to_update = self._process_annotation_state( - node=annotation_state_node, - status=AnnotationStatus.FAILED, - attempt_count=next_attempt_count, - annotation_message=msg, - ) - count_failed += 1 - self.logger.info( - f"\t- set the annotation status to {AnnotationStatus.FAILED}\n- ran into the following error: {msg}" - ) - failed_file_ids.append(file_id) - else: - job_node_to_update = self._process_annotation_state( - node=annotation_state_node, - status=AnnotationStatus.RETRY, - attempt_count=next_attempt_count, - annotation_message=msg, - ) - count_retry += 1 - self.logger.info( - f"\t- set the annotation status to 'Retry'\n- ran into the following error: {msg}" - ) - else: - msg = f"found 0 annotations in diagram_detect_item for file {str(file_id)}" - if next_attempt_count >= self.max_retries: + is None, + ) + self.logger.info(f"\t- {annotation_msg}") + self.logger.info(f"\t- {pattern_msg}") + + # Logic to handle multi-page files + page_count = results.get("regular", {}).get("pageCount", 1) + annotated_pages = self._check_all_pages_annotated(annotation_state_node, page_count) + + if annotated_pages == page_count: job_node_to_update = self._process_annotation_state( - node=annotation_state_node, - status=AnnotationStatus.FAILED, - attempt_count=next_attempt_count, - annotation_message=msg, + annotation_state_node, + AnnotationStatus.ANNOTATED, + next_attempt, + annotated_pages, + page_count, + annotation_msg, + pattern_msg, ) - count_failed += 1 - self.logger.info(f"\t- set the annotation status to 'Failed'\n- {msg}") - failed_file_ids.append(file_id) + count_success += 1 else: job_node_to_update = self._process_annotation_state( - node=annotation_state_node, - status=AnnotationStatus.RETRY, - attempt_count=next_attempt_count, - annotation_message=msg, + annotation_state_node, + AnnotationStatus.NEW, + current_attempt, + annotated_pages, + page_count, + "Processed page batch, more pages remaining", + pattern_msg, ) - count_retry += 1 - self.logger.info(f"\t- set the annotation status to 'Retry'\n- {msg}") - if job_node_to_update: - annotation_state_node_applies.append(job_node_to_update) - - if failed_file_ids: - file_applies: NodeApplyList = self.client.data_modeling.instances.retrieve_nodes( - nodes=failed_file_ids, sources=self.file_view.as_view_id() - ).as_write() - for node_apply in file_applies: - node_apply.existing_version = None - tags_property: list[str] = cast(list[str], node_apply.sources[0].properties["tags"]) - if "AnnotationInProcess" in tags_property: - index = tags_property.index("AnnotationInProcess") - tags_property[index] = "AnnotationFailed" - elif "Annotated" in tags_property: - self.logger.debug( - f"Annotated is in the tags property of {node_apply.as_id()}\nTherefore, this set of pages does not contain any annotations while the prior pages do" + count_success += 1 # Still a success for this batch + + except Exception as e: + self.logger.error(f"Failed to process annotations for file {file_id}: {e}") + if next_attempt >= self.max_retries: + job_node_to_update = self._process_annotation_state( + annotation_state_node, + AnnotationStatus.FAILED, + next_attempt, + annotation_message=str(e), + pattern_mode_message=str(e), ) - elif "AnnotationFailed" not in tags_property: - self.logger.error( - f"AnnotationFailed and AnnotationInProcess not found in tag property of {node_apply.as_id()}" + count_failed += 1 + else: + job_node_to_update = self._process_annotation_state( + annotation_state_node, + AnnotationStatus.RETRY, + next_attempt, + annotation_message=str(e), + pattern_mode_message=str(e), ) - try: - self.client.data_modeling.instances.apply(nodes=file_applies, replace=False) - except CogniteAPIError as e: - self.logger.error(f"Ran into the following error:\n\t{str(e)}\nTrying again in 30 seconds") - time.sleep(30) - self.client.data_modeling.instances.apply(nodes=file_applies, replace=False) + count_retry += 1 + + annotation_state_node_applies.append(job_node_to_update) + # Batch update the state nodes at the end if annotation_state_node_applies: - node_count = len(annotation_state_node_applies) - count_annotated = node_count - count_retry - count_failed self.logger.info( - message=f"Updating {node_count} annotation state instances", + f"Updating {len(annotation_state_node_applies)} annotation state instances", section="START", ) try: - self.apply_service.update_nodes(list_node_apply=annotation_state_node_applies) + self.apply_service.update_instances(list_node_apply=annotation_state_node_applies) self.logger.info( - f"\t- {count_annotated} set to Annotated\n- {count_retry} set to retry\n- {count_failed} set to failed" + f"\t- {count_success} set to Annotated/New\n\t- {count_retry} set to Retry\n\t- {count_failed} set to Failed" ) except Exception as e: self.logger.error( - message=f"Error during batch update of individual annotation states: \n{e}", + f"Error during batch update of annotation states: {e}", section="END", ) - self.tracker.add_files(success=count_annotated, failed=(count_failed + count_retry)) + self.tracker.add_files(success=count_success, failed=(count_failed + count_retry)) + return None def _process_annotation_state( self, @@ -331,9 +293,28 @@ def _process_annotation_state( annotated_page_count: int | None = None, page_count: int | None = None, annotation_message: str | None = None, + pattern_mode_message: str | None = None, ) -> NodeApply: """ - Create a node apply from the node passed into the function. + Creates a NodeApply to update an annotation state instance with processing results. + + Updates status, attempt count, timestamps, and page tracking for multi-page files. + The annotatedPageCount and pageCount properties are updated based on progress through + the file's pages. + + Args: + node: The annotation state node to update. + status: New annotation status (ANNOTATED, FAILED, NEW, RETRY). + attempt_count: Current attempt count for this file. + annotated_page_count: Number of pages successfully annotated so far. + page_count: Total number of pages in the file. + annotation_message: Message describing regular annotation results. + pattern_mode_message: Message describing pattern mode results. + + Returns: + NodeApply object ready to be applied to update the annotation state. + + NOTE: Create a node apply from the node passed into the function. The annotatedPageCount and pageCount properties won't be set if this is the first time the job has been run for the specific node. Thus, we set it here and include logic to handle the scneario where it is set. NOTE: Always want to use the latest page count from the diagram detect results @@ -349,29 +330,23 @@ def _process_annotation_state( - If an error occurs, the annotated_page_count and page_count won't be passed - Don't want to touch the pageCount and annotatedPageCount properties in this scenario """ - if not annotated_page_count or not page_count: - update_properties = { - "annotationStatus": status, - "sourceUpdatedTime": datetime.now(timezone.utc).replace(microsecond=0).isoformat(), - "annotationMessage": annotation_message, - "attemptCount": attempt_count, - "diagramDetectJobId": None, # clear the job id - } - else: - update_properties = { - "annotationStatus": status, - "sourceUpdatedTime": datetime.now(timezone.utc).replace(microsecond=0).isoformat(), - "annotationMessage": annotation_message, - "attemptCount": attempt_count, - "diagramDetectJobId": None, # clear the job id - "annotatedPageCount": annotated_page_count, - "pageCount": page_count, - } + update_properties = { + "annotationStatus": status, + "sourceUpdatedTime": datetime.now(timezone.utc).replace(microsecond=0).isoformat(), + "annotationMessage": annotation_message, + "patternModeMessage": pattern_mode_message, + "attemptCount": attempt_count, + "finalizeFunctionId": self.function_id, + "finalizeFunctionCallId": self.call_id, + } + if annotated_page_count and page_count: + update_properties["annotatedPageCount"] = annotated_page_count + update_properties["pageCount"] = page_count node_apply = NodeApply( space=node.space, external_id=node.external_id, - existing_version=None, # update the node regardless of existing version + existing_version=None, sources=[ NodeOrEdgeData( source=self.annotation_state_view.as_view_id(), @@ -384,8 +359,19 @@ def _process_annotation_state( def _check_all_pages_annotated(self, node: Node, page_count: int) -> int: """ - The annotatedPageCount and pageCount properties won't be set if this is the first time the job has been run for the specific node. + Calculates how many pages have been annotated after this batch completes. + Handles progressive annotation of multi-page files by tracking which pages have been + processed based on the configured page_range batch size. + + Args: + node: The annotation state node being processed. + page_count: Total number of pages in the file from diagram detect results. + + Returns: + Number of pages annotated after this batch (includes previous batches). + + NOTE: The annotatedPageCount and pageCount properties won't be set if this is the first time the job has been run for the specific node. - if annotated_page_count is not set (first run): - if page_range >= to the page count: - annotated_page_count = page_count b/c all of the pages were passed into the FileReference during LaunchService @@ -425,15 +411,18 @@ def _update_batch_state( failed: bool = False, ): """ - Updates the properties of FileAnnnotationState - 1. If failed set to True - - update the status and delete the diagram detect jobId of the nodes - 2. If there's an annoatation message and attempt count - - if status is "Processing": - - Update the status of the nodes - - Set 'sourceUpdateTime' to the time it was claimed so that the jobs first in line for pickup again - - else: - - Update the status of the nodes + Updates annotation state instances in bulk, typically for error scenarios. + + Used when jobs are incomplete or failed to reset job IDs and update status for + retry or re-queuing. + + Args: + batch: BatchOfNodes containing annotation state nodes to update. + status: New annotation status to set for all nodes. + failed: Whether this is a failure scenario (clears job IDs if True). + + Returns: + None """ if len(batch.nodes) == 0: return @@ -444,6 +433,7 @@ def _update_batch_state( "annotationStatus": status, "sourceUpdatedTime": datetime.now(timezone.utc).replace(microsecond=0).isoformat(), "diagramDetectJobId": None, + "patternModeJobId": None, } batch.update_node_properties( new_properties=update_properties, @@ -466,7 +456,7 @@ def _update_batch_state( view_id=self.annotation_state_view.as_view_id(), ) try: - update_results = self.apply_service.update_nodes(list_node_apply=batch.apply) + update_results = self.apply_service.update_instances(list_node_apply=batch.apply) self.logger.info(f"- set annotation status to {status}") except Exception as e: self.logger.error( @@ -474,5 +464,5 @@ def _update_batch_state( section="END", ) time.sleep(30) - update_results = self.apply_service.update_nodes(list_node_apply=batch.apply) + update_results = self.apply_service.update_instances(list_node_apply=batch.apply) self.logger.info(f"- set annotation status to {status}") diff --git a/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_finalize/services/LoggerService.py b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_finalize/services/LoggerService.py index 17f24d6b..773b7797 100644 --- a/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_finalize/services/LoggerService.py +++ b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_finalize/services/LoggerService.py @@ -25,6 +25,16 @@ def __init__( self.write = False def _format_message_lines(self, prefix: str, message: str) -> list[str]: + """ + Formats multi-line messages with consistent indentation. + + Args: + prefix: The log level prefix (e.g., "[INFO]", "[ERROR]"). + message: The message to format. + + Returns: + List of formatted message lines with proper indentation. + """ formatted_lines = [] if "\n" not in message: formatted_lines.append(f"{prefix} {message}") @@ -37,6 +47,16 @@ def _format_message_lines(self, prefix: str, message: str) -> list[str]: return formatted_lines def _print(self, prefix: str, message: str) -> None: + """ + Prints formatted log messages to console and optionally to file. + + Args: + prefix: The log level prefix to prepend to the message. + message: The message to log. + + Returns: + None + """ lines_to_log = self._format_message_lines(prefix, message) if self.write and self.file_handler: try: @@ -51,6 +71,16 @@ def _print(self, prefix: str, message: str) -> None: print(line) def debug(self, message: str, section: Literal["START", "END", "BOTH"] | None = None) -> None: + """ + Logs a debug-level message. + + Args: + message: The debug message to log. + section: Optional section separator position (START, END, or BOTH). + + Returns: + None + """ if section == "START" or section == "BOTH": self._section() if self.log_level == "DEBUG": @@ -59,6 +89,16 @@ def debug(self, message: str, section: Literal["START", "END", "BOTH"] | None = self._section() def info(self, message: str, section: Literal["START", "END", "BOTH"] | None = None) -> None: + """ + Logs an info-level message. + + Args: + message: The informational message to log. + section: Optional section separator position (START, END, or BOTH). + + Returns: + None + """ if section == "START" or section == "BOTH": self._section() if self.log_level in ("DEBUG", "INFO"): @@ -67,6 +107,16 @@ def info(self, message: str, section: Literal["START", "END", "BOTH"] | None = N self._section() def warning(self, message: str, section: Literal["START", "END", "BOTH"] | None = None) -> None: + """ + Logs a warning-level message. + + Args: + message: The warning message to log. + section: Optional section separator position (START, END, or BOTH). + + Returns: + None + """ if section == "START" or section == "BOTH": self._section() if self.log_level in ("DEBUG", "INFO", "WARNING"): @@ -75,6 +125,16 @@ def warning(self, message: str, section: Literal["START", "END", "BOTH"] | None self._section() def error(self, message: str, section: Literal["START", "END", "BOTH"] | None = None) -> None: + """ + Logs an error-level message. + + Args: + message: The error message to log. + section: Optional section separator position (START, END, or BOTH). + + Returns: + None + """ if section == "START" or section == "BOTH": self._section() self._print("[ERROR]", message) @@ -82,6 +142,12 @@ def error(self, message: str, section: Literal["START", "END", "BOTH"] | None = self._section() def _section(self) -> None: + """ + Prints a visual separator line for log sections. + + Returns: + None + """ if self.write and self.file_handler: self.file_handler.write( "--------------------------------------------------------------------------------\n" @@ -89,6 +155,12 @@ def _section(self) -> None: print("--------------------------------------------------------------------------------") def close(self) -> None: + """ + Closes the file handler if file logging is enabled. + + Returns: + None + """ if self.file_handler: try: self.file_handler.close() diff --git a/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_finalize/services/PipelineService.py b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_finalize/services/PipelineService.py index 5dd95bc7..7cf5d885 100644 --- a/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_finalize/services/PipelineService.py +++ b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_finalize/services/PipelineService.py @@ -36,7 +36,13 @@ def __init__(self, pipeline_ext_id: str, client: CogniteClient): def update_extraction_pipeline(self, msg: str) -> None: """ - Update the message log for the extraction pipeline + Appends a message to the extraction pipeline run log. + + Args: + msg: The message to append to the pipeline log. + + Returns: + None """ if not self.ep_write.message: self.ep_write.message = msg @@ -48,7 +54,13 @@ def upload_extraction_pipeline( status: Literal["success", "failure", "seen"], ) -> None: """ - Upload the extraction pipeline run so that status and message logs are captured + Creates an extraction pipeline run with accumulated status and messages. + + Args: + status: The run status to report (success, failure, or seen). + + Returns: + None """ self.ep_write.status = status self.client.extraction_pipelines.runs.create(self.ep_write) diff --git a/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_finalize/services/ReportService.py b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_finalize/services/ReportService.py deleted file mode 100644 index c991b405..00000000 --- a/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_finalize/services/ReportService.py +++ /dev/null @@ -1,127 +0,0 @@ -import abc - -from cognite.client import CogniteClient -from cognite.client.data_classes import RowWrite - -from services.ConfigService import Config -from services.LoggerService import CogniteFunctionLogger - - -class IReportService(abc.ABC): - """ - Interface for reporting the annotations that have been applied - e.g.) Used as the numerator for annotation link rate at Marathon - """ - - @abc.abstractmethod - def add_annotations(self, doc_rows: list[RowWrite], tag_rows: list[RowWrite]) -> None: - pass - - @abc.abstractmethod - def delete_annotations( - self, - doc_row_keys: list[str], - tag_row_keys: list[str], - ) -> None: - pass - - @abc.abstractmethod - def update_report(self) -> str: - pass - - -class GeneralReportService(IReportService): - """ - Interface for reporting the annotations that have been applied - e.g.) Used as the numerator for annotation link rate at Marathon - """ - - def __init__(self, client: CogniteClient, config: Config, logger: CogniteFunctionLogger): - self.client = client - self.config = config - self.logger = logger - - self.db: str = config.finalize_function.report_service.raw_db - self.doc_table: tuple[str, list[RowWrite], list[str]] = ( - config.finalize_function.report_service.raw_table_doc_doc, - [], - [], - ) - self.tag_table: tuple[str, list[RowWrite], list[str]] = ( - config.finalize_function.report_service.raw_table_doc_tag, - [], - [], - ) - self.batch_size: int = config.finalize_function.report_service.raw_batch_size - self.delete: bool = self.config.finalize_function.clean_old_annotations - - def add_annotations(self, doc_rows: list[RowWrite], tag_rows: list[RowWrite]) -> None: - """ - NOTE: Using batch size to ensure that we're writing to raw efficiently. IMO report doesn't need to be pushed to raw at the end of every diagram detect job. - Though we don't want to be too efficient to where we lose out on data in case anything happens to the thread. Thus this balances efficiency with data secureness. - Updating report at the end of every job with 50 files that's processed leads to around 15 seconds of additional time added. - Thus, for 61,000 files / 50 files per job = 1220 jobs * 15 seconds added = 18300 seconds = 305 minutes saved by writing to RAW more efficiently. - """ - self.doc_table[1].extend(doc_rows) - self.tag_table[1].extend(tag_rows) - if len(self.doc_table[1]) + len(self.tag_table[1]) > self.batch_size: - msg = self.update_report() - self.logger.info(f"{msg}", "BOTH") - return - - def delete_annotations( - self, - doc_row_keys: list[str], - tag_row_keys: list[str], - ) -> None: - self.doc_table[2].extend(doc_row_keys) - self.tag_table[2].extend(tag_row_keys) - return - - def update_report(self) -> str: - """ - Upload annotation edges to RAW for reporting. - If clean old annotations is set to true, delete the rows before uploading the rows in RAW. - NOTE: tuple meaning -> self.doc_table[0] = tbl_name, [1] = rows to upload, [2] = keys of the rows to delete - """ - delete_msg = None - if self.delete: - self.client.raw.rows.delete( - db_name=self.db, - table_name=self.doc_table[0], - key=self.doc_table[2], - ) - self.client.raw.rows.delete( - db_name=self.db, - table_name=self.tag_table[0], - key=self.tag_table[2], - ) - delete_msg = f"Deleted annotations from db: {self.db}\n- deleted {len(self.doc_table[2])} rows from tbl: {self.doc_table[0]}\n- deleted {len(self.tag_table[2])} rows from tbl: {self.tag_table[0]}" - - update_msg = "No annotations to upload" - if len(self.doc_table[1]) > 0 or len(self.tag_table[1]) > 0: - update_msg = f"Uploaded annotations to db: {self.db}\n- added {len(self.doc_table[1])} rows to tbl: {self.doc_table[0]}\n- added {len(self.tag_table[1])} rows to tbl: {self.tag_table[0]}" - self.client.raw.rows.insert( - db_name=self.db, - table_name=self.doc_table[0], - row=self.doc_table[1], - ensure_parent=True, - ) - self.client.raw.rows.insert( - db_name=self.db, - table_name=self.tag_table[0], - row=self.tag_table[1], - ensure_parent=True, - ) - self._clear_tables() - - if delete_msg: - return f" {delete_msg}\n{update_msg}" - return f" {update_msg}" - - def _clear_tables(self) -> None: - self.doc_table[1].clear() - self.tag_table[1].clear() - if self.delete: - self.doc_table[2].clear() - self.tag_table[2].clear() diff --git a/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_finalize/services/RetrieveService.py b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_finalize/services/RetrieveService.py index 97bb7cba..272c0770 100644 --- a/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_finalize/services/RetrieveService.py +++ b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_finalize/services/RetrieveService.py @@ -34,7 +34,9 @@ def get_diagram_detect_job_result(self, job_id: int) -> dict | None: pass @abc.abstractmethod - def get_job_id(self) -> tuple[int, dict[NodeId, Node]] | tuple[None, None]: + def get_job_id( + self, + ) -> tuple[int, int | None, dict[NodeId, Node]] | tuple[None, None, None]: pass @@ -55,6 +57,18 @@ def __init__(self, client: CogniteClient, config: Config, logger: CogniteFunctio self.job_api: str = f"/api/v1/projects/{self.client.config.project}/context/diagram/detect" def get_diagram_detect_job_result(self, job_id: int) -> dict | None: + """ + Retrieves the results of a diagram detection job by job ID. + + Polls the diagram detect API to check if a job has completed and returns the results + if available. + + Args: + job_id: The diagram detection job ID to retrieve results for. + + Returns: + Dictionary containing job results if completed, None if still processing or failed. + """ url = f"{self.job_api}/{job_id}" result = None response = self.client.get(url) @@ -69,9 +83,30 @@ def get_diagram_detect_job_result(self, job_id: int) -> dict | None: self.logger.debug(f"{job_id} - Request to get job result failed with {response.status_code} code") return - def get_job_id(self) -> tuple[int, dict[NodeId, Node]] | tuple[None, None]: + def get_job_id( + self, + ) -> tuple[int, int | None, dict[NodeId, Node]] | tuple[None, None, None]: """ - To ensure threads are protected, we do the following... + Retrieves and claims an available diagram detection job for processing. + + Implements optimistic locking to ensure thread-safe job claiming across parallel + function executions. Queries for jobs ready to finalize and attempts to claim them + by updating their status to "Finalizing". + + Args: + None + + Returns: + A tuple containing: + - Regular diagram detection job ID + - Optional pattern mode job ID + - Dictionary mapping file NodeIds to their annotation state nodes + Returns (None, None, None) if no jobs are available. + + Raises: + CogniteAPIError: If another thread has already claimed the job (version conflict). + + NOTE: To ensure threads are protected, we do the following... 1. Query for an available job id 2. Find all annotation state nodes with that job id 3. Claim those nodes by providing the existing version in the node apply request @@ -86,7 +121,7 @@ def get_job_id(self) -> tuple[int, dict[NodeId, Node]] | tuple[None, None]: sort_by_time.append( instances.InstanceSort( property=self.annotation_state_view.as_property_ref("sourceUpdatedTime"), - direction="descending", + direction="ascending", ) ) @@ -94,19 +129,22 @@ def get_job_id(self) -> tuple[int, dict[NodeId, Node]] | tuple[None, None]: instance_type="node", sources=self.annotation_state_view.as_view_id(), space=self.annotation_state_view.instance_space, - limit=-1, + limit=1, filter=self.filter_jobs, sort=sort_by_time, ) if len(annotation_state_instance) == 0: - return None, None + return None, None, None job_node: Node = annotation_state_instance.pop(-1) job_id: int = cast( int, job_node.properties[self.annotation_state_view.as_view_id()]["diagramDetectJobId"], ) + pattern_mode_job_id: int | None = job_node.properties[self.annotation_state_view.as_view_id()].get( + "patternModeJobId" + ) filter_job_id = Equals( property=self.annotation_state_view.as_property_ref("diagramDetectJobId"), @@ -132,11 +170,27 @@ def get_job_id(self) -> tuple[int, dict[NodeId, Node]] | tuple[None, None]: file_node_id = NodeId(space=file_reference["space"], external_id=file_reference["externalId"]) file_to_state_map[file_node_id] = node - return job_id, file_to_state_map + return job_id, pattern_mode_job_id, file_to_state_map def _attempt_to_claim(self, list_job_nodes_to_claim: NodeApplyList) -> None: """ - (Optimistic locking based off the node version) + Attempts to claim annotation state nodes using optimistic locking. + + Updates node status from "Processing" to "Finalizing" while preserving existing version + for conflict detection. Includes client-side validation to handle read-after-write + consistency edge cases. + + Args: + list_job_nodes_to_claim: NodeApplyList of annotation state nodes to claim. + + Returns: + None + + Raises: + CogniteAPIError: If another thread has claimed the job (version conflict) or if + client-side lock bypass detection triggers. + + NOTE: (Optimistic locking based off the node version) Attempt to 'claim' the annotation state nodes by updating the annotation status property. This relies on how the API applies changes to nodes. Specifically... if an existing version is provided in the nodes that are used for the .apply() endpoint, a version conflict will occur if another thread has already claimed the job. diff --git a/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_finalize/utils/DataStructures.py b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_finalize/utils/DataStructures.py index 0f7bc3f2..7bc3c435 100644 --- a/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_finalize/utils/DataStructures.py +++ b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_finalize/utils/DataStructures.py @@ -31,6 +31,7 @@ class EnvConfig: class DiagramAnnotationStatus(str, Enum): SUGGESTED = "Suggested" APPROVED = "Approved" + REJECTED = "Rejected" class AnnotationStatus(str, Enum): @@ -80,8 +81,8 @@ class AnnotationState: sourceUpdatedTime: str = field( default_factory=lambda: datetime.now(timezone.utc).replace(microsecond=0).isoformat() ) - sourceCreatedUser: str = "fn_dm_context_annotation_launch" - sourceUpdatedUser: str = "fn_dm_context_annotation_launch" + sourceCreatedUser: str = "fn_dm_context_annotation_finalize" + sourceUpdatedUser: str = "fn_dm_context_annotation_finalize" def _create_external_id(self) -> str: """ @@ -125,18 +126,17 @@ class entity: "external_id": file.external_id, "name": file.properties[job_config.file_view.as_view_id()]["name"], "space": file.space, - search_property: file.properties[job_config.file_view.as_view_id()][ - search_property - ], - "annotation_type_external_id": job_config.file_view.type, + "annotation_type": job_config.file_view.type, + "resource_type": file.properties[job_config.file_view.as_view_id()][{resource_type}], + "search_property": file.properties[job_config.file_view.as_view_id()][{search_property}], } - Note: kind of prefer a generic variable name here as opposed to specific ones that changes based off config -> i.e.) for marathon the variable here would be aliases instead of search_property """ external_id: str name: str space: str - annotation_type_external_id: Literal["diagrams.FileLink", "diagrams.AssetLink"] | None + annotation_type: Literal["diagrams.FileLink", "diagrams.AssetLink"] | None + resource_type: str search_property: list[str] = field(default_factory=list) def to_dict(self): @@ -309,7 +309,7 @@ def generate_overall_report(self) -> str: def generate_ep_run( self, - caller: Literal["Launch", "Finalize"], + caller: Literal["Prepare", "Launch", "Finalize"], function_id: str | None, call_id: str | None, ) -> str: diff --git a/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_launch/handler.py b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_launch/handler.py index 4ff0be16..f5c03ed7 100644 --- a/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_launch/handler.py +++ b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_launch/handler.py @@ -26,7 +26,7 @@ def handle(data: dict, function_call_info: dict, client: CogniteClient) -> dict: 2. Create an instance of the launch function and create implementations of the interfaces 3. Run the launch instance until... 4. It's been 7 minutes - 5. There are no files left that need to be annoated + 5. There are no files left that need to be launched NOTE: Cognite functions have a run-time limit of 10 minutes. Don't want the function to die at the 10minute mark since there's no guarantee all code will execute. Thus we set a timelimit of 7 minutes (conservative) so that code execution is guaranteed. @@ -47,19 +47,11 @@ def handle(data: dict, function_call_info: dict, client: CogniteClient) -> dict: client=client, logger=logger_instance, tracker=tracker_instance, + function_call_info=function_call_info, ) run_status: str = "success" try: - while datetime.now(timezone.utc) - start_time < timedelta(minutes=7): - if launch_instance.prepare() == "Done": - break - logger_instance.info(tracker_instance.generate_local_report()) - - overall_report: str = tracker_instance.generate_overall_report() - logger_instance.info(overall_report, "BOTH") - tracker_instance.reset() - while datetime.now(timezone.utc) - start_time < timedelta(minutes=7): if launch_instance.run() == "Done": return {"status": run_status, "data": data} @@ -72,15 +64,12 @@ def handle(data: dict, function_call_info: dict, client: CogniteClient) -> dict: return {"status": run_status, "message": msg} finally: logger_instance.info(tracker_instance.generate_overall_report(), "BOTH") - # only want to report on the count of successful and failed files in ep_logs if there were files that were processed or an error occured - # else run log will be too messy. - if tracker_instance.files_failed != 0 or tracker_instance.files_success != 0 or run_status == "failure": - function_id = function_call_info.get("function_id") - call_id = function_call_info.get("call_id") - pipeline_instance.update_extraction_pipeline( - msg=tracker_instance.generate_ep_run("Launch", function_id, call_id) - ) - pipeline_instance.upload_extraction_pipeline(status=run_status) + function_id = function_call_info.get("function_id") + call_id = function_call_info.get("call_id") + pipeline_instance.update_extraction_pipeline( + msg=tracker_instance.generate_ep_run("Launch", function_id, call_id) + ) + pipeline_instance.upload_extraction_pipeline(status=run_status) def run_locally(config_file: dict[str, str], log_path: str | None = None): @@ -89,7 +78,7 @@ def run_locally(config_file: dict[str, str], log_path: str | None = None): 1. Create an instance of config, logger, and tracker 2. Create an instance of the Launch function and create implementations of the interfaces 3. Run the launch instance until - 4. There are no files left that need to be annoated + 4. There are no files left that need to be launched """ log_level = config_file.get("logLevel", "DEBUG") config_instance, client = create_config_service(function_data=config_file) @@ -105,16 +94,9 @@ def run_locally(config_file: dict[str, str], log_path: str | None = None): client=client, logger=logger_instance, tracker=tracker_instance, + function_call_info={"function_id": None, "call_id": None}, ) try: - while True: - if launch_instance.prepare() == "Done": - break - logger_instance.info(tracker_instance.generate_local_report()) - - logger_instance.info(tracker_instance.generate_overall_report(), "BOTH") - tracker_instance.reset() - while True: if launch_instance.run() == "Done": break @@ -129,7 +111,7 @@ def run_locally(config_file: dict[str, str], log_path: str | None = None): logger_instance.close() -def _create_launch_service(config, client, logger, tracker) -> AbstractLaunchService: +def _create_launch_service(config, client, logger, tracker, function_call_info) -> AbstractLaunchService: cache_instance: ICacheService = create_general_cache_service(config, client, logger) data_model_instance: IDataModelService = create_general_data_model_service(config, client, logger) annotation_instance: IAnnotationService = create_general_annotation_service(config, client, logger) @@ -141,11 +123,12 @@ def _create_launch_service(config, client, logger, tracker) -> AbstractLaunchSer data_model_service=data_model_instance, cache_service=cache_instance, annotation_service=annotation_instance, + function_call_info=function_call_info, ) return launch_instance -def _create_local_launch_service(config, client, logger, tracker) -> AbstractLaunchService: +def _create_local_launch_service(config, client, logger, tracker, function_call_info) -> AbstractLaunchService: cache_instance: ICacheService = create_general_cache_service(config, client, logger) data_model_instance: IDataModelService = create_general_data_model_service(config, client, logger) annotation_instance: IAnnotationService = create_general_annotation_service(config, client, logger) @@ -157,6 +140,7 @@ def _create_local_launch_service(config, client, logger, tracker) -> AbstractLau data_model_service=data_model_instance, cache_service=cache_instance, annotation_service=annotation_instance, + function_call_info=function_call_info, ) return launch_instance diff --git a/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_launch/services/AnnotationService.py b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_launch/services/AnnotationService.py index 9437851a..c134afd6 100644 --- a/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_launch/services/AnnotationService.py +++ b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_launch/services/AnnotationService.py @@ -1,4 +1,5 @@ import abc +import copy from typing import Any from cognite.client import CogniteClient from services.ConfigService import Config @@ -21,6 +22,10 @@ class IAnnotationService(abc.ABC): def run_diagram_detect(self, files: list[FileReference], entities: list[dict[str, Any]]) -> int: pass + @abc.abstractmethod + def run_pattern_mode_detect(self, files: list[FileReference], pattern_samples: list[dict[str, Any]]) -> int: + pass + # maybe a different class for debug mode and run mode? class GeneralAnnotationService(IAnnotationService): @@ -37,8 +42,27 @@ def __init__(self, config: Config, client: CogniteClient, logger: CogniteFunctio self.diagram_detect_config: DiagramDetectConfig | None = None if config.launch_function.annotation_service.diagram_detect_config: self.diagram_detect_config = config.launch_function.annotation_service.diagram_detect_config.as_config() + # NOTE: Remove Leading Zeros has a weird interaction with pattern mode so will always turn off + if config.launch_function.pattern_mode: + # NOTE: Shallow copy that still references Mutable objects in self.diagram_detect_config. + # Since RemoveLeadingZeros is a boolean value, it is immutable and we can modify the copy without effecting the original. + self.pattern_detect_config = copy.copy(self.diagram_detect_config) + self.pattern_detect_config.remove_leading_zeros = False def run_diagram_detect(self, files: list[FileReference], entities: list[dict[str, Any]]) -> int: + """ + Initiates a diagram detection job using CDF's diagram detect API. + + Args: + files: List of file references to process for annotation. + entities: List of entity dictionaries containing searchable properties for annotation matching. + + Returns: + The job ID of the initiated diagram detection job. + + Raises: + Exception: If the API call does not return a valid job ID. + """ detect_job: DiagramDetectResults = self.client.diagrams.detect( file_references=files, entities=entities, @@ -50,4 +74,35 @@ def run_diagram_detect(self, files: list[FileReference], entities: list[dict[str if detect_job.job_id: return detect_job.job_id else: - raise Exception(f"404 ---- No job Id was created") + raise Exception(f"API call to diagram/detect did not return a job ID") + + def run_pattern_mode_detect(self, files: list[FileReference], pattern_samples: list[dict[str, Any]]) -> int: + """ + Initiates a diagram detection job in pattern mode using generated pattern samples. + + Pattern mode enables detection of entities based on regex-like patterns rather than exact matches, + useful for finding variations of asset tags and identifiers. + + Args: + files: List of file references to process for annotation. + pattern_samples: List of pattern sample dictionaries containing regex-like patterns for matching. + + Returns: + The job ID of the initiated pattern mode diagram detection job. + + Raises: + Exception: If the API call does not return a valid job ID. + """ + detect_job: DiagramDetectResults = self.client.diagrams.detect( + file_references=files, + entities=pattern_samples, + partial_match=self.annotation_config.partial_match, + min_tokens=self.annotation_config.min_tokens, + search_field="sample", + configuration=self.pattern_detect_config, + pattern_mode=True, + ) + if detect_job.job_id: + return detect_job.job_id + else: + raise Exception("API call to diagram/detect in pattern mode did not return a job ID") diff --git a/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_launch/services/CacheService.py b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_launch/services/CacheService.py index 742e14c3..f122c132 100644 --- a/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_launch/services/CacheService.py +++ b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_launch/services/CacheService.py @@ -1,7 +1,11 @@ import abc +import re +from typing import Iterator, Any, Dict, List, Set, cast +from collections import defaultdict from datetime import datetime, timezone, timedelta from cognite.client import CogniteClient from cognite.client.data_classes import RowWrite, Row +from cognite.client.exceptions import CogniteNotFoundError from cognite.client.data_classes.data_modeling import ( Node, NodeList, @@ -25,15 +29,19 @@ def get_entities( data_model_service: IDataModelService, primary_scope_value: str, secondary_scope_value: str | None, - ) -> list[dict]: + ) -> tuple[list[dict], list[dict]]: + pass + + @abc.abstractmethod + def _update_cache(self, row_to_write: RowWrite) -> None: pass @abc.abstractmethod - def _update_cache(self) -> list[dict]: + def _validate_cache(self, last_update_datetime_str: str) -> bool: pass @abc.abstractmethod - def _validate_cache(self) -> bool: + def _generate_tag_samples_from_entities(self, entities: list[dict]) -> list[dict]: pass @@ -51,6 +59,7 @@ def __init__(self, config: Config, client: CogniteClient, logger: CogniteFunctio self.db_name: str = config.launch_function.cache_service.raw_db self.tbl_name: str = config.launch_function.cache_service.raw_table_cache + self.manual_patterns_tbl_name: str = config.launch_function.cache_service.raw_manual_patterns_catalog self.cache_time_limit: int = config.launch_function.cache_service.cache_time_limit # in hours self.file_view: ViewPropertyConfig = config.data_model_views.file_view @@ -61,10 +70,23 @@ def get_entities( data_model_service: IDataModelService, primary_scope_value: str, secondary_scope_value: str | None, - ) -> list[dict]: + ) -> tuple[list[dict], list[dict]]: """ - Returns file and asset entities for use in diagram detect job - Ensures that the cache is up to date and valid + Retrieves or generates entities and pattern samples for diagram detection. + + This method orchestrates the cache lifecycle: checking validity, fetching fresh data if needed, + generating pattern samples, and updating the cache. The cache is scoped by primary and secondary + scope values to ensure relevant entities are used for each file context. + + Args: + data_model_service: Service instance for querying data model instances. + primary_scope_value: Primary scope identifier (e.g., site, facility). + secondary_scope_value: Optional secondary scope identifier (e.g., unit, area). + + Returns: + A tuple containing: + - Combined list of entity dictionaries (assets + files) for diagram detection. + - Combined list of pattern sample dictionaries for pattern mode detection. """ entities: list[dict] = [] if secondary_scope_value: @@ -72,67 +94,91 @@ def get_entities( else: key = f"{primary_scope_value}" - cdf_raw = self.client.raw.rows - row: Row | None = cdf_raw.retrieve(db_name=self.db_name, table_name=self.tbl_name, key=key) + try: + row: Row | None = self.client.raw.rows.retrieve(db_name=self.db_name, table_name=self.tbl_name, key=key) + except: + row = None - if row and row.columns: - last_update_time_str = row.columns["LastUpdateTimeUtcIso"] - if self._validate_cache(last_update_time_str) == False: - self.logger.debug("Refreshing RAW entities cache") - entities = self._update_cache(data_model_service, key, primary_scope_value, secondary_scope_value) - else: - asset_entity: list[dict] = row.columns["AssetEntities"] - file_entity: list[dict] = row.columns["FileEntities"] - entities = asset_entity + file_entity - else: - entities = self._update_cache(data_model_service, key, primary_scope_value, secondary_scope_value) + # Attempt to retrieve from the cache + if row and row.columns and self._validate_cache(row.columns["LastUpdateTimeUtcIso"]): + self.logger.debug(f"Cache valid for key: {key}. Retrieving entities and patterns.") + asset_entities: list[dict] = row.columns.get("AssetEntities", []) + file_entities: list[dict] = row.columns.get("FileEntities", []) + combined_pattern_samples: list[dict] = row.columns.get("CombinedPatternSamples", []) + return (asset_entities + file_entities), combined_pattern_samples - return entities + self.logger.info(f"Refreshing RAW entities cache and patterns cache for key: {key}") - def _update_cache( - self, - data_model_service: IDataModelService, - key: str, - primary_scope_value: str, - secondary_scope_value: str | None, - ) -> list[dict]: - """ - Creates (or overwrites) the cache for a given group. It fetches all relevant - contextualization entities for the files in the group from the data model - and stores them in the cache table. - """ - asset_instances: NodeList - file_instances: NodeList + # Fetch data asset_instances, file_instances = data_model_service.get_instances_entities( primary_scope_value, secondary_scope_value ) - asset_entities: list[dict] = [] - file_entities: list[dict] = [] + # Convert to entities for diagram detect job asset_entities, file_entities = self._convert_instances_to_entities(asset_instances, file_instances) + entities = asset_entities + file_entities + + # Generate pattern samples from the same entities + asset_pattern_samples = self._generate_tag_samples_from_entities(asset_entities) + file_pattern_samples = self._generate_tag_samples_from_entities(file_entities) + auto_pattern_samples = asset_pattern_samples + file_pattern_samples + + # Grab the manual pattern samples + manual_pattern_samples = self._get_manual_patterns(primary_scope_value, secondary_scope_value) + + # Merge the auto and manual patterns + combined_pattern_samples = self._merge_patterns(auto_pattern_samples, manual_pattern_samples) - current_time_seconds = datetime.now(timezone.utc).isoformat() + # Update cache new_row = RowWrite( key=key, columns={ "AssetEntities": asset_entities, "FileEntities": file_entities, - "LastUpdateTimeUtcIso": current_time_seconds, + "AssetPatternSamples": asset_pattern_samples, + "FilePatternSamples": file_pattern_samples, + "ManualPatternSamples": manual_pattern_samples, + "CombinedPatternSamples": combined_pattern_samples, + "LastUpdateTimeUtcIso": datetime.now(timezone.utc).isoformat(), }, ) + self._update_cache(new_row) + return entities, combined_pattern_samples + + def _update_cache(self, row_to_write: RowWrite) -> None: + """ + Writes a cache entry to the RAW database table. + + This method's only responsibility is the database insertion. All data preparation + and formatting should be done before calling this method. + + Args: + row_to_write: Fully-formed RowWrite object containing cache data to persist. + + Returns: + None + """ self.client.raw.rows.insert( db_name=self.db_name, table_name=self.tbl_name, - row=new_row, + row=row_to_write, + ensure_parent=True, ) - - entities = asset_entities + file_entities - return entities + self.logger.info(f"Successfully updated RAW cache") + return def _validate_cache(self, last_update_datetime_str: str) -> bool: """ - Checks if the retrieved cache is still valid by comparing its creation - timestamp with the 'cacheTimeLimit' from the configuration. + Validates whether cached data is still fresh based on time elapsed since last update. + + Compares the cache's last update timestamp against the configured cache time limit + to determine if a refresh is needed. + + Args: + last_update_datetime_str: ISO-formatted datetime string of the cache's last update. + + Returns: + True if the cache is still valid (within time limit), False if expired. """ last_update_datetime_utc = datetime.fromisoformat(last_update_datetime_str) current_datetime_utc = datetime.now(timezone.utc) @@ -151,42 +197,314 @@ def _convert_instances_to_entities( self, asset_instances: NodeList, file_instances: NodeList ) -> tuple[list[dict], list[dict]]: """ - Convert the asset and file nodes into an entity + Transforms data model node instances into entity dictionaries for diagram detection. + + Extracts relevant properties from asset and file nodes and formats them as entity + dictionaries compatible with the diagram detect API. + + Args: + asset_instances: NodeList of asset instances from the data model. + file_instances: NodeList of file instances from the data model. + + Returns: + A tuple containing: + - List of target entity dictionaries (typically assets). + - List of file entity dictionaries. """ + target_entities_resource_type: str | None = self.config.launch_function.target_entities_resource_property target_entities_search_property: str = self.config.launch_function.target_entities_search_property target_entities: list[dict] = [] + for instance in asset_instances: instance_properties = instance.properties.get(self.target_entities_view.as_view_id()) + asset_resource_type: str = ( + instance_properties[target_entities_resource_type] + if target_entities_resource_type + else self.target_entities_view.external_id + ) if target_entities_search_property in instance_properties: asset_entity = entity( external_id=instance.external_id, name=instance_properties.get("name"), space=instance.space, + annotation_type=self.target_entities_view.annotation_type, + resource_type=asset_resource_type, search_property=instance_properties.get(target_entities_search_property), - annotation_type_external_id=self.target_entities_view.annotation_type, ) target_entities.append(asset_entity.to_dict()) else: + search_value: list = [instance_properties.get("name")] asset_entity = entity( external_id=instance.external_id, name=instance_properties.get("name"), space=instance.space, - search_property=instance_properties.get("name"), - annotation_type_external_id=self.target_entities_view.annotation_type, + annotation_type=self.target_entities_view.annotation_type, + resource_type=asset_resource_type, + search_property=search_value, ) target_entities.append(asset_entity.to_dict()) + file_resource_type_prop: str | None = self.config.launch_function.file_resource_property file_search_property: str = self.config.launch_function.file_search_property file_entities: list[dict] = [] + for instance in file_instances: instance_properties = instance.properties.get(self.file_view.as_view_id()) + file_entity_resource_type: str = ( + instance_properties[file_resource_type_prop] + if target_entities_resource_type + else self.file_view.external_id + ) file_entity = entity( external_id=instance.external_id, name=instance_properties.get("name"), space=instance.space, + annotation_type=self.file_view.annotation_type, + resource_type=file_entity_resource_type, search_property=instance_properties.get(file_search_property), - annotation_type_external_id=self.file_view.annotation_type, ) file_entities.append(file_entity.to_dict()) return target_entities, file_entities + + def _generate_tag_samples_from_entities(self, entities: list[dict]) -> list[dict]: + """ + Generates regex-like pattern samples from entity search properties for pattern mode detection. + + Analyzes entity aliases to extract common patterns and variations, creating consolidated + pattern samples that can match multiple similar tags (e.g., "[FT]-000[A|B]"). + + Args: + entities: List of entity dictionaries containing search properties (aliases). + + Returns: + List of pattern sample dictionaries, each containing: + - sample: List of pattern strings + - resource_type: Entity resource type + - annotation_type: Annotation type for the entity + """ + # Structure: { resource_type: {"patterns": { template_key: [...] }, "annotation_type": "..."} } + pattern_builders: Dict[str, Dict[str, Any]] = defaultdict(lambda: {"patterns": {}, "annotation_type": None}) + self.logger.info(f"Generating pattern samples from {len(entities)} entities.") + + def _parse_alias(alias: str, resource_type_key: str) -> tuple[str, list[list[str]]]: + """ + Parse an alias into a normalized template string and collect variable letter groups. + + - Treat hyphens '-' and spaces ' ' as literal characters. + - Wrap all other non-alphanumeric characters in brackets to mark them as required literals (e.g., [+], [.]). + - Replace digits with '0' and letters with 'A' in alphanumeric segments. + - If an alphanumeric segment equals the resource type and is token-boundary isolated, wrap it in brackets to mark it constant. + """ + # Tokenize alias into alphanumeric runs and single-character separators + tokens: list[str] = [] + current_alnum: list[str] = [] + for ch in alias: + if ch.isalnum(): + current_alnum.append(ch) + else: + if current_alnum: + tokens.append("".join(current_alnum)) + current_alnum = [] + tokens.append(ch) + if current_alnum: + tokens.append("".join(current_alnum)) + + full_template_key_parts: list[str] = [] + all_variable_parts: list[list[str]] = [] + + def is_separator(tok: str) -> bool: + return len(tok) == 1 and not tok.isalnum() + + for i, part in enumerate(tokens): + if not part: + continue + if is_separator(part): + # Hyphen and space are plain literals; other specials must be wrapped in brackets + if part == "-" or part == " ": + full_template_key_parts.append(part) + else: + full_template_key_parts.append(f"[{part}]") + continue + + # Alphanumeric segment + left_ok = (i == 0) or is_separator(tokens[i - 1]) + right_ok = (i == len(tokens) - 1) or is_separator(tokens[i + 1]) + if left_ok and right_ok and part == resource_type_key: + full_template_key_parts.append(f"[{part}]") + continue + + segment_template = re.sub(r"\d", "0", part) + segment_template = re.sub(r"[A-Za-z]", "A", segment_template) + full_template_key_parts.append(segment_template) + + variable_letters = re.findall(r"[A-Za-z]+", part) + if variable_letters: + all_variable_parts.append(variable_letters) + + return "".join(full_template_key_parts), all_variable_parts + + for entity in entities: + key = entity["resource_type"] + if pattern_builders[key]["annotation_type"] is None: + pattern_builders[key]["annotation_type"] = entity.get("annotation_type") + + aliases = entity.get("search_property", []) + for alias in aliases: + if not alias: + continue + template_key, variable_parts_from_alias = _parse_alias(alias, key) + resource_patterns = pattern_builders[key]["patterns"] + if template_key in resource_patterns: + existing_variable_sets = resource_patterns[template_key] + for i, part_group in enumerate(variable_parts_from_alias): + for j, letter_group in enumerate(part_group): + existing_variable_sets[i][j].add(letter_group) + else: + new_variable_sets = [] + for part_group in variable_parts_from_alias: + new_variable_sets.append([set([lg]) for lg in part_group]) + resource_patterns[template_key] = new_variable_sets + + result = [] + for resource_type, data in pattern_builders.items(): + final_samples = [] + templates: Dict[str, List[List[Set[str]]]] = data.get("patterns") or {} + annotation_type = data["annotation_type"] + for template_key, collected_vars in templates.items(): + var_iter: Iterator[List[Set[str]]] = iter(collected_vars) + + def build_segment(segment_template: str) -> str: + if "A" not in segment_template: + return segment_template + try: + letter_groups_for_segment: List[Set[str]] = next(var_iter) + letter_group_iter: Iterator[Set[str]] = iter(letter_groups_for_segment) + + def replace_A(match): + alternatives = sorted(list(next(letter_group_iter))) + return f"[{'|'.join(alternatives)}]" + + return re.sub(r"A+", replace_A, segment_template) + except StopIteration: + return segment_template + + # Split by bracketed constants or any single non-alphanumeric separator to preserve them as tokens + parts = [p for p in re.split(r"(\[[^\]]+\]|[^A-Za-z0-9])", template_key) if p != ""] + final_pattern_parts = [build_segment(p) if re.search(r"A", p) else p for p in parts] + final_samples.append("".join(final_pattern_parts)) + + # Sanity filter: drop overly generic numeric-only patterns (must contain a letter or a character class) + def _has_alpha_or_class(s: str) -> bool: + if re.search(r"[A-Za-z]", s): + return True + # Character class: bracketed alternatives like [A|B] or [1|2] + if re.search(r"\[[^\]]*\|[^\]]*\]", s): + return True + return False + + final_samples = [s for s in final_samples if _has_alpha_or_class(s)] + + if final_samples: + result.append( + { + "sample": sorted(final_samples), + "resource_type": resource_type, + "annotation_type": annotation_type, + } + ) + return result + + def _get_manual_patterns(self, primary_scope: str, secondary_scope: str | None) -> list[dict]: + """ + Retrieves manually defined pattern samples from the RAW catalog. + + Fetches patterns at three levels of specificity: global, primary scope, and combined scope, + allowing for hierarchical pattern definitions with increasing specificity. + + Args: + primary_scope: Primary scope identifier for fetching scope-specific patterns. + secondary_scope: Optional secondary scope identifier for fetching more specific patterns. + + Returns: + List of manually defined pattern dictionaries from all applicable scope levels. + """ + keys_to_fetch = ["GLOBAL"] + if primary_scope: + keys_to_fetch.append(primary_scope) + if primary_scope and secondary_scope: + keys_to_fetch.append(f"{primary_scope}_{secondary_scope}") + + self.logger.info(f"Fetching manual patterns for keys: {keys_to_fetch}") + all_manual_patterns = [] + for key in keys_to_fetch: + try: + row: Row | None = self.client.raw.rows.retrieve( + db_name=self.db_name, + table_name=self.manual_patterns_tbl_name, + key=key, + ) + if row: + patterns = (row.columns or {}).get("patterns", []) + all_manual_patterns.extend(patterns) + except CogniteNotFoundError: + self.logger.info(f"No manual patterns found for key: {key}. This may be expected.") + except Exception as e: + self.logger.error(f"Failed to retrieve manual patterns for key {key}: {e}") + + return all_manual_patterns + + def _merge_patterns(self, auto_patterns: list[dict], manual_patterns: list[dict]) -> list[dict]: + """ + Combines automatically generated and manually defined patterns by resource type. + + Merges pattern samples from both sources, ensuring no duplicates while preserving + all unique patterns for each resource type. Auto-pattern annotation types take precedence. + + Args: + auto_patterns: List of automatically generated pattern dictionaries. + manual_patterns: List of manually defined pattern dictionaries. + + Returns: + List of merged pattern dictionaries, deduplicated and organized by resource type. + """ + merged: Dict[str, Dict[str, Any]] = defaultdict(lambda: {"samples": set(), "annotation_type": None}) + + # Process auto-generated patterns + for item in auto_patterns: + resource_type = item.get("resource_type") + if resource_type: + bucket = merged[resource_type] + samples_set = cast(Set[str], bucket["samples"]) + sample_list = item.get("sample") or [] + samples_set.update(sample_list) + # Set annotation_type if not already set + if not bucket.get("annotation_type"): + bucket["annotation_type"] = item.get("annotation_type") + + # Process manual patterns + for item in manual_patterns: + resource_type = item.get("resource_type") + if resource_type and item.get("sample"): + bucket = merged[resource_type] + samples_set = cast(Set[str], bucket["samples"]) + samples_set.add(cast(str, item["sample"])) + # Set annotation_type if not already set (auto-patterns take precedence) + if not bucket.get("annotation_type"): + # NOTE: UI that creates manual patterns will need to also have the annotation type as a required entry + bucket["annotation_type"] = item.get("annotation_type", "diagrams.AssetLink") + + # Convert the merged dictionary back to the required list format + final_list = [] + for resource_type, data in merged.items(): + samples_safe: Set[str] = cast(Set[str], data.get("samples") or set()) + final_list.append( + { + "resource_type": resource_type, + "sample": sorted(list(samples_safe)), + "annotation_type": data.get("annotation_type"), + } + ) + + self.logger.info(f"Merged auto and manual patterns into {len(final_list)} resource types.") + return final_list diff --git a/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_launch/services/ConfigService.py b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_launch/services/ConfigService.py index 8c126a18..f1d2584d 100644 --- a/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_launch/services/ConfigService.py +++ b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_launch/services/ConfigService.py @@ -8,6 +8,7 @@ CustomizeFuzziness, DirectionWeights, ) +from cognite.client.data_classes.data_modeling import NodeId from cognite.client.data_classes.filters import Filter from cognite.client import CogniteClient from cognite.client import data_modeling as dm @@ -168,6 +169,7 @@ class CacheServiceConfig(BaseModel, alias_generator=to_camel): cache_time_limit: int raw_db: str raw_table_cache: str + raw_manual_patterns_catalog: str class AnnotationServiceConfig(BaseModel, alias_generator=to_camel): @@ -188,6 +190,9 @@ class LaunchFunction(BaseModel, alias_generator=to_camel): secondary_scope_property: Optional[str] = None file_search_property: str = "aliases" target_entities_search_property: str = "aliases" + pattern_mode: bool + file_resource_property: Optional[str] = None + target_entities_resource_property: Optional[str] = None data_model_service: DataModelServiceConfig cache_service: CacheServiceConfig annotation_service: AnnotationServiceConfig @@ -201,13 +206,11 @@ class RetrieveServiceConfig(BaseModel, alias_generator=to_camel): class ApplyServiceConfig(BaseModel, alias_generator=to_camel): auto_approval_threshold: float = Field(gt=0.0, le=1.0) auto_suggest_threshold: float = Field(gt=0.0, le=1.0) - - -class ReportServiceConfig(BaseModel, alias_generator=to_camel): + sink_node: NodeId raw_db: str raw_table_doc_tag: str raw_table_doc_doc: str - raw_batch_size: int + raw_table_doc_pattern: str class FinalizeFunction(BaseModel, alias_generator=to_camel): @@ -215,7 +218,76 @@ class FinalizeFunction(BaseModel, alias_generator=to_camel): max_retry_attempts: int retrieve_service: RetrieveServiceConfig apply_service: ApplyServiceConfig - report_service: ReportServiceConfig + + +# Promote Related Configs +class TextNormalizationConfig(BaseModel, alias_generator=to_camel): + """ + Configuration for text normalization and variation generation. + + Controls how text is normalized for matching and what variations are generated + to improve match rates across different naming conventions. + + These flags affect both the normalize() function (for cache keys and direct matching) + and generate_text_variations() function (for query-based matching). + """ + + remove_special_characters: bool = True + convert_to_lowercase: bool = True + strip_leading_zeros: bool = True + + +class EntitySearchServiceConfig(BaseModel, alias_generator=to_camel): + """ + Configuration for the EntitySearchService in the promote function. + + Controls entity search and text normalization behavior: + - Queries entities directly (server-side IN filter on entity/file aliases) + - Text normalization for generating search variations + + Uses efficient server-side filtering on the smaller entity dataset rather than + the larger annotation edge dataset for better performance at scale. + """ + + enable_existing_annotations_search: bool = True + enable_global_entity_search: bool = True + max_entity_search_limit: int = Field(default=1000, gt=0, le=10000) + text_normalization: TextNormalizationConfig + + +class PromoteCacheServiceConfig(BaseModel, alias_generator=to_camel): + """ + Configuration for the CacheService in the promote function. + + Controls caching behavior for textβ†’entity mappings. + """ + + cache_table_name: str + + +class PromoteFunctionConfig(BaseModel, alias_generator=to_camel): + """ + Configuration for the promote function. + + The promote function resolves pattern-mode annotations by finding matching entities + and updating annotation edges from pointing to a sink node to pointing to actual entities. + + Configuration is organized by service interface: + - entitySearchService: Controls entity search strategies + - cacheService: Controls caching behavior + + Batch size is controlled via getCandidatesQuery.limit field. + """ + + get_candidates_query: QueryConfig | list[QueryConfig] + raw_db: str + raw_table_doc_pattern: str + raw_table_doc_tag: str + raw_table_doc_doc: str + delete_rejected_edges: bool + delete_suggested_edges: bool + entity_search_service: EntitySearchServiceConfig + cache_service: PromoteCacheServiceConfig class DataModelViews(BaseModel, alias_generator=to_camel): @@ -230,6 +302,7 @@ class Config(BaseModel, alias_generator=to_camel): prepare_function: PrepareFunction launch_function: LaunchFunction finalize_function: FinalizeFunction + promote_function: PromoteFunctionConfig @classmethod def parse_direct_relation(cls, value: Any) -> Any: diff --git a/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_launch/services/DataModelService.py b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_launch/services/DataModelService.py index 695a539b..ee374874 100644 --- a/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_launch/services/DataModelService.py +++ b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_launch/services/DataModelService.py @@ -104,7 +104,14 @@ def __init__(self, config: Config, client: CogniteClient, logger: CogniteFunctio def get_files_for_annotation_reset(self) -> NodeList | None: """ - Query for files based on the getFilesForAnnotationReset config parameters + Retrieves files that need their annotation status reset based on configuration. + + Args: + None + + Returns: + NodeList of file instances to reset, or None if no reset query is configured. + NOTE: Not building the filter in the object instantiation because the filter will only ever be used once throughout all runs of prepare Furthermore, there is an implicit guarantee that a filter will be returned b/c launch checks if the query exists. """ @@ -125,8 +132,16 @@ def get_files_for_annotation_reset(self) -> NodeList | None: def get_files_to_annotate(self) -> NodeList | None: """ - Query for files that are marked "ToAnnotate" in tags and don't have 'AnnotataionInProcess' and 'Annotated' in tags. - More specific details of the query come from the getFilesToAnnotate config parameter. + Retrieves files ready for annotation processing based on their tag status. + + Queries for files marked "ToAnnotate" that don't have 'AnnotationInProcess' or 'Annotated' tags. + The specific query filters are defined in the getFilesToAnnotate config parameter. + + Args: + None + + Returns: + NodeList of file instances ready for annotation, or None if no files found. """ result: NodeList | None = self.client.data_modeling.instances.list( instance_type="node", @@ -142,9 +157,19 @@ def get_files_to_process( self, ) -> tuple[NodeList, dict[NodeId, Node]] | tuple[None, None]: """ - Query for FileAnnotationStateInstances based on the getFilesToProcess config parameter. - Extract the NodeIds of the file that is referenced in mpcAnnotationState. - Retrieve the files with the NodeIds. + Retrieves files with annotation state instances that are ready for diagram detection. + + Queries for FileAnnotationStateInstances based on the getFilesToProcess config parameter, + extracts the linked file NodeIds, and retrieves the corresponding file nodes. + + Args: + None + + Returns: + A tuple containing: + - NodeList of file instances to process + - Dictionary mapping file NodeIds to their annotation state Node instances + Returns (None, None) if no files are found. """ annotation_state_filter = self._get_annotation_state_filter() annotation_state_instances: NodeList = self.client.data_modeling.instances.list( @@ -181,7 +206,18 @@ def get_files_to_process( def _get_annotation_state_filter(self) -> Filter: """ - filter = (getFilesToProcess filter || (annotationStatus == Processing && now() - lastUpdatedTime) > 1440 minutes) + Builds a filter for annotation state instances, including automatic retry logic for stuck jobs. + + Combines the configured filter with a fallback filter that catches annotation state instances + stuck in Processing/Finalizing status for more than 12 hours. + + Args: + None + + Returns: + Combined Filter for querying annotation state instances. + + NOTE: filter = (getFilesToProcess filter || (annotationStatus == Processing && now() - lastUpdatedTime) > 1440 minutes) - getFilesToProcess filter comes from extraction pipeline - (annotationStatus == Processing | Finalizing && now() - lastUpdatedTime) > 720 minutes/12 hours -> hardcoded -> reprocesses any file that's stuck - Edge case that occurs very rarely but can happen. @@ -202,7 +238,13 @@ def _get_annotation_state_filter(self) -> Filter: def update_annotation_state(self, list_node_apply: list[NodeApply]) -> NodeApplyResultList: """ - Updates annotation state nodes from the node applies passed into the function + Updates existing annotation state nodes with new property values. + + Args: + list_node_apply: List of NodeApply objects containing updated properties. + + Returns: + NodeApplyResultList containing the results of the update operation. """ update_results: InstancesApplyResult = self.client.data_modeling.instances.apply( nodes=list_node_apply, @@ -212,7 +254,13 @@ def update_annotation_state(self, list_node_apply: list[NodeApply]) -> NodeApply def create_annotation_state(self, list_node_apply: list[NodeApply]) -> NodeApplyResultList: """ - Creates annotation state nodes from the node applies passed into the function + Creates new annotation state nodes, replacing any existing nodes with the same IDs. + + Args: + list_node_apply: List of NodeApply objects to create as new annotation state instances. + + Returns: + NodeApplyResultList containing the results of the creation operation. """ update_results: InstancesApplyResult = self.client.data_modeling.instances.apply( nodes=list_node_apply, @@ -225,9 +273,22 @@ def get_instances_entities( self, primary_scope_value: str, secondary_scope_value: str | None ) -> tuple[NodeList, NodeList]: """ - Return the entities that can be used in diagram detect - 1. grab assets that meet the filter requirement - 2. grab files that meet the filter requirement + Retrieves target entities and file entities for use in diagram detection. + + Queries the data model for entities (assets) and files that match the configured filters + and scope values, which will be used to create the entity cache for diagram detection. + + Args: + primary_scope_value: Primary scope identifier (e.g., site, facility). + secondary_scope_value: Optional secondary scope identifier (e.g., unit, area). + + Returns: + A tuple containing: + - NodeList of target entity instances (typically assets) + - NodeList of file entity instances + + NOTE: 1. grab assets that meet the filter requirement + NOTE: 2. grab files that meet the filter requirement """ target_filter: Filter = self._get_target_entities_filter(primary_scope_value, secondary_scope_value) file_filter: Filter = self._get_file_entities_filter(primary_scope_value, secondary_scope_value) @@ -250,7 +311,18 @@ def get_instances_entities( def _get_target_entities_filter(self, primary_scope_value: str, secondary_scope_value: str | None) -> Filter: """ - Create a filter that... + Builds a filter for target entities (assets) based on scope and configuration. + + Creates a filter combining scope-specific filtering with global 'ScopeWideDetect' entities. + + Args: + primary_scope_value: Primary scope identifier for filtering entities. + secondary_scope_value: Optional secondary scope identifier for more specific filtering. + + Returns: + Combined Filter for querying target entities. + + NOTE: Create a filter that... - grabs assets in the primary_scope_value and secondary_scope_value provided with detectInDiagram in the tags property or - grabs assets in the primary_scope_value with ScopeWideDetect in the tags property (hard coded) -> provides an option to include entities outside of the secondary_scope_value @@ -283,7 +355,19 @@ def _get_target_entities_filter(self, primary_scope_value: str, secondary_scope_ def _get_file_entities_filter(self, primary_scope_value: str, secondary_scope_value: str | None) -> Filter: """ - Create a filter that... + Builds a filter for file entities based on scope and configuration. + + Creates a filter combining scope-specific filtering with global 'ScopeWideDetect' files, + ensuring file entities have the required search properties. + + Args: + primary_scope_value: Primary scope identifier for filtering file entities. + secondary_scope_value: Optional secondary scope identifier for more specific filtering. + + Returns: + Combined Filter for querying file entities. + + NOTE: Create a filter that... - grabs assets in the primary_scope_value and secondary_scope_value provided with DetectInDiagram in the tags property or - grabs assets in the primary_scope_value with ScopeWideDetect in the tags property (hard coded) -> provides an option to include entities outside of the secondary_scope_value diff --git a/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_launch/services/LaunchService.py b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_launch/services/LaunchService.py index 51ab9803..62c44cfe 100644 --- a/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_launch/services/LaunchService.py +++ b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_launch/services/LaunchService.py @@ -28,8 +28,8 @@ class AbstractLaunchService(abc.ABC): """ - Orchestrates the file annotation launch process. This service prepares files for annotation, - manages batching and caching, and initiates diagram detection jobs. + Orchestrates the file annotation launch process. This service manages batching and caching, + and initiates diagram detection jobs for files ready to be annotated. """ def __init__( @@ -50,13 +50,6 @@ def __init__( self.cache_service = cache_service self.annotation_service = annotation_service - @abc.abstractmethod - def prepare(self) -> str | None: - """ - Peronally think it's cleaner having this operate as a separate cognite function -> but due to mpc function constraints it wouldn't make sense for our project to go down this route (Jack) - """ - pass - @abc.abstractmethod def run(self) -> str | None: pass @@ -64,8 +57,8 @@ def run(self) -> str | None: class GeneralLaunchService(AbstractLaunchService): """ - Orchestrates the file annotation launch process. This service prepares files for annotation, - manages batching and caching, and initiates diagram detection jobs. + Orchestrates the file annotation launch process. This service manages batching and caching, + and initiates diagram detection jobs for files ready to be annotated. """ def __init__( @@ -77,6 +70,7 @@ def __init__( data_model_service: IDataModelService, cache_service: ICacheService, annotation_service: IAnnotationService, + function_call_info: dict, ): super().__init__( client, @@ -94,131 +88,31 @@ def __init__( self.file_view: ViewPropertyConfig = config.data_model_views.file_view self.in_memory_cache: list[dict] = [] + self.in_memory_patterns: list[dict] = [] self._cached_primary_scope: str | None = None self._cached_secondary_scope: str | None = None self.primary_scope_property: str = self.config.launch_function.primary_scope_property self.secondary_scope_property: str | None = self.config.launch_function.secondary_scope_property - self.reset_files: bool = False - if self.config.prepare_function.get_files_for_annotation_reset_query: - self.reset_files = True + self.function_id: int | None = function_call_info.get("function_id") + self.call_id: int | None = function_call_info.get("call_id") - # NOTE: I believe this code should be encapsulated as a separate CDF function named prepFunction. Due to the amount of cdf functions we can spin up, we're coupling this within the launchFunction. - def prepare(self) -> Literal["Done"] | None: - """ - Retrieves files marked "ToAnnotate" in the tags property and creates a 1-to-1 ratio of FileAnnotationState instances to files + def run(self) -> Literal["Done"] | None: """ - self.logger.info( - message=f"Starting Prepare Function", - section="START", - ) - try: - if self.reset_files: - file_nodes_to_reset: NodeList | None = self.data_model_service.get_files_for_annotation_reset() - if not file_nodes_to_reset: - self.logger.info( - "No files found with the getFilesForAnnotationReset query provided in the config file" - ) - else: - self.logger.info(f"Resetting {len(file_nodes_to_reset)} files") - reset_node_apply: list[NodeApply] = [] - for file_node in file_nodes_to_reset: - file_node_apply: NodeApply = file_node.as_write() - tags_property: list[str] = cast(list[str], file_node_apply.sources[0].properties["tags"]) - if "AnnotationInProcess" in tags_property: - tags_property.remove("AnnotationInProcess") - if "Annotated" in tags_property: - tags_property.remove("Annotated") - if "AnnotationFailed" in tags_property: - tags_property.remove("AnnotationFailed") - - reset_node_apply.append(file_node_apply) - update_results = self.data_model_service.update_annotation_state(reset_node_apply) - self.logger.info( - f"Removed the AnnotationInProcess/Annotated/AnnotationFailed tag of {len(update_results)} files" - ) - self.reset_files = False - except CogniteAPIError as e: - # NOTE: Reliant on the CogniteAPI message to stay the same across new releases. If unexpected changes were to occur please refer to this section of the code and check if error message is now different. - if ( - e.code == 408 - and e.message == "Graph query timed out. Reduce load or contention, or optimise your query." - ): - # NOTE: 408 indicates a timeout error. Keep retrying the query if a timeout occurs. - self.logger.error(message=f"Ran into the following error:\n{str(e)}") - return - else: - raise e - - try: - file_nodes: NodeList | None = self.data_model_service.get_files_to_annotate() - if not file_nodes: - self.logger.info( - message=f"No files found to prepare", - section="END", - ) - return "Done" - self.logger.info(f"Preparing {len(file_nodes)} files") - except CogniteAPIError as e: - # NOTE: Reliant on the CogniteAPI message to stay the same across new releases. If unexpected changes were to occur please refer to this section of the code and check if error message is now different. - if ( - e.code == 408 - and e.message == "Graph query timed out. Reduce load or contention, or optimise your query." - ): - # NOTE: 408 indicates a timeout error. Keep retrying the query if a timeout occurs. - self.logger.error(message=f"Ran into the following error:\n{str(e)}") - return - else: - raise e - - annotation_state_instances: list[NodeApply] = [] - file_apply_instances: list[NodeApply] = [] - for file_node in file_nodes: - node_id = {"space": file_node.space, "externalId": file_node.external_id} - annotation_instance = AnnotationState( - annotationStatus=AnnotationStatus.NEW, - linkedFile=node_id, - ) - if not self.annotation_state_view.instance_space: - msg = ( - "Need an instance space in DataModelViews/AnnotationStateView config to store the annotation state" - ) - self.logger.error(msg) - raise ValueError(msg) - annotation_instance_space: str = self.annotation_state_view.instance_space + Main execution loop for launching diagram detection jobs. - annotation_node_apply: NodeApply = annotation_instance.to_node_apply( - node_space=annotation_instance_space, - annotation_state_view=self.annotation_state_view.as_view_id(), - ) - annotation_state_instances.append(annotation_node_apply) + Retrieves files ready for processing, organizes them into context-aware batches based on scope, + ensures appropriate entity caches are loaded, and initiates diagram detection jobs for each batch. - file_node_apply: NodeApply = file_node.as_write() - tags_property: list[str] = cast(list[str], file_node_apply.sources[0].properties["tags"]) - if "AnnotationInProcess" not in tags_property: - tags_property.append("AnnotationInProcess") - file_apply_instances.append(file_node_apply) + Args: + None - try: - create_results = self.data_model_service.create_annotation_state(annotation_state_instances) - self.logger.info(message=f"Created {len(create_results)} annotation state instances") - update_results = self.data_model_service.update_annotation_state(file_apply_instances) - self.logger.info( - message=f"Added 'AnnotationInProcess' to the tag property for {len(update_results)} files", - section="END", - ) - except Exception as e: - self.logger.error(message=f"Ran into the following error:\n{str(e)}", section="END") - raise + Returns: + "Done" if no more files to process or max jobs reached, None if processing should continue. - self.tracker.add_files(success=len(file_nodes)) - return - - def run(self) -> Literal["Done"] | None: - """ - The main entry point for the launch service. It prepares the files and then - processes them in organized, context-aware batches. + Raises: + CogniteAPIError: If query timeout (408) or max jobs reached (429), handled gracefully. """ self.logger.info( message=f"Starting Launch Function", @@ -288,10 +182,17 @@ def run(self) -> Literal["Done"] | None: def _organize_files_for_processing(self, list_files: NodeList) -> list[FileProcessingBatch]: """ - Groups files based on the 'primary_scope_property' and 'secondary_scope_property' - defined in the configuration. This strategy allows us to load a relevant entity cache - once for a group of files that share the same operational context, significantly - reducing redundant CDF queries. + Organizes files into batches grouped by scope for efficient processing. + + Groups files based on primary and secondary scope properties defined in configuration. + This strategy enables loading a relevant entity cache once per group, significantly + reducing redundant CDF queries for files sharing the same operational context. + + Args: + list_files: NodeList of file instances to organize into batches. + + Returns: + List of FileProcessingBatch objects, each containing files from the same scope. """ organized_data: dict[str, dict[str, list[Node]]] = defaultdict(lambda: defaultdict(list)) @@ -327,8 +228,20 @@ def _organize_files_for_processing(self, list_files: NodeList) -> list[FileProce def _ensure_cache_for_batch(self, primary_scope_value: str, secondary_scope_value: str | None): """ - Ensure self.in_memory_cache is populated for the given site and unit. - Checks if there's a mismatch in site, unit, or if the in_memory_cache is empty + Ensures the in-memory entity cache is loaded and current for the given scope. + + Checks if cache needs refreshing (scope mismatch or empty cache) and fetches fresh + entities and patterns from the cache service if needed. + + Args: + primary_scope_value: Primary scope identifier for the batch being processed. + secondary_scope_value: Optional secondary scope identifier for the batch. + + Returns: + None + + Raises: + CogniteAPIError: If query timeout (408) occurs, handled gracefully by returning early. """ if ( self._cached_primary_scope != primary_scope_value @@ -337,8 +250,10 @@ def _ensure_cache_for_batch(self, primary_scope_value: str, secondary_scope_valu ): self.logger.info(f"Refreshing in memory cache") try: - self.in_memory_cache = self.cache_service.get_entities( - self.data_model_service, primary_scope_value, secondary_scope_value + self.in_memory_cache, self.in_memory_patterns = self.cache_service.get_entities( + self.data_model_service, + primary_scope_value, + secondary_scope_value, ) self._cached_primary_scope = primary_scope_value self._cached_secondary_scope = secondary_scope_value @@ -356,16 +271,28 @@ def _ensure_cache_for_batch(self, primary_scope_value: str, secondary_scope_valu def _process_batch(self, batch: BatchOfPairedNodes): """ - Processes a single batch of files. For each file, it starts a diagram - detection job and then updates the corresponding 'AnnotationState' node - with the job ID and a 'Processing' status. + Processes a batch of files by initiating diagram detection jobs and updating state. + + Runs both regular and pattern mode diagram detection (if enabled) for all files in the batch, + then updates annotation state instances with job IDs and processing status. + + Args: + batch: BatchOfPairedNodes containing file references and their annotation state nodes. + + Returns: + None + + Raises: + CogniteAPIError: If max concurrent jobs reached (429), handled gracefully. """ if batch.is_empty(): return - self.logger.info(f"Running diagram detect on {batch.size()} files with {len(self.in_memory_cache)} entities") - try: + # Run regular diagram detect + self.logger.info( + f"Running diagram detect on {batch.size()} files with {len(self.in_memory_cache)} entities" + ) job_id: int = self.annotation_service.run_diagram_detect( files=batch.file_references, entities=self.in_memory_cache ) @@ -373,14 +300,35 @@ def _process_batch(self, batch: BatchOfPairedNodes): "annotationStatus": AnnotationStatus.PROCESSING, "sourceUpdatedTime": datetime.now(timezone.utc).replace(microsecond=0).isoformat(), "diagramDetectJobId": job_id, + "launchFunctionId": self.function_id, + "launchFunctionCallId": self.call_id, } + + # Run diagram detect on pattern mode + pattern_job_id: int | None = None + if self.config.launch_function.pattern_mode: + total_patterns = 0 + if self.in_memory_patterns and len(self.in_memory_patterns) >= 2: + total_patterns = len(self.in_memory_patterns[0].get("sample", [])) + len( + self.in_memory_patterns[1].get("sample", []) + ) + elif self.in_memory_patterns and len(self.in_memory_patterns) >= 1: + total_patterns = len(self.in_memory_patterns[0].get("sample", [])) + self.logger.info( + f"Running pattern mode diagram detect on {batch.size()} files with {total_patterns} sample patterns" + ) + pattern_job_id = self.annotation_service.run_pattern_mode_detect( + files=batch.file_references, pattern_samples=self.in_memory_patterns + ) + update_properties["patternModeJobId"] = pattern_job_id + batch.batch_states.update_node_properties( new_properties=update_properties, view_id=self.annotation_state_view.as_view_id(), ) - update_results = self.data_model_service.update_annotation_state(batch.batch_states.apply) + self.data_model_service.update_annotation_state(batch.batch_states.apply) self.logger.info( - message=f" Updated the annotation state instances:\n- annotation status set to 'Processing'\n- job id set to {job_id}", + message=f"Updated the annotation state instances:\n- annotation status set to 'Processing'\n- job id set to {job_id}\n- pattern mode job id set to {pattern_job_id}", section="END", ) finally: @@ -389,22 +337,36 @@ def _process_batch(self, batch: BatchOfPairedNodes): class LocalLaunchService(GeneralLaunchService): """ - A Launch service that uses a custom, local process for handling batches, - while inheriting all other functionality from GeneralLaunchService. + Launch service variant for local development and debugging. + + Extends GeneralLaunchService with custom error handling for local runs, including + sleep/retry logic for API rate limiting rather than immediate termination. """ def _process_batch(self, batch: BatchOfPairedNodes): """ - This method overrides the original _process_batch. - Instead of calling the annotation service, it could, for example, - process the files locally. + Processes a batch with local-specific error handling. + + Extends the base _process_batch with additional error handling suitable for local runs, + including automatic retry with sleep on rate limit errors (429) rather than terminating. + + Args: + batch: BatchOfPairedNodes containing file references and their annotation state nodes. + + Returns: + None + + Raises: + Exception: If non-rate-limit errors occur. """ if batch.is_empty(): return - self.logger.info(f"Running diagram detect on {batch.size()} files with {len(self.in_memory_cache)} entities") - try: + # Run regular diagram detect + self.logger.info( + f"Running diagram detect on {batch.size()} files with {len(self.in_memory_cache)} entities" + ) job_id: int = self.annotation_service.run_diagram_detect( files=batch.file_references, entities=self.in_memory_cache ) @@ -412,14 +374,35 @@ def _process_batch(self, batch: BatchOfPairedNodes): "annotationStatus": AnnotationStatus.PROCESSING, "sourceUpdatedTime": datetime.now(timezone.utc).replace(microsecond=0).isoformat(), "diagramDetectJobId": job_id, + "launchFunctionId": self.function_id, + "launchFunctionCallId": self.call_id, } + + # Run diagram detect on pattern mode + pattern_job_id: int | None = None + if self.config.launch_function.pattern_mode: + total_patterns = 0 + if self.in_memory_patterns and len(self.in_memory_patterns) >= 2: + total_patterns = len(self.in_memory_patterns[0].get("sample", [])) + len( + self.in_memory_patterns[1].get("sample", []) + ) + elif self.in_memory_patterns and len(self.in_memory_patterns) >= 1: + total_patterns = len(self.in_memory_patterns[0].get("sample", [])) + self.logger.info( + f"Running pattern mode diagram detect on {batch.size()} files with {total_patterns} sample patterns" + ) + pattern_job_id = self.annotation_service.run_pattern_mode_detect( + files=batch.file_references, pattern_samples=self.in_memory_patterns + ) + update_properties["patternModeJobId"] = pattern_job_id + batch.batch_states.update_node_properties( new_properties=update_properties, view_id=self.annotation_state_view.as_view_id(), ) - update_results = self.data_model_service.update_annotation_state(batch.batch_states.apply) + self.data_model_service.update_annotation_state(batch.batch_states.apply) self.logger.info( - message=f" Updated the annotation state instances:\n- annotation status set to 'Processing'\n- job id set to {job_id}", + message=f"Updated the annotation state instances:\n- annotation status set to 'Processing'\n- job id set to {job_id}\n- pattern mode job id set to {pattern_job_id}", section="END", ) except CogniteAPIError as e: diff --git a/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_launch/services/LoggerService.py b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_launch/services/LoggerService.py index 17f24d6b..773b7797 100644 --- a/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_launch/services/LoggerService.py +++ b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_launch/services/LoggerService.py @@ -25,6 +25,16 @@ def __init__( self.write = False def _format_message_lines(self, prefix: str, message: str) -> list[str]: + """ + Formats multi-line messages with consistent indentation. + + Args: + prefix: The log level prefix (e.g., "[INFO]", "[ERROR]"). + message: The message to format. + + Returns: + List of formatted message lines with proper indentation. + """ formatted_lines = [] if "\n" not in message: formatted_lines.append(f"{prefix} {message}") @@ -37,6 +47,16 @@ def _format_message_lines(self, prefix: str, message: str) -> list[str]: return formatted_lines def _print(self, prefix: str, message: str) -> None: + """ + Prints formatted log messages to console and optionally to file. + + Args: + prefix: The log level prefix to prepend to the message. + message: The message to log. + + Returns: + None + """ lines_to_log = self._format_message_lines(prefix, message) if self.write and self.file_handler: try: @@ -51,6 +71,16 @@ def _print(self, prefix: str, message: str) -> None: print(line) def debug(self, message: str, section: Literal["START", "END", "BOTH"] | None = None) -> None: + """ + Logs a debug-level message. + + Args: + message: The debug message to log. + section: Optional section separator position (START, END, or BOTH). + + Returns: + None + """ if section == "START" or section == "BOTH": self._section() if self.log_level == "DEBUG": @@ -59,6 +89,16 @@ def debug(self, message: str, section: Literal["START", "END", "BOTH"] | None = self._section() def info(self, message: str, section: Literal["START", "END", "BOTH"] | None = None) -> None: + """ + Logs an info-level message. + + Args: + message: The informational message to log. + section: Optional section separator position (START, END, or BOTH). + + Returns: + None + """ if section == "START" or section == "BOTH": self._section() if self.log_level in ("DEBUG", "INFO"): @@ -67,6 +107,16 @@ def info(self, message: str, section: Literal["START", "END", "BOTH"] | None = N self._section() def warning(self, message: str, section: Literal["START", "END", "BOTH"] | None = None) -> None: + """ + Logs a warning-level message. + + Args: + message: The warning message to log. + section: Optional section separator position (START, END, or BOTH). + + Returns: + None + """ if section == "START" or section == "BOTH": self._section() if self.log_level in ("DEBUG", "INFO", "WARNING"): @@ -75,6 +125,16 @@ def warning(self, message: str, section: Literal["START", "END", "BOTH"] | None self._section() def error(self, message: str, section: Literal["START", "END", "BOTH"] | None = None) -> None: + """ + Logs an error-level message. + + Args: + message: The error message to log. + section: Optional section separator position (START, END, or BOTH). + + Returns: + None + """ if section == "START" or section == "BOTH": self._section() self._print("[ERROR]", message) @@ -82,6 +142,12 @@ def error(self, message: str, section: Literal["START", "END", "BOTH"] | None = self._section() def _section(self) -> None: + """ + Prints a visual separator line for log sections. + + Returns: + None + """ if self.write and self.file_handler: self.file_handler.write( "--------------------------------------------------------------------------------\n" @@ -89,6 +155,12 @@ def _section(self) -> None: print("--------------------------------------------------------------------------------") def close(self) -> None: + """ + Closes the file handler if file logging is enabled. + + Returns: + None + """ if self.file_handler: try: self.file_handler.close() diff --git a/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_launch/services/PipelineService.py b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_launch/services/PipelineService.py index 5dd95bc7..7cf5d885 100644 --- a/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_launch/services/PipelineService.py +++ b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_launch/services/PipelineService.py @@ -36,7 +36,13 @@ def __init__(self, pipeline_ext_id: str, client: CogniteClient): def update_extraction_pipeline(self, msg: str) -> None: """ - Update the message log for the extraction pipeline + Appends a message to the extraction pipeline run log. + + Args: + msg: The message to append to the pipeline log. + + Returns: + None """ if not self.ep_write.message: self.ep_write.message = msg @@ -48,7 +54,13 @@ def upload_extraction_pipeline( status: Literal["success", "failure", "seen"], ) -> None: """ - Upload the extraction pipeline run so that status and message logs are captured + Creates an extraction pipeline run with accumulated status and messages. + + Args: + status: The run status to report (success, failure, or seen). + + Returns: + None """ self.ep_write.status = status self.client.extraction_pipelines.runs.create(self.ep_write) diff --git a/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_launch/utils/DataStructures.py b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_launch/utils/DataStructures.py index 0f7bc3f2..8ef6675d 100644 --- a/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_launch/utils/DataStructures.py +++ b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_launch/utils/DataStructures.py @@ -31,6 +31,7 @@ class EnvConfig: class DiagramAnnotationStatus(str, Enum): SUGGESTED = "Suggested" APPROVED = "Approved" + REJECTED = "Rejected" class AnnotationStatus(str, Enum): @@ -80,8 +81,8 @@ class AnnotationState: sourceUpdatedTime: str = field( default_factory=lambda: datetime.now(timezone.utc).replace(microsecond=0).isoformat() ) - sourceCreatedUser: str = "fn_dm_context_annotation_launch" - sourceUpdatedUser: str = "fn_dm_context_annotation_launch" + sourceCreatedUser: str = "fn_dm_context_annotation_prepare" + sourceUpdatedUser: str = "fn_dm_context_annotation_prepare" def _create_external_id(self) -> str: """ @@ -125,18 +126,17 @@ class entity: "external_id": file.external_id, "name": file.properties[job_config.file_view.as_view_id()]["name"], "space": file.space, - search_property: file.properties[job_config.file_view.as_view_id()][ - search_property - ], - "annotation_type_external_id": job_config.file_view.type, + "annotation_type": job_config.file_view.type, + "resource_type": file.properties[job_config.file_view.as_view_id()][{resource_type}], + "search_property": file.properties[job_config.file_view.as_view_id()][{search_property}], } - Note: kind of prefer a generic variable name here as opposed to specific ones that changes based off config -> i.e.) for marathon the variable here would be aliases instead of search_property """ external_id: str name: str space: str - annotation_type_external_id: Literal["diagrams.FileLink", "diagrams.AssetLink"] | None + annotation_type: Literal["diagrams.FileLink", "diagrams.AssetLink"] | None + resource_type: str search_property: list[str] = field(default_factory=list) def to_dict(self): @@ -309,7 +309,7 @@ def generate_overall_report(self) -> str: def generate_ep_run( self, - caller: Literal["Launch", "Finalize"], + caller: Literal["Prepare", "Launch", "Finalize"], function_id: str | None, call_id: str | None, ) -> str: diff --git a/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_prepare/__init__.py b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_prepare/__init__.py new file mode 100644 index 00000000..e69de29b diff --git a/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_prepare/dependencies.py b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_prepare/dependencies.py new file mode 100644 index 00000000..55d4e7bd --- /dev/null +++ b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_prepare/dependencies.py @@ -0,0 +1,96 @@ +import os + +from pathlib import Path +from dotenv import load_dotenv +from typing import Any, Tuple, Literal +from cognite.client import CogniteClient, ClientConfig +from cognite.client.credentials import OAuthClientCredentials + +from utils.DataStructures import EnvConfig +from services.LoggerService import CogniteFunctionLogger +from services.ConfigService import Config, load_config_parameters +from services.DataModelService import GeneralDataModelService +from services.PipelineService import GeneralPipelineService + + +def get_env_variables() -> EnvConfig: + print("Loading environment variables from .env...") + + project_path = (Path(__file__).parent / ".env").resolve() + print(f"project_path is set to: {project_path}") + + load_dotenv() + + required_envvars = ( + "CDF_PROJECT", + "CDF_CLUSTER", + "IDP_TENANT_ID", + "IDP_CLIENT_ID", + "IDP_CLIENT_SECRET", + ) + + missing = [envvar for envvar in required_envvars if envvar not in os.environ] + if missing: + raise ValueError(f"Missing one or more env.vars: {missing}") + + return EnvConfig( + cdf_project=os.getenv("CDF_PROJECT"), # type: ignore + cdf_cluster=os.getenv("CDF_CLUSTER"), # type: ignore + tenant_id=os.getenv("IDP_TENANT_ID"), # type: ignore + client_id=os.getenv("IDP_CLIENT_ID"), # type: ignore + client_secret=os.getenv("IDP_CLIENT_SECRET"), # type: ignore + ) + + +def create_client(env_config: EnvConfig, debug: bool = False): + SCOPES = [f"https://{env_config.cdf_cluster}.cognitedata.com/.default"] + TOKEN_URL = f"https://login.microsoftonline.com/{env_config.tenant_id}/oauth2/v2.0/token" + creds = OAuthClientCredentials( + token_url=TOKEN_URL, + client_id=env_config.client_id, + client_secret=env_config.client_secret, + scopes=SCOPES, + ) + cnf = ClientConfig( + client_name="DEV_Working", + project=env_config.cdf_project, + base_url=f"https://{env_config.cdf_cluster}.cognitedata.com", # NOTE: base_url might need to be adjusted if on PSAAS or Private Link + credentials=creds, + debug=debug, + ) + client = CogniteClient(cnf) + return client + + +def create_config_service( + function_data: dict[str, Any], client: CogniteClient | None = None +) -> Tuple[Config, CogniteClient]: + if client is None: + env_config = get_env_variables() + client = create_client(env_config) + config = load_config_parameters(client=client, function_data=function_data) + return config, client + + +def create_logger_service(log_level): + if log_level not in ["DEBUG", "INFO", "WARNING", "ERROR"]: + return CogniteFunctionLogger() + else: + return CogniteFunctionLogger(log_level=log_level) + + +def create_write_logger_service(log_level, filepath): + if log_level not in ["DEBUG", "INFO", "WARNING", "ERROR"]: + return CogniteFunctionLogger(write=True, filepath=filepath) + else: + return CogniteFunctionLogger(log_level=log_level, write=True, filepath=filepath) + + +def create_general_data_model_service( + config: Config, client: CogniteClient, logger: CogniteFunctionLogger +) -> GeneralDataModelService: + return GeneralDataModelService(config=config, client=client, logger=logger) + + +def create_general_pipeline_service(client: CogniteClient, pipeline_ext_id: str) -> GeneralPipelineService: + return GeneralPipelineService(pipeline_ext_id, client) diff --git a/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_prepare/handler.py b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_prepare/handler.py new file mode 100644 index 00000000..0a3a562e --- /dev/null +++ b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_prepare/handler.py @@ -0,0 +1,145 @@ +import sys +from cognite.client import CogniteClient +from datetime import datetime, timezone, timedelta + +from dependencies import ( + create_config_service, + create_logger_service, + create_write_logger_service, + create_general_data_model_service, + create_general_pipeline_service, +) +from services.PrepareService import GeneralPrepareService, LocalPrepareService, AbstractPrepareService +from services.DataModelService import IDataModelService +from services.PipelineService import IPipelineService +from utils.DataStructures import PerformanceTracker + + +def handle(data: dict, function_call_info: dict, client: CogniteClient) -> dict: + """ + Main entry point for the cognite function. + 1. Create an instance of config, logger, and tracker + 2. Create an instance of the prepare function and create implementations of the interfaces + 3. Run the prepare instance until... + 4. It's been 7 minutes + 5. There are no files left that need to be prepared + NOTE: Cognite functions have a run-time limit of 10 minutes. + Don't want the function to die at the 10minute mark since there's no guarantee all code will execute. + Thus we set a timelimit of 7 minutes (conservative) so that code execution is guaranteed. + documentation on the calling a function can be found here... https://api-docs.cognite.com/20230101/tag/Function-calls/operation/postFunctionsCall + """ + start_time = datetime.now(timezone.utc) + log_level = data.get("logLevel", "INFO") + + config_instance, client = create_config_service(function_data=data, client=client) + logger_instance = create_logger_service(log_level) + tracker_instance = PerformanceTracker() + pipeline_instance: IPipelineService = create_general_pipeline_service( + client, pipeline_ext_id=data["ExtractionPipelineExtId"] + ) + + prepare_instance: AbstractPrepareService = _create_prepare_service( + config=config_instance, + client=client, + logger=logger_instance, + tracker=tracker_instance, + function_call_info=function_call_info, + ) + + run_status: str = "success" + try: + while datetime.now(timezone.utc) - start_time < timedelta(minutes=7): + if prepare_instance.run() == "Done": + return {"status": run_status, "data": data} + logger_instance.info(tracker_instance.generate_local_report()) + return {"status": run_status, "data": data} + except Exception as e: + run_status = "failure" + msg = f"{str(e)}" + logger_instance.error(message=msg, section="BOTH") + return {"status": run_status, "message": msg} + finally: + logger_instance.info(tracker_instance.generate_overall_report(), "BOTH") + # only want to report on the count of successful and failed files in ep_logs if there were files that were processed or an error occured + # else run log will be too messy. + function_id = function_call_info.get("function_id") + call_id = function_call_info.get("call_id") + pipeline_instance.update_extraction_pipeline( + msg=tracker_instance.generate_ep_run("Prepare", function_id, call_id) + ) + pipeline_instance.upload_extraction_pipeline(status=run_status) + + +def run_locally(config_file: dict[str, str], log_path: str | None = None): + """ + Main entry point for the cognite function. + 1. Create an instance of config, logger, and tracker + 2. Create an instance of the Prepare function and create implementations of the interfaces + 3. Run the prepare instance until + 4. There are no files left that need to be prepared + """ + log_level = config_file.get("logLevel", "DEBUG") + config_instance, client = create_config_service(function_data=config_file) + + if log_path: + logger_instance = create_write_logger_service(log_level=log_level, filepath=log_path) + else: + logger_instance = create_logger_service(log_level=log_level) + tracker_instance = PerformanceTracker() + + prepare_instance: AbstractPrepareService = _create_local_prepare_service( + config=config_instance, + client=client, + logger=logger_instance, + tracker=tracker_instance, + function_call_info={"function_id": None, "call_id": None}, + ) + try: + while True: + if prepare_instance.run() == "Done": + break + logger_instance.info(tracker_instance.generate_local_report()) + except Exception as e: + logger_instance.error( + message=f"Ran into the following error: \n{e}", + section="END", + ) + finally: + logger_instance.info(tracker_instance.generate_overall_report(), "BOTH") + logger_instance.close() + + +def _create_prepare_service(config, client, logger, tracker, function_call_info) -> AbstractPrepareService: + data_model_instance: IDataModelService = create_general_data_model_service(config, client, logger) + prepare_instance: AbstractPrepareService = GeneralPrepareService( + client=client, + config=config, + logger=logger, + tracker=tracker, + data_model_service=data_model_instance, + function_call_info=function_call_info, + ) + return prepare_instance + + +def _create_local_prepare_service(config, client, logger, tracker, function_call_info) -> AbstractPrepareService: + data_model_instance: IDataModelService = create_general_data_model_service(config, client, logger) + prepare_instance: AbstractPrepareService = LocalPrepareService( + client=client, + config=config, + logger=logger, + tracker=tracker, + data_model_service=data_model_instance, + function_call_info=function_call_info, + ) + return prepare_instance + + +if __name__ == "__main__": + # NOTE: Receives the arguments from .vscode/launch.json. Mimics arguments that are passed into the serverless function. + config_file = { + "ExtractionPipelineExtId": sys.argv[1], + "logLevel": sys.argv[2], + } + log_path = sys.argv[3] + run_locally(config_file, log_path) diff --git a/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_prepare/requirements.txt b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_prepare/requirements.txt new file mode 100644 index 00000000..bd7f2bc3 --- /dev/null +++ b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_prepare/requirements.txt @@ -0,0 +1,24 @@ +annotated-types==0.7.0 +certifi==2025.4.26 +cffi==1.17.1 +charset-normalizer==3.4.2 +cognite-sdk==7.76.0 +cryptography==44.0.3 +dotenv==0.9.9 +idna==3.10 +msal==1.32.3 +oauthlib==3.2.2 +packaging==25.0 +protobuf==6.30.2 +pycparser==2.22 +pydantic==2.11.4 +pydantic_core==2.33.2 +PyJWT==2.10.1 +python-dotenv==1.1.0 +PyYAML==6.0.2 +requests==2.32.3 +requests-oauthlib==1.3.1 +typing-inspection==0.4.0 +typing_extensions==4.13.2 +urllib3==2.5.0 + diff --git a/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_prepare/services/ConfigService.py b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_prepare/services/ConfigService.py new file mode 100644 index 00000000..f1d2584d --- /dev/null +++ b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_prepare/services/ConfigService.py @@ -0,0 +1,371 @@ +from enum import Enum +from typing import Any, Literal, cast, Optional + +import yaml +from cognite.client.data_classes.contextualization import ( + DiagramDetectConfig, + ConnectionFlags, + CustomizeFuzziness, + DirectionWeights, +) +from cognite.client.data_classes.data_modeling import NodeId +from cognite.client.data_classes.filters import Filter +from cognite.client import CogniteClient +from cognite.client import data_modeling as dm +from cognite.client.exceptions import CogniteAPIError +from pydantic import BaseModel, Field +from pydantic.alias_generators import to_camel +from utils.DataStructures import AnnotationStatus, FilterOperator + + +# Configuration Classes +class ViewPropertyConfig(BaseModel, alias_generator=to_camel): + schema_space: str + instance_space: Optional[str] = None + external_id: str + version: str + annotation_type: Optional[Literal["diagrams.FileLink", "diagrams.AssetLink"]] = None + + def as_view_id(self) -> dm.ViewId: + return dm.ViewId(space=self.schema_space, external_id=self.external_id, version=self.version) + + def as_property_ref(self, property) -> list[str]: + return [self.schema_space, f"{self.external_id}/{self.version}", property] + + +class FilterConfig(BaseModel, alias_generator=to_camel): + values: Optional[list[AnnotationStatus | str] | AnnotationStatus | str] = None + negate: bool = False + operator: FilterOperator + target_property: str + + def as_filter(self, view_properties: ViewPropertyConfig) -> Filter: + property_reference = view_properties.as_property_ref(self.target_property) + + # Converts enum value into string -> i.e.) in the case of AnnotationStatus + if isinstance(self.values, list): + find_values = [v.value if isinstance(v, Enum) else v for v in self.values] + elif isinstance(self.values, Enum): + find_values = self.values.value + else: + find_values = self.values + + filter: Filter + if find_values is None: + if self.operator == FilterOperator.EXISTS: + filter = dm.filters.Exists(property=property_reference) + else: + raise ValueError(f"Operator {self.operator} requires a value") + elif self.operator == FilterOperator.IN: + if not isinstance(find_values, list): + raise ValueError(f"Operator 'IN' requires a list of values for property {self.target_property}") + filter = dm.filters.In(property=property_reference, values=find_values) + elif self.operator == FilterOperator.EQUALS: + filter = dm.filters.Equals(property=property_reference, value=find_values) + elif self.operator == FilterOperator.CONTAINSALL: + filter = dm.filters.ContainsAll(property=property_reference, values=find_values) + elif self.operator == FilterOperator.SEARCH: + filter = dm.filters.Search(property=property_reference, value=find_values) + else: + raise NotImplementedError(f"Operator {self.operator} is not implemented.") + + if self.negate: + return dm.filters.Not(filter) + else: + return filter + + +class QueryConfig(BaseModel, alias_generator=to_camel): + target_view: ViewPropertyConfig + filters: list[FilterConfig] + limit: Optional[int] = -1 + + def build_filter(self) -> Filter: + list_filters: list[Filter] = [f.as_filter(self.target_view) for f in self.filters] + + if len(list_filters) == 1: + return list_filters[0] + else: + return dm.filters.And(*list_filters) # NOTE: '*' Unpacks each filter in the list + + +class ConnectionFlagsConfig(BaseModel, alias_generator=to_camel): + no_text_inbetween: Optional[bool] = None + natural_reading_order: Optional[bool] = None + + def as_connection_flag(self) -> ConnectionFlags: + params = {key: value for key, value in self.model_dump().items() if value is not None} + return ConnectionFlags(**params) + + +class CustomizeFuzzinessConfig(BaseModel, alias_generator=to_camel): + fuzzy_score: Optional[float] = None + max_boxes: Optional[int] = None + min_chars: Optional[int] = None + + def as_customize_fuzziness(self) -> CustomizeFuzziness: + params = {key: value for key, value in self.model_dump().items() if value is not None} + return CustomizeFuzziness(**params) + + +class DirectionWeightsConfig(BaseModel, alias_generator=to_camel): + left: Optional[float] = None + right: Optional[float] = None + up: Optional[float] = None + down: Optional[float] = None + + def as_direction_weights(self) -> DirectionWeights: + params = {key: value for key, value in self.model_dump().items() if value is not None} + return DirectionWeights(**params) + + +class DiagramDetectConfigModel(BaseModel, alias_generator=to_camel): + # NOTE: configs come from V7 of the cognite python sdk cognite SDK + annotation_extract: Optional[bool] = None + case_sensitive: Optional[bool] = None + connection_flags: Optional[ConnectionFlagsConfig] = None + customize_fuzziness: Optional[CustomizeFuzzinessConfig] = None + direction_delta: Optional[float] = None + direction_weights: Optional[DirectionWeightsConfig] = None + min_fuzzy_score: Optional[float] = None + read_embedded_text: Optional[bool] = None + remove_leading_zeros: Optional[bool] = None + substitutions: Optional[dict[str, list[str]]] = None + + def as_config(self) -> DiagramDetectConfig: + params = {} + if self.annotation_extract is not None: + params["annotation_extract"] = self.annotation_extract + if self.case_sensitive is not None: + params["case_sensitive"] = self.case_sensitive + if self.connection_flags is not None: + params["connection_flags"] = self.connection_flags.as_connection_flag() + if self.customize_fuzziness is not None: + params["customize_fuzziness"] = self.customize_fuzziness.as_customize_fuzziness() + if self.direction_delta is not None: + params["direction_delta"] = self.direction_delta + if self.direction_weights is not None: + params["direction_weights"] = self.direction_weights.as_direction_weights() + if self.min_fuzzy_score is not None: + params["min_fuzzy_score"] = self.min_fuzzy_score + if self.read_embedded_text is not None: + params["read_embedded_text"] = self.read_embedded_text + if self.remove_leading_zeros is not None: + params["remove_leading_zeros"] = self.remove_leading_zeros + if self.substitutions is not None: + params["substitutions"] = self.substitutions + + return DiagramDetectConfig(**params) + + +# Launch Related Configs +class DataModelServiceConfig(BaseModel, alias_generator=to_camel): + get_files_to_process_query: QueryConfig | list[QueryConfig] + get_target_entities_query: QueryConfig | list[QueryConfig] + get_file_entities_query: QueryConfig | list[QueryConfig] + + +class CacheServiceConfig(BaseModel, alias_generator=to_camel): + cache_time_limit: int + raw_db: str + raw_table_cache: str + raw_manual_patterns_catalog: str + + +class AnnotationServiceConfig(BaseModel, alias_generator=to_camel): + page_range: int = Field(gt=0, le=50) + partial_match: bool = True + min_tokens: int = 1 + diagram_detect_config: Optional[DiagramDetectConfigModel] = None + + +class PrepareFunction(BaseModel, alias_generator=to_camel): + get_files_for_annotation_reset_query: Optional[QueryConfig | list[QueryConfig]] = None + get_files_to_annotate_query: QueryConfig | list[QueryConfig] + + +class LaunchFunction(BaseModel, alias_generator=to_camel): + batch_size: int = Field(gt=0, le=50) + primary_scope_property: str + secondary_scope_property: Optional[str] = None + file_search_property: str = "aliases" + target_entities_search_property: str = "aliases" + pattern_mode: bool + file_resource_property: Optional[str] = None + target_entities_resource_property: Optional[str] = None + data_model_service: DataModelServiceConfig + cache_service: CacheServiceConfig + annotation_service: AnnotationServiceConfig + + +# Finalize Related Configs +class RetrieveServiceConfig(BaseModel, alias_generator=to_camel): + get_job_id_query: QueryConfig | list[QueryConfig] + + +class ApplyServiceConfig(BaseModel, alias_generator=to_camel): + auto_approval_threshold: float = Field(gt=0.0, le=1.0) + auto_suggest_threshold: float = Field(gt=0.0, le=1.0) + sink_node: NodeId + raw_db: str + raw_table_doc_tag: str + raw_table_doc_doc: str + raw_table_doc_pattern: str + + +class FinalizeFunction(BaseModel, alias_generator=to_camel): + clean_old_annotations: bool + max_retry_attempts: int + retrieve_service: RetrieveServiceConfig + apply_service: ApplyServiceConfig + + +# Promote Related Configs +class TextNormalizationConfig(BaseModel, alias_generator=to_camel): + """ + Configuration for text normalization and variation generation. + + Controls how text is normalized for matching and what variations are generated + to improve match rates across different naming conventions. + + These flags affect both the normalize() function (for cache keys and direct matching) + and generate_text_variations() function (for query-based matching). + """ + + remove_special_characters: bool = True + convert_to_lowercase: bool = True + strip_leading_zeros: bool = True + + +class EntitySearchServiceConfig(BaseModel, alias_generator=to_camel): + """ + Configuration for the EntitySearchService in the promote function. + + Controls entity search and text normalization behavior: + - Queries entities directly (server-side IN filter on entity/file aliases) + - Text normalization for generating search variations + + Uses efficient server-side filtering on the smaller entity dataset rather than + the larger annotation edge dataset for better performance at scale. + """ + + enable_existing_annotations_search: bool = True + enable_global_entity_search: bool = True + max_entity_search_limit: int = Field(default=1000, gt=0, le=10000) + text_normalization: TextNormalizationConfig + + +class PromoteCacheServiceConfig(BaseModel, alias_generator=to_camel): + """ + Configuration for the CacheService in the promote function. + + Controls caching behavior for textβ†’entity mappings. + """ + + cache_table_name: str + + +class PromoteFunctionConfig(BaseModel, alias_generator=to_camel): + """ + Configuration for the promote function. + + The promote function resolves pattern-mode annotations by finding matching entities + and updating annotation edges from pointing to a sink node to pointing to actual entities. + + Configuration is organized by service interface: + - entitySearchService: Controls entity search strategies + - cacheService: Controls caching behavior + + Batch size is controlled via getCandidatesQuery.limit field. + """ + + get_candidates_query: QueryConfig | list[QueryConfig] + raw_db: str + raw_table_doc_pattern: str + raw_table_doc_tag: str + raw_table_doc_doc: str + delete_rejected_edges: bool + delete_suggested_edges: bool + entity_search_service: EntitySearchServiceConfig + cache_service: PromoteCacheServiceConfig + + +class DataModelViews(BaseModel, alias_generator=to_camel): + core_annotation_view: ViewPropertyConfig + annotation_state_view: ViewPropertyConfig + file_view: ViewPropertyConfig + target_entities_view: ViewPropertyConfig + + +class Config(BaseModel, alias_generator=to_camel): + data_model_views: DataModelViews + prepare_function: PrepareFunction + launch_function: LaunchFunction + finalize_function: FinalizeFunction + promote_function: PromoteFunctionConfig + + @classmethod + def parse_direct_relation(cls, value: Any) -> Any: + if isinstance(value, dict): + return dm.DirectRelationReference.load(value) + return value + + +# Functions to construct queries +def get_limit_from_query(query: QueryConfig | list[QueryConfig]) -> int: + """ + Determines the retrieval limit from a query configuration. + Handles 'None' by treating it as the default -1 (unlimited). + """ + default_limit = -1 + if isinstance(query, list): + if not query: + return default_limit + limits = [q.limit if q.limit is not None else default_limit for q in query] + return max(limits) + else: + return query.limit if query.limit is not None else default_limit + + +def build_filter_from_query(query: QueryConfig | list[QueryConfig]) -> Filter: + """ + Builds a Cognite Filter from a query configuration. + + If the query is a list, it builds a filter for each item and combines them with a logical OR. + If the query is a single object, it builds the filter directly from it. + """ + if isinstance(query, list): + list_filters: list[Filter] = [q.build_filter() for q in query] + if not list_filters: + raise ValueError("Query list cannot be empty.") + return dm.filters.Or(*list_filters) if len(list_filters) > 1 else list_filters[0] + else: + return query.build_filter() + + +def load_config_parameters( + client: CogniteClient, + function_data: dict[str, Any], +) -> Config: + """ + Retrieves the configuration parameters from the function data and loads the configuration from CDF. + """ + if "ExtractionPipelineExtId" not in function_data: + raise ValueError("Missing key 'ExtractionPipelineExtId' in input data to the function") + + pipeline_ext_id = function_data["ExtractionPipelineExtId"] + try: + raw_config = client.extraction_pipelines.config.retrieve(pipeline_ext_id) + if raw_config.config is None: + raise ValueError(f"No config found for extraction pipeline: {pipeline_ext_id!r}") + except CogniteAPIError: + raise RuntimeError(f"Not able to retrieve pipeline config for extraction pipeline: {pipeline_ext_id!r}") + + loaded_yaml_data = yaml.safe_load(raw_config.config) + + if isinstance(loaded_yaml_data, dict): + return Config.model_validate(loaded_yaml_data) + else: + raise ValueError( + "Invalid configuration structure from CDF: \nExpected a YAML dictionary with a top-level 'config' key." + ) diff --git a/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_prepare/services/DataModelService.py b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_prepare/services/DataModelService.py new file mode 100644 index 00000000..ee374874 --- /dev/null +++ b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_prepare/services/DataModelService.py @@ -0,0 +1,403 @@ +import abc +from datetime import datetime, timezone, timedelta +from cognite.client import CogniteClient +from cognite.client.data_classes.data_modeling import ( + Node, + NodeId, + NodeList, + NodeApply, + NodeApplyResultList, + instances, + InstancesApplyResult, +) +from cognite.client.data_classes.filters import ( + Filter, + Equals, + In, + Range, + Exists, +) + +from services.ConfigService import ( + Config, + ViewPropertyConfig, + build_filter_from_query, + get_limit_from_query, +) +from services.LoggerService import CogniteFunctionLogger +from utils.DataStructures import AnnotationStatus + + +class IDataModelService(abc.ABC): + """ + Interface for interacting with data model instances in CDF + """ + + @abc.abstractmethod + def get_files_for_annotation_reset(self) -> NodeList | None: + pass + + @abc.abstractmethod + def get_files_to_annotate(self) -> NodeList | None: + pass + + @abc.abstractmethod + def get_files_to_process( + self, + ) -> tuple[NodeList, dict[NodeId, Node]] | tuple[None, None]: + pass + + @abc.abstractmethod + def update_annotation_state( + self, + list_node_apply: list[NodeApply], + ) -> NodeApplyResultList: + pass + + @abc.abstractmethod + def create_annotation_state( + self, + list_node_apply: list[NodeApply], + ) -> NodeApplyResultList: + pass + + @abc.abstractmethod + def get_instances_entities( + self, primary_scope_value: str, secondary_scope_value: str | None + ) -> tuple[NodeList, NodeList]: + pass + + +class GeneralDataModelService(IDataModelService): + """ + Implementation used for real runs + """ + + def __init__(self, config: Config, client: CogniteClient, logger: CogniteFunctionLogger): + self.client: CogniteClient = client + self.config: Config = config + self.logger: CogniteFunctionLogger = logger + + self.annotation_state_view: ViewPropertyConfig = config.data_model_views.annotation_state_view + self.file_view: ViewPropertyConfig = config.data_model_views.file_view + self.target_entities_view: ViewPropertyConfig = config.data_model_views.target_entities_view + + self.get_files_to_annotate_retrieve_limit: int | None = get_limit_from_query( + config.prepare_function.get_files_to_annotate_query + ) + self.get_files_to_process_retrieve_limit: int | None = get_limit_from_query( + config.launch_function.data_model_service.get_files_to_process_query + ) + + self.filter_files_to_annotate: Filter = build_filter_from_query( + config.prepare_function.get_files_to_annotate_query + ) + self.filter_files_to_process: Filter = build_filter_from_query( + config.launch_function.data_model_service.get_files_to_process_query + ) + self.filter_target_entities: Filter = build_filter_from_query( + config.launch_function.data_model_service.get_target_entities_query + ) + self.filter_file_entities: Filter = build_filter_from_query( + config.launch_function.data_model_service.get_file_entities_query + ) + + def get_files_for_annotation_reset(self) -> NodeList | None: + """ + Retrieves files that need their annotation status reset based on configuration. + + Args: + None + + Returns: + NodeList of file instances to reset, or None if no reset query is configured. + + NOTE: Not building the filter in the object instantiation because the filter will only ever be used once throughout all runs of prepare + Furthermore, there is an implicit guarantee that a filter will be returned b/c launch checks if the query exists. + """ + if not self.config.prepare_function.get_files_for_annotation_reset_query: + return + + filter_files_for_annotation_reset: Filter = build_filter_from_query( + self.config.prepare_function.get_files_for_annotation_reset_query + ) + result: NodeList | None = self.client.data_modeling.instances.list( + instance_type="node", + sources=self.file_view.as_view_id(), + space=self.file_view.instance_space, + limit=-1, # NOTE: this should always be kept at -1 so that all files defined in the query will get reset + filter=filter_files_for_annotation_reset, + ) + return result + + def get_files_to_annotate(self) -> NodeList | None: + """ + Retrieves files ready for annotation processing based on their tag status. + + Queries for files marked "ToAnnotate" that don't have 'AnnotationInProcess' or 'Annotated' tags. + The specific query filters are defined in the getFilesToAnnotate config parameter. + + Args: + None + + Returns: + NodeList of file instances ready for annotation, or None if no files found. + """ + result: NodeList | None = self.client.data_modeling.instances.list( + instance_type="node", + sources=self.file_view.as_view_id(), + space=self.file_view.instance_space, + limit=self.get_files_to_annotate_retrieve_limit, # NOTE: the amount of instances that are returned may or may not matter depending on how the memory constraints of azure/aws functions + filter=self.filter_files_to_annotate, + ) + + return result + + def get_files_to_process( + self, + ) -> tuple[NodeList, dict[NodeId, Node]] | tuple[None, None]: + """ + Retrieves files with annotation state instances that are ready for diagram detection. + + Queries for FileAnnotationStateInstances based on the getFilesToProcess config parameter, + extracts the linked file NodeIds, and retrieves the corresponding file nodes. + + Args: + None + + Returns: + A tuple containing: + - NodeList of file instances to process + - Dictionary mapping file NodeIds to their annotation state Node instances + Returns (None, None) if no files are found. + """ + annotation_state_filter = self._get_annotation_state_filter() + annotation_state_instances: NodeList = self.client.data_modeling.instances.list( + instance_type="node", + sources=self.annotation_state_view.as_view_id(), + space=self.annotation_state_view.instance_space, + limit=self.get_files_to_process_retrieve_limit, + filter=annotation_state_filter, + ) + + if not annotation_state_instances: + return None, None + + file_to_state_map: dict[NodeId, Node] = {} + list_file_node_ids: list[NodeId] = [] + + for node in annotation_state_instances: + file_reference = node.properties.get(self.annotation_state_view.as_view_id()).get("linkedFile") + if self.file_view.instance_space is None or self.file_view.instance_space == file_reference["space"]: + file_node_id = NodeId( + space=file_reference["space"], + external_id=file_reference["externalId"], + ) + + file_to_state_map[file_node_id] = node + list_file_node_ids.append(file_node_id) + + file_instances: NodeList = self.client.data_modeling.instances.retrieve_nodes( + nodes=list_file_node_ids, + sources=self.file_view.as_view_id(), + ) + + return file_instances, file_to_state_map + + def _get_annotation_state_filter(self) -> Filter: + """ + Builds a filter for annotation state instances, including automatic retry logic for stuck jobs. + + Combines the configured filter with a fallback filter that catches annotation state instances + stuck in Processing/Finalizing status for more than 12 hours. + + Args: + None + + Returns: + Combined Filter for querying annotation state instances. + + NOTE: filter = (getFilesToProcess filter || (annotationStatus == Processing && now() - lastUpdatedTime) > 1440 minutes) + - getFilesToProcess filter comes from extraction pipeline + - (annotationStatus == Processing | Finalizing && now() - lastUpdatedTime) > 720 minutes/12 hours -> hardcoded -> reprocesses any file that's stuck + - Edge case that occurs very rarely but can happen. + NOTE: Implementation of a more complex query that can't be handled in config should come from an implementation of the interface. + """ + annotation_status_property = self.annotation_state_view.as_property_ref("annotationStatus") + annotation_last_updated_property = self.annotation_state_view.as_property_ref("sourceUpdatedTime") + # NOTE: While this number is hard coded, I believe it doesn't need to be configured. Number comes from my experience with the pipeline. Feel free to change if your experience leads to a different number + latest_permissible_time_utc = datetime.now(timezone.utc) - timedelta(minutes=720) + latest_permissible_time_utc = latest_permissible_time_utc.isoformat(timespec="milliseconds") + filter_stuck = In( + annotation_status_property, + [AnnotationStatus.PROCESSING, AnnotationStatus.FINALIZING], + ) & Range(annotation_last_updated_property, lt=latest_permissible_time_utc) + + filter = self.filter_files_to_process | filter_stuck # | == OR + return filter + + def update_annotation_state(self, list_node_apply: list[NodeApply]) -> NodeApplyResultList: + """ + Updates existing annotation state nodes with new property values. + + Args: + list_node_apply: List of NodeApply objects containing updated properties. + + Returns: + NodeApplyResultList containing the results of the update operation. + """ + update_results: InstancesApplyResult = self.client.data_modeling.instances.apply( + nodes=list_node_apply, + replace=False, # ensures we don't delete other properties in the view + ) + return update_results.nodes + + def create_annotation_state(self, list_node_apply: list[NodeApply]) -> NodeApplyResultList: + """ + Creates new annotation state nodes, replacing any existing nodes with the same IDs. + + Args: + list_node_apply: List of NodeApply objects to create as new annotation state instances. + + Returns: + NodeApplyResultList containing the results of the creation operation. + """ + update_results: InstancesApplyResult = self.client.data_modeling.instances.apply( + nodes=list_node_apply, + auto_create_direct_relations=True, + replace=True, # ensures we reset the properties of the node + ) + return update_results.nodes + + def get_instances_entities( + self, primary_scope_value: str, secondary_scope_value: str | None + ) -> tuple[NodeList, NodeList]: + """ + Retrieves target entities and file entities for use in diagram detection. + + Queries the data model for entities (assets) and files that match the configured filters + and scope values, which will be used to create the entity cache for diagram detection. + + Args: + primary_scope_value: Primary scope identifier (e.g., site, facility). + secondary_scope_value: Optional secondary scope identifier (e.g., unit, area). + + Returns: + A tuple containing: + - NodeList of target entity instances (typically assets) + - NodeList of file entity instances + + NOTE: 1. grab assets that meet the filter requirement + NOTE: 2. grab files that meet the filter requirement + """ + target_filter: Filter = self._get_target_entities_filter(primary_scope_value, secondary_scope_value) + file_filter: Filter = self._get_file_entities_filter(primary_scope_value, secondary_scope_value) + + target_entities: NodeList = self.client.data_modeling.instances.list( + instance_type="node", + sources=self.target_entities_view.as_view_id(), + space=self.target_entities_view.instance_space, + filter=target_filter, + limit=-1, # NOTE: this should always be kept at -1 so that all entities are retrieved + ) + file_entities: NodeList = self.client.data_modeling.instances.list( + instance_type="node", + sources=self.file_view.as_view_id(), + space=self.file_view.instance_space, + filter=file_filter, + limit=-1, # NOTE: this should always be kept at -1 so that all entities are retrieved + ) + return target_entities, file_entities + + def _get_target_entities_filter(self, primary_scope_value: str, secondary_scope_value: str | None) -> Filter: + """ + Builds a filter for target entities (assets) based on scope and configuration. + + Creates a filter combining scope-specific filtering with global 'ScopeWideDetect' entities. + + Args: + primary_scope_value: Primary scope identifier for filtering entities. + secondary_scope_value: Optional secondary scope identifier for more specific filtering. + + Returns: + Combined Filter for querying target entities. + + NOTE: Create a filter that... + - grabs assets in the primary_scope_value and secondary_scope_value provided with detectInDiagram in the tags property + or + - grabs assets in the primary_scope_value with ScopeWideDetect in the tags property (hard coded) -> provides an option to include entities outside of the secondary_scope_value + """ + filter_primary_scope: Filter = Equals( + property=self.target_entities_view.as_property_ref(self.config.launch_function.primary_scope_property), + value=primary_scope_value, + ) + filter_entities: Filter = self.filter_target_entities + # NOTE: ScopeWideDetect is an optional string that allows annotating across scopes + filter_scope_wide: Filter = In( + property=self.target_entities_view.as_property_ref("tags"), + values=["ScopeWideDetect"], + ) + if not primary_scope_value: + target_filter = filter_entities | filter_scope_wide + elif secondary_scope_value: + filter_secondary_scope: Filter = Equals( + property=self.target_entities_view.as_property_ref( + self.config.launch_function.secondary_scope_property + ), + value=secondary_scope_value, + ) + target_filter = (filter_primary_scope & filter_secondary_scope & filter_entities) | ( + filter_primary_scope & filter_scope_wide + ) + else: + target_filter = (filter_primary_scope & filter_entities) | (filter_primary_scope & filter_scope_wide) + return target_filter + + def _get_file_entities_filter(self, primary_scope_value: str, secondary_scope_value: str | None) -> Filter: + """ + Builds a filter for file entities based on scope and configuration. + + Creates a filter combining scope-specific filtering with global 'ScopeWideDetect' files, + ensuring file entities have the required search properties. + + Args: + primary_scope_value: Primary scope identifier for filtering file entities. + secondary_scope_value: Optional secondary scope identifier for more specific filtering. + + Returns: + Combined Filter for querying file entities. + + NOTE: Create a filter that... + - grabs assets in the primary_scope_value and secondary_scope_value provided with DetectInDiagram in the tags property + or + - grabs assets in the primary_scope_value with ScopeWideDetect in the tags property (hard coded) -> provides an option to include entities outside of the secondary_scope_value + """ + filter_primary_scope: Filter = Equals( + property=self.file_view.as_property_ref(self.config.launch_function.primary_scope_property), + value=primary_scope_value, + ) + filter_entities: Filter = self.filter_file_entities + filter_search_property_exists: Filter = Exists( + property=self.file_view.as_property_ref(self.config.launch_function.file_search_property), + ) + # NOTE: ScopeWideDetect is an optional string that allows annotating across scopes + filter_scope_wide: Filter = In( + property=self.file_view.as_property_ref("tags"), + values=["ScopeWideDetect"], + ) + if not primary_scope_value: + file_filter = (filter_entities & filter_search_property_exists) | (filter_scope_wide) + elif secondary_scope_value: + filter_secondary_scope: Filter = Equals( + property=self.file_view.as_property_ref(self.config.launch_function.secondary_scope_property), + value=secondary_scope_value, + ) + file_filter = ( + filter_primary_scope & filter_entities & filter_secondary_scope & filter_search_property_exists + ) | (filter_primary_scope & filter_scope_wide) + else: + file_filter = (filter_primary_scope & filter_entities & filter_search_property_exists) | ( + filter_primary_scope & filter_scope_wide + ) + + return file_filter diff --git a/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_prepare/services/LoggerService.py b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_prepare/services/LoggerService.py new file mode 100644 index 00000000..773b7797 --- /dev/null +++ b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_prepare/services/LoggerService.py @@ -0,0 +1,169 @@ +from typing import Literal +import os + + +class CogniteFunctionLogger: + def __init__( + self, + log_level: Literal["DEBUG", "INFO", "WARNING", "ERROR"] = "INFO", + write: bool = False, + filepath: str | None = None, + ): + self.log_level = log_level.upper() + self.write = write + self.filepath = filepath + self.file_handler = None + + if self.filepath and self.write: + try: + dir_name = os.path.dirname(self.filepath) + if dir_name: + os.makedirs(dir_name, exist_ok=True) + self.file_handler = open(self.filepath, "a", encoding="utf-8") + except Exception as e: + print(f"[LOGGER_SETUP_ERROR] Could not open log file {self.filepath}: {e}") + self.write = False + + def _format_message_lines(self, prefix: str, message: str) -> list[str]: + """ + Formats multi-line messages with consistent indentation. + + Args: + prefix: The log level prefix (e.g., "[INFO]", "[ERROR]"). + message: The message to format. + + Returns: + List of formatted message lines with proper indentation. + """ + formatted_lines = [] + if "\n" not in message: + formatted_lines.append(f"{prefix} {message}") + else: + lines = message.split("\n") + formatted_lines.append(f"{prefix}{lines[0]}") + padding = " " * len(prefix) + for line_content in lines[1:]: + formatted_lines.append(f"{padding} {line_content}") + return formatted_lines + + def _print(self, prefix: str, message: str) -> None: + """ + Prints formatted log messages to console and optionally to file. + + Args: + prefix: The log level prefix to prepend to the message. + message: The message to log. + + Returns: + None + """ + lines_to_log = self._format_message_lines(prefix, message) + if self.write and self.file_handler: + try: + for line in lines_to_log: + print(line) + self.file_handler.write(line + "\n") + self.file_handler.flush() + except Exception as e: + print(f"[LOGGER_SETUP_ERROR] Could not write to {self.filepath}: {e}") + elif not self.write: + for line in lines_to_log: + print(line) + + def debug(self, message: str, section: Literal["START", "END", "BOTH"] | None = None) -> None: + """ + Logs a debug-level message. + + Args: + message: The debug message to log. + section: Optional section separator position (START, END, or BOTH). + + Returns: + None + """ + if section == "START" or section == "BOTH": + self._section() + if self.log_level == "DEBUG": + self._print("[DEBUG]", message) + if section == "END" or section == "BOTH": + self._section() + + def info(self, message: str, section: Literal["START", "END", "BOTH"] | None = None) -> None: + """ + Logs an info-level message. + + Args: + message: The informational message to log. + section: Optional section separator position (START, END, or BOTH). + + Returns: + None + """ + if section == "START" or section == "BOTH": + self._section() + if self.log_level in ("DEBUG", "INFO"): + self._print("[INFO]", message) + if section == "END" or section == "BOTH": + self._section() + + def warning(self, message: str, section: Literal["START", "END", "BOTH"] | None = None) -> None: + """ + Logs a warning-level message. + + Args: + message: The warning message to log. + section: Optional section separator position (START, END, or BOTH). + + Returns: + None + """ + if section == "START" or section == "BOTH": + self._section() + if self.log_level in ("DEBUG", "INFO", "WARNING"): + self._print("[WARNING]", message) + if section == "END" or section == "BOTH": + self._section() + + def error(self, message: str, section: Literal["START", "END", "BOTH"] | None = None) -> None: + """ + Logs an error-level message. + + Args: + message: The error message to log. + section: Optional section separator position (START, END, or BOTH). + + Returns: + None + """ + if section == "START" or section == "BOTH": + self._section() + self._print("[ERROR]", message) + if section == "END" or section == "BOTH": + self._section() + + def _section(self) -> None: + """ + Prints a visual separator line for log sections. + + Returns: + None + """ + if self.write and self.file_handler: + self.file_handler.write( + "--------------------------------------------------------------------------------\n" + ) + print("--------------------------------------------------------------------------------") + + def close(self) -> None: + """ + Closes the file handler if file logging is enabled. + + Returns: + None + """ + if self.file_handler: + try: + self.file_handler.close() + except Exception as e: + print(f"[LOGGER_CLEANUP_ERROR] Error closing log file: {e}") + self.file_handler = None diff --git a/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_prepare/services/PipelineService.py b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_prepare/services/PipelineService.py new file mode 100644 index 00000000..7cf5d885 --- /dev/null +++ b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_prepare/services/PipelineService.py @@ -0,0 +1,66 @@ +import abc + +from typing import Literal +from cognite.client import CogniteClient +from cognite.client.data_classes import ExtractionPipelineRunWrite + + +class IPipelineService(abc.ABC): + """ + Interface for creating and updating extraction pipeline logs. + """ + + @abc.abstractmethod + def update_extraction_pipeline(self, msg: str) -> None: + pass + + @abc.abstractmethod + def upload_extraction_pipeline( + self, + status: Literal["success", "failure", "seen"], + ) -> None: + pass + + +class GeneralPipelineService(IPipelineService): + """ + Implementation of the pipeline interface + """ + + def __init__(self, pipeline_ext_id: str, client: CogniteClient): + self.client: CogniteClient = client + self.ep_write: ExtractionPipelineRunWrite = ExtractionPipelineRunWrite( + extpipe_external_id=pipeline_ext_id, + status="seen", + ) + + def update_extraction_pipeline(self, msg: str) -> None: + """ + Appends a message to the extraction pipeline run log. + + Args: + msg: The message to append to the pipeline log. + + Returns: + None + """ + if not self.ep_write.message: + self.ep_write.message = msg + else: + self.ep_write.message = f"{self.ep_write.message}\n{msg}" + + def upload_extraction_pipeline( + self, + status: Literal["success", "failure", "seen"], + ) -> None: + """ + Creates an extraction pipeline run with accumulated status and messages. + + Args: + status: The run status to report (success, failure, or seen). + + Returns: + None + """ + self.ep_write.status = status + self.client.extraction_pipelines.runs.create(self.ep_write) diff --git a/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_prepare/services/PrepareService.py b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_prepare/services/PrepareService.py new file mode 100644 index 00000000..2bbeeb14 --- /dev/null +++ b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_prepare/services/PrepareService.py @@ -0,0 +1,209 @@ +import abc +from typing import cast, Literal +from cognite.client import CogniteClient +from cognite.client.exceptions import CogniteAPIError +from cognite.client.data_classes.data_modeling import ( + NodeList, + NodeApply, +) + +from services.ConfigService import Config, ViewPropertyConfig +from services.DataModelService import IDataModelService +from services.LoggerService import CogniteFunctionLogger +from utils.DataStructures import ( + AnnotationStatus, + AnnotationState, + PerformanceTracker, +) + + +class AbstractPrepareService(abc.ABC): + """ + Orchestrates the file annotation prepare process. This service prepares files for annotation + by creating annotation state instances for files marked ToAnnotate. + """ + + def __init__( + self, + client: CogniteClient, + config: Config, + logger: CogniteFunctionLogger, + tracker: PerformanceTracker, + data_model_service: IDataModelService, + ): + self.client = client + self.config = config + self.logger = logger + self.tracker = tracker + self.data_model_service = data_model_service + + @abc.abstractmethod + def run(self) -> str | None: + pass + + +class GeneralPrepareService(AbstractPrepareService): + """ + Orchestrates the file annotation prepare process. This service prepares files for annotation + by creating annotation state instances for files marked ToAnnotate. + """ + + def __init__( + self, + client: CogniteClient, + config: Config, + logger: CogniteFunctionLogger, + tracker: PerformanceTracker, + data_model_service: IDataModelService, + function_call_info: dict, + ): + super().__init__( + client, + config, + logger, + tracker, + data_model_service, + ) + + self.annotation_state_view: ViewPropertyConfig = config.data_model_views.annotation_state_view + self.file_view: ViewPropertyConfig = config.data_model_views.file_view + + self.function_id: int | None = function_call_info.get("function_id") + self.call_id: int | None = function_call_info.get("call_id") + + self.reset_files: bool = False + if self.config.prepare_function.get_files_for_annotation_reset_query: + self.reset_files = True + + def run(self) -> Literal["Done"] | None: + """ + Prepares files for annotation by creating annotation state instances. + + Retrieves files marked "ToAnnotate", creates corresponding FileAnnotationState instances, + and updates file tags to indicate processing has started. Can also reset files if configured. + + Args: + None + + Returns: + "Done" if no more files need preparation, None if processing should continue. + + Raises: + CogniteAPIError: If query timeout or other API errors occur (408 errors are handled gracefully). + ValueError: If annotation state view instance space is not configured. + """ + self.logger.info( + message=f"Starting Prepare Function", + section="START", + ) + try: + if self.reset_files: + file_nodes_to_reset: NodeList | None = self.data_model_service.get_files_for_annotation_reset() + if not file_nodes_to_reset: + self.logger.info( + "No files found with the getFilesForAnnotationReset query provided in the config file" + ) + else: + self.logger.info(f"Resetting {len(file_nodes_to_reset)} files") + reset_node_apply: list[NodeApply] = [] + for file_node in file_nodes_to_reset: + file_node_apply: NodeApply = file_node.as_write() + tags_property: list[str] = cast(list[str], file_node_apply.sources[0].properties["tags"]) + if "AnnotationInProcess" in tags_property: + tags_property.remove("AnnotationInProcess") + if "Annotated" in tags_property: + tags_property.remove("Annotated") + if "AnnotationFailed" in tags_property: + tags_property.remove("AnnotationFailed") + + reset_node_apply.append(file_node_apply) + update_results = self.data_model_service.update_annotation_state(reset_node_apply) + self.logger.info( + f"Removed the AnnotationInProcess/Annotated/AnnotationFailed tag of {len(update_results)} files" + ) + self.reset_files = False + except CogniteAPIError as e: + # NOTE: Reliant on the CogniteAPI message to stay the same across new releases. If unexpected changes were to occur please refer to this section of the code and check if error message is now different. + if ( + e.code == 408 + and e.message == "Graph query timed out. Reduce load or contention, or optimise your query." + ): + # NOTE: 408 indicates a timeout error. Keep retrying the query if a timeout occurs. + self.logger.error(message=f"Ran into the following error:\n{str(e)}") + return + else: + raise e + + try: + file_nodes: NodeList | None = self.data_model_service.get_files_to_annotate() + if not file_nodes: + self.logger.info( + message=f"No files found to prepare", + section="END", + ) + return "Done" + self.logger.info(f"Preparing {len(file_nodes)} files") + except CogniteAPIError as e: + # NOTE: Reliant on the CogniteAPI message to stay the same across new releases. If unexpected changes were to occur please refer to this section of the code and check if error message is now different. + if ( + e.code == 408 + and e.message == "Graph query timed out. Reduce load or contention, or optimise your query." + ): + # NOTE: 408 indicates a timeout error. Keep retrying the query if a timeout occurs. + self.logger.error(message=f"Ran into the following error:\n{str(e)}") + return + else: + raise e + + annotation_state_instances: list[NodeApply] = [] + file_apply_instances: list[NodeApply] = [] + for file_node in file_nodes: + node_id = {"space": file_node.space, "externalId": file_node.external_id} + annotation_instance = AnnotationState( + annotationStatus=AnnotationStatus.NEW, + linkedFile=node_id, + ) + if not self.annotation_state_view.instance_space: + msg = ( + "Need an instance space in DataModelViews/AnnotationStateView config to store the annotation state" + ) + self.logger.error(msg) + raise ValueError(msg) + annotation_instance_space: str = self.annotation_state_view.instance_space + + annotation_node_apply: NodeApply = annotation_instance.to_node_apply( + node_space=annotation_instance_space, + annotation_state_view=self.annotation_state_view.as_view_id(), + ) + annotation_state_instances.append(annotation_node_apply) + + file_node_apply: NodeApply = file_node.as_write() + tags_property: list[str] = cast(list[str], file_node_apply.sources[0].properties["tags"]) + if "AnnotationInProcess" not in tags_property: + tags_property.append("AnnotationInProcess") + file_apply_instances.append(file_node_apply) + + try: + create_results = self.data_model_service.create_annotation_state(annotation_state_instances) + self.logger.info(message=f"Created {len(create_results)} annotation state instances") + update_results = self.data_model_service.update_annotation_state(file_apply_instances) + self.logger.info( + message=f"Added 'AnnotationInProcess' to the tag property for {len(update_results)} files", + section="END", + ) + except Exception as e: + self.logger.error(message=f"Ran into the following error:\n{str(e)}", section="END") + raise + + self.tracker.add_files(success=len(file_nodes)) + return + + +class LocalPrepareService(GeneralPrepareService): + """ + Prepare service variant for local development and debugging. + + Extends GeneralPrepareService with any local-specific behavior if needed. + """ + + pass diff --git a/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_prepare/utils/DataStructures.py b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_prepare/utils/DataStructures.py new file mode 100644 index 00000000..8ef6675d --- /dev/null +++ b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_prepare/utils/DataStructures.py @@ -0,0 +1,331 @@ +from dataclasses import dataclass, asdict, field +from typing import Literal, cast +from enum import Enum +from datetime import datetime, timezone, timedelta + +from cognite.client.data_classes.data_modeling import ( + Node, + NodeId, + NodeApply, + NodeOrEdgeData, + ViewId, +) +from cognite.client.data_classes.contextualization import ( + FileReference, +) + + +@dataclass +class EnvConfig: + """ + Data structure holding the configs to connect to CDF client locally + """ + + cdf_project: str + cdf_cluster: str + tenant_id: str + client_id: str + client_secret: str + + +class DiagramAnnotationStatus(str, Enum): + SUGGESTED = "Suggested" + APPROVED = "Approved" + REJECTED = "Rejected" + + +class AnnotationStatus(str, Enum): + """ + Defines the types of values that the annotationStatus property can be for the Annotation State Instances. + Inherits from 'str' so that the enum members are also string instances, + making them directly usable where a string is expected (e.g., serialization). + Holds the different values that the annotationStatus property can be for the Annotation State Instances. + """ + + NEW = "New" + RETRY = "Retry" + PROCESSING = "Processing" + FINALIZING = "Finalizing" + ANNOTATED = "Annotated" + FAILED = "Failed" + + +class FilterOperator(str, Enum): + """ + Defines the types of filter operations that can be specified in the configuration. + Inherits from 'str' so that the enum members are also string instances, + making them directly usable where a string is expected (e.g., serialization). + """ + + EQUALS = "Equals" # Checks for equality against a single value. + EXISTS = "Exists" # Checks if a property exists (is not null). + CONTAINSALL = "ContainsAll" # Checks if an item contains all specified values for a given property + IN = "In" # Checks if a value is within a list of specified values. Not implementing CONTAINSANY b/c IN is usually more suitable + SEARCH = "Search" # Performs full text search on a specified property + + +@dataclass +class AnnotationState: + """ + Data structure holding the mpcAnnotationState view properties. Time will convert to Timestamp when ingested into CDF. + """ + + annotationStatus: AnnotationStatus + linkedFile: dict[str, str] = field(default_factory=dict) + attemptCount: int = 0 + annotationMessage: str | None = None + diagramDetectJobId: int | None = None + sourceCreatedTime: str = field( + default_factory=lambda: datetime.now(timezone.utc).replace(microsecond=0).isoformat() + ) + sourceUpdatedTime: str = field( + default_factory=lambda: datetime.now(timezone.utc).replace(microsecond=0).isoformat() + ) + sourceCreatedUser: str = "fn_dm_context_annotation_prepare" + sourceUpdatedUser: str = "fn_dm_context_annotation_prepare" + + def _create_external_id(self) -> str: + """ + Create a deterministic external ID so that we can replace mpcAnnotationState of files that have been updated and aren't new + """ + prefix = "an_state" + linked_file_space = self.linkedFile["space"] + linked_file_id = self.linkedFile["externalId"] + return f"{prefix}_{linked_file_space}_{linked_file_id}" + + def to_dict(self) -> dict: + return asdict(self) + + def to_node_apply(self, node_space: str, annotation_state_view: ViewId) -> NodeApply: + external_id: str = self._create_external_id() + + return NodeApply( + space=node_space, + external_id=external_id, + sources=[ + NodeOrEdgeData( + source=annotation_state_view, + properties=self.to_dict(), + ) + ], + ) + + +@dataclass +class FileProcessingBatch: + primary_scope_value: str + secondary_scope_value: str | None + files: list[Node] + + +@dataclass +class entity: + """ + data structure for the 'entities' fed into diagram detect, + { + "external_id": file.external_id, + "name": file.properties[job_config.file_view.as_view_id()]["name"], + "space": file.space, + "annotation_type": job_config.file_view.type, + "resource_type": file.properties[job_config.file_view.as_view_id()][{resource_type}], + "search_property": file.properties[job_config.file_view.as_view_id()][{search_property}], + } + """ + + external_id: str + name: str + space: str + annotation_type: Literal["diagrams.FileLink", "diagrams.AssetLink"] | None + resource_type: str + search_property: list[str] = field(default_factory=list) + + def to_dict(self): + return asdict(self) + + +@dataclass +class BatchOfNodes: + nodes: list[Node] = field(default_factory=list) + ids: list[NodeId] = field(default_factory=list) + apply: list[NodeApply] = field(default_factory=list) + + def add(self, node: Node): + self.nodes.append(node) + node_id = node.as_id() + self.ids.append(node_id) + return + + def clear(self): + self.nodes.clear() + self.ids.clear() + self.apply.clear() + return + + def update_node_properties(self, new_properties: dict, view_id: ViewId): + for node in self.nodes: + node_apply = NodeApply( + space=node.space, + external_id=node.external_id, + existing_version=None, + sources=[ + NodeOrEdgeData( + source=view_id, + properties=new_properties, + ) + ], + ) + self.apply.append(node_apply) + return + + +@dataclass +class BatchOfPairedNodes: + """ + Where nodeA is an instance of the file view and nodeB is an instance of the annotation state view + """ + + file_to_state_map: dict[NodeId, Node] + batch_files: BatchOfNodes = field(default_factory=BatchOfNodes) + batch_states: BatchOfNodes = field(default_factory=BatchOfNodes) + file_references: list[FileReference] = field(default_factory=list) + + def add_pair(self, file_node: Node, file_reference: FileReference): + self.file_references.append(file_reference) + self.batch_files.add(file_node) + file_node_id: NodeId = file_node.as_id() + state_node: Node = self.file_to_state_map[file_node_id] + self.batch_states.add(state_node) + + def create_file_reference( + self, + file_node_id: NodeId, + page_range: int, + annotation_state_view_id: ViewId, + ) -> FileReference: + """ + Create a file reference that has a page range for annotation. + The current implementation of the detect api 20230101-beta only allows annotation of files up to 50 pages. + Thus, this is my idea of how we can enables annotating files that are more than 50 pages long. + + The annotatedPageCount and pageCount properties won't be set in the initial creation of the annotation state nodes. + That's because we don't know how many pages are in the pdf until we run the diagram detect job where the page count gets returned from the results of the job. + Thus, annotatedPageCount and pageCount get set in the finalize function. + The finalize function will set the page count properties based on the page count that returned from diagram detect job results. + - If the pdf has less than 50 pages, say 3 pages, then... + - annotationStatus property will get set to 'complete' + - annotatedPageCount and pageCount properties will be set to 3. + - Elif the pdf has more than 50 pages, say 80, then... + - annotationStatus property will get set to 'new' + - annotatedPageCount set to 50 + - pageCount set to 80 + - attemptCount doesn't get incremented + + NOTE: Chose to create the file_reference here b/c I already have access to the file node and state node. + If I chose to have this logic in the launchService then we'd have to iterate on all of the nodes that have already been added. + Thus -> O(N) + O(N) to create the BatchOfPairedNodes and then to create the file references + Instead, this approach makes it just O(N) + """ + annotation_state_node: Node = self.file_to_state_map[file_node_id] + annotated_page_count: int | None = cast( + int, + annotation_state_node.properties[annotation_state_view_id].get("annotatedPageCount"), + ) + page_count: int | None = cast( + int, + annotation_state_node.properties[annotation_state_view_id].get("pageCount"), + ) + if not annotated_page_count or not page_count: + file_reference: FileReference = FileReference( + file_instance_id=file_node_id, + first_page=1, + last_page=page_range, + ) + else: + # NOTE: adding 1 here since that annotated_page_count variable holds the last page that was annotated. Thus we want to annotate the following page + # e.g.) first run annotates pages 1-50 second run would annotate 51-100 + first_page = annotated_page_count + 1 + last_page = annotated_page_count + page_range + if page_count <= last_page: + last_page = page_count + file_reference: FileReference = FileReference( + file_instance_id=file_node_id, + first_page=first_page, + last_page=last_page, + ) + + return file_reference + + def clear_pair(self): + self.batch_files.clear() + self.batch_states.clear() + self.file_references.clear() + + def size(self) -> int: + return len(self.file_references) + + def is_empty(self) -> bool: + if self.file_references: + return False + return True + + +@dataclass +class PerformanceTracker: + """ + Keeps track of metrics + """ + + files_success: int = 0 + files_failed: int = 0 + total_runs: int = 0 + total_time_delta: timedelta = timedelta(0) + latest_run_time: datetime = field(default_factory=lambda: datetime.now(timezone.utc)) + + def _run_time(self) -> timedelta: + time_delta = datetime.now(timezone.utc) - self.latest_run_time + return time_delta + + def _average_run_time(self) -> timedelta: + if self.total_runs == 0: + return timedelta(0) + return self.total_time_delta / self.total_runs + + def add_files(self, success: int, failed: int = 0): + self.files_success += success + self.files_failed += failed + + def generate_local_report(self) -> str: + self.total_runs += 1 + time_delta = self._run_time() + self.total_time_delta += time_delta + self.latest_run_time = datetime.now(timezone.utc) + + report = f"run time: {time_delta}" + return report + + def generate_overall_report(self) -> str: + report = f" Run started {datetime.now(timezone.utc)}\n- total runs: {self.total_runs}\n- total files processed: {self.files_success+self.files_failed}\n- successful files: {self.files_success}\n- failed files: {self.files_failed}\n- total run time: {self.total_time_delta}\n- average run time: {self._average_run_time()}" + return report + + def generate_ep_run( + self, + caller: Literal["Prepare", "Launch", "Finalize"], + function_id: str | None, + call_id: str | None, + ) -> str: + """Generates the report string for the extraction pipeline run.""" + report = ( + f"(caller:{caller}, function_id:{function_id}, call_id:{call_id}) - " + f"total files processed: {self.files_success + self.files_failed} - " + f"successful files: {self.files_success} - " + f"failed files: {self.files_failed}" + ) + return report + + def reset(self) -> None: + self.files_success = 0 + self.files_failed = 0 + self.total_runs: int = 0 + self.total_time_delta = timedelta(0) + self.latest_run_time = datetime.now(timezone.utc) + print("PerformanceTracker state has been reset") diff --git a/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_promote/dependencies.py b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_promote/dependencies.py new file mode 100644 index 00000000..831b5447 --- /dev/null +++ b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_promote/dependencies.py @@ -0,0 +1,190 @@ +import os + +from pathlib import Path +from dotenv import load_dotenv +from typing import Any, Tuple, Literal, cast +from cognite.client import CogniteClient, ClientConfig, global_config +from cognite.client.credentials import OAuthClientCredentials +from utils.DataStructures import EnvConfig + +from services.ConfigService import Config, load_config_parameters +from services.LoggerService import CogniteFunctionLogger +from services.EntitySearchService import EntitySearchService +from services.CacheService import CacheService + + +def get_env_variables() -> EnvConfig: + """ + Loads environment variables required for CDF authentication from .env file. + + Required environment variables: + - CDF_PROJECT: CDF project name + - CDF_CLUSTER: CDF cluster (e.g., westeurope-1) + - IDP_TENANT_ID: Azure AD tenant ID + - IDP_CLIENT_ID: Azure AD application client ID + - IDP_CLIENT_SECRET: Azure AD application client secret + + Returns: + EnvConfig object containing all required environment variables + + Raises: + ValueError: If any required environment variables are missing + """ + print("Loading environment variables from .env...") + + project_path: Path = (Path(__file__).parent / ".env").resolve() + print(f"project_path is set to: {project_path}") + + load_dotenv() + + required_envvars: tuple[str, ...] = ( + "CDF_PROJECT", + "CDF_CLUSTER", + "IDP_TENANT_ID", + "IDP_CLIENT_ID", + "IDP_CLIENT_SECRET", + ) + + missing: list[str] = [envvar for envvar in required_envvars if envvar not in os.environ] + if missing: + raise ValueError(f"Missing one or more env.vars: {missing}") + + return EnvConfig( + cdf_project=os.getenv("CDF_PROJECT"), # type: ignore + cdf_cluster=os.getenv("CDF_CLUSTER"), # type: ignore + tenant_id=os.getenv("IDP_TENANT_ID"), # type: ignore + client_id=os.getenv("IDP_CLIENT_ID"), # type: ignore + client_secret=os.getenv("IDP_CLIENT_SECRET"), # type: ignore + ) + + +def create_client(env_config: EnvConfig, debug: bool = False) -> CogniteClient: + """ + Creates an authenticated CogniteClient using OAuth client credentials flow. + + Args: + env_config: Environment configuration containing CDF connection details + debug: Whether to enable debug mode on the client (default: False) + + Returns: + Authenticated CogniteClient instance + """ + SCOPES: list[str] = [f"https://{env_config.cdf_cluster}.cognitedata.com/.default"] + TOKEN_URL: str = f"https://login.microsoftonline.com/{env_config.tenant_id}/oauth2/v2.0/token" + creds: OAuthClientCredentials = OAuthClientCredentials( + token_url=TOKEN_URL, + client_id=env_config.client_id, + client_secret=env_config.client_secret, + scopes=SCOPES, + ) + settings: dict[str, bool] = { + "disable_ssl": True, + } + global_config.apply_settings(settings) + cnf: ClientConfig = ClientConfig( + client_name="DEV_Working", + project=env_config.cdf_project, + base_url=f"https://{env_config.cdf_cluster}.cognitedata.com", + credentials=creds, + debug=debug, + ) + client: CogniteClient = CogniteClient(cnf) + return client + + +def create_logger_service(log_level: str, filepath: str | None) -> CogniteFunctionLogger: + """ + Creates a logger service for tracking function execution. + + Args: + log_level: Logging level ("DEBUG", "INFO", "WARNING", "ERROR") + filepath: Optional file path for writing logs to disk + + Returns: + CogniteFunctionLogger instance configured with specified settings + """ + write: bool + if filepath: + write = True + else: + write = False + if log_level not in ["DEBUG", "INFO", "WARNING", "ERROR"]: + return CogniteFunctionLogger() + else: + # Cast to Literal type to satisfy type checker + validated_log_level: Literal["DEBUG", "INFO", "WARNING", "ERROR"] = cast( + Literal["DEBUG", "INFO", "WARNING", "ERROR"], log_level + ) + return CogniteFunctionLogger(log_level=validated_log_level, write=write, filepath=filepath) + + +def create_config_service( + function_data: dict[str, Any], client: CogniteClient | None = None +) -> Tuple[Config, CogniteClient]: + """ + Creates configuration service and CogniteClient for the function. + + Loads configuration from CDF based on the ExtractionPipelineExtId provided in function_data. + If no client is provided, creates one using environment variables. + + Args: + function_data: Dictionary containing function input data (must include ExtractionPipelineExtId) + client: Optional pre-initialized CogniteClient (if None, creates new client) + + Returns: + Tuple of (Config, CogniteClient) + """ + if not client: + env_config: EnvConfig = get_env_variables() + client = create_client(env_config) + config: Config = load_config_parameters(client=client, function_data=function_data) + return config, client + + +def create_entity_search_service( + config: Config, client: CogniteClient, logger: CogniteFunctionLogger +) -> EntitySearchService: + """ + Creates an EntitySearchService instance for finding entities by text. + + Factory function that initializes EntitySearchService with configuration. + + Args: + config: Configuration object containing data model views and entity search settings + client: CogniteClient for API interactions + logger: Logger instance for tracking execution + + Returns: + Initialized EntitySearchService instance + + Raises: + ValueError: If regular_annotation_space (file_view.instance_space) is None + """ + return EntitySearchService(config=config, client=client, logger=logger) + + +def create_cache_service( + config: Config, client: CogniteClient, logger: CogniteFunctionLogger, entity_search_service: EntitySearchService +) -> CacheService: + """ + Creates a CacheService instance for caching textβ†’entity mappings. + + Factory function that initializes CacheService with configuration. + Importantly, reuses the normalize() function from EntitySearchService to ensure + consistent text normalization between caching and searching. + + Args: + config: Configuration object containing RAW database settings and data model views + client: CogniteClient for API interactions + logger: Logger instance for tracking execution + entity_search_service: EntitySearchService instance (to reuse normalize function) + + Returns: + Initialized CacheService instance + """ + return CacheService( + config=config, + client=client, + logger=logger, + normalize_fn=entity_search_service.normalize, # Reuse normalization from entity search + ) diff --git a/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_promote/handler.py b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_promote/handler.py new file mode 100644 index 00000000..273134b0 --- /dev/null +++ b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_promote/handler.py @@ -0,0 +1,164 @@ +import os +import sys +import time +from datetime import datetime, timezone, timedelta +from cognite.client import CogniteClient +from dependencies import ( + create_config_service, + create_logger_service, + create_entity_search_service, + create_cache_service, +) +from services.PromoteService import GeneralPromoteService +from services.ConfigService import Config +from services.LoggerService import CogniteFunctionLogger +from services.EntitySearchService import EntitySearchService +from services.CacheService import CacheService +from utils.DataStructures import PromoteTracker + + +def handle(data: dict, function_call_info: dict, client: CogniteClient) -> dict[str, str]: + """ + Main entry point for the Cognite Function - promotes pattern-mode annotations. + + This function runs in a loop for up to 7 minutes, processing batches of pattern-mode + annotations. For each batch: + 1. Retrieves candidate edges (pattern-mode annotations pointing to sink node) + 2. Searches for matching entities using EntitySearchService (with caching) + 3. Updates edges and RAW tables based on search results + 4. Pauses 10 seconds between batches + + Pattern-mode annotations are created when diagram detection finds text matching + regex patterns but can't match it to the provided entity list. This function + attempts to resolve those annotations post-hoc. + + Args: + data: Function input data containing: + - ExtractionPipelineExtId: ID of extraction pipeline for config + - logLevel: Logging level (DEBUG, INFO, WARNING, ERROR) + - logPath: Optional path for writing logs to file + function_call_info: Metadata about the function call (not currently used) + client: Pre-initialized CogniteClient for API interactions + + Returns: + Dictionary with execution status: + - {"status": "success", "message": "..."} on normal completion + - {"status": "failure", "message": "..."} on error + + Raises: + Exception: Any unexpected errors are caught, logged, and returned in status dict + """ + start_time: datetime = datetime.now(timezone.utc) + + config: Config + config, client = create_config_service(function_data=data, client=client) + logger: CogniteFunctionLogger = create_logger_service(data.get("logLevel", "DEBUG"), data.get("logPath")) + tracker: PromoteTracker = PromoteTracker() + + # Create service dependencies + entity_search_service: EntitySearchService = create_entity_search_service(config, client, logger) + cache_service: CacheService = create_cache_service(config, client, logger, entity_search_service) + + # Create promote service with injected dependencies + promote_service: GeneralPromoteService = GeneralPromoteService( + client=client, + config=config, + logger=logger, + tracker=tracker, + entity_search_service=entity_search_service, + cache_service=cache_service, + ) + + run_status: str = "success" + try: + # Run in a loop for a maximum of 7 minutes b/c serverless functions can run for max 10 minutes before hardware dies + while datetime.now(timezone.utc) - start_time < timedelta(minutes=7): + result: str | None = promote_service.run() + if result == "Done": + logger.info("No more candidates to process. Exiting.", section="END") + break + # Log batch report and pause between batches + logger.info(tracker.generate_local_report(), section="START") + return {"status": run_status, "data": data} + except Exception as e: + run_status = "failure" + msg: str = f"{str(e)}" + logger.error(f"An unexpected error occurred: {msg}", section="BOTH") + return {"status": run_status, "message": msg} + finally: + # Generate overall summary report + logger.info(tracker.generate_overall_report(), section="BOTH") + + +def run_locally(config_file: dict) -> None: + """ + Entry point for local execution and debugging. + + Runs the promote function locally using environment variables for authentication + instead of Cognite Functions runtime. Useful for development and testing. + + Args: + config_file: Configuration dictionary containing: + - ExtractionPipelineExtId: ID of extraction pipeline for config + - logLevel: Logging level (DEBUG, INFO, WARNING, ERROR) + - logPath: Path for writing logs to file + + Returns: + None (execution results are logged) + + Raises: + ValueError: If required environment variables are missing + """ + from dependencies import create_client, get_env_variables + from utils.DataStructures import EnvConfig + + env_vars: EnvConfig = get_env_variables() + client: CogniteClient = create_client(env_vars) + + # Mock function_call_info for local runs + config: Config + config, client = create_config_service(function_data=config_file) + logger: CogniteFunctionLogger = create_logger_service( + config_file.get("logLevel", "DEBUG"), config_file.get("logPath") + ) + tracker: PromoteTracker = PromoteTracker() + + # Create service dependencies + entity_search_service: EntitySearchService = create_entity_search_service(config, client, logger) + cache_service: CacheService = create_cache_service(config, client, logger, entity_search_service) + + # Create promote service with injected dependencies + promote_service: GeneralPromoteService = GeneralPromoteService( + client=client, + config=config, + logger=logger, + tracker=tracker, + entity_search_service=entity_search_service, + cache_service=cache_service, + ) + + try: + # Run in a loop for a maximum of 7 minutes b/c serverless functions can run for max 10 minutes before hardware dies + while True: + result: str | None = promote_service.run() + if result == "Done": + logger.info("No more candidates to process. Exiting.", section="END") + break + # Log batch report and pause between batches + logger.info(tracker.generate_local_report(), section="START") + except Exception as e: + run_status = "failure" + msg: str = f"{str(e)}" + logger.error(f"An unexpected error occurred: {msg}", section="BOTH") + finally: + # Generate overall summary report + logger.info(tracker.generate_overall_report(), section="BOTH") + + +if __name__ == "__main__": + config_file = { + "ExtractionPipelineExtId": sys.argv[1], + "logLevel": sys.argv[2], + "logPath": sys.argv[3], + } + run_locally(config_file) diff --git a/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_promote/requirements.txt b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_promote/requirements.txt new file mode 100644 index 00000000..bd7f2bc3 --- /dev/null +++ b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_promote/requirements.txt @@ -0,0 +1,24 @@ +annotated-types==0.7.0 +certifi==2025.4.26 +cffi==1.17.1 +charset-normalizer==3.4.2 +cognite-sdk==7.76.0 +cryptography==44.0.3 +dotenv==0.9.9 +idna==3.10 +msal==1.32.3 +oauthlib==3.2.2 +packaging==25.0 +protobuf==6.30.2 +pycparser==2.22 +pydantic==2.11.4 +pydantic_core==2.33.2 +PyJWT==2.10.1 +python-dotenv==1.1.0 +PyYAML==6.0.2 +requests==2.32.3 +requests-oauthlib==1.3.1 +typing-inspection==0.4.0 +typing_extensions==4.13.2 +urllib3==2.5.0 + diff --git a/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_promote/services/CacheService.py b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_promote/services/CacheService.py new file mode 100644 index 00000000..6c79ba58 --- /dev/null +++ b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_promote/services/CacheService.py @@ -0,0 +1,356 @@ +import abc +from datetime import datetime, timezone, timedelta +from typing import Callable, Any +from cognite.client import CogniteClient +from cognite.client.data_classes.data_modeling import Node, NodeList +from cognite.client.data_classes.data_modeling.ids import ViewId +from cognite.client.data_classes.raw import Row +from services.LoggerService import CogniteFunctionLogger +from services.ConfigService import Config, ViewPropertyConfig + + +class ICacheService(abc.ABC): + """ + Interface for services that cache text β†’ entity mappings to improve lookup performance. + """ + + @abc.abstractmethod + def get(self, text: str, annotation_type: str) -> Node | None: + """ + Retrieves a cached entity node for the given text and annotation type. + + Args: + text: Text to look up + annotation_type: Type of annotation + + Returns: + Cached Node if found, None if cache miss + """ + pass + + @abc.abstractmethod + def set(self, text: str, annotation_type: str, node: Node | None) -> None: + """ + Caches an entity node for the given text and annotation type. + + Args: + text: Text being cached + annotation_type: Type of annotation + node: Entity node to cache, or None for negative caching + """ + pass + + @abc.abstractmethod + def get_from_memory(self, text: str, annotation_type: str) -> Node | None: + """ + Retrieves from in-memory cache only (no persistent storage lookup). + + Args: + text: Text to look up + annotation_type: Type of annotation + + Returns: + Cached Node if found in memory, None otherwise + """ + pass + + +class CacheService(ICacheService): + """ + Manages two-tier caching for text β†’ entity mappings to dramatically improve performance. + + **TIER 1: In-Memory Cache** (This Run Only): + - Ultra-fast lookup (in-memory dictionary) + - Dictionary stored in memory: {(text, type): (space, id) or None} + - **Includes negative caching** (remembers "no match found" to avoid repeated searches) + - Cleared when function execution ends + - Used for: Both positive matches AND negative results (not found) + + **TIER 2: Persistent RAW Cache** (All Runs): + - Fast lookup (single database query) + - Stored in RAW table: promote_text_to_entity_cache + - Benefits all future function runs indefinitely + - Tracks hit count for analytics + - **Only caches positive matches** (unambiguous single entities found) + - Does NOT cache negative results (to allow for new entities added over time) + + **Performance Impact:** + - First lookup: Slowest (query annotation edges + entity retrieval) + - Cached lookup (same run): Fastest (in-memory dictionary) + - Cached lookup (future run): Fast (single database query) + - Self-improving: Gets faster as cache fills + """ + + def __init__( + self, + config: Config, + client: CogniteClient, + logger: CogniteFunctionLogger, + normalize_fn: Callable[[str], str], + ): + """ + Initializes the cache service. + + Args: + config: Configuration object containing data model views and cache settings + client: Cognite client + logger: Logger instance + normalize_fn: Function to normalize text for cache keys + """ + self.client = client + self.logger = logger + self.config = config + self.normalize = normalize_fn + + # Extract view configurations + file_view: ViewPropertyConfig = config.data_model_views.file_view + target_entities_view: ViewPropertyConfig = config.data_model_views.target_entities_view + + # Extract view IDs + self.file_view_id = file_view.as_view_id() + self.target_entities_view_id = target_entities_view.as_view_id() + + # Extract RAW database and cache table configuration + self.raw_db: str = config.promote_function.raw_db + self.cache_table_name: str = config.promote_function.cache_service.cache_table_name + + self.function_id = "fn_file_annotation_promote" + + # In-memory cache: {(text, type): (space, ext_id) or None} + self._memory_cache: dict[tuple[str, str], tuple[str, str] | None] = {} + + def get(self, text: str, annotation_type: str) -> Node | None: + """ + Retrieves a cached entity node for the given text and annotation type. + + Checks in-memory cache first, then falls back to persistent RAW cache. + + Args: + text: The text to look up + annotation_type: Type of annotation ("diagrams.FileLink" or "diagrams.AssetLink") + + Returns: + Cached Node if found, None if cache miss + """ + cache_key: tuple[str, str] = (text, annotation_type) + + # TIER 1: In-memory cache (instant) + if cache_key in self._memory_cache: + cached_result: tuple[str, str] | None = self._memory_cache[cache_key] + if cached_result is None: + # Negative cache entry + return None + + # Retrieve the node from cache + space: str + ext_id: str + space, ext_id = cached_result + view_id: ViewId = ( + self.file_view_id if annotation_type == "diagrams.FileLink" else self.target_entities_view_id + ) + + try: + retrieved: Any = self.client.data_modeling.instances.retrieve_nodes( + nodes=(space, ext_id), sources=view_id + ) + if retrieved: + self.logger.debug(f"βœ“ [CACHE] In-memory cache HIT for '{text}'") + node: Node | None = self._extract_single_node(retrieved) + return node + except Exception as e: + self.logger.warning(f"[CACHE] Failed to retrieve cached node for '{text}': {e}") + # Invalidate this cache entry + del self._memory_cache[cache_key] + return None + + # TIER 2: Persistent RAW cache (fast) + cached_node: Node | None = self._get_from_persistent_cache(text, annotation_type) + if cached_node: + self.logger.info(f"βœ“ [CACHE] Persistent cache HIT for '{text}'") + # Populate in-memory cache for future lookups in this run + self._memory_cache[cache_key] = (cached_node.space, cached_node.external_id) + return cached_node + + # Cache miss + return None + + def get_from_memory(self, text: str, annotation_type: str) -> Node | None: + """ + Retrieves from in-memory cache only (no persistent storage lookup). + + Useful for checking if we've already looked up this text in this run. + + Args: + text: The text to look up + annotation_type: Type of annotation + + Returns: + Cached Node if found in memory, None otherwise + """ + cache_key: tuple[str, str] = (text, annotation_type) + if cache_key not in self._memory_cache: + return None + + cached_result: tuple[str, str] | None = self._memory_cache[cache_key] + if cached_result is None: + return None + + space: str + ext_id: str + space, ext_id = cached_result + view_id: ViewId = self.file_view_id if annotation_type == "diagrams.FileLink" else self.target_entities_view_id + + try: + retrieved: Any = self.client.data_modeling.instances.retrieve_nodes(nodes=(space, ext_id), sources=view_id) + if retrieved: + return self._extract_single_node(retrieved) + except Exception: + pass + + return None + + def set(self, text: str, annotation_type: str, node: Node | None) -> None: + """ + Caches an entity node for the given text and annotation type. + + Caching behavior: + - Positive matches (node provided): Cached in BOTH in-memory AND persistent RAW + - Negative results (node=None): Cached ONLY in-memory (allows for new entities over time) + + Args: + text: The text being cached + annotation_type: Type of annotation + node: The entity node to cache, or None for negative caching (in-memory only) + """ + cache_key: tuple[str, str] = (text, annotation_type) + + if node is None: + # Negative cache entry (IN-MEMORY ONLY - not persisted to RAW) + # This avoids repeated searches within the same run but allows new entities added later + self._memory_cache[cache_key] = None + self.logger.debug(f"βœ“ [CACHE] Cached negative result for '{text}' (in-memory only)") + return + + # Positive cache entry (BOTH in-memory AND persistent RAW) + self._memory_cache[cache_key] = (node.space, node.external_id) + self._set_in_persistent_cache(text, annotation_type, node) + self.logger.debug(f"βœ“ [CACHE] Cached positive match for '{text}' β†’ {node.external_id} (in-memory + RAW)") + + def _get_from_persistent_cache(self, text: str, annotation_type: str) -> Node | None: + """ + Checks persistent RAW cache for text β†’ entity mapping. + + Returns: + Node if cache hit, None if miss + """ + try: + # Normalize text for consistent cache keys + cache_key: str = self.normalize(text) + + row: Any = self.client.raw.rows.retrieve( + db_name=self.raw_db, + table_name=self.cache_table_name, + key=cache_key, + ) + + if not row or not row.columns: + return None + + # Verify annotation type matches + if row.columns.get("annotationType") != annotation_type: + return None + + # Retrieve the cached node + end_node_space: Any = row.columns.get("endNodeSpace") + end_node_ext_id: Any = row.columns.get("endNode") + + if not end_node_space or not end_node_ext_id: + return None + + view_id: ViewId = ( + self.file_view_id if annotation_type == "diagrams.FileLink" else self.target_entities_view_id + ) + + retrieved: Any = self.client.data_modeling.instances.retrieve_nodes( + nodes=(end_node_space, end_node_ext_id), sources=view_id + ) + + if retrieved: + return self._extract_single_node(retrieved) + + return None + + except Exception as e: + # Cache miss or error - just continue without cache + self.logger.debug(f"[CACHE] Cache check failed for '{text}': {e}") + return None + + def _set_in_persistent_cache(self, text: str, annotation_type: str, node: Node) -> None: + """ + Updates persistent RAW cache with text β†’ entity mapping. + Only caches unambiguous single matches. + # NOTE: This cache has two entry points. One entry point is automatically generated connections (e.g. from this code) + # The second entry point is from the streamlit app. Manual promotions through the streamlit app will have the result cached into the RAW table. + # The sourceCreatedUser will be the functionId for auto generated cache rows and will be a usersId for the manual promotions. + """ + try: + cache_key: str = self.normalize(text) + + cache_data: Row = Row( + key=cache_key, + columns={ + "originalText": text, + "endNode": node.external_id, + "endNodeSpace": node.space, + "annotationType": annotation_type, + "lastUpdateTimeUtcIso": datetime.now(timezone.utc).isoformat(), + "sourceCreatedUser": self.function_id, + }, + ) + + self.client.raw.rows.insert( + db_name=self.raw_db, + table_name=self.cache_table_name, + row=cache_data, + ensure_parent=True, + ) + + except Exception as e: + # Don't fail the run if cache update fails + self.logger.warning(f"Failed to update cache for '{text}': {e}") + + def _extract_single_node(self, retrieved: Node | NodeList) -> Node | None: + """ + Extracts a single Node from the retrieved result. + + Handles both single Node and NodeList returns from the SDK. + """ + if isinstance(retrieved, NodeList) and len(retrieved) > 0: + first_node = list(retrieved)[0] + return first_node if isinstance(first_node, Node) else None + elif isinstance(retrieved, Node): + return retrieved + else: + return None + + def get_stats(self) -> dict[str, int]: + """ + Returns statistics about the in-memory cache. + + Returns: + Dictionary with cache statistics + """ + total_entries = len(self._memory_cache) + negative_entries = sum(1 for v in self._memory_cache.values() if v is None) + positive_entries = total_entries - negative_entries + + return { + "total_entries": total_entries, + "positive_entries": positive_entries, + "negative_entries": negative_entries, + } + + def clear_memory_cache(self) -> None: + """Clears the in-memory cache. Useful for testing.""" + self._memory_cache.clear() + self.logger.debug("In-memory cache cleared") diff --git a/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_promote/services/ConfigService.py b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_promote/services/ConfigService.py new file mode 100644 index 00000000..f1d2584d --- /dev/null +++ b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_promote/services/ConfigService.py @@ -0,0 +1,371 @@ +from enum import Enum +from typing import Any, Literal, cast, Optional + +import yaml +from cognite.client.data_classes.contextualization import ( + DiagramDetectConfig, + ConnectionFlags, + CustomizeFuzziness, + DirectionWeights, +) +from cognite.client.data_classes.data_modeling import NodeId +from cognite.client.data_classes.filters import Filter +from cognite.client import CogniteClient +from cognite.client import data_modeling as dm +from cognite.client.exceptions import CogniteAPIError +from pydantic import BaseModel, Field +from pydantic.alias_generators import to_camel +from utils.DataStructures import AnnotationStatus, FilterOperator + + +# Configuration Classes +class ViewPropertyConfig(BaseModel, alias_generator=to_camel): + schema_space: str + instance_space: Optional[str] = None + external_id: str + version: str + annotation_type: Optional[Literal["diagrams.FileLink", "diagrams.AssetLink"]] = None + + def as_view_id(self) -> dm.ViewId: + return dm.ViewId(space=self.schema_space, external_id=self.external_id, version=self.version) + + def as_property_ref(self, property) -> list[str]: + return [self.schema_space, f"{self.external_id}/{self.version}", property] + + +class FilterConfig(BaseModel, alias_generator=to_camel): + values: Optional[list[AnnotationStatus | str] | AnnotationStatus | str] = None + negate: bool = False + operator: FilterOperator + target_property: str + + def as_filter(self, view_properties: ViewPropertyConfig) -> Filter: + property_reference = view_properties.as_property_ref(self.target_property) + + # Converts enum value into string -> i.e.) in the case of AnnotationStatus + if isinstance(self.values, list): + find_values = [v.value if isinstance(v, Enum) else v for v in self.values] + elif isinstance(self.values, Enum): + find_values = self.values.value + else: + find_values = self.values + + filter: Filter + if find_values is None: + if self.operator == FilterOperator.EXISTS: + filter = dm.filters.Exists(property=property_reference) + else: + raise ValueError(f"Operator {self.operator} requires a value") + elif self.operator == FilterOperator.IN: + if not isinstance(find_values, list): + raise ValueError(f"Operator 'IN' requires a list of values for property {self.target_property}") + filter = dm.filters.In(property=property_reference, values=find_values) + elif self.operator == FilterOperator.EQUALS: + filter = dm.filters.Equals(property=property_reference, value=find_values) + elif self.operator == FilterOperator.CONTAINSALL: + filter = dm.filters.ContainsAll(property=property_reference, values=find_values) + elif self.operator == FilterOperator.SEARCH: + filter = dm.filters.Search(property=property_reference, value=find_values) + else: + raise NotImplementedError(f"Operator {self.operator} is not implemented.") + + if self.negate: + return dm.filters.Not(filter) + else: + return filter + + +class QueryConfig(BaseModel, alias_generator=to_camel): + target_view: ViewPropertyConfig + filters: list[FilterConfig] + limit: Optional[int] = -1 + + def build_filter(self) -> Filter: + list_filters: list[Filter] = [f.as_filter(self.target_view) for f in self.filters] + + if len(list_filters) == 1: + return list_filters[0] + else: + return dm.filters.And(*list_filters) # NOTE: '*' Unpacks each filter in the list + + +class ConnectionFlagsConfig(BaseModel, alias_generator=to_camel): + no_text_inbetween: Optional[bool] = None + natural_reading_order: Optional[bool] = None + + def as_connection_flag(self) -> ConnectionFlags: + params = {key: value for key, value in self.model_dump().items() if value is not None} + return ConnectionFlags(**params) + + +class CustomizeFuzzinessConfig(BaseModel, alias_generator=to_camel): + fuzzy_score: Optional[float] = None + max_boxes: Optional[int] = None + min_chars: Optional[int] = None + + def as_customize_fuzziness(self) -> CustomizeFuzziness: + params = {key: value for key, value in self.model_dump().items() if value is not None} + return CustomizeFuzziness(**params) + + +class DirectionWeightsConfig(BaseModel, alias_generator=to_camel): + left: Optional[float] = None + right: Optional[float] = None + up: Optional[float] = None + down: Optional[float] = None + + def as_direction_weights(self) -> DirectionWeights: + params = {key: value for key, value in self.model_dump().items() if value is not None} + return DirectionWeights(**params) + + +class DiagramDetectConfigModel(BaseModel, alias_generator=to_camel): + # NOTE: configs come from V7 of the cognite python sdk cognite SDK + annotation_extract: Optional[bool] = None + case_sensitive: Optional[bool] = None + connection_flags: Optional[ConnectionFlagsConfig] = None + customize_fuzziness: Optional[CustomizeFuzzinessConfig] = None + direction_delta: Optional[float] = None + direction_weights: Optional[DirectionWeightsConfig] = None + min_fuzzy_score: Optional[float] = None + read_embedded_text: Optional[bool] = None + remove_leading_zeros: Optional[bool] = None + substitutions: Optional[dict[str, list[str]]] = None + + def as_config(self) -> DiagramDetectConfig: + params = {} + if self.annotation_extract is not None: + params["annotation_extract"] = self.annotation_extract + if self.case_sensitive is not None: + params["case_sensitive"] = self.case_sensitive + if self.connection_flags is not None: + params["connection_flags"] = self.connection_flags.as_connection_flag() + if self.customize_fuzziness is not None: + params["customize_fuzziness"] = self.customize_fuzziness.as_customize_fuzziness() + if self.direction_delta is not None: + params["direction_delta"] = self.direction_delta + if self.direction_weights is not None: + params["direction_weights"] = self.direction_weights.as_direction_weights() + if self.min_fuzzy_score is not None: + params["min_fuzzy_score"] = self.min_fuzzy_score + if self.read_embedded_text is not None: + params["read_embedded_text"] = self.read_embedded_text + if self.remove_leading_zeros is not None: + params["remove_leading_zeros"] = self.remove_leading_zeros + if self.substitutions is not None: + params["substitutions"] = self.substitutions + + return DiagramDetectConfig(**params) + + +# Launch Related Configs +class DataModelServiceConfig(BaseModel, alias_generator=to_camel): + get_files_to_process_query: QueryConfig | list[QueryConfig] + get_target_entities_query: QueryConfig | list[QueryConfig] + get_file_entities_query: QueryConfig | list[QueryConfig] + + +class CacheServiceConfig(BaseModel, alias_generator=to_camel): + cache_time_limit: int + raw_db: str + raw_table_cache: str + raw_manual_patterns_catalog: str + + +class AnnotationServiceConfig(BaseModel, alias_generator=to_camel): + page_range: int = Field(gt=0, le=50) + partial_match: bool = True + min_tokens: int = 1 + diagram_detect_config: Optional[DiagramDetectConfigModel] = None + + +class PrepareFunction(BaseModel, alias_generator=to_camel): + get_files_for_annotation_reset_query: Optional[QueryConfig | list[QueryConfig]] = None + get_files_to_annotate_query: QueryConfig | list[QueryConfig] + + +class LaunchFunction(BaseModel, alias_generator=to_camel): + batch_size: int = Field(gt=0, le=50) + primary_scope_property: str + secondary_scope_property: Optional[str] = None + file_search_property: str = "aliases" + target_entities_search_property: str = "aliases" + pattern_mode: bool + file_resource_property: Optional[str] = None + target_entities_resource_property: Optional[str] = None + data_model_service: DataModelServiceConfig + cache_service: CacheServiceConfig + annotation_service: AnnotationServiceConfig + + +# Finalize Related Configs +class RetrieveServiceConfig(BaseModel, alias_generator=to_camel): + get_job_id_query: QueryConfig | list[QueryConfig] + + +class ApplyServiceConfig(BaseModel, alias_generator=to_camel): + auto_approval_threshold: float = Field(gt=0.0, le=1.0) + auto_suggest_threshold: float = Field(gt=0.0, le=1.0) + sink_node: NodeId + raw_db: str + raw_table_doc_tag: str + raw_table_doc_doc: str + raw_table_doc_pattern: str + + +class FinalizeFunction(BaseModel, alias_generator=to_camel): + clean_old_annotations: bool + max_retry_attempts: int + retrieve_service: RetrieveServiceConfig + apply_service: ApplyServiceConfig + + +# Promote Related Configs +class TextNormalizationConfig(BaseModel, alias_generator=to_camel): + """ + Configuration for text normalization and variation generation. + + Controls how text is normalized for matching and what variations are generated + to improve match rates across different naming conventions. + + These flags affect both the normalize() function (for cache keys and direct matching) + and generate_text_variations() function (for query-based matching). + """ + + remove_special_characters: bool = True + convert_to_lowercase: bool = True + strip_leading_zeros: bool = True + + +class EntitySearchServiceConfig(BaseModel, alias_generator=to_camel): + """ + Configuration for the EntitySearchService in the promote function. + + Controls entity search and text normalization behavior: + - Queries entities directly (server-side IN filter on entity/file aliases) + - Text normalization for generating search variations + + Uses efficient server-side filtering on the smaller entity dataset rather than + the larger annotation edge dataset for better performance at scale. + """ + + enable_existing_annotations_search: bool = True + enable_global_entity_search: bool = True + max_entity_search_limit: int = Field(default=1000, gt=0, le=10000) + text_normalization: TextNormalizationConfig + + +class PromoteCacheServiceConfig(BaseModel, alias_generator=to_camel): + """ + Configuration for the CacheService in the promote function. + + Controls caching behavior for textβ†’entity mappings. + """ + + cache_table_name: str + + +class PromoteFunctionConfig(BaseModel, alias_generator=to_camel): + """ + Configuration for the promote function. + + The promote function resolves pattern-mode annotations by finding matching entities + and updating annotation edges from pointing to a sink node to pointing to actual entities. + + Configuration is organized by service interface: + - entitySearchService: Controls entity search strategies + - cacheService: Controls caching behavior + + Batch size is controlled via getCandidatesQuery.limit field. + """ + + get_candidates_query: QueryConfig | list[QueryConfig] + raw_db: str + raw_table_doc_pattern: str + raw_table_doc_tag: str + raw_table_doc_doc: str + delete_rejected_edges: bool + delete_suggested_edges: bool + entity_search_service: EntitySearchServiceConfig + cache_service: PromoteCacheServiceConfig + + +class DataModelViews(BaseModel, alias_generator=to_camel): + core_annotation_view: ViewPropertyConfig + annotation_state_view: ViewPropertyConfig + file_view: ViewPropertyConfig + target_entities_view: ViewPropertyConfig + + +class Config(BaseModel, alias_generator=to_camel): + data_model_views: DataModelViews + prepare_function: PrepareFunction + launch_function: LaunchFunction + finalize_function: FinalizeFunction + promote_function: PromoteFunctionConfig + + @classmethod + def parse_direct_relation(cls, value: Any) -> Any: + if isinstance(value, dict): + return dm.DirectRelationReference.load(value) + return value + + +# Functions to construct queries +def get_limit_from_query(query: QueryConfig | list[QueryConfig]) -> int: + """ + Determines the retrieval limit from a query configuration. + Handles 'None' by treating it as the default -1 (unlimited). + """ + default_limit = -1 + if isinstance(query, list): + if not query: + return default_limit + limits = [q.limit if q.limit is not None else default_limit for q in query] + return max(limits) + else: + return query.limit if query.limit is not None else default_limit + + +def build_filter_from_query(query: QueryConfig | list[QueryConfig]) -> Filter: + """ + Builds a Cognite Filter from a query configuration. + + If the query is a list, it builds a filter for each item and combines them with a logical OR. + If the query is a single object, it builds the filter directly from it. + """ + if isinstance(query, list): + list_filters: list[Filter] = [q.build_filter() for q in query] + if not list_filters: + raise ValueError("Query list cannot be empty.") + return dm.filters.Or(*list_filters) if len(list_filters) > 1 else list_filters[0] + else: + return query.build_filter() + + +def load_config_parameters( + client: CogniteClient, + function_data: dict[str, Any], +) -> Config: + """ + Retrieves the configuration parameters from the function data and loads the configuration from CDF. + """ + if "ExtractionPipelineExtId" not in function_data: + raise ValueError("Missing key 'ExtractionPipelineExtId' in input data to the function") + + pipeline_ext_id = function_data["ExtractionPipelineExtId"] + try: + raw_config = client.extraction_pipelines.config.retrieve(pipeline_ext_id) + if raw_config.config is None: + raise ValueError(f"No config found for extraction pipeline: {pipeline_ext_id!r}") + except CogniteAPIError: + raise RuntimeError(f"Not able to retrieve pipeline config for extraction pipeline: {pipeline_ext_id!r}") + + loaded_yaml_data = yaml.safe_load(raw_config.config) + + if isinstance(loaded_yaml_data, dict): + return Config.model_validate(loaded_yaml_data) + else: + raise ValueError( + "Invalid configuration structure from CDF: \nExpected a YAML dictionary with a top-level 'config' key." + ) diff --git a/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_promote/services/EntitySearchService.py b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_promote/services/EntitySearchService.py new file mode 100644 index 00000000..f1dc1cb4 --- /dev/null +++ b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_promote/services/EntitySearchService.py @@ -0,0 +1,404 @@ +import abc +import re +from typing import Callable, Any +from cognite.client import CogniteClient +from cognite.client.data_classes.data_modeling import Node, NodeList, ViewId +from cognite.client.data_classes.filters import Filter, Equals, In +from services.LoggerService import CogniteFunctionLogger +from services.ConfigService import Config, ViewPropertyConfig + + +class IEntitySearchService(abc.ABC): + """ + Interface for services that find entities by text using various search strategies. + """ + + @abc.abstractmethod + def find_entity(self, text: str, annotation_type: str, entity_space: str) -> list[Node]: + """ + Finds entities matching the given text using multiple strategies. + + Args: + text: Text to search for + annotation_type: Type of annotation being searched + entity_space: Space to search in for global fallback + + Returns: + List of matched Node objects + """ + pass + + +class EntitySearchService(IEntitySearchService): + """ + Finds entities by text using server-side filtering on entity aliases. + + This service queries entities directly using an IN filter on the aliases property, + which is more efficient than querying annotation edges: + + **Why query entities directly instead of annotation edges?** + - Entity dataset is smaller and stable (~1,000-10,000 entities) + - Annotation edges grow quadratically (Files Γ— Entities = potentially millions) + - Neither startNodeText nor aliases properties are indexed + - Without indexes, smaller dataset = better performance + - Entity count doesn't increase as more files are annotated + + **Search Strategy:** + - Generate text variations (e.g., "V-0912" β†’ ["V-0912", "v-0912", "V-912", "v912", ...]) + - Query entities with server-side IN filter on aliases property + - Uses text variations to handle different naming conventions + - Returns matches from specified entity space + + **Utilities:** + - `generate_text_variations()`: Creates common variations (case, leading zeros, special chars) + - `normalize()`: Normalizes text for cache keys (removes special chars, lowercase, strips zeros) + """ + + def __init__( + self, + config: Config, + client: CogniteClient, + logger: CogniteFunctionLogger, + ): + """ + Initializes the entity search service. + + Args: + config: Configuration object containing data model views and entity search settings + client: Cognite client + logger: Logger instance + + Raises: + ValueError: If regular_annotation_space (file_view.instance_space) is None + """ + self.client = client + self.logger = logger + self.config = config + + # Extract view IDs + self.core_annotation_view_id = config.data_model_views.core_annotation_view.as_view_id() + self.file_view_id = config.data_model_views.file_view.as_view_id() + self.target_entities_view_id = config.data_model_views.target_entities_view.as_view_id() + + # Extract regular annotation space + self.regular_annotation_space: str | None = config.data_model_views.file_view.instance_space + if not self.regular_annotation_space: + raise ValueError("regular_annotation_space (file_view.instance_space) is required but was None") + + # Extract text normalization config + self.text_normalization_config = config.promote_function.entity_search_service.text_normalization + + def find_entity(self, text: str, annotation_type: str, entity_space: str) -> list[Node]: + """ + Finds entities matching the given text by querying entity aliases. + + This is the main entry point for entity search. + + Strategy: + 1. Generate text variations (e.g., "V-0912" β†’ ["V-0912", "v-0912", "V-912", "v912", ...]) + 2. Query entities with server-side IN filter on aliases property + + Note: We query entities directly rather than annotation edges because: + - Entity dataset is smaller and more stable (~1,000-10,000 entities) + - Annotation edges grow quadratically (Files Γ— Entities = potentially millions) + - Neither startNodeText nor aliases properties are indexed + - Without indexes, smaller dataset = better performance + + Args: + text: Text to search for (e.g., "V-123", "G18A-921") + annotation_type: Type of annotation ("diagrams.FileLink" or "diagrams.AssetLink") + entity_space: Space to search in + + Returns: + List of matched nodes: + - [] if no match found + - [node] if single unambiguous match + - [node1, node2] if ambiguous (multiple matches) + """ + # Generate text variations once + text_variations: list[str] = self.generate_text_variations(text) + self.logger.info(f"Generated {len(text_variations)} text variation(s) for '{text}': {text_variations}") + + # Determine which view to query based on annotation type + if annotation_type == "diagrams.FileLink": + source: ViewId = self.file_view_id + else: + source = self.target_entities_view_id + + # Query entities directly by aliases + found_nodes: list[Node] = self.find_global_entity(text_variations, source, entity_space) + + return found_nodes + + def find_from_existing_annotations(self, text_variations: list[str], annotation_type: str) -> list[Node]: + """ + [UNUSED] Searches for existing successful annotations with matching startNodeText. + + ** WHY THIS FUNCTION IS NOT USED: ** + While this was originally designed as a "smart" optimization to find proven matches, + it actually queries the LARGER dataset: + + - Annotation edges grow quadratically: O(Files Γ— Entities) = potentially millions + - Entity/file nodes grow linearly: O(Entities) = thousands + - Neither startNodeText nor aliases properties are indexed + - Without indexes, querying the smaller dataset (entities) is always faster + + Performance comparison at scale: + - This function: Scans ~500,000+ annotation edges (grows over time) + - Global entity search: Scans ~1,000-10,000 entities (relatively stable) + + Result: Global entity search is 50-500x faster at scale. + + This function is kept for reference but should not be used in production. + + Args: + text_variations: List of text variations to search for (e.g., ["V-0912", "v-0912", "V-912", ...]) + annotation_type: "diagrams.FileLink" or "diagrams.AssetLink" + + Returns: + List of matched entity nodes (0, 1, or 2+ for ambiguous) + """ + # Use first text variation (original text) for logging + original_text: str = text_variations[0] if text_variations else "unknown" + + try: + # Query edges directly with IN filter + # These are annotation edges that are from regular diagram detect (not pattern mode) + # NOTE: manually promoted results from pattern mode are added to the + text_filter: Filter = In(self.core_annotation_view_id.as_property_ref("startNodeText"), text_variations) + edges: Any = self.client.data_modeling.instances.list( + instance_type="edge", + sources=[self.core_annotation_view_id], + filter=text_filter, + space=self.regular_annotation_space, # Where regular annotations live + limit=1000, # Reasonable limit + ) + + if not edges: + return [] + + # Count occurrences of each endNode + matched_end_nodes: dict[tuple[str, str], int] = {} # {(space, externalId): count} + for edge in edges: + # Check annotation type matches + edge_props: dict[str, Any] = edge.properties.get(self.core_annotation_view_id, {}) + edge_type: Any = edge_props.get("type") + + if edge_type != annotation_type: + continue # Skip edges of different type + + # Extract endNode from the edge + end_node_ref: Any = edge.end_node + if end_node_ref: + key: tuple[str, str] = (end_node_ref.space, end_node_ref.external_id) + matched_end_nodes[key] = matched_end_nodes.get(key, 0) + 1 + + if not matched_end_nodes: + return [] + + # If multiple different endNodes found, it's ambiguous + top_matches: list[tuple[str, str]] + if len(matched_end_nodes) > 1: + self.logger.warning( + f"Found {len(matched_end_nodes)} different entities for '{original_text}' in existing annotations. " + f"This indicates data quality issues or legitimate ambiguity." + ) + # Return list of most common matches (limit to 2 for ambiguity detection) + sorted_matches: list[tuple[tuple[str, str], int]] = sorted( + matched_end_nodes.items(), key=lambda x: x[1], reverse=True + ) + top_matches = [match[0] for match in sorted_matches[:2]] + else: + # Single consistent match found + top_matches = [list(matched_end_nodes.keys())[0]] + + # Fetch the actual node objects for the matched entities + view_to_use: ViewId = ( + self.file_view_id if annotation_type == "diagrams.FileLink" else self.target_entities_view_id + ) + + matched_nodes: list[Node] = [] + for space, ext_id in top_matches: + retrieved: Any = self.client.data_modeling.instances.retrieve_nodes( + nodes=(space, ext_id), sources=view_to_use + ) + # Handle both single Node and NodeList returns + if retrieved: + if isinstance(retrieved, list): + matched_nodes.extend(retrieved) + else: + matched_nodes.append(retrieved) + + if matched_nodes: + self.logger.info( + f"Found {len(matched_nodes)} match(es) for '{original_text}' from existing annotations " + f"(appeared {matched_end_nodes.get((matched_nodes[0].space, matched_nodes[0].external_id), 0)} times)" + ) + + return matched_nodes + + except Exception as e: + self.logger.error(f"Error searching existing annotations for '{original_text}': {e}") + return [] + + def find_global_entity(self, text_variations: list[str], source: ViewId, entity_space: str) -> list[Node]: + """ + Performs a global, un-scoped search for an entity matching the given text variations. + Uses server-side IN filter with text variations to handle different naming conventions. + + This approach uses server-side filtering on the aliases property, making it efficient + and scalable even with large numbers of entities in a space. + + Args: + text_variations: List of text variations to search for (e.g., ["V-0912", "v-0912", "V-912", ...]) + source: View to query (file_view or target_entities_view) + entity_space: Space to search in + + Returns: + List of matched nodes (0, 1, or 2 for ambiguity detection) + """ + # Use first text variation (original text) for logging + original_text: str = text_variations[0] if text_variations else "unknown" + + try: + # Query entities with IN filter on aliases property + aliases_filter: Filter = In(source.as_property_ref("aliases"), text_variations) + + entities: Any = self.client.data_modeling.instances.list( + instance_type="node", + sources=source, + filter=aliases_filter, + space=entity_space, + limit=1000, # Reasonable limit to prevent timeouts + ) + + if not entities: + return [] + + # Convert to list and check for ambiguity + matched_entities: list[Node] = list(entities) + + if len(matched_entities) > 1: + self.logger.warning( + f"Found {len(matched_entities)} entities with aliases matching '{original_text}' in space '{entity_space}'. " + f"This is ambiguous. Returning first 2 for ambiguity detection." + ) + return matched_entities[:2] + + if matched_entities: + self.logger.debug( + f"Found {len(matched_entities)} match(es) for '{original_text}' via global entity search" + ) + + return matched_entities + + except Exception as e: + self.logger.error(f"Error searching for entity '{original_text}' in space '{entity_space}': {e}") + return [] + + def generate_text_variations(self, text: str) -> list[str]: + """ + Generates common variations of a text string to improve matching. + + Respects text_normalization_config settings: + - removeSpecialCharacters: Generate variations without special characters + - convertToLowercase: Generate lowercase variations + - stripLeadingZeros: Generate variations with leading zeros removed + + Examples (all flags enabled): + "V-0912" β†’ ["V-0912", "v-0912", "V-912", "v-912", "V0912", "v0912", "V912", "v912"] + "P&ID-001" β†’ ["P&ID-001", "p&id-001", "P&ID-1", "p&id-1", "PID001", "pid001", "PID1", "pid1"] + + Examples (all flags disabled): + "V-0912" β†’ ["V-0912"] # Only original + + Args: + text: Original text from pattern detection + + Returns: + List of text variations based on config settings + """ + variations: set[str] = set() + variations.add(text) # Always include original + + # Helper function to strip leading zeros + def strip_leading_zeros_in_text(s: str) -> str: + return re.sub(r"\b0+(\d+)", r"\1", s) + + # Helper function to remove special characters + def remove_special_chars(s: str) -> str: + return re.sub(r"[^a-zA-Z0-9]", "", s) + + # Generate all combinations of transformations systematically + # We'll build up variations by applying each transformation flag + base_variations: set[str] = {text} + + # Apply removeSpecialCharacters transformations + if self.text_normalization_config.remove_special_characters: + new_variations: set[str] = set() + for v in base_variations: + new_variations.add(remove_special_chars(v)) + base_variations.update(new_variations) + + # Apply convertToLowercase transformations + if self.text_normalization_config.convert_to_lowercase: + new_variations = set() + for v in base_variations: + new_variations.add(v.lower()) + base_variations.update(new_variations) + + # Apply stripLeadingZeros transformations + if self.text_normalization_config.strip_leading_zeros: + new_variations = set() + for v in base_variations: + new_variations.add(strip_leading_zeros_in_text(v)) + base_variations.update(new_variations) + + return list(base_variations) + + def normalize(self, s: str) -> str: + """ + Normalizes a string for comparison based on text_normalization_config settings. + + Applies transformations in sequence based on config: + 1. removeSpecialCharacters: Remove non-alphanumeric characters + 2. convertToLowercase: Convert to lowercase + 3. stripLeadingZeros: Remove leading zeros from number sequences + + Examples (all flags enabled): + "V-0912" -> "v912" + "FT-101A" -> "ft101a" + "P&ID-0001" -> "pid1" + + Examples (all flags disabled): + "V-0912" -> "V-0912" # No transformation + + Examples (only removeSpecialCharacters): + "V-0912" -> "V0912" # Special chars removed, case and zeros preserved + + Args: + s: String to normalize + + Returns: + Normalized string based on config settings + """ + if not isinstance(s, str): + return "" + + # Apply transformations based on config + if self.text_normalization_config.remove_special_characters: + s = re.sub(r"[^a-zA-Z0-9]", "", s) + + if self.text_normalization_config.convert_to_lowercase: + s = s.lower() + + if self.text_normalization_config.strip_leading_zeros: + # Define a replacer function that converts any matched number to an int and back to a string + def strip_leading_zeros(match): + # match.group(0) is the matched string (e.g., "0912") + return str(int(match.group(0))) + + # Apply the replacer function to all sequences of digits (\d+) in the string + s = re.sub(r"\d+", strip_leading_zeros, s) + + return s diff --git a/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_promote/services/LoggerService.py b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_promote/services/LoggerService.py new file mode 100644 index 00000000..17f24d6b --- /dev/null +++ b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_promote/services/LoggerService.py @@ -0,0 +1,97 @@ +from typing import Literal +import os + + +class CogniteFunctionLogger: + def __init__( + self, + log_level: Literal["DEBUG", "INFO", "WARNING", "ERROR"] = "INFO", + write: bool = False, + filepath: str | None = None, + ): + self.log_level = log_level.upper() + self.write = write + self.filepath = filepath + self.file_handler = None + + if self.filepath and self.write: + try: + dir_name = os.path.dirname(self.filepath) + if dir_name: + os.makedirs(dir_name, exist_ok=True) + self.file_handler = open(self.filepath, "a", encoding="utf-8") + except Exception as e: + print(f"[LOGGER_SETUP_ERROR] Could not open log file {self.filepath}: {e}") + self.write = False + + def _format_message_lines(self, prefix: str, message: str) -> list[str]: + formatted_lines = [] + if "\n" not in message: + formatted_lines.append(f"{prefix} {message}") + else: + lines = message.split("\n") + formatted_lines.append(f"{prefix}{lines[0]}") + padding = " " * len(prefix) + for line_content in lines[1:]: + formatted_lines.append(f"{padding} {line_content}") + return formatted_lines + + def _print(self, prefix: str, message: str) -> None: + lines_to_log = self._format_message_lines(prefix, message) + if self.write and self.file_handler: + try: + for line in lines_to_log: + print(line) + self.file_handler.write(line + "\n") + self.file_handler.flush() + except Exception as e: + print(f"[LOGGER_SETUP_ERROR] Could not write to {self.filepath}: {e}") + elif not self.write: + for line in lines_to_log: + print(line) + + def debug(self, message: str, section: Literal["START", "END", "BOTH"] | None = None) -> None: + if section == "START" or section == "BOTH": + self._section() + if self.log_level == "DEBUG": + self._print("[DEBUG]", message) + if section == "END" or section == "BOTH": + self._section() + + def info(self, message: str, section: Literal["START", "END", "BOTH"] | None = None) -> None: + if section == "START" or section == "BOTH": + self._section() + if self.log_level in ("DEBUG", "INFO"): + self._print("[INFO]", message) + if section == "END" or section == "BOTH": + self._section() + + def warning(self, message: str, section: Literal["START", "END", "BOTH"] | None = None) -> None: + if section == "START" or section == "BOTH": + self._section() + if self.log_level in ("DEBUG", "INFO", "WARNING"): + self._print("[WARNING]", message) + if section == "END" or section == "BOTH": + self._section() + + def error(self, message: str, section: Literal["START", "END", "BOTH"] | None = None) -> None: + if section == "START" or section == "BOTH": + self._section() + self._print("[ERROR]", message) + if section == "END" or section == "BOTH": + self._section() + + def _section(self) -> None: + if self.write and self.file_handler: + self.file_handler.write( + "--------------------------------------------------------------------------------\n" + ) + print("--------------------------------------------------------------------------------") + + def close(self) -> None: + if self.file_handler: + try: + self.file_handler.close() + except Exception as e: + print(f"[LOGGER_CLEANUP_ERROR] Error closing log file: {e}") + self.file_handler = None diff --git a/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_promote/services/PromoteService.py b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_promote/services/PromoteService.py new file mode 100644 index 00000000..19771b14 --- /dev/null +++ b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_promote/services/PromoteService.py @@ -0,0 +1,457 @@ +import abc +import time +from typing import Any, Literal +from cognite.client import CogniteClient +from cognite.client.data_classes import RowWrite +from cognite.client.data_classes.data_modeling import ( + Edge, + EdgeId, + EdgeList, + EdgeApply, + Node, + NodeOrEdgeData, + DirectRelationReference, + NodeList, +) +from services.ConfigService import Config, build_filter_from_query, get_limit_from_query +from services.LoggerService import CogniteFunctionLogger +from services.CacheService import CacheService +from services.EntitySearchService import EntitySearchService +from utils.DataStructures import DiagramAnnotationStatus, PromoteTracker + + +class IPromoteService(abc.ABC): + """ + Interface for services that promote pattern-mode annotations by finding entities + and updating annotation edges. + """ + + @abc.abstractmethod + def run(self) -> Literal["Done"] | None: + """ + Main execution method for promoting pattern-mode annotations. + + Returns: + "Done" if no more candidates need processing, None if processing should continue. + """ + pass + + +class GeneralPromoteService(IPromoteService): + """ + Promotes pattern-mode annotations by finding matching entities and updating annotation edges. + + This service retrieves candidate pattern-mode annotations (edges pointing to sink node), + searches for matching entities using EntitySearchService (with caching via CacheService), + and updates both the data model edges and RAW tables with the results. + + Pattern-mode annotations are created during diagram detection when entities can't be + matched to the provided entity list but match regex patterns. This service attempts to + resolve those annotations by searching existing annotations and entity aliases. + """ + + def __init__( + self, + client: CogniteClient, + config: Config, + logger: CogniteFunctionLogger, + tracker: PromoteTracker, + entity_search_service: EntitySearchService, + cache_service: CacheService, + ): + """ + Initialize the promote service with required dependencies. + + Args: + client: CogniteClient for API interactions + config: Configuration object containing data model views and settings + logger: Logger instance for tracking execution + tracker: Performance tracker for metrics (edges promoted/rejected/ambiguous) + entity_search_service: Service for finding entities by text (injected) + cache_service: Service for caching textβ†’entity mappings (injected) + """ + self.client = client + self.config = config + self.logger = logger + self.tracker = tracker + self.core_annotation_view = self.config.data_model_views.core_annotation_view + self.file_view = self.config.data_model_views.file_view + self.target_entities_view = self.config.data_model_views.target_entities_view + + # Sink node reference (from finalize_function config as it's shared) + self.sink_node_ref = DirectRelationReference( + space=self.config.finalize_function.apply_service.sink_node.space, + external_id=self.config.finalize_function.apply_service.sink_node.external_id, + ) + + # RAW database and table configuration + self.raw_db = self.config.promote_function.raw_db + self.raw_pattern_table = self.config.promote_function.raw_table_doc_pattern + self.raw_doc_doc_table = self.config.promote_function.raw_table_doc_doc + self.raw_doc_tag_table = self.config.promote_function.raw_table_doc_tag + + # Promote flags + self.delete_rejected_edges: bool = self.config.promote_function.delete_rejected_edges + self.delete_suggested_edges: bool = self.config.promote_function.delete_suggested_edges + + # Injected service dependencies + self.entity_search_service = entity_search_service + self.cache_service = cache_service + + def run(self) -> Literal["Done"] | None: + """ + Main execution method for promoting pattern-mode annotations. + + Process flow: + 1. Retrieve candidate edges (pattern-mode annotations not yet promoted) + 2. Group candidates by (text, type) for deduplication + 3. For each unique text/type: + - Check cache for previous results + - Search for matching entity via EntitySearchService + - Update cache with results + 4. Prepare edge and RAW table updates + 5. Apply updates to data model and RAW tables + + Args: + None + + Returns: + "Done" if no candidates found (processing complete), + None if candidates were processed (more batches may exist). + + Raises: + Exception: Any unexpected errors during processing are logged and re-raised. + """ + self.logger.info("Starting Promote batch", section="START") + + try: + candidates: EdgeList | None = self._get_promote_candidates() + if not candidates: + self.logger.info("No Promote candidates found.", section="END") + return "Done" + except Exception as e: + self.logger.error(f"Ran into the following error: {str(e)}") + self.logger.info("Retrying in 15 seconds") + time.sleep(15) + return + + self.logger.info(f"Found {len(candidates)} Promote candidates. Starting processing.") + + # Group candidates by (startNodeText, annotationType) for deduplication + grouped_candidates: dict[tuple[str, str], list[Edge]] = {} + for edge in candidates: + properties: dict[str, Any] = edge.properties[self.core_annotation_view.as_view_id()] + text: Any = properties.get("startNodeText") + annotation_type: str = edge.type.external_id + + if text and annotation_type: + key: tuple[str, str] = (text, annotation_type) + if key not in grouped_candidates: + grouped_candidates[key] = [] + grouped_candidates[key].append(edge) + + self.logger.info( + message=f"Grouped {len(candidates)} candidates into {len(grouped_candidates)} unique text/type combinations.", + ) + self.logger.debug( + message=f"Deduplication savings: {len(candidates) - len(grouped_candidates)} queries avoided.", + section="END", + ) + + edges_to_update: list[EdgeApply] = [] + raw_rows_to_update: list[RowWrite] = [] + # TODO: think about whether we need to delete the cooresponding raw row of edges that we delete OR if it should be placed in another RAW table when rejected + # raw_rows_to_delete: list[RowWrite] = [] + edges_to_delete: list[EdgeId] = [] + + # Track results for this batch + batch_promoted: int = 0 + batch_rejected: int = 0 + batch_ambiguous: int = 0 + + try: + # Process each unique text/type combination once + for (text_to_find, annotation_type), edges_with_same_text in grouped_candidates.items(): + entity_space: str | None = ( + self.file_view.instance_space + if annotation_type == "diagrams.FileLink" + else self.target_entities_view.instance_space + ) + + if not entity_space: + self.logger.warning(f"Could not determine entity space for type '{annotation_type}'. Skipping.") + continue + + # Strategy: Check cache β†’ query edges β†’ fallback to global search + found_nodes: list[Node] | list = self._find_entity_with_cache( + text_to_find, annotation_type, entity_space + ) + + # Determine result type for tracking AND deletion decision + num_edges: int = len(edges_with_same_text) + should_delete: bool = False + + if len(found_nodes) == 1: + batch_promoted += num_edges + should_delete = False # Never delete promoted edges + elif len(found_nodes) == 0: + batch_rejected += num_edges + should_delete = self.delete_rejected_edges + else: # Multiple matches + batch_ambiguous += num_edges + should_delete = self.delete_suggested_edges + + # Apply the same result to ALL edges with this text + for edge in edges_with_same_text: + edge_apply, raw_row = self._prepare_edge_update(edge, found_nodes) + + if should_delete: + # Delete the edge but still update RAW row to track what happened + edges_to_delete.append(EdgeId(edge.space, edge.external_id)) + if raw_row is not None: + raw_rows_to_update.append(raw_row) + else: + # Update both edge and RAW row + if edge_apply is not None: + edges_to_update.append(edge_apply) + if raw_row is not None: + raw_rows_to_update.append(raw_row) + finally: + # Update tracker with batch results + self.tracker.add_edges(promoted=batch_promoted, rejected=batch_rejected, ambiguous=batch_ambiguous) + + if edges_to_update: + self.client.data_modeling.instances.apply(edges=edges_to_update) + self.logger.info( + f"Successfully updated {len(edges_to_update)} edges in data model:\n" + f" β”œβ”€ Promoted: {batch_promoted}\n" + f" β”œβ”€ Rejected: {batch_rejected}\n" + f" └─ Ambiguous: {batch_ambiguous}", + section="BOTH", + ) + + if edges_to_delete: + self.client.data_modeling.instances.delete(edges=edges_to_delete) + self.logger.info(f"Successfully deleted {len(edges_to_delete)} edges from data model.", section="END") + + if raw_rows_to_update: + self.client.raw.rows.insert( + db_name=self.raw_db, + table_name=self.raw_pattern_table, + row=raw_rows_to_update, + ensure_parent=True, + ) + self.logger.info(f"Successfully updated {len(raw_rows_to_update)} rows in RAW table.", section="END") + + if not edges_to_update and not edges_to_delete and not raw_rows_to_update: + self.logger.info("No edges were updated in this run.", section="END") + + return None # Continue running if more candidates might exist + + def _get_promote_candidates(self) -> EdgeList | None: + """ + Retrieves pattern-mode annotation edges that are candidates for promotion. + + Uses query configuration from promote_function config if available, otherwise falls back + to hardcoded filter for backward compatibility. + + Default query criteria (when no config): + - End node is the sink node (placeholder for unresolved entities) + - Status is "Suggested" (not yet approved/rejected) + - Tags do not contain "PromoteAttempted" (haven't been processed yet) + + Args: + None + + Returns: + EdgeList of candidate edges, or None if no candidates found. + Limited by getCandidatesQuery.limit (default 500 if -1/unlimited). + """ + # Use query config if available + if self.config.promote_function and self.config.promote_function.get_candidates_query: + query_filter = build_filter_from_query(self.config.promote_function.get_candidates_query) + limit = get_limit_from_query(self.config.promote_function.get_candidates_query) + # If limit is -1 (unlimited), use sensible default + if limit == -1: + limit = 500 + else: + # Backward compatibility: hardcoded filter + query_filter = { + "and": [ + {"equals": {"property": ["edge", "space"], "value": self.sink_node_ref.space}}, + {"equals": {"property": self.core_annotation_view.as_property_ref("status"), "value": "Suggested"}}, + { + "not": { + "containsAny": { + "property": self.core_annotation_view.as_property_ref("tags"), + "values": ["PromoteAttempted"], + } + } + }, + ] + } + limit = 500 # Default batch size + + return self.client.data_modeling.instances.list( + instance_type="edge", + sources=[self.core_annotation_view.as_view_id()], + filter=query_filter, + limit=limit, + space=self.sink_node_ref.space + ) + + def _find_entity_with_cache(self, text: str, annotation_type: str, entity_space: str) -> list[Node] | list: + """ + Finds entity for text using multi-tier caching strategy. + + Caching strategy (fastest to slowest): + - TIER 1: In-memory cache (this run only, in-memory dictionary) + - TIER 2: Persistent RAW cache (all runs, single database query) + - TIER 3: EntitySearchService (global entity search, server-side IN filter on aliases) + + Caching behavior: + - Only caches unambiguous single matches (len(found_nodes) == 1) + - Caches negative results (no match found) to avoid repeated lookups + - Does NOT cache ambiguous results (multiple matches) + + Args: + text: Text to search for (e.g., "V-123", "G18A-921") + annotation_type: Type of annotation ("diagrams.FileLink" or "diagrams.AssetLink") + entity_space: Space to search in for global fallback + + Returns: + List of matched Node objects: + - Empty list [] if no match found + - Single-element list [node] if unambiguous match + - Two-element list [node1, node2] if ambiguous (data quality issue) + """ + # TIER 1 & 2: Check cache (in-memory + persistent) + cached_node: Node | None = self.cache_service.get(text, annotation_type) + if cached_node is not None: + return [cached_node] + + # Check if we've already determined there's no match + # (negative caching is handled internally by cache service) + if self.cache_service.get_from_memory(text, annotation_type) is None: + # We've checked this before in this run and found nothing + if (text, annotation_type) in self.cache_service._memory_cache: + return [] + + # TIER 3 & 4: Use EntitySearchService (edges β†’ global search) + found_nodes: list[Node] = self.entity_search_service.find_entity(text, annotation_type, entity_space) + + # Update cache based on result + if found_nodes and len(found_nodes) == 1: + # Unambiguous match - cache it + self.cache_service.set(text, annotation_type, found_nodes[0]) + elif not found_nodes: + # No match - cache negative result + self.cache_service.set(text, annotation_type, None) + # Don't cache ambiguous results (len > 1) + + return found_nodes + + def _prepare_edge_update( + self, edge: Edge, found_nodes: list[Node] | list + ) -> tuple[EdgeApply | None, RowWrite | None]: + """ + Prepares updates for both data model edge and RAW table based on entity search results. + + Handles three scenarios: + 1. Single match (len==1): Mark as "Approved", point edge to entity, add "PromotedAuto" tag + 2. No match (len==0): Mark as "Rejected", keep pointing to sink, add "PromoteAttempted" tag + 3. Ambiguous (len>=2): Keep "Suggested", add "PromoteAttempted" and "AmbiguousMatch" tags + + For all cases: + - Retrieves existing RAW row to preserve all data + - Updates edge properties (status, tags, endNode if match found) + - Updates RAW row with same changes + - Returns both for atomic update + + Args: + edge: The annotation edge to update (pattern-mode annotation) + found_nodes: List of matched entity nodes from entity search + - [] = no match + - [node] = single unambiguous match + - [node1, node2] = ambiguous (multiple matches) + + Returns: + Tuple of (EdgeApply, RowWrite): + - EdgeApply: Edge update for data model + - RowWrite: Row update for RAW table + Both will always be returned (never None). + """ + # Get the current edge properties before creating the write version + edge_props: Any = edge.properties.get(self.core_annotation_view.as_view_id(), {}) + current_tags: Any = edge_props.get("tags", []) + updated_tags: list[str] = list(current_tags) if isinstance(current_tags, list) else [] + + # Now create the write version + edge_apply: EdgeApply = edge.as_write() + + # Fetch existing RAW row to preserve all data + raw_data: dict[str, Any] = {} + try: + existing_row: Any = self.client.raw.rows.retrieve( + db_name=self.raw_db, table_name=self.raw_pattern_table, key=edge.external_id + ) + if existing_row and existing_row.columns: + raw_data = {k: v for k, v in existing_row.columns.items()} + except Exception as e: + self.logger.warning(f"Could not retrieve RAW row for edge {edge.external_id}: {e}") + + # Prepare update properties for the edge + update_properties: dict[str, Any] = {} + + if len(found_nodes) == 1: # Success - single match found + matched_node: Node = found_nodes[0] + self.logger.info( + f"βœ“ Found single match for '{edge_props.get('startNodeText')}' β†’ {matched_node.external_id}. \n\t- Promoting edge: ({edge.space}, {edge.external_id})\n\t- Start node: ({edge.start_node.space}, {edge.start_node.external_id})." + ) + + # Update edge to point to the found entity + edge_apply.end_node = DirectRelationReference(matched_node.space, matched_node.external_id) + update_properties["status"] = DiagramAnnotationStatus.APPROVED.value + updated_tags.append("PromotedAuto") + + # Update RAW row with new end node information + raw_data["endNode"] = matched_node.external_id + raw_data["endNodeSpace"] = matched_node.space + raw_data["status"] = DiagramAnnotationStatus.APPROVED.value + + # Get resource type from the matched entity + entity_props: Any = matched_node.properties.get(self.target_entities_view.as_view_id(), {}) + resource_type: Any = entity_props.get("resourceType") or entity_props.get("type") + if resource_type: + raw_data["endNodeResourceType"] = resource_type + + elif len(found_nodes) == 0: # Failure - no match found + self.logger.info( + f"βœ— No match found for '{edge_props.get('startNodeText')}'.\n\t- Rejecting edge: ({edge.space}, {edge.external_id})\n\t- Start node: ({edge.start_node.space}, {edge.start_node.external_id})." + ) + update_properties["status"] = DiagramAnnotationStatus.REJECTED.value + updated_tags.append("PromoteAttempted") + + # Update RAW row status + raw_data["status"] = DiagramAnnotationStatus.REJECTED.value + + else: # Ambiguous - multiple matches found + self.logger.info( + f"⚠ Multiple matches found for '{edge_props.get('startNodeText')}'.\n\t- Ambiguous edge: ({edge.space}, {edge.external_id})\n\t- Start node: ({edge.start_node.space}, {edge.start_node.external_id})." + ) + updated_tags.extend(["PromoteAttempted", "AmbiguousMatch"]) + + # Don't change status, just add tags to RAW + raw_data["status"] = edge_props.get("status", DiagramAnnotationStatus.SUGGESTED.value) + + # Update edge properties + update_properties["tags"] = updated_tags + raw_data["tags"] = updated_tags + edge_apply.sources[0] = NodeOrEdgeData( + source=self.core_annotation_view.as_view_id(), properties=update_properties + ) + + # Create RowWrite object for RAW table update + raw_row: RowWrite | None = RowWrite(key=edge.external_id, columns=raw_data) if raw_data else None + + return edge_apply, raw_row diff --git a/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_promote/utils/DataStructures.py b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_promote/utils/DataStructures.py new file mode 100644 index 00000000..3f817694 --- /dev/null +++ b/modules/contextualization/cdf_file_annotation/functions/fn_file_annotation_promote/utils/DataStructures.py @@ -0,0 +1,442 @@ +from dataclasses import dataclass, asdict, field +from typing import Literal, cast +from enum import Enum +from datetime import datetime, timezone, timedelta + +from cognite.client.data_classes.data_modeling import ( + Node, + NodeId, + NodeApply, + NodeOrEdgeData, + ViewId, +) +from cognite.client.data_classes.contextualization import ( + FileReference, +) + + +@dataclass +class EnvConfig: + """ + Data structure holding the configs to connect to CDF client locally + """ + + cdf_project: str + cdf_cluster: str + tenant_id: str + client_id: str + client_secret: str + + +class DiagramAnnotationStatus(str, Enum): + SUGGESTED = "Suggested" + APPROVED = "Approved" + REJECTED = "Rejected" + + +class AnnotationStatus(str, Enum): + """ + Defines the types of values that the annotationStatus property can be for the Annotation State Instances. + Inherits from 'str' so that the enum members are also string instances, + making them directly usable where a string is expected (e.g., serialization). + Holds the different values that the annotationStatus property can be for the Annotation State Instances. + """ + + NEW = "New" + RETRY = "Retry" + PROCESSING = "Processing" + FINALIZING = "Finalizing" + ANNOTATED = "Annotated" + FAILED = "Failed" + + +class FilterOperator(str, Enum): + """ + Defines the types of filter operations that can be specified in the configuration. + Inherits from 'str' so that the enum members are also string instances, + making them directly usable where a string is expected (e.g., serialization). + """ + + EQUALS = "Equals" # Checks for equality against a single value. + EXISTS = "Exists" # Checks if a property exists (is not null). + CONTAINSALL = "ContainsAll" # Checks if an item contains all specified values for a given property + IN = "In" # Checks if a value is within a list of specified values. Not implementing CONTAINSANY b/c IN is usually more suitable + SEARCH = "Search" # Performs full text search on a specified property + + +@dataclass +class AnnotationState: + """ + Data structure holding the mpcAnnotationState view properties. Time will convert to Timestamp when ingested into CDF. + """ + + annotationStatus: AnnotationStatus + linkedFile: dict[str, str] = field(default_factory=dict) + attemptCount: int = 0 + annotationMessage: str | None = None + diagramDetectJobId: int | None = None + sourceCreatedTime: str = field( + default_factory=lambda: datetime.now(timezone.utc).replace(microsecond=0).isoformat() + ) + sourceUpdatedTime: str = field( + default_factory=lambda: datetime.now(timezone.utc).replace(microsecond=0).isoformat() + ) + sourceCreatedUser: str = "fn_dm_context_annotation_launch" + sourceUpdatedUser: str = "fn_dm_context_annotation_launch" + + def _create_external_id(self) -> str: + """ + Create a deterministic external ID so that we can replace mpcAnnotationState of files that have been updated and aren't new + """ + prefix = "an_state" + linked_file_space = self.linkedFile["space"] + linked_file_id = self.linkedFile["externalId"] + return f"{prefix}_{linked_file_space}_{linked_file_id}" + + def to_dict(self) -> dict: + return asdict(self) + + def to_node_apply(self, node_space: str, annotation_state_view: ViewId) -> NodeApply: + external_id: str = self._create_external_id() + + return NodeApply( + space=node_space, + external_id=external_id, + sources=[ + NodeOrEdgeData( + source=annotation_state_view, + properties=self.to_dict(), + ) + ], + ) + + +@dataclass +class FileProcessingBatch: + primary_scope_value: str + secondary_scope_value: str | None + files: list[Node] + + +@dataclass +class entity: + """ + data structure for the 'entities' fed into diagram detect, + { + "external_id": file.external_id, + "name": file.properties[job_config.file_view.as_view_id()]["name"], + "space": file.space, + "annotation_type": job_config.file_view.type, + "resource_type": file.properties[job_config.file_view.as_view_id()][{resource_type}], + "search_property": file.properties[job_config.file_view.as_view_id()][{search_property}], + } + """ + + external_id: str + name: str + space: str + annotation_type: Literal["diagrams.FileLink", "diagrams.AssetLink"] | None + resource_type: str + search_property: list[str] = field(default_factory=list) + + def to_dict(self): + return asdict(self) + + +@dataclass +class BatchOfNodes: + nodes: list[Node] = field(default_factory=list) + ids: list[NodeId] = field(default_factory=list) + apply: list[NodeApply] = field(default_factory=list) + + def add(self, node: Node): + self.nodes.append(node) + node_id = node.as_id() + self.ids.append(node_id) + return + + def clear(self): + self.nodes.clear() + self.ids.clear() + self.apply.clear() + return + + def update_node_properties(self, new_properties: dict, view_id: ViewId): + for node in self.nodes: + node_apply = NodeApply( + space=node.space, + external_id=node.external_id, + existing_version=None, + sources=[ + NodeOrEdgeData( + source=view_id, + properties=new_properties, + ) + ], + ) + self.apply.append(node_apply) + return + + +@dataclass +class BatchOfPairedNodes: + """ + Where nodeA is an instance of the file view and nodeB is an instance of the annotation state view + """ + + file_to_state_map: dict[NodeId, Node] + batch_files: BatchOfNodes = field(default_factory=BatchOfNodes) + batch_states: BatchOfNodes = field(default_factory=BatchOfNodes) + file_references: list[FileReference] = field(default_factory=list) + + def add_pair(self, file_node: Node, file_reference: FileReference): + self.file_references.append(file_reference) + self.batch_files.add(file_node) + file_node_id: NodeId = file_node.as_id() + state_node: Node = self.file_to_state_map[file_node_id] + self.batch_states.add(state_node) + + def create_file_reference( + self, + file_node_id: NodeId, + page_range: int, + annotation_state_view_id: ViewId, + ) -> FileReference: + """ + Create a file reference that has a page range for annotation. + The current implementation of the detect api 20230101-beta only allows annotation of files up to 50 pages. + Thus, this is my idea of how we can enables annotating files that are more than 50 pages long. + + The annotatedPageCount and pageCount properties won't be set in the initial creation of the annotation state nodes. + That's because we don't know how many pages are in the pdf until we run the diagram detect job where the page count gets returned from the results of the job. + Thus, annotatedPageCount and pageCount get set in the finalize function. + The finalize function will set the page count properties based on the page count that returned from diagram detect job results. + - If the pdf has less than 50 pages, say 3 pages, then... + - annotationStatus property will get set to 'complete' + - annotatedPageCount and pageCount properties will be set to 3. + - Elif the pdf has more than 50 pages, say 80, then... + - annotationStatus property will get set to 'new' + - annotatedPageCount set to 50 + - pageCount set to 80 + - attemptCount doesn't get incremented + + NOTE: Chose to create the file_reference here b/c I already have access to the file node and state node. + If I chose to have this logic in the launchService then we'd have to iterate on all of the nodes that have already been added. + Thus -> O(N) + O(N) to create the BatchOfPairedNodes and then to create the file references + Instead, this approach makes it just O(N) + """ + annotation_state_node: Node = self.file_to_state_map[file_node_id] + annotated_page_count: int | None = cast( + int, + annotation_state_node.properties[annotation_state_view_id].get("annotatedPageCount"), + ) + page_count: int | None = cast( + int, + annotation_state_node.properties[annotation_state_view_id].get("pageCount"), + ) + if not annotated_page_count or not page_count: + file_reference: FileReference = FileReference( + file_instance_id=file_node_id, + first_page=1, + last_page=page_range, + ) + else: + # NOTE: adding 1 here since that annotated_page_count variable holds the last page that was annotated. Thus we want to annotate the following page + # e.g.) first run annotates pages 1-50 second run would annotate 51-100 + first_page = annotated_page_count + 1 + last_page = annotated_page_count + page_range + if page_count <= last_page: + last_page = page_count + file_reference: FileReference = FileReference( + file_instance_id=file_node_id, + first_page=first_page, + last_page=last_page, + ) + + return file_reference + + def clear_pair(self): + self.batch_files.clear() + self.batch_states.clear() + self.file_references.clear() + + def size(self) -> int: + return len(self.file_references) + + def is_empty(self) -> bool: + if self.file_references: + return False + return True + + +@dataclass +class PerformanceTracker: + """ + Keeps track of metrics + """ + + files_success: int = 0 + files_failed: int = 0 + total_runs: int = 0 + total_time_delta: timedelta = timedelta(0) + latest_run_time: datetime = field(default_factory=lambda: datetime.now(timezone.utc)) + + def _run_time(self) -> timedelta: + time_delta = datetime.now(timezone.utc) - self.latest_run_time + return time_delta + + def _average_run_time(self) -> timedelta: + if self.total_runs == 0: + return timedelta(0) + return self.total_time_delta / self.total_runs + + def add_files(self, success: int, failed: int = 0): + self.files_success += success + self.files_failed += failed + + def generate_local_report(self) -> str: + self.total_runs += 1 + time_delta = self._run_time() + self.total_time_delta += time_delta + self.latest_run_time = datetime.now(timezone.utc) + + report = f"run time: {time_delta}" + return report + + def generate_overall_report(self) -> str: + report = f" Run started {datetime.now(timezone.utc)}\n- total runs: {self.total_runs}\n- total files processed: {self.files_success+self.files_failed}\n- successful files: {self.files_success}\n- failed files: {self.files_failed}\n- total run time: {self.total_time_delta}\n- average run time: {self._average_run_time()}" + return report + + def generate_ep_run( + self, + caller: Literal["Launch", "Finalize"], + function_id: str | None, + call_id: str | None, + ) -> str: + """Generates the report string for the extraction pipeline run.""" + report = ( + f"(caller:{caller}, function_id:{function_id}, call_id:{call_id}) - " + f"total files processed: {self.files_success + self.files_failed} - " + f"successful files: {self.files_success} - " + f"failed files: {self.files_failed}" + ) + return report + + def reset(self) -> None: + self.files_success = 0 + self.files_failed = 0 + self.total_runs: int = 0 + self.total_time_delta = timedelta(0) + self.latest_run_time = datetime.now(timezone.utc) + print("PerformanceTracker state has been reset") + + +@dataclass +class PromoteTracker: + """ + Tracks metrics for the promote function. + + Metrics: + - edges_promoted: Edges successfully promoted (single match found) + - edges_rejected: Edges rejected (no match found) + - edges_ambiguous: Edges with ambiguous matches (multiple entities found) + - total_runs: Number of batches processed + - total_time_delta: Cumulative runtime + """ + + edges_promoted: int = 0 + edges_rejected: int = 0 + edges_ambiguous: int = 0 + total_runs: int = 0 + total_time_delta: timedelta = field(default_factory=lambda: timedelta(0)) + latest_run_time: datetime = field(default_factory=lambda: datetime.now(timezone.utc)) + + def _run_time(self) -> timedelta: + """Calculates time since last run started.""" + time_delta: timedelta = datetime.now(timezone.utc) - self.latest_run_time + return time_delta + + def _average_run_time(self) -> timedelta: + """Calculates average time per batch.""" + if self.total_runs == 0: + return timedelta(0) + return self.total_time_delta / self.total_runs + + def add_edges(self, promoted: int = 0, rejected: int = 0, ambiguous: int = 0) -> None: + """ + Adds edge counts to the tracker. + + Args: + promoted: Number of edges successfully promoted + rejected: Number of edges rejected (no match) + ambiguous: Number of edges with ambiguous matches + """ + self.edges_promoted += promoted + self.edges_rejected += rejected + self.edges_ambiguous += ambiguous + + def generate_local_report(self) -> str: + """ + Generates a report for the current batch. + + Returns: + String report with run time + """ + self.total_runs += 1 + time_delta: timedelta = self._run_time() + self.total_time_delta += time_delta + self.latest_run_time = datetime.now(timezone.utc) + + report: str = f"Batch run time: {time_delta}" + return report + + def generate_overall_report(self) -> str: + """ + Generates a comprehensive report for all runs. + + Returns: + String report with all metrics + """ + total_edges: int = self.edges_promoted + self.edges_rejected + self.edges_ambiguous + report: str = ( + f"Promote Function Summary\n" + f"- Total runs: {self.total_runs}\n" + f"- Total edges processed: {total_edges}\n" + f" β”œβ”€ Promoted (auto): {self.edges_promoted}\n" + f" β”œβ”€ Rejected (no match): {self.edges_rejected}\n" + f" └─ Ambiguous (multiple matches): {self.edges_ambiguous}\n" + f"- Total run time: {self.total_time_delta}\n" + f"- Average run time: {self._average_run_time()}" + ) + return report + + def generate_ep_run(self, function_id: str | None, call_id: str | None) -> str: + """ + Generates a report string for extraction pipeline logging. + + Args: + function_id: Cognite Function ID + call_id: Cognite Function call ID + + Returns: + String report for extraction pipeline + """ + total_edges: int = self.edges_promoted + self.edges_rejected + self.edges_ambiguous + report: str = ( + f"(caller:Promote, function_id:{function_id}, call_id:{call_id}) - " + f"total edges processed: {total_edges} - " + f"promoted: {self.edges_promoted} - " + f"rejected: {self.edges_rejected} - " + f"ambiguous: {self.edges_ambiguous}" + ) + return report + + def reset(self) -> None: + """Resets all tracker metrics to initial state.""" + self.edges_promoted = 0 + self.edges_rejected = 0 + self.edges_ambiguous = 0 + self.total_runs = 0 + self.total_time_delta = timedelta(0) + self.latest_run_time = datetime.now(timezone.utc) + print("PromoteTracker state has been reset") diff --git a/modules/contextualization/cdf_file_annotation/functions/functions.Function.yaml b/modules/contextualization/cdf_file_annotation/functions/functions.Function.yaml index 11e711bf..f6b00037 100644 --- a/modules/contextualization/cdf_file_annotation/functions/functions.Function.yaml +++ b/modules/contextualization/cdf_file_annotation/functions/functions.Function.yaml @@ -1,3 +1,13 @@ +- name: Prepare File Annotations + externalId: {{ prepareFunctionExternalId }} + owner: "Anonymous" + description: "Create annotation state instances for files marked ToAnnotate." + metadata: + version: {{ prepareFunctionVersion }} + + runtime: "py311" + functionPath: "handler.py" + - name: Launch File Annotations externalId: {{ launchFunctionExternalId }} owner: "Anonymous" @@ -18,3 +28,13 @@ runtime: "py311" functionPath: "handler.py" +- name: Promote File Annotations + externalId: {{ promoteFunctionExternalId }} + owner: "Anonymous" + description: "Automatically promote suggested pattern mode annotations created by the finalize function if it exists." + metadata: + version: {{ promoteFunctionVersion }} + + runtime: "py311" + functionPath: "handler.py" + diff --git a/modules/contextualization/cdf_file_annotation/local_setup/quickstart_setup.ipynb b/modules/contextualization/cdf_file_annotation/local_setup/quickstart_setup.ipynb index 195933a6..fbccf141 100644 --- a/modules/contextualization/cdf_file_annotation/local_setup/quickstart_setup.ipynb +++ b/modules/contextualization/cdf_file_annotation/local_setup/quickstart_setup.ipynb @@ -16,9 +16,12 @@ "\n", "from cognite.client.data_classes.data_modeling import (\n", " Node,\n", + " NodeId,\n", " NodeList,\n", " NodeApplyList,\n", " ViewId,\n", + " NodeApply,\n", + " NodeOrEdgeData,\n", ")" ] }, @@ -101,7 +104,7 @@ "outputs": [], "source": [ "# Replace the value of organization with the one used in config..yaml\n", - "organization: str = \"tx\"\n", + "organization: str = \n", "file_view_name: str = f\"{organization}File\"\n", "\n", "# Create a view class\n", @@ -122,7 +125,7 @@ "source": [ "# retrieve instances of txFile\n", "files: NodeList[Node] = cdf_client.data_modeling.instances.list(instance_type=\"node\", sources=file_view.as_view_id(), limit=-1)\n", - "print(files[0])" + "print(files[1])" ] }, { @@ -137,6 +140,11 @@ "\n", "for file in file_node_apply_list:\n", " file.sources[0].properties[\"tags\"] = [\"ToAnnotate\", \"DetectInDiagrams\"]\n", + " alias = []\n", + " name = file.sources[0].properties[\"name\"]\n", + " alias.append(name.replace(\".pdf\", \"\"))\n", + " file.sources[0].properties[\"aliases\"] = alias\n", + "\n", "print(file_node_apply_list[0])" ] }, @@ -156,7 +164,7 @@ { "cell_type": "code", "execution_count": null, - "id": "701df86b", + "id": "9579ff52", "metadata": {}, "outputs": [], "source": [ @@ -167,65 +175,84 @@ " external_id=equipment_view_name,\n", " version=\"v1\",\n", " instance_space=\"springfield_instances\",\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "9c40f234", - "metadata": {}, - "outputs": [], - "source": [ + ")\n", + "\n", "# retrieve instances of txEquipment\n", - "equipments: NodeList[Node] = cdf_client.data_modeling.instances.list(instance_type=\"node\", sources=txEquipment_view.as_view_id(), limit=-1)\n", - "print(equipments[0])" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "969ae758", - "metadata": {}, - "outputs": [], - "source": [ - "equipment_node_apply_list: NodeApplyList = equipments.as_write()\n", + "equipments: NodeList[Node] = cdf_client.data_modeling.instances.list(instance_type=\"node\", sources=equipment_view.as_view_id(), limit=-1)\n", "\n", - "for equipment in equipment_node_apply_list:\n", - " equipment.sources[0].properties[\"tags\"] = [\"DetectInDiagrams\"]\n", - "print(equipment_node_apply_list[0])" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "01eacbe1", - "metadata": {}, - "outputs": [], - "source": [ - "cdf_client.data_modeling.instances.apply(equipment_node_apply_list)" + "# Now lets do the same with the equipment nodes in the project so that we have entities to match against\n", + "asset_view_name: str = f\"{organization}Asset\"\n", + "asset_view: ViewPropertyConfig = ViewPropertyConfig(\n", + " schema_space=\"sp_enterprise_process_industry\",\n", + " external_id=asset_view_name,\n", + " version=\"v1\",\n", + " instance_space=\"springfield_instances\",\n", + ")\n", + "\n", + "asset_node_apply_list = []\n", + "for equipment in equipments:\n", + " external_id = \"asset:\"+equipment.external_id\n", + " space = equipment.space\n", + "\n", + " properties:dict = {}\n", + " equipment_name = equipment.properties[equipment_view.as_view_id()][\"name\"]\n", + "\n", + " properties[\"tags\"] = [\"DetectInDiagrams\"]\n", + " properties[\"name\"] = equipment_name\n", + " properties[\"description\"] = equipment.properties[equipment_view.as_view_id()][\"description\"]\n", + " properties[\"sourceId\"] = equipment.properties[equipment_view.as_view_id()][\"sourceId\"]\n", + " properties[\"sourceUpdatedUser\"] = equipment.properties[equipment_view.as_view_id()][\"sourceUpdatedUser\"]\n", + "\n", + " aliases = []\n", + " name_tokens = equipment_name.split(\"-\")\n", + " alt_alias = \"\"\n", + " aliases.append(equipment_name)\n", + " for index,token in enumerate(name_tokens):\n", + " if index == 0:\n", + " continue\n", + " if index == 1:\n", + " alt_alias = token\n", + " else:\n", + " alt_alias = alt_alias + \"-\" + token\n", + " aliases.append(alt_alias)\n", + " \n", + " properties[\"aliases\"] = aliases\n", + " asset_node_apply_list.append(\n", + " NodeApply(\n", + " space=equipment.space,\n", + " external_id=\"asset:\"+equipment.external_id,\n", + " sources=[\n", + " NodeOrEdgeData(\n", + " source=asset_view.as_view_id(),\n", + " properties=properties,\n", + " )\n", + " ],\n", + " )\n", + " )\n", + "\n", + "print(len(asset_node_apply_list))\n", + "print(asset_node_apply_list[0])\n" ] }, { "cell_type": "code", "execution_count": null, - "id": "17ee6c05", + "id": "aad9ad6b", "metadata": {}, "outputs": [], "source": [ - "# In case you're interested in seeing the instances of file annotation state\n", - "fileAnnotationState_view: ViewPropertyConfig = ViewPropertyConfig(\n", - " schema_space= \"sp_hdm\",\n", - " external_id=\"FileAnnotationState\",\n", - " version = \"v1.0.0\",\n", + "update_results = cdf_client.data_modeling.instances.apply(\n", + " nodes=asset_node_apply_list,\n", + " auto_create_direct_relations=True,\n", + " replace=True, # ensures we reset the properties of the node\n", ")\n", - "cdf_client.data_modeling.instances.list(sources=fileAnnotationState_view.as_view_id())" + "print(update_results)" ] } ], "metadata": { "kernelspec": { - "display_name": ".venv", + "display_name": ".venv (3.12.9)", "language": "python", "name": "python3" }, diff --git a/modules/contextualization/cdf_file_annotation/raw/tbl_file_annotation.Tables.yaml b/modules/contextualization/cdf_file_annotation/raw/tbl_file_annotation.Tables.yaml index 5c63a3e1..6a3a5abb 100644 --- a/modules/contextualization/cdf_file_annotation/raw/tbl_file_annotation.Tables.yaml +++ b/modules/contextualization/cdf_file_annotation/raw/tbl_file_annotation.Tables.yaml @@ -5,4 +5,13 @@ tableName: {{ rawTableDocDoc }} - dbName: {{ rawDb }} - tableName: {{ rawTableCache }} \ No newline at end of file + tableName: {{ rawTableDocPattern }} + +- dbName: {{ rawDb }} + tableName: {{ rawTableCache }} + +- dbName: {{ rawDb }} + tableName: {{ rawManualPatternsCatalog }} + +- dbName: {{ rawDb }} + tableName: {{ rawTablePromoteCache }} \ No newline at end of file diff --git a/modules/contextualization/cdf_file_annotation/streamlit/file_annotation_dashboard.Streamlit.yaml b/modules/contextualization/cdf_file_annotation/streamlit/file_annotation_dashboard.Streamlit.yaml index fbb9e422..7a60f61e 100644 --- a/modules/contextualization/cdf_file_annotation/streamlit/file_annotation_dashboard.Streamlit.yaml +++ b/modules/contextualization/cdf_file_annotation/streamlit/file_annotation_dashboard.Streamlit.yaml @@ -5,4 +5,4 @@ description: Dashboard to inspect the health of the File Annotation Pipeline published: true theme: Light dataSetExternalId: {{ annotationDatasetExternalId }} -entrypoint: Extraction_Pipeline.py \ No newline at end of file +entrypoint: Pipeline_Health.py \ No newline at end of file diff --git a/modules/contextualization/cdf_file_annotation/streamlit/file_annotation_dashboard/Extraction_Pipeline.py b/modules/contextualization/cdf_file_annotation/streamlit/file_annotation_dashboard/Extraction_Pipeline.py deleted file mode 100644 index bbb18a2b..00000000 --- a/modules/contextualization/cdf_file_annotation/streamlit/file_annotation_dashboard/Extraction_Pipeline.py +++ /dev/null @@ -1,262 +0,0 @@ -import streamlit as st -import pandas as pd -import altair as alt -from cognite.client import CogniteClient -from datetime import datetime, timedelta -from helper import ( - fetch_annotation_states, - fetch_pipeline_run_history, - process_runs_for_graphing, - fetch_extraction_pipeline_config, - calculate_success_failure_stats, - fetch_function_logs, - parse_run_message, -) - - -# --- Page Configuration --- -st.set_page_config( - page_title="Pipeline Run History", - page_icon="πŸ“ˆ", - layout="wide", -) - -# --- Data Fetching --- -pipeline_runs = fetch_pipeline_run_history() - -# --- Main Application --- -st.title("Pipeline Run History") -st.markdown("This page provides statistics and detailed history for all extraction pipeline runs.") - - -# --- Pipeline Statistics Section --- -if pipeline_runs: - # Time window selection - time_window_map = { - "All": None, - "Last 24 Hours": 24, - "Last 7 Days": 7 * 24, - "Last 30 Days": 30 * 24, - } - time_window_option = st.sidebar.selectbox( - "Filter by Time Window:", - options=list(time_window_map.keys()), - ) - window_hours = time_window_map[time_window_option] - - if window_hours is not None: - now = pd.Timestamp.now(tz="UTC") - filter_start_time = now - timedelta(hours=window_hours) - # Filter runs based on the time window - recent_pipeline_runs = [ - run - for run in pipeline_runs - if pd.to_datetime(run.created_time, unit="ms").tz_localize("UTC") > filter_start_time - ] - else: - # If 'All' is selected, use the original unfiltered list of runs - recent_pipeline_runs = pipeline_runs - - # MODIFICATION: Check if 'recent_pipeline_runs' has data BEFORE processing. - # If it's empty, display a message. Otherwise, proceed with stats and graphs. - if not recent_pipeline_runs: - st.warning("No pipeline runs found in the selected time window.") - else: - # --- Calculate detailed stats for the selected time window --- - df_runs_for_graphing = process_runs_for_graphing(recent_pipeline_runs) - - launch_success = 0 - launch_failure = 0 - finalize_success = 0 - finalize_failure = 0 - - for run in recent_pipeline_runs: - # We need to parse the message to determine the caller type - parsed_message = parse_run_message(run.message) - caller = parsed_message.get("caller") - - if caller == "Launch": - if run.status == "success": - launch_success += 1 - elif run.status == "failure": - launch_failure += 1 - elif caller == "Finalize": - if run.status == "success": - finalize_success += 1 - elif run.status == "failure": - finalize_failure += 1 - - total_launched_recent = int(df_runs_for_graphing[df_runs_for_graphing["type"] == "Launch"]["count"].sum()) - total_finalized_recent = int(df_runs_for_graphing[df_runs_for_graphing["type"] == "Finalize"]["count"].sum()) - - # --- Display Metrics and Graphs in two columns --- - g_col1, g_col2 = st.columns(2) - - with g_col1: - st.subheader("Launch Runs") - m_col1, m_col2, m_col3 = st.columns(3) - m_col1.metric( - f"Files Launched", - f"{total_launched_recent:,}", - ) - m_col2.metric( - "Successful Runs", - f"{launch_success:,}", - ) - m_col3.metric( - "Failed Runs", - f"{launch_failure:,}", - delta=f"{launch_failure:,}" if launch_failure > 0 else "0", - delta_color="inverse", - ) - - with g_col2: - st.subheader("Finalize Runs") - m_col4, m_col5, m_col6 = st.columns(3) - m_col4.metric( - f"Files Finalized", - f"{total_finalized_recent:,}", - ) - m_col5.metric( - "Successful Runs", - f"{finalize_success:,}", - ) - m_col6.metric( - "Failed Runs", - f"{finalize_failure:,}", - delta=f"{finalize_failure:,}" if finalize_failure > 0 else "0", - delta_color="inverse", - ) - - # --- Graphs --- - base_chart = ( - alt.Chart(df_runs_for_graphing) - .mark_circle(size=60, opacity=0.7) - .encode( - x=alt.X("timestamp:T", title="Time of Run"), - y=alt.Y("count:Q", title="Files Processed"), - tooltip=["timestamp:T", "count:Q", "type:N"], - ) - .interactive() - ) - - chart_col1, chart_col2 = st.columns(2) - with chart_col1: - launch_chart = base_chart.transform_filter(alt.datum.type == "Launch").properties( - title="Files Processed per Launch Run" - ) - st.altair_chart(launch_chart, use_container_width=True) - with chart_col2: - finalize_chart = base_chart.transform_filter(alt.datum.type == "Finalize").properties( - title="Files Processed per Finalize Run" - ) - st.altair_chart(finalize_chart, use_container_width=True) - - # --- UNIFIED DETAILED RUN HISTORY --- - with st.expander("View recent runs and fetch logs", expanded=True): - if not recent_pipeline_runs: - st.info("No runs in the selected time window.") - else: - f_col1, f_col2 = st.columns(2) - with f_col1: - run_status_filter = st.radio( - "Filter by run status:", - ("All", "Success", "Failure"), - horizontal=True, - key="run_status_filter", - ) - with f_col2: - caller_type_filter = st.radio( - "Filter by caller type:", - ("All", "Launch", "Finalize"), - horizontal=True, - key="caller_type_filter", - ) - - st.divider() - - filtered_runs = recent_pipeline_runs - if run_status_filter != "All": - filtered_runs = [run for run in filtered_runs if run.status.lower() == run_status_filter.lower()] - - if caller_type_filter != "All": - filtered_runs = [ - run for run in filtered_runs if parse_run_message(run.message).get("caller") == caller_type_filter - ] - - if not filtered_runs: - st.warning(f"No runs match the selected filters.") - else: - # Pagination state - if "page_num" not in st.session_state: - st.session_state.page_num = 0 - - items_per_page = 3 - start_idx = st.session_state.page_num * items_per_page - end_idx = start_idx + items_per_page - paginated_runs = filtered_runs[start_idx:end_idx] - - # Display logic for each run - for run in paginated_runs: - - if run.status == "success": - st.markdown(f"**Status:** Success") - st.success( - f"Timestamp: {pd.to_datetime(run.created_time, unit='ms').tz_localize('UTC').strftime('%Y-%m-%d %H:%M:%S %Z')}" - ) - else: - st.markdown(f"**Status:** Failure") - st.error( - f"Timestamp: {pd.to_datetime(run.created_time, unit='ms').tz_localize('UTC').strftime('%Y-%m-%d %H:%M:%S %Z')}" - ) - - parsed_message = parse_run_message(run.message) - if run.message: - st.code(run.message, language="text") - - function_id = int(parsed_message.get("function_id")) - call_id = int(parsed_message.get("call_id")) - - if function_id and call_id: - button_key = f"log_btn_all_{call_id}" - if st.button("Fetch Function Logs", key=button_key): - with st.spinner("Fetching logs..."): - logs = fetch_function_logs(function_id=function_id, call_id=call_id) - if logs: - st.text_area( - "Function Logs", - "".join(logs), - height=300, - key=f"log_area_all_{call_id}", - ) - else: - st.warning("No logs found for this run.") - st.divider() - - # Pagination controls - total_pages = (len(filtered_runs) + items_per_page - 1) // items_per_page - if total_pages > 1: - p_col1, p_col2, p_col3 = st.columns([1, 2, 1]) - with p_col1: - if st.button( - "Previous", - disabled=(st.session_state.page_num == 0), - use_container_width=True, - ): - st.session_state.page_num -= 1 - st.rerun() - with p_col2: - st.markdown( - f"
Page {st.session_state.page_num + 1} of {total_pages}
", - unsafe_allow_html=True, - ) - with p_col3: - if st.button( - "Next", - disabled=(st.session_state.page_num >= total_pages - 1), - use_container_width=True, - ): - st.session_state.page_num += 1 - st.rerun() -else: - st.info("No data returned from Cognite Data Fusion. Please check your settings and data model.") diff --git a/modules/contextualization/cdf_file_annotation/streamlit/file_annotation_dashboard/Pipeline_Health.py b/modules/contextualization/cdf_file_annotation/streamlit/file_annotation_dashboard/Pipeline_Health.py new file mode 100644 index 00000000..826d7b0c --- /dev/null +++ b/modules/contextualization/cdf_file_annotation/streamlit/file_annotation_dashboard/Pipeline_Health.py @@ -0,0 +1,522 @@ +import streamlit as st +import pandas as pd +import altair as alt +from datetime import timedelta +from helper import ( + fetch_annotation_states, + fetch_pipeline_run_history, + process_runs_for_graphing, + fetch_extraction_pipeline_config, + fetch_function_logs, + parse_run_message, + find_pipelines, + get_files_by_call_id, + calculate_overview_kpis, + filter_log_lines, +) + +# --- Page Configuration --- +st.set_page_config( + page_title="Pipeline Health", + page_icon="🩺", + layout="wide", +) + +# --- Session State and Callbacks --- +if "selected_pipeline" not in st.session_state: + st.session_state.selected_pipeline = None +if "selected_status_file_index" not in st.session_state: + st.session_state.selected_status_file_index = None +if "page_num" not in st.session_state: # For run history pagination + st.session_state.page_num = 0 + + +def reset_table_selection(): + st.session_state.selected_status_file_index = None + + +# --- Sidebar --- +st.sidebar.title("Pipeline Selection") +pipeline_ids = find_pipelines() + +if not pipeline_ids: + st.info("No active file annotation pipelines found to monitor.") + st.stop() + +if st.session_state.selected_pipeline not in pipeline_ids: + st.session_state.selected_pipeline = pipeline_ids[0] + +selected_pipeline = st.sidebar.selectbox("Select a pipeline to monitor:", options=pipeline_ids, key="selected_pipeline") + +# --- Main Application --- +st.title("Pipeline Health Dashboard") + +# --- Data Fetching --- +config_result = fetch_extraction_pipeline_config(selected_pipeline) +if not config_result: + st.error(f"Could not fetch configuration for pipeline: {selected_pipeline}") + st.stop() + +ep_config, view_config = config_result + +annotation_state_view = view_config["annotation_state"] +file_view = view_config["file"] + +df_annotation_states = fetch_annotation_states(annotation_state_view, file_view) +pipeline_runs = fetch_pipeline_run_history(selected_pipeline) + +# --- Create Tabs --- +overview_tab, explorer_tab, history_tab = st.tabs(["Overview", "File Explorer", "Run History"]) + +# ========================================== +# OVERVIEW TAB +# ========================================== +with overview_tab: + st.subheader( + "Live Pipeline KPIs", + help="Provides a high-level summary of the pipeline's current state and historical throughput. The KPIs are calculated directly from the AnnotationState data model for real-time accuracy.", + ) + + kpis = calculate_overview_kpis(df_annotation_states) + + kpi_col1, kpi_col2, kpi_col3 = st.columns(3) + kpi_col1.metric("Files Awaiting Processing", f"{kpis['awaiting_processing']:,}") + kpi_col2.metric("Total Files Processed", f"{kpis['processed_total']:,}") + kpi_col3.metric( + "Overall Failure Rate", + f"{kpis['failure_rate_total']:.2f}%", + delta=f"{kpis['failed_total']:,} failed files", + delta_color="inverse" if kpis["failed_total"] > 0 else "off", + ) + + st.divider() + st.subheader("Pipeline Throughput") + time_agg = st.radio("Aggregate by:", options=["Daily", "Hourly", "Weekly"], horizontal=True, key="time_agg_radio") + + if not df_annotation_states.empty: + df_finalized = df_annotation_states[df_annotation_states["status"].isin(["Annotated", "Failed"])].copy() + if not df_finalized.empty: + if time_agg == "Hourly": + df_finalized["time_bucket"] = df_finalized["lastUpdatedTime"].dt.floor("H") + elif time_agg == "Weekly": + df_finalized["time_bucket"] = ( + df_finalized["lastUpdatedTime"].dt.to_period("W").apply(lambda p: p.start_time) + ) + else: # Daily + df_finalized["time_bucket"] = df_finalized["lastUpdatedTime"].dt.date + + daily_counts = df_finalized.groupby("time_bucket").size().reset_index(name="count") + + throughput_chart = ( + alt.Chart(daily_counts) + .mark_bar() + .encode( + x=alt.X("time_bucket:T", title=f"Time ({time_agg})"), + y=alt.Y("count:Q", title="Number of Files Finalized"), + tooltip=["time_bucket:T", "count:Q"], + ) + .properties(title=f"Files Finalized {time_agg}") + .interactive() + ) + st.altair_chart(throughput_chart, use_container_width=True) + else: + st.info("No files have been finalized yet.") + +# ========================================== +# FILE EXPLORER TAB +# ========================================== +with explorer_tab: + st.subheader( + "File-Centric Debugging", + help="A file-centric debugging tool for deep-dive analysis. Filter and select any file to view its current status, metadata, and the specific Launch and Finalize function logs associated with it.", + ) + if df_annotation_states.empty: + st.info("No annotation state data found for this pipeline.") + else: + with st.expander("Filter and Slice Data"): + # ... (your existing filter logic remains unchanged here) ... + excluded_columns = [ + "externalId", + "space", + "annotationMessage", + "fileAliases", + "fileAssets", + "fileIsuploaded", + "diagramDetectJobId", + "linkedFile", + "patternModeJobId", + "sourceCreatedUser", + "sourceCreatedTime", + "sourceUpdatedTime", + "sourceUpdatedUser", + "fileSpace", + "fileSourceupdateduser", + "fileSourcecreatedUser", + "fileSourceId", + "createdTime", + "fileSourcecreateduser", + "patternModeMessage", + "fileSourceupdatedtime", + "fileSourcecreatedtime", + "fileUploadedtime", + ] + potential_columns = [col for col in df_annotation_states.columns if col not in excluded_columns] + filterable_columns = [] + for col in potential_columns: + # Skip empty columns or columns where the first item is a list/dict + if df_annotation_states[col].dropna().empty or isinstance( + df_annotation_states[col].dropna().iloc[0], (list, dict) + ): + continue + + # Final check to ensure the column is suitable for filtering + if df_annotation_states[col].nunique() < 100: + filterable_columns.append(col) + + filterable_columns = sorted(filterable_columns) + + filter_col1, filter_col2 = st.columns(2) + selected_column = filter_col1.selectbox( + "Filter by Metadata Property", + ["None"] + filterable_columns, + on_change=reset_table_selection, + key="meta_filter", + ) + + selected_values = [] + if selected_column != "None": + unique_values = sorted(df_annotation_states[selected_column].dropna().unique().tolist()) + selected_values = filter_col2.multiselect( + f"Select Value(s) for {selected_column}", + unique_values, + on_change=reset_table_selection, + key="value_filter", + ) + + df_display = df_annotation_states.copy() + if selected_column != "None" and selected_values: + df_display = df_display[df_display[selected_column].isin(selected_values)] + + df_display = df_display.sort_values(by="lastUpdatedTime", ascending=False).reset_index(drop=True) + df_display.insert(0, "Select", False) + + if ( + st.session_state.selected_status_file_index is not None + and st.session_state.selected_status_file_index < len(df_display) + ): + df_display.at[st.session_state.selected_status_file_index, "Select"] = True + + # --- START: New additions for customizable and readable columns --- + default_columns = [ + "Select", + "fileName", + "fileExternalId", + "fileSourceid", + "status", + "fileMimetype", + "pageCount", + "annotatedPageCount", + ] + all_columns = df_display.columns.tolist() + + with st.popover("Customize Table Columns"): + selected_columns = st.multiselect( + "Select columns to display:", + options=all_columns, + default=[col for col in default_columns if col in all_columns], + ) + + if not selected_columns: + st.warning("Please select at least one column to display.") + st.stop() + + edited_df = st.data_editor( + df_display[selected_columns], # Display only selected columns + key="status_table_editor", + column_config={ + "Select": st.column_config.CheckboxColumn(required=True), + "fileName": "File Name", + "fileExternalId": "File External ID", + "status": "Annotation Status", + "retries": "Retries", + "fileSourceid": "Source ID", + "fileMimetype": "Mime Type", + "annotationMessage": "Annotation Message", + "patternModeMessage": "Pattern Mode Message", + "pageCount": "Page Count", + "annotatedPageCount": "Annotated Page Count", + }, + use_container_width=True, + hide_index=True, + disabled=df_display.columns.difference(["Select"]), + ) + + selected_indices = edited_df[edited_df.Select].index.tolist() + if len(selected_indices) > 1: + new_selection = [ + idx for idx in selected_indices if idx != st.session_state.get("selected_status_file_index") + ] + st.session_state.selected_status_file_index = new_selection[0] if new_selection else None + st.rerun() + elif len(selected_indices) == 1: + st.session_state.selected_status_file_index = selected_indices[0] + elif len(selected_indices) == 0 and st.session_state.selected_status_file_index is not None: + st.session_state.selected_status_file_index = None + st.rerun() + + if ( + st.session_state.selected_status_file_index is not None + and st.session_state.selected_status_file_index < len(df_display) + ): + st.divider() + st.subheader("Function Log Viewer") + selected_row = df_display.iloc[st.session_state.selected_status_file_index] + file_ext_id = selected_row.get("fileExternalId", "") + + finalize_tab, launch_tab = st.tabs(["Finalize Log", "Launch Log"]) + with finalize_tab: + finalize_func_id = selected_row.get("finalizeFunctionId") + finalize_call_id = selected_row.get("finalizeFunctionCallId") + if pd.notna(finalize_func_id) and pd.notna(finalize_call_id): + with st.spinner("Fetching finalize log..."): + finalize_logs_raw = "".join( + fetch_function_logs(function_id=int(finalize_func_id), call_id=int(finalize_call_id)) + ) + if finalize_logs_raw: + st.download_button( + "Download Full Log", finalize_logs_raw, f"{file_ext_id}_finalize_log.txt" + ) + filtered_log = filter_log_lines(finalize_logs_raw, file_ext_id) + st.write("**Relevant Log Entries:**") + st.code( + filtered_log if filtered_log else "No log entries found for this specific file.", + language="log", + ) + with st.expander("View Full Log"): + st.code(finalize_logs_raw, language="log") + else: + st.warning("No finalize logs found.") + else: + st.info("No Finalize Function call information available for this file.") + + with launch_tab: + launch_func_id = selected_row.get("launchFunctionId") + launch_call_id = selected_row.get("launchFunctionCallId") + if pd.notna(launch_func_id) and pd.notna(launch_call_id): + with st.spinner("Fetching launch log..."): + launch_logs_raw = "".join( + fetch_function_logs(function_id=int(launch_func_id), call_id=int(launch_call_id)) + ) + # NOTE: launch log doesn't provide log lines with individual Node Id's of files processed + if launch_logs_raw: + st.download_button("Download Full Log", launch_logs_raw, f"{file_ext_id}_launch_log.txt") + with st.expander("View Full Log"): + st.code(launch_logs_raw, language="log") + else: + st.warning("No launch logs found.") + else: + st.info("No Launch Function call information available for this file.") + +# ========================================== +# RUN HISTORY TAB +# ========================================== +with history_tab: + st.subheader( + "Run-Centric Analysis", + help="A run-centric view for analyzing the execution history of the pipeline functions. Review the status, logs, and a list of files processed for each individual pipeline run.", + ) + if not pipeline_runs: + st.info("No pipeline runs found for this pipeline.") + else: + time_window_map = {"All": None, "Last 24 Hours": 24, "Last 7 Days": 7 * 24, "Last 30 Days": 30 * 24} + time_window_option = st.selectbox( + "Filter by Time Window:", options=list(time_window_map.keys()), key="time_window_history" + ) + window_hours = time_window_map[time_window_option] + + if window_hours is not None: + now = pd.Timestamp.now(tz="UTC") + filter_start_time = now - timedelta(hours=window_hours) + recent_pipeline_runs = [ + run + for run in pipeline_runs + if pd.to_datetime(run.created_time, unit="ms").tz_localize("UTC") > filter_start_time + ] + else: + recent_pipeline_runs = pipeline_runs + + if not recent_pipeline_runs: + st.warning("No pipeline runs found in the selected time window.") + else: + df_runs_for_graphing = process_runs_for_graphing(recent_pipeline_runs) + launch_success, launch_failure, finalize_success, finalize_failure = 0, 0, 0, 0 + for run in recent_pipeline_runs: + parsed_message = parse_run_message(run.message) + caller = parsed_message.get("caller") + if caller == "Launch": + if run.status == "success": + launch_success += 1 + elif run.status == "failure": + launch_failure += 1 + elif caller == "Finalize": + if run.status == "success": + finalize_success += 1 + elif run.status == "failure": + finalize_failure += 1 + + total_launched_recent = int(df_runs_for_graphing[df_runs_for_graphing["type"] == "Launch"]["count"].sum()) + total_finalized_recent = int( + df_runs_for_graphing[df_runs_for_graphing["type"] == "Finalize"]["count"].sum() + ) + + g_col1, g_col2 = st.columns(2) + with g_col1: + st.subheader("Launch Runs") + m_col1, m_col2, m_col3 = st.columns(3) + m_col1.metric("Files Launched", f"{total_launched_recent:,}") + m_col2.metric("Successful Runs", f"{launch_success:,}") + m_col3.metric( + "Failed Runs", + f"{launch_failure:,}", + delta=f"{launch_failure:,}" if launch_failure > 0 else "0", + delta_color="inverse", + ) + with g_col2: + st.subheader("Finalize Runs") + m_col4, m_col5, m_col6 = st.columns(3) + m_col4.metric("Files Finalized", f"{total_finalized_recent:,}") + m_col5.metric("Successful Runs", f"{finalize_success:,}") + m_col6.metric( + "Failed Runs", + f"{finalize_failure:,}", + delta=f"{finalize_failure:,}" if finalize_failure > 0 else "0", + delta_color="inverse", + ) + + st.divider() + + base_chart = ( + alt.Chart(df_runs_for_graphing) + .mark_circle(size=60, opacity=0.7) + .encode( + x=alt.X("timestamp:T", title="Time of Run"), + y=alt.Y("count:Q", title="Files Processed"), + tooltip=["timestamp:T", "count:Q", "type:N"], + ) + .interactive() + ) + + chart_col1, chart_col2 = st.columns(2) + with chart_col1: + st.altair_chart( + base_chart.transform_filter(alt.datum.type == "Launch").properties( + title="Files Processed per Launch Run" + ), + use_container_width=True, + ) + with chart_col2: + st.altair_chart( + base_chart.transform_filter(alt.datum.type == "Finalize").properties( + title="Files Processed per Finalize Run" + ), + use_container_width=True, + ) + + st.divider() + st.subheader("Detailed Run History") + + f_col1, f_col2 = st.columns(2) + run_status_filter = f_col1.radio( + "Filter by run status:", ("All", "Success", "Failure"), horizontal=True, key="run_status_filter" + ) + caller_type_filter = f_col2.radio( + "Filter by caller type:", ("All", "Launch", "Finalize"), horizontal=True, key="caller_type_filter" + ) + + filtered_runs = recent_pipeline_runs + if run_status_filter != "All": + filtered_runs = [run for run in filtered_runs if run.status.lower() == run_status_filter.lower()] + if caller_type_filter != "All": + filtered_runs = [ + run for run in filtered_runs if parse_run_message(run.message).get("caller") == caller_type_filter + ] + + if not filtered_runs: + st.warning("No runs match the selected filters.") + else: + items_per_page = 5 + start_idx = st.session_state.page_num * items_per_page + end_idx = start_idx + items_per_page + paginated_runs = filtered_runs[start_idx:end_idx] + + if not paginated_runs: + st.warning("No runs match the selected filters.") + else: + for run in paginated_runs: + st.markdown( + f"**Status:** {run.status.capitalize()} at {pd.to_datetime(run.created_time, unit='ms').tz_localize('UTC').strftime('%Y-%m-%d %H:%M:%S')}" + ) + st.code(run.message, language="text") + + parsed_message = parse_run_message(run.message) + function_id_str = parsed_message.get("function_id") + call_id_str = parsed_message.get("call_id") + + expander_col1, expander_col2 = st.columns(2) + + with expander_col1: + with st.expander("View Function Log"): + st.write("**Function Log**") + log_key = f"log_{run.id}" + + if function_id_str and call_id_str: + # Show the log if it has been fetched, otherwise show the load button + if log_key in st.session_state: + st.download_button( + "Download Log", st.session_state[log_key], f"run_{run.id}_log.txt" + ) + st.code(st.session_state[log_key], language="log") + else: + if st.button("Load Log", key=f"load_btn_{run.id}"): + with st.spinner("Fetching logs..."): + logs = "".join( + fetch_function_logs( + function_id=int(function_id_str), call_id=int(call_id_str) + ) + ) + st.session_state[log_key] = ( + logs if logs else "No logs found for this run." + ) + st.rerun() + else: + st.info("No log information in run message.") + + with expander_col2: + with st.expander("View Files Processed"): + st.write("External ID(s):") + if call_id_str: + df_files_in_run = get_files_by_call_id(int(call_id_str), annotation_state_view) + if not df_files_in_run.empty: + file_list = df_files_in_run["File External ID"].tolist() + st.text("\n".join(file_list)) + else: + st.write("No associated files found.") + else: + st.info("No call_id found in run message.") + st.divider() + + total_pages = (len(filtered_runs) + items_per_page - 1) // items_per_page + if total_pages > 1: + p_col1, p_col2, p_col3 = st.columns([1, 2, 1]) + if p_col1.button("Previous", disabled=(st.session_state.page_num == 0), use_container_width=True): + st.session_state.page_num -= 1 + st.rerun() + p_col2.markdown( + f"
Page {st.session_state.page_num + 1} of {total_pages}
", + unsafe_allow_html=True, + ) + if p_col3.button( + "Next", disabled=(st.session_state.page_num >= total_pages - 1), use_container_width=True + ): + st.session_state.page_num += 1 + st.rerun() diff --git a/modules/contextualization/cdf_file_annotation/streamlit/file_annotation_dashboard/canvas.py b/modules/contextualization/cdf_file_annotation/streamlit/file_annotation_dashboard/canvas.py new file mode 100644 index 00000000..af13b0cf --- /dev/null +++ b/modules/contextualization/cdf_file_annotation/streamlit/file_annotation_dashboard/canvas.py @@ -0,0 +1,238 @@ +from cognite.client import CogniteClient +from cognite.client.data_classes.data_modeling import ( + NodeOrEdgeData, + NodeApply, + EdgeApply, + ContainerId, + ViewId, + NodeId, + EdgeId, + Node, + Edge, +) +from cognite.client.data_classes.filters import Equals, And, Not +import datetime +import uuid +import streamlit as st + +# Settings for the Industrial Canvas Data Model +CANVAS_SPACE_CANVAS = "cdf_industrial_canvas" +CANVAS_SPACE_INSTANCE = "IndustrialCanvasInstanceSpace" +CANVAS_CONTAINER_CANVAS = "Canvas" +CANVAS_CONTAINER_INSTANCE = "FdmInstanceContainerReference" +CANVAS_CONTAINER_ANNOTATION = "CanvasAnnotation" + + +def get_time(): + now = datetime.datetime.now(datetime.timezone.utc) + return now.strftime("%Y-%m-%dT%H:%M:%S.%f")[:-3] + "Z" + + +def get_user_id(client: CogniteClient): + if client: + return client.iam.user_profiles.me().user_identifier + return None + + +def generate_id(): + return str(uuid.uuid4()) + + +def generate_properties(file_node: Node, file_view_id: ViewId, node_id: str, offset_x: int = 0, offset_y: int = 0): + """Generates the property dictionary for a file node to be displayed on the canvas.""" + return { + "id": node_id, + "containerReferenceType": "fdmInstance", + "label": file_node.properties[file_view_id].get("name", file_node.external_id), + "x": offset_x, + "y": offset_y, + "width": 800, # Increased default size for better viewing + "height": 600, + "maxWidth": 1600, + "maxHeight": 1200, + "instanceExternalId": file_node.external_id, + "instanceSpace": file_node.space, + "viewExternalId": file_view_id.external_id, + "viewSpace": file_view_id.space, + "viewVersion": file_view_id.version, + "properties": {"zIndex": 0}, + } + + +def create_canvas(name: str, file_node: Node, client: CogniteClient): + """Creates the main canvas node.""" + canvas_id = f"file_annotation_canvas_{file_node.external_id}" + file_annotation_label = {"externalId": "file_annotations_solution_tag", "space": "SolutionTagsInstanceSpace"} + canvas = NodeApply( + space=CANVAS_SPACE_INSTANCE, + external_id=canvas_id, + sources=[ + NodeOrEdgeData( + source=ContainerId(CANVAS_SPACE_CANVAS, CANVAS_CONTAINER_CANVAS), + properties={ + "name": name, + "visibility": "private", + "updatedAt": get_time(), + "createdBy": get_user_id(client), + "updatedBy": get_user_id(client), + "solutionTags": [file_annotation_label], + }, + ) + ], + ) + return canvas + + +def fetch_existing_canvas(name: str, file_node: Node, client: CogniteClient): + existing_canvas = client.data_modeling.instances.retrieve( + nodes=NodeId(space=CANVAS_SPACE_INSTANCE, external_id=f"file_annotation_canvas_{file_node.external_id}") + ) + + return existing_canvas.nodes[0] if existing_canvas.nodes else None + + +def create_objects(canvas_id: str, file_node: Node, file_view_id: ViewId): + """Creates the node and edge for the file container, returning its ID.""" + file_container_id = f"file_annotation_file_container_{file_node.external_id}" + properties = generate_properties(file_node, file_view_id, file_container_id) + + node_apply = NodeApply( + space=CANVAS_SPACE_INSTANCE, + external_id=f"{canvas_id}_{file_container_id}", + sources=[ + NodeOrEdgeData( + source=ContainerId(CANVAS_SPACE_CANVAS, CANVAS_CONTAINER_INSTANCE), + properties=properties, + ) + ], + ) + + edge_apply = EdgeApply( + space=CANVAS_SPACE_INSTANCE, + external_id=f"{canvas_id}_{canvas_id}_{file_container_id}", + type=(CANVAS_SPACE_CANVAS, "referencesFdmInstanceContainerReference"), + start_node=(CANVAS_SPACE_INSTANCE, canvas_id), + end_node=(CANVAS_SPACE_INSTANCE, f"{canvas_id}_{file_container_id}"), + ) + return [node_apply], [edge_apply], file_container_id + + +def create_bounding_box_annotations(canvas_id: str, file_container_id: str, unmatched_tags: list[dict]): + """Creates annotation nodes and edges for unmatched tags.""" + annotation_nodes = [] + annotation_edges = [] + + for tag_info in unmatched_tags: + tag_text = tag_info["text"] + regions = tag_info.get("regions", []) + + for region in regions: + vertices = region.get("vertices", []) + if not vertices: + continue + + x_coords = [v["x"] for v in vertices] + y_coords = [v["y"] for v in vertices] + x_min, x_max = min(x_coords), max(x_coords) + y_min, y_max = min(y_coords), max(y_coords) + + annotation_id = generate_id() + properties = { + "id": annotation_id, + "annotationType": "rectangle", + "containerId": file_container_id, # <-- This is the crucial link + "isSelectable": True, + "isDraggable": True, + "isResizable": True, + "properties": { + "x": x_min, + "y": y_min, + "width": x_max - x_min, + "height": y_max - y_min, + "label": tag_text, + "zIndex": 10, + "style": { + "fill": "rgba(40, 167, 69, 0.3)", # Semi-transparent vibrant green + "stroke": "rgb(40, 167, 69)", # Solid vibrant green for the border + "strokeWidth": 1, + "opacity": 1, + }, + }, + } + + annotation_node = NodeApply( + space=CANVAS_SPACE_INSTANCE, + external_id=f"{canvas_id}_{annotation_id}", + sources=[ + NodeOrEdgeData( + source=ContainerId(CANVAS_SPACE_CANVAS, CANVAS_CONTAINER_ANNOTATION), + properties=properties, + ) + ], + ) + annotation_nodes.append(annotation_node) + + annotation_edge = EdgeApply( + space=CANVAS_SPACE_INSTANCE, + external_id=f"{canvas_id}_{canvas_id}_{annotation_id}", + type=(CANVAS_SPACE_CANVAS, "referencesCanvasAnnotation"), + start_node=(CANVAS_SPACE_INSTANCE, canvas_id), + end_node=(CANVAS_SPACE_INSTANCE, f"{canvas_id}_{annotation_id}"), + ) + annotation_edges.append(annotation_edge) + + return annotation_nodes, annotation_edges + + +def dm_generate( + name: str, file_node: Node, file_view_id: ViewId, client: CogniteClient, unmatched_tags_with_regions: list = [] +): + """Orchestrates the creation of the canvas, its objects, and bounding box annotations.""" + canvas = fetch_existing_canvas(name, file_node, client) + + if canvas: + file_container_id = f"file_annotation_file_container_{file_node.external_id}" + reset_canvas_annotations(canvas.external_id, client) + nodes = [] + edges = [] + else: + canvas = create_canvas(name, file_node, client) + nodes, edges, file_container_id = create_objects(canvas.external_id, file_node, file_view_id) + + canvas_id = canvas.external_id + + if unmatched_tags_with_regions: + annotation_nodes, annotation_edges = create_bounding_box_annotations( + canvas_id, file_container_id, unmatched_tags_with_regions + ) + nodes.extend(annotation_nodes) + edges.extend(annotation_edges) + + client.data_modeling.instances.apply(nodes=[canvas] + nodes, edges=edges) + st.session_state["canvas_id"] = canvas_id + return canvas_id + + +def reset_canvas_annotations(canvas_id: str, client: CogniteClient): + """Deletes all canvas annotations, which includes nodes and edges""" + edge_filter = And( + Equals( + property=["edge", "type"], value={"space": CANVAS_SPACE_CANVAS, "externalId": "referencesCanvasAnnotation"} + ), + Equals(property=["edge", "startNode"], value={"space": CANVAS_SPACE_INSTANCE, "externalId": canvas_id}), + ) + + edges_to_delete = client.data_modeling.instances.list( + instance_type="edge", + filter=edge_filter, + limit=-1, + ) + + edges_to_delete_ids = [EdgeId(space=e.space, external_id=e.external_id) for e in edges_to_delete] + nodes_to_delete_ids = [NodeId(space=e.end_node.space, external_id=e.end_node.external_id) for e in edges_to_delete] + + if edges_to_delete_ids: + client.data_modeling.instances.delete(edges=edges_to_delete_ids) + + if nodes_to_delete_ids: + client.data_modeling.instances.delete(nodes=nodes_to_delete_ids) diff --git a/modules/contextualization/cdf_file_annotation/streamlit/file_annotation_dashboard/data_structures.py b/modules/contextualization/cdf_file_annotation/streamlit/file_annotation_dashboard/data_structures.py index 400cd560..f5fe0f4b 100644 --- a/modules/contextualization/cdf_file_annotation/streamlit/file_annotation_dashboard/data_structures.py +++ b/modules/contextualization/cdf_file_annotation/streamlit/file_annotation_dashboard/data_structures.py @@ -1,18 +1,18 @@ -import streamlit as st -from cognite.client.data_classes.data_modeling import ViewId -from dataclasses import dataclass - - -# Configuration Classes -@dataclass -class ViewPropertyConfig: - schema_space: str - external_id: str - version: str - instance_space: str | None = None - - def as_view_id(self) -> ViewId: - return ViewId(space=self.schema_space, external_id=self.external_id, version=self.version) - - def as_property_ref(self, property) -> list[str]: - return [self.schema_space, f"{self.external_id}/{self.version}", property] +import streamlit as st +from cognite.client.data_classes.data_modeling import ViewId +from dataclasses import dataclass + + +# Configuration Classes +@dataclass +class ViewPropertyConfig: + schema_space: str + external_id: str + version: str + instance_space: str | None = None + + def as_view_id(self) -> ViewId: + return ViewId(space=self.schema_space, external_id=self.external_id, version=self.version) + + def as_property_ref(self, property) -> list[str]: + return [self.schema_space, f"{self.external_id}/{self.version}", property] diff --git a/modules/contextualization/cdf_file_annotation/streamlit/file_annotation_dashboard/helper.py b/modules/contextualization/cdf_file_annotation/streamlit/file_annotation_dashboard/helper.py index 542efa81..794f31dd 100644 --- a/modules/contextualization/cdf_file_annotation/streamlit/file_annotation_dashboard/helper.py +++ b/modules/contextualization/cdf_file_annotation/streamlit/file_annotation_dashboard/helper.py @@ -5,13 +5,106 @@ import pandas as pd from datetime import datetime, timedelta from cognite.client import CogniteClient -from cognite.client.data_classes.data_modeling import ViewId, NodeId +from cognite.client.data_classes import RowWrite, Asset, AssetFilter +from cognite.client.data_classes.data_modeling import ( + ViewId, + NodeId, + Node, + filters, + EdgeApply, + NodeOrEdgeData, + DirectRelationReference, +) from cognite.client.data_classes.functions import FunctionCallLog from data_structures import ViewPropertyConfig +from canvas import dm_generate client = CogniteClient() -PIPELINE_EXT_ID = "ep_file_annotation" + +@st.cache_data(ttl=3600) +def get_file_node(file_id: NodeId, file_view: ViewPropertyConfig) -> Node | None: + """Fetches a single file node from CDF.""" + try: + node = client.data_modeling.instances.retrieve_nodes(nodes=file_id, sources=file_view.as_view_id()) + return node + except Exception as e: + st.error(f"Failed to retrieve file node {file_id}: {e}") + return None + + +def generate_file_canvas( + file_id: NodeId, file_view: ViewPropertyConfig, ep_config: dict, unmatched_tags_with_regions: list = [] +): + """ + Generates an Industrial Canvas, including bounding boxes for unmatched tags, + and returns the canvas URL. + """ + file_node = get_file_node(file_id, file_view) + if not file_node: + st.error("Could not generate canvas because the file node could not be retrieved.") + return None + + canvas_name = f"Annotation Quality Analysis - {file_node.external_id}" + + try: + domain = os.getenv("COGNITE_ORGANIZATION") + project = client.config.project + cluster = client.config.cdf_cluster + + canvas_id = dm_generate( + name=canvas_name, + file_node=file_node, + file_view_id=file_view.as_view_id(), + client=client, + unmatched_tags_with_regions=unmatched_tags_with_regions, + ) + st.success(f"Successfully generated canvas: {canvas_name}") + + canvas_url = f"https://{domain}.fusion.cognite.com/{project}/industrial-canvas/canvas?canvasId={canvas_id}&cluster={cluster}.cognitedata.com&env={cluster}&workspace=industrial-tools" + return canvas_url + + except Exception as e: + st.error(f"Failed to generate canvas: {e}") + return None + + +@st.cache_data(ttl=600) +def find_pipelines(name_filter: str = "file_annotation") -> list[str]: + """ + Finds the external IDs of all extraction pipelines in the project, + filtered by a substring in their external ID. + """ + try: + all_pipelines = client.extraction_pipelines.list(limit=-1) + if not all_pipelines: + st.warning(f"No extraction pipelines found in the project.") + return [] + + filtered_ids = [p.external_id for p in all_pipelines if name_filter in p.external_id] + + if not filtered_ids: + st.warning(f"No pipelines matching the filter '*{name_filter}*' found in the project.") + return [] + + return sorted(filtered_ids) + except Exception as e: + st.error(f"An error occurred while searching for extraction pipelines: {e}") + return [] + + +@st.cache_data(ttl=3600) +def fetch_raw_table_data(db_name: str, table_name: str) -> pd.DataFrame: + """Fetches all rows from a specified RAW table and returns as a DataFrame.""" + try: + rows = client.raw.rows.list(db_name=db_name, table_name=table_name, limit=-1) + if not rows: + return pd.DataFrame() + data = [row.columns for row in rows] + return pd.DataFrame(data) + except Exception as e: + st.error(f"Failed to fetch data from RAW table '{table_name}': {e}") + return pd.DataFrame() def parse_run_message(message: str) -> dict: @@ -19,7 +112,6 @@ def parse_run_message(message: str) -> dict: if not message: return {} - # Regex to capture all key-value pairs from the new format pattern = re.compile( r"\(caller:(?P\w+), function_id:(?P[\w\.-]+), call_id:(?P[\w\.-]+)\) - " r"total files processed: (?P\d+) - " @@ -29,7 +121,6 @@ def parse_run_message(message: str) -> dict: match = pattern.search(message) if match: data = match.groupdict() - # Convert numeric strings to integers for key in ["total", "success", "failed"]: if key in data: data[key] = int(data[key]) @@ -38,11 +129,11 @@ def parse_run_message(message: str) -> dict: @st.cache_data(ttl=3600) -def fetch_extraction_pipeline_config() -> tuple[dict, ViewPropertyConfig, ViewPropertyConfig]: +def fetch_extraction_pipeline_config(pipeline_ext_id: str) -> tuple[dict, ViewPropertyConfig, ViewPropertyConfig]: """ Fetch configurations from the latest extraction """ - ep_configuration = client.extraction_pipelines.config.retrieve(external_id=PIPELINE_EXT_ID) + ep_configuration = client.extraction_pipelines.config.retrieve(external_id=pipeline_ext_id) config_dict = yaml.safe_load(ep_configuration.config) local_annotation_state_view = config_dict["dataModelViews"]["annotationStateView"] @@ -61,7 +152,21 @@ def fetch_extraction_pipeline_config() -> tuple[dict, ViewPropertyConfig, ViewPr local_file_view.get("instanceSpace"), ) - return (config_dict, annotation_state_view, file_view) + local_target_entities_view = config_dict["dataModelViews"]["targetEntitiesView"] + target_entities_view = ViewPropertyConfig( + local_target_entities_view["schemaSpace"], + local_target_entities_view["externalId"], + local_target_entities_view["version"], + local_target_entities_view.get("instanceSpace"), + ) + + views_dict = { + "annotation_state": annotation_state_view, + "file": file_view, + "target_entities": target_entities_view, + } + + return (config_dict, views_dict) @st.cache_data(ttl=3600) @@ -70,7 +175,6 @@ def fetch_annotation_states(annotation_state_view: ViewPropertyConfig, file_view Fetches annotation state instances from the specified data model view and joins them with their corresponding file instances. """ - # 1. Fetch all annotation state instances annotation_instances = client.data_modeling.instances.list( instance_type="node", space=annotation_state_view.instance_space, @@ -78,10 +182,8 @@ def fetch_annotation_states(annotation_state_view: ViewPropertyConfig, file_view limit=-1, ) if not annotation_instances: - st.info("No annotation state instances found in the specified view.") return pd.DataFrame() - # 2. Process annotation states and collect NodeIds for linked files annotation_data = [] nodes_to_fetch = [] for instance in annotation_instances: @@ -106,91 +208,41 @@ def fetch_annotation_states(annotation_state_view: ViewPropertyConfig, file_view if df_annotations.empty or not nodes_to_fetch: return df_annotations - # 3. Fetch corresponding file instances using the collected NodeIds - # Remove duplicates before fetching unique_nodes_to_fetch = list(set(nodes_to_fetch)) file_instances = client.data_modeling.instances.retrieve_nodes( nodes=unique_nodes_to_fetch, sources=file_view.as_view_id() ) - # 4. Process file instances into a DataFrame file_data = [] for instance in file_instances: - node_data = { - "fileExternalId": instance.external_id, - "fileSpace": instance.space, - } + node_data = {"fileExternalId": instance.external_id, "fileSpace": instance.space} properties = instance.properties[file_view.as_view_id()] for prop_key, prop_value in properties.items(): - if isinstance(prop_value, list): - string_values = [] - for value in prop_value: - string_values.append(value) - node_data[f"file{prop_key.capitalize()}"] = ", ".join(filter(None, string_values)) - else: - node_data[f"file{prop_key.capitalize()}"] = prop_value + node_data[f"file{prop_key.capitalize()}"] = ( + ", ".join(map(str, prop_value)) if isinstance(prop_value, list) else prop_value + ) file_data.append(node_data) if not file_data: return df_annotations df_files = pd.DataFrame(file_data) - - # 5. Merge annotation data with file data df_merged = pd.merge(df_annotations, df_files, on=["fileExternalId", "fileSpace"], how="left") - # 6. Final data cleaning and preparation - if "createdTime" in df_merged.columns: - df_merged["createdTime"] = df_merged["createdTime"].dt.tz_localize("UTC") - if "lastUpdatedTime" in df_merged.columns: - df_merged["lastUpdatedTime"] = df_merged["lastUpdatedTime"].dt.tz_localize("UTC") - - df_merged.rename( - columns={ - "annotationStatus": "status", - "attemptCount": "retries", - "diagramDetectJobId": "jobId", - }, - inplace=True, - ) + for col in ["createdTime", "lastUpdatedTime"]: + if col in df_merged.columns: + df_merged[col] = df_merged[col].dt.tz_localize("UTC") - for col in ["status", "fileExternalId", "retries", "jobId"]: - if col not in df_merged.columns: - df_merged[col] = None + df_merged.rename(columns={"annotationStatus": "status", "attemptCount": "retries"}, inplace=True) return df_merged @st.cache_data(ttl=3600) -def fetch_pipeline_run_history(): +def fetch_pipeline_run_history(pipeline_ext_id: str): """Fetches the full run history for a given extraction pipeline.""" - return client.extraction_pipelines.runs.list(external_id=PIPELINE_EXT_ID, limit=-1) - - -def calculate_success_failure_stats(runs): - """Calculates success and failure counts from a list of pipeline runs.""" - success_count = sum(1 for run in runs if run.status == "success") - failure_count = sum(1 for run in runs if run.status == "failure") - return success_count, failure_count - - -def get_failed_run_details(runs): - """Filters for failed runs and extracts their details, including IDs.""" - failed_runs = [] - for run in runs: - if run.status == "failure": - parsed_message = parse_run_message(run.message) - failed_runs.append( - { - "timestamp": pd.to_datetime(run.created_time, unit="ms").tz_localize("UTC"), - "message": run.message, - "status": run.status, - "function_id": parsed_message.get("function_id"), - "call_id": parsed_message.get("call_id"), - } - ) - return sorted(failed_runs, key=lambda x: x["timestamp"], reverse=True) + return client.extraction_pipelines.runs.list(external_id=pipeline_ext_id, limit=-1) @st.cache_data(ttl=3600) @@ -205,57 +257,530 @@ def fetch_function_logs(function_id: int, call_id: int): def process_runs_for_graphing(runs): """Transforms pipeline run data into a DataFrame for graphing.""" - launch_data = [] - finalize_runs_to_agg = [] - + launch_data, finalize_runs_to_agg = [], [] for run in runs: if run.status != "success": continue - parsed = parse_run_message(run.message) if not parsed: continue - - timestamp = pd.to_datetime(run.created_time, unit="ms").tz_localize("UTC") - count = parsed.get("total", 0) - caller = parsed.get("caller") - + timestamp, count, caller = ( + pd.to_datetime(run.created_time, unit="ms").tz_localize("UTC"), + parsed.get("total", 0), + parsed.get("caller"), + ) if caller == "Launch": launch_data.append({"timestamp": timestamp, "count": count, "type": "Launch"}) elif caller == "Finalize": finalize_runs_to_agg.append({"timestamp": timestamp, "count": count}) - # --- Aggregate Finalize Runs --- aggregated_finalize_data = [] if finalize_runs_to_agg: finalize_runs_to_agg.sort(key=lambda x: x["timestamp"]) - current_group_start_time = finalize_runs_to_agg[0]["timestamp"] - current_group_count = 0 - + current_group_start_time, current_group_count = finalize_runs_to_agg[0]["timestamp"], 0 for run in finalize_runs_to_agg: if run["timestamp"] < current_group_start_time + timedelta(minutes=10): current_group_count += run["count"] else: aggregated_finalize_data.append( - { - "timestamp": current_group_start_time, - "count": current_group_count, - "type": "Finalize", - } + {"timestamp": current_group_start_time, "count": current_group_count, "type": "Finalize"} ) - current_group_start_time = run["timestamp"] - current_group_count = run["count"] - + current_group_start_time, current_group_count = run["timestamp"], run["count"] if current_group_count > 0: aggregated_finalize_data.append( - { - "timestamp": current_group_start_time, - "count": current_group_count, - "type": "Finalize", - } + {"timestamp": current_group_start_time, "count": current_group_count, "type": "Finalize"} ) - df_launch = pd.DataFrame(launch_data) - df_finalize = pd.DataFrame(aggregated_finalize_data) + return pd.concat([pd.DataFrame(launch_data), pd.DataFrame(aggregated_finalize_data)], ignore_index=True) + + +@st.cache_data(ttl=3600) +def fetch_pattern_catalog(db_name: str, table_name: str) -> pd.DataFrame: + """ + Fetches the entity cache and explodes it to create a complete + catalog of all generated patterns, indexed by resourceType. + """ + try: + rows = client.raw.rows.list(db_name=db_name, table_name=table_name, limit=-1) + if not rows: + return pd.DataFrame() + all_patterns = [] + for row in pd.DataFrame([row.columns for row in rows]).itertuples(): + for sample_list in ["AssetPatternSamples", "FilePatternSamples"]: + if hasattr(row, sample_list) and isinstance(getattr(row, sample_list), list): + for item in getattr(row, sample_list): + if item.get("sample") and item.get("resource_type"): + all_patterns.extend( + [ + {"resourceType": item["resource_type"], "pattern": pattern} + for pattern in item["sample"] + ] + ) + return pd.DataFrame(all_patterns) + except Exception as e: + st.error(f"Failed to fetch pattern catalog from '{table_name}': {e}") + return pd.DataFrame() + + +def fetch_manual_patterns(db_name: str, table_name: str) -> pd.DataFrame: + """ + Fetches all manual patterns from the RAW table and explodes them + into a tidy DataFrame for display and editing. + """ + all_patterns = [] + try: + for row in client.raw.rows.list(db_name=db_name, table_name=table_name, limit=-1): + key, patterns_list = row.key, row.columns.get("patterns", []) + scope_level, primary_scope, secondary_scope = "Global", "", "" + if key != "GLOBAL": + parts = key.split("_") + if len(parts) == 2: + scope_level, primary_scope, secondary_scope = "Secondary Scope", parts[0], parts[1] + else: + scope_level, primary_scope = "Primary Scope", key + all_patterns.extend( + [ + { + "key": key, + "scope_level": scope_level, + "annotation_type": p.get("annotation_type"), + "primary_scope": primary_scope, + "secondary_scope": secondary_scope, + "sample": p.get("sample"), + "resource_type": p.get("resource_type"), + "created_by": p.get("created_by"), + } + for p in patterns_list + ] + ) + + df = ( + pd.DataFrame(all_patterns) + if all_patterns + else pd.DataFrame( + columns=[ + "key", + "scope_level", + "annotation_type", + "primary_scope", + "secondary_scope", + "sample", + "resource_type", + "created_by", + ] + ) + ) + return df.fillna("").astype(str) + except Exception as e: + if "NotFoundError" not in str(type(e)): + st.error(f"Failed to fetch manual patterns: {e}") + return pd.DataFrame( + columns=[ + "key", + "scope_level", + "annotation_type", + "primary_scope", + "secondary_scope", + "sample", + "resource_type", + "created_by", + ] + ) + + +def save_manual_patterns(df: pd.DataFrame, db_name: str, table_name: str): + """ + Takes a tidy DataFrame of patterns, groups them by scope key, + and writes them back to the RAW table. + """ + + def create_key(row): + if row["scope_level"] == "Global": + return "GLOBAL" + if row["scope_level"] == "Primary Scope" and row["primary_scope"]: + return row["primary_scope"] + if row["scope_level"] == "Secondary Scope" and row["primary_scope"] and row["secondary_scope"]: + return f"{row['primary_scope']}_{row['secondary_scope']}" + return None + + df["key"] = df.apply(create_key, axis=1) + df.dropna(subset=["key"], inplace=True) + rows_to_write = [ + RowWrite( + key=key, + columns={ + "patterns": group[["sample", "resource_type", "annotation_type", "created_by"]].to_dict("records") + }, + ) + for key, group in df.groupby("key") + ] + + existing_keys = {r.key for r in client.raw.rows.list(db_name, table_name, limit=-1)} + keys_to_delete = list(existing_keys - {r.key for r in rows_to_write}) + if keys_to_delete: + client.raw.rows.delete(db_name=db_name, table_name=table_name, key=keys_to_delete) + if rows_to_write: + client.raw.rows.insert(db_name=db_name, table_name=table_name, row=rows_to_write, ensure_parent=True) + + +@st.cache_data(ttl=600) +def get_files_by_call_id(call_id: int, annotation_state_view: ViewPropertyConfig) -> pd.DataFrame: + """ + Finds all files associated with a specific function call ID by querying + the AnnotationState data model. + """ + if not call_id: + return pd.DataFrame() + try: + call_id_filter = filters.Or( + filters.Equals(annotation_state_view.as_property_ref("launchFunctionCallId"), call_id), + filters.Equals(annotation_state_view.as_property_ref("finalizeFunctionCallId"), call_id), + ) + instances = client.data_modeling.instances.list( + instance_type="node", sources=annotation_state_view.as_view_id(), filter=call_id_filter, limit=-1 + ) + if not instances: + return pd.DataFrame() + + view_id_tuple = annotation_state_view.as_view_id() + file_ids = [ + instance.properties.get(view_id_tuple, {}).get("linkedFile", {}).get("externalId") + for instance in instances + if instance.properties.get(view_id_tuple, {}).get("linkedFile", {}).get("externalId") + ] + return pd.DataFrame(file_ids, columns=["File External ID"]) + except Exception as e: + st.error(f"Failed to query files by call ID: {e}") + return pd.DataFrame() + + +def calculate_overview_kpis(df: pd.DataFrame) -> dict: + """Calculates high-level KPIs from the AnnotationState dataframe.""" + kpis = {"awaiting_processing": 0, "processed_total": 0, "failed_total": 0, "failure_rate_total": 0} + if df.empty: + return kpis + kpis["awaiting_processing"] = len(df[df["status"].isin(["New", "Retry", "Processing", "Finalizing"])]) + finalized_all_time = df[df["status"].isin(["Annotated", "Failed"])] + kpis["processed_total"] = len(finalized_all_time) + kpis["failed_total"] = len(finalized_all_time[finalized_all_time["status"] == "Failed"]) + if kpis["processed_total"] > 0: + kpis["failure_rate_total"] = (kpis["failed_total"] / kpis["processed_total"]) * 100 + return kpis + + +def filter_log_lines(log_text: str, search_string: str) -> str: + """ + Takes a block of log text and a search string, returning a new string + containing the lines that include the search string, plus the subsequent + indented lines that provide context. + """ + if not log_text or not isinstance(log_text, str): + return "Log content is not available or in an invalid format." + relevant_blocks, lines = [], log_text.splitlines() + for i, line in enumerate(lines): + if search_string in line: + current_block = [line] + next_line_index = i + 1 + while next_line_index < len(lines): + next_line = lines[next_line_index] + if next_line.strip().startswith("-") or "\t" in next_line or " " in next_line: + current_block.append(next_line) + next_line_index += 1 + else: + break + relevant_blocks.append("\n".join(current_block)) + return "\n\n".join(relevant_blocks) + + +# --- Remove all non-alphanumeric characters, convert to lowercase, and strip leading zeros from numbers --- +def normalize(s): + """ + Normalizes a string by: + 1. Ensuring it's a string. + 2. Removing all non-alphanumeric characters. + 3. Converting to lowercase. + 4. Removing leading zeros from any sequence of digits found within the string. + """ + if not isinstance(s, str): + return "" + + # Step 1: Basic cleaning (e.g., "V-0912" -> "v0912") + s = re.sub(r"[^a-zA-Z0-9]", "", s).lower() + + # Step 2: Define a replacer function that converts any matched number to an int and back to a string + def strip_leading_zeros(match): + # match.group(0) is the matched string (e.g., "0912") + return str(int(match.group(0))) + + # Step 3: Apply the replacer function to all sequences of digits (\d+) in the string + # This turns "v0912" into "v912" + return re.sub(r"\d+", strip_leading_zeros, s) + + +@st.cache_data(ttl=600) +def fetch_potential_annotations(db_name: str, table_name: str, file_external_id: str) -> pd.DataFrame: + """Fetches potential annotations for a specific file from the patterns RAW table.""" + try: + rows = client.raw.rows.list( + db_name=db_name, table_name=table_name, limit=-1, filter={"startNode": file_external_id} + ) + if not rows: + return pd.DataFrame() + return pd.DataFrame([row.columns for row in rows]) + except Exception as e: + st.error(f"Failed to fetch potential annotations: {e}") + return pd.DataFrame() + + +@st.cache_data(ttl=3600) +def fetch_entities(entity_view: ViewPropertyConfig, resource_property: str) -> pd.DataFrame: + """ + Fetches entity instances from the specified data model view and returns a tidy DataFrame. + """ + instances = client.data_modeling.instances.list( + instance_type="node", space=entity_view.instance_space, sources=entity_view.as_view_id(), limit=-1 + ) + + if not instances: + return pd.DataFrame() + + data = [] + + for instance in instances: + props = instance.properties.get(entity_view.as_view_id(), {}) or {} + row = {"externalId": instance.external_id, "space": instance.space} + + row["name"] = props.get("name") + row["resourceType"] = props.get(resource_property) + row["sysUnit"] = props.get("sysUnit") + + for k, v in props.items(): + if k not in row: + row[k] = v + + data.append(row) + + return pd.DataFrame(data) + + +def show_connect_unmatched_ui( + tag_text, + file_view, + target_entities_view, + file_resource_property, + target_entities_resource_property, + associated_files, + tab, + db_name, + pattern_table, + apply_config, +): + """ + Displays the UI to connect a single unmatched tag to either an Asset or a File. + """ + st.markdown(f"### Tag to Connect: `{tag_text}`") + st.markdown(f"Associated Files: `{associated_files}`") + col1, col2 = st.columns(2) + entity_type = None + + with col1: + if st.button("Retrieve Assets", key=f"btn_retrieve_assets_{tab}"): + st.session_state.selected_entity_type_to_connect = "asset" + st.session_state.selected_entity_to_connect_index = None + with col2: + if st.button("Retrieve Files", key=f"btn_retrieve_files_{tab}"): + st.session_state.selected_entity_type_to_connect = "file" + st.session_state.selected_entity_to_connect_index = None + + entity_type = st.session_state.selected_entity_type_to_connect + + if not entity_type: + return + + if entity_type == "file": + entity_view = file_view + resource_property = file_resource_property + annotation_type = "diagrams.FileLink" + else: + entity_view = target_entities_view + resource_property = target_entities_resource_property + annotation_type = "diagrams.AssetLink" + + df_entities = fetch_entities(entity_view, resource_property) + + if df_entities.empty: + st.warning(f"No {entity_type}s found.") + return + + df_entities_display = df_entities.copy() + df_entities_display.insert(0, "Select", False) + + if st.session_state.selected_entity_to_connect_index is not None: + idx = st.session_state.selected_entity_to_connect_index + + if idx in df_entities_display.index: + df_entities_display.loc[:, "Select"] = False + df_entities_display.at[idx, "Select"] = True + + filterable_columns = [col for col in ["sysUnit", "resourceType"] if col in df_entities_display.columns] + + for filterable_column in filterable_columns: + unique_values = sorted(df_entities_display[filterable_column].dropna().unique().tolist()) + + selected_value = st.selectbox( + f"Filter by {filterable_column}", + key=f"sb_filterable_column_{filterable_column}_{tab}", + options=[None] + unique_values, + index=0, + ) + + if selected_value: + df_entities_display = df_entities_display[df_entities_display[filterable_column] == selected_value] + + all_columns = df_entities_display.columns.tolist() + default_columns = ["Select", "name", "resourceType", "sysUnit", "externalId"] + + with st.popover("Customize Table Columns"): + selected_columns = st.multiselect( + f"Select columns to display ({entity_type}s)", + options=all_columns, + default=[col for col in default_columns if col in all_columns], + key=f"ms_selected_columns_{tab}_{entity_type}", + ) + + entity_editor_key = f"{entity_type}_editor_{tag_text}_{tab}" + edited_entities = st.data_editor( + df_entities_display[selected_columns], + key=entity_editor_key, + column_config={ + "Select": st.column_config.CheckboxColumn(required=True), + "name": "Name", + "externalId": "External ID", + "resourceType": "Resource Type", + "sysUnit": "Sys Unit", + }, + use_container_width=True, + hide_index=True, + disabled=df_entities_display.columns.difference(["Select"]), + ) + + selected_indices = edited_entities[edited_entities.Select].index.tolist() + + if len(selected_indices) > 1: + new_selection = [idx for idx in selected_indices if idx != st.session_state.selected_entity_to_connect_index] + st.session_state.selected_entity_to_connect_index = new_selection[0] if new_selection else None + st.rerun() + elif len(selected_indices) == 1: + st.session_state.selected_entity_to_connect_index = selected_indices[0] + elif len(selected_indices) == 0 and st.session_state.selected_entity_to_connect_index is not None: + st.session_state.selected_entity_to_connect_index = None + st.rerun() + + if st.session_state.selected_entity_to_connect_index is not None: + selected_entity = df_entities.loc[st.session_state.selected_entity_to_connect_index] + if st.button( + f"Connect '{tag_text}' to '{selected_entity['name']}' in {str(len(associated_files)) + ' files' if len(associated_files) > 1 else str(len(associated_files)) + ' file'}", + key=f"btn_connect_tag_to_entities_{tab}", + ): + success, count, error = create_tag_connection( + client, + db_name, + pattern_table, + tag_text, + associated_files, + selected_entity, + annotation_type, + apply_config, + entity_view, + ) + + if success: + st.toast( + f"{count} annotation{'s' if count > 1 else ''} created from tag '{tag_text}' to {entity_type} '{selected_entity['name']}' " + f"in {len(associated_files)} file{'s' if len(associated_files) > 1 else ''}!", + icon=":material/check_small:", + ) + st.cache_data.clear() + else: + st.toast(body=f"Failed to connect tag '{tag_text}': {error}", icon=":material/error:") + + +def create_tag_connection( + client: CogniteClient, + db_name: str, + table_name: str, + tag_text: str, + associated_files: list[str], + selected_entity: pd.Series, + annotation_type: str, + apply_config: dict, + entity_view: ViewPropertyConfig, +): + updated_rows = [] + updated_edges = [] + + try: + rows = client.raw.rows.list(db_name=db_name, table_name=table_name, limit=-1) + + sink_node_space = apply_config["sinkNode"]["space"] + + for row in rows: + row_data = row.columns + + if row_data.get("startNodeText") == tag_text and row_data.get("startNode") in associated_files: + edge_external_id = row.key + file_id = row_data.get("startNode") + + updated_edges.append( + EdgeApply( + space=sink_node_space, + external_id=edge_external_id, + type=DirectRelationReference(space=row_data.get("viewSpace"), external_id=annotation_type), + start_node=DirectRelationReference(space=row_data.get("startNodeSpace"), external_id=file_id), + end_node=DirectRelationReference( + space=selected_entity.get("space"), external_id=selected_entity.get("externalId") + ), + ) + ) + + row_data["endNode"] = selected_entity["externalId"] + row_data["endNodeSpace"] = selected_entity["space"] + + resource_type = ( + selected_entity["resourceType"] if selected_entity["resourceType"] else entity_view.external_id + ) + + row_data["endNodeResourceType"] = resource_type + row_data["status"] = "Approved" + + updated_rows.append(RowWrite(key=edge_external_id, columns=row_data)) + + if updated_rows: + client.raw.rows.insert(db_name=db_name, table_name=table_name, row=updated_rows, ensure_parent=True) + + if updated_edges: + client.data_modeling.instances.apply(edges=updated_edges, replace=False) + + return True, len(updated_rows), None + except Exception as e: + return False, 0, str(e) + + +def build_unmatched_tags_with_regions(df: pd.DataFrame, file_id: str, potential_new_annotations: list[str]): + df_filtered = df[(df["startNode"] == file_id) & (df["startNodeText"].isin(potential_new_annotations))] + + unmatched_tags_with_regions = [] + + for _, row in df_filtered.iterrows(): + region = { + "vertices": [ + {"x": row["startNodeXMin"], "y": row["startNodeYMin"]}, + {"x": row["startNodeXMax"], "y": row["startNodeYMin"]}, + {"x": row["startNodeXMax"], "y": row["startNodeYMax"]}, + {"x": row["startNodeXMin"], "y": row["startNodeYMax"]}, + ] + } + + unmatched_tags_with_regions.append({"text": row["startNodeText"], "regions": [region]}) - return pd.concat([df_launch, df_finalize], ignore_index=True) + return unmatched_tags_with_regions diff --git a/modules/contextualization/cdf_file_annotation/streamlit/file_annotation_dashboard/pages/Annotation_Quality.py b/modules/contextualization/cdf_file_annotation/streamlit/file_annotation_dashboard/pages/Annotation_Quality.py new file mode 100644 index 00000000..a9757f56 --- /dev/null +++ b/modules/contextualization/cdf_file_annotation/streamlit/file_annotation_dashboard/pages/Annotation_Quality.py @@ -0,0 +1,790 @@ +import streamlit as st +import pandas as pd +import altair as alt +from datetime import datetime, timezone +from helper import ( + fetch_extraction_pipeline_config, + fetch_raw_table_data, + find_pipelines, + generate_file_canvas, + fetch_pattern_catalog, + fetch_manual_patterns, + fetch_annotation_states, + save_manual_patterns, + normalize, + show_connect_unmatched_ui, + build_unmatched_tags_with_regions, +) +from cognite.client.data_classes.data_modeling import NodeId + +# --- Page Configuration --- +st.set_page_config( + page_title="Annotation Quality", + page_icon="🎯", + layout="wide", +) + + +# --- Callback function to reset selection --- +def reset_selection(): + st.session_state.selected_row_index = None + + +# --- Initialize Session State --- +if "selected_row_index" not in st.session_state: + st.session_state.selected_row_index = None +if "selected_unmatched_per_file_index" not in st.session_state: + st.session_state.selected_unmatched_per_file_index = None +if "selected_unmatched_overall_index" not in st.session_state: + st.session_state.selected_unmatched_overall_index = None +if "selected_entity_to_connect_index" not in st.session_state: + st.session_state.selected_entity_to_connect_index = None +if "selected_entity_type_to_connect" not in st.session_state: + st.session_state.selected_entity_type_to_connect = None + +# --- Sidebar for Pipeline Selection --- +st.sidebar.title("Pipeline Selection") +pipeline_ids = find_pipelines() + +if not pipeline_ids: + st.info("No active file annotation pipelines found to monitor.") + st.stop() + +selected_pipeline = st.sidebar.selectbox("Select a pipeline:", options=pipeline_ids, key="quality_pipeline_selector") + +# --- Data Fetching & Processing --- +config_result = fetch_extraction_pipeline_config(selected_pipeline) +if not config_result: + st.error(f"Could not fetch configuration for pipeline: {selected_pipeline}") + st.stop() + +ep_config, view_config = config_result + +annotation_state_view = view_config["annotation_state"] +file_view = view_config["file"] +target_entities_view = view_config["target_entities"] + +apply_config = ep_config.get("finalizeFunction", {}).get("applyService", {}) +cache_config = ep_config.get("launchFunction", {}).get("cacheService", {}) +db_name = apply_config.get("rawDb") +pattern_table = apply_config.get("rawTableDocPattern") +tag_table = apply_config.get("rawTableDocTag") +doc_table = apply_config.get("rawTableDocDoc") +cache_table = cache_config.get("rawTableCache") +manual_patterns_table = cache_config.get("rawManualPatternsCatalog") +file_resource_property = ep_config.get("launchFunction", {}).get("fileResourceProperty", "") +target_entities_resource_property = ep_config.get("launchFunction", {}).get("targetEntitiesResourceProperty", "") + +if not all([db_name, pattern_table, tag_table, doc_table, cache_table, manual_patterns_table]): + st.error("Could not find all required RAW table names in the pipeline configuration.") + st.stop() + +# --- Main Application --- +st.title("Annotation Quality Dashboard") + +# --- Create Tabs --- +overall_tab, per_file_tab, management_tab = st.tabs( + ["Overall Quality Metrics", "Per-File Analysis", "Pattern Management"] +) + +# ========================================== +# OVERALL QUALITY METRICS TAB +# ========================================== +with overall_tab: + df_patterns = fetch_raw_table_data(db_name, pattern_table) + df_tags = fetch_raw_table_data(db_name, tag_table) + df_docs = fetch_raw_table_data(db_name, doc_table) + + if df_patterns.empty: + st.info("The pattern catalog is empty. Run the pipeline with patternMode enabled to generate data.") + else: + df_annotations = pd.concat([df_tags, df_docs], ignore_index=True) + + st.subheader( + "Overall Annotation Quality", + help="Provides a high-level summary of pattern performance across all files. Use these aggregate metrics, charts, and tag lists to understand the big picture and identify systemic trends or gaps in the pattern catalog.", + ) + all_resource_types = ["All"] + sorted(df_patterns["endNodeResourceType"].unique().tolist()) + selected_resource_type = st.selectbox( + "Filter by Resource Type:", + options=all_resource_types, + on_change=reset_selection, + key="resource_type_filter", + ) + + if selected_resource_type == "All": + df_metrics_input = df_patterns + df_annotations_input = df_annotations + else: + df_metrics_input = df_patterns[df_patterns["endNodeResourceType"] == selected_resource_type] + if not df_annotations.empty and "endNodeResourceType" in df_annotations.columns: + df_annotations_input = df_annotations[df_annotations["endNodeResourceType"] == selected_resource_type] + else: + df_annotations_input = pd.DataFrame() + + # 1. Get the original, un-normalized sets of strings + potential_tags_original = set(df_metrics_input["startNodeText"]) + actual_annotations_original = ( + set(df_annotations_input["startNodeText"]) + if not df_annotations_input.empty and "startNodeText" in df_annotations_input.columns + else set() + ) + + # 2. Create a mapping from normalized text back to an original version for display + text_map = { + normalize(text): text for text in potential_tags_original.union(actual_annotations_original) if text + } + + # 3. Create fully normalized sets for logical comparison + normalized_potential_set = {normalize(t) for t in potential_tags_original} + normalized_actual_set = {normalize(t) for t in actual_annotations_original} + + # 4. Perform all set operations on the normalized data for accurate logic + normalized_unmatched = normalized_potential_set - normalized_actual_set + + # 5. Use the map to get the final sets with original text for display + actual_annotations_set = {text_map[t] for t in normalized_actual_set if t in text_map} + potential_new_annotations_set = {text_map[t] for t in normalized_unmatched if t in text_map} + + total_actual = len(actual_annotations_set) + total_potential = len(potential_new_annotations_set) + + overall_coverage = ( + (total_actual / (total_actual + total_potential)) * 100 if (total_actual + total_potential) > 0 else 0 + ) + + st.metric( + "Overall Annotation Coverage", + f"{overall_coverage:.2f}%", + help="The percentage of all unique tags (both actual and potential) that have been successfully annotated. Formula: Total Actual Annotations / (Total Actual Annotations + Total Potential New Annotations)", + ) + + st.divider() + chart_data = [] + for resource_type in all_resource_types[1:]: + df_patterns_filtered = df_patterns[df_patterns["endNodeResourceType"] == resource_type] + df_annotations_filtered = ( + df_annotations[df_annotations["endNodeResourceType"] == resource_type] + if not df_annotations.empty and "endNodeResourceType" in df_annotations.columns + else pd.DataFrame() + ) + potential = set(df_patterns_filtered["startNodeText"]) + actual = ( + set(df_annotations_filtered["startNodeText"]) + if not df_annotations_filtered.empty and "startNodeText" in df_annotations_filtered.columns + else set() + ) + # Use normalized comparison for chart data as well + norm_potential = {normalize(p) for p in potential} + norm_actual = {normalize(a) for a in actual} + + total_actual_rt = len(norm_actual) + total_potential_rt = len(norm_potential - norm_actual) + + coverage = ( + (total_actual_rt / (total_actual_rt + total_potential_rt)) * 100 + if (total_actual_rt + total_potential_rt) > 0 + else 0 + ) + chart_data.append( + { + "resourceType": resource_type, + "coverageRate": coverage, + "actualAnnotations": total_actual_rt, + "potentialNewAnnotations": total_potential_rt, + } + ) + + df_chart_data = pd.DataFrame(chart_data) + df_chart_display = ( + df_chart_data[df_chart_data["resourceType"] == selected_resource_type] + if selected_resource_type != "All" + else df_chart_data + ) + + if not df_chart_display.empty: + coverage_chart = ( + alt.Chart(df_chart_display) + .mark_bar() + .encode( + x=alt.X("resourceType:N", title="Resource Type", sort="-y"), + y=alt.Y("coverageRate:Q", title="Annotation Coverage (%)", scale=alt.Scale(domain=[0, 100])), + tooltip=["resourceType", "coverageRate", "actualAnnotations", "potentialNewAnnotations"], + ) + .properties(title="Annotation Coverage by Resource Type") + ) + st.altair_chart(coverage_chart, use_container_width=True) + + else: + st.info("No data available for the selected resource type to generate charts.") + + st.divider() + # --- Pattern Catalog --- + with st.expander("View Full Pattern Catalog"): + df_auto_patterns = fetch_pattern_catalog(db_name, cache_table) + df_manual_patterns = fetch_manual_patterns(db_name, manual_patterns_table) + + df_auto_patterns.rename(columns={"resourceType": "resource_type", "pattern": "sample"}, inplace=True) + df_combined_patterns = ( + pd.concat( + [df_auto_patterns[["resource_type", "sample"]], df_manual_patterns[["resource_type", "sample"]]] + ) + .drop_duplicates() + .sort_values(by=["resource_type", "sample"]) + ) + + if df_combined_patterns.empty: + st.info("Pattern catalog is empty or could not be loaded.") + else: + resource_types = sorted(df_combined_patterns["resource_type"].unique()) + tabs = st.tabs(resource_types) + for i, resource_type in enumerate(resource_types): + with tabs[i]: + df_filtered_patterns = df_combined_patterns[ + df_combined_patterns["resource_type"] == resource_type + ] + st.dataframe( + df_filtered_patterns[["sample"]], + use_container_width=True, + hide_index=True, + column_config={"sample": "Pattern"}, + ) + tag_col1, tag_col2 = st.columns(2) + with tag_col1: + st.metric( + "βœ… Actual Annotations", + f"{total_actual}", + help="A list of all unique tags that have been successfully created. This is our ground truth.", + ) + st.dataframe( + pd.DataFrame(sorted(list(actual_annotations_set)), columns=["Tag"]), + use_container_width=True, + hide_index=True, + ) + with tag_col2: + st.metric( + "πŸ’‘ Potential New Annotations", + f"{total_potential}", + help="A list of all unique tags found by the pattern-mode job that do not yet exist as actual annotations. This is now a clean 'to-do list' of tags that could be promoted or used to create new patterns.", + ) + + unmatched_display = pd.DataFrame(sorted(list(potential_new_annotations_set)), columns=["text"]) + unmatched_display.insert(0, "Select", False) + + if st.session_state.selected_unmatched_overall_index is not None: + idx = st.session_state.selected_unmatched_overall_index + + if idx in unmatched_display.index: + unmatched_display.loc[:, "Select"] = False + unmatched_display.at[idx, "Select"] = True + + unmatched_tags_list = list(potential_new_annotations_set) + df_unmatched_filtered = df_metrics_input[df_metrics_input["startNodeText"].isin(unmatched_tags_list)] + + tag_to_files_unmatched = ( + df_unmatched_filtered.groupby("startNodeText")["startNode"].unique().apply(list).to_dict() + ) + + tag_occurrences = ( + df_unmatched_filtered.groupby("startNodeText")["startNode"] + .count() + .reset_index() + .rename(columns={"startNode": "occurrenceCount"}) + ) + + tag_file_counts = ( + df_unmatched_filtered.groupby("startNodeText")["startNode"] + .nunique() + .reset_index() + .rename(columns={"startNode": "fileCount"}) + ) + + tag_stats = tag_file_counts.merge(tag_occurrences, on="startNodeText", how="outer") + + unmatched_display = unmatched_display.merge(tag_stats, left_on="text", right_on="startNodeText", how="left") + unmatched_display.drop(columns=["startNodeText"], inplace=True) + + unmatched_editor_key = "overall_unmatched_tags_editor" + unmatched_data_editor = st.data_editor( + unmatched_display, + key=unmatched_editor_key, + column_config={ + "Select": st.column_config.CheckboxColumn(required=True), + "text": "Tag", + "fileCount": "Associated Files", + "occurrenceCount": "Occurrences", + }, + use_container_width=True, + hide_index=True, + disabled=unmatched_display.columns.difference(["Select"]), + ) + + selected_indices = unmatched_data_editor[unmatched_data_editor.Select].index.tolist() + + if len(selected_indices) > 1: + new_selection = [ + idx for idx in selected_indices if idx != st.session_state.selected_unmatched_overall_index + ] + st.session_state.selected_unmatched_overall_index = new_selection[0] if new_selection else None + st.rerun() + elif len(selected_indices) == 1: + st.session_state.selected_unmatched_overall_index = selected_indices[0] + elif len(selected_indices) == 0 and st.session_state.selected_unmatched_overall_index is not None: + st.session_state.selected_unmatched_overall_index = None + st.rerun() + + if st.session_state.selected_unmatched_overall_index is not None: + selected_tag_row = unmatched_display.loc[st.session_state.selected_unmatched_overall_index] + selected_tag_text = selected_tag_row["text"] + + show_connect_unmatched_ui( + selected_tag_text, + file_view, + target_entities_view, + file_resource_property, + target_entities_resource_property, + associated_files=tag_to_files_unmatched.get(selected_tag_text, []), + tab="overall", + db_name=db_name, + pattern_table=pattern_table, + apply_config=apply_config, + ) + + +# ========================================== +# PER-FILE ANALYSIS TAB +# ========================================== +with per_file_tab: + st.subheader( + "Per-File Annotation Quality", + help="A deep-dive tool for investigating the quality scores of individual files. Filter the table to find specific examples of high or low performance, then select a file to see a detailed breakdown of its specific matched, unmatched, and missed tags.", + ) + + df_patterns_file = fetch_raw_table_data(db_name, pattern_table) + df_tags_file = fetch_raw_table_data(db_name, tag_table) + df_docs_file = fetch_raw_table_data(db_name, doc_table) + + if df_patterns_file.empty: + st.info("The pattern catalog is empty. Run the pipeline with patternMode enabled to generate data.") + else: + df_annotations_file = pd.concat([df_tags_file, df_docs_file], ignore_index=True) + df_patterns_agg_file = ( + df_patterns_file.groupby("startNode")["startNodeText"].apply(set).reset_index(name="potentialTags") + ) + df_annotations_agg_file = ( + df_annotations_file.groupby("startNode")["startNodeText"].apply(set).reset_index(name="actualAnnotations") + if not df_annotations_file.empty + else pd.DataFrame(columns=["startNode", "actualAnnotations"]) + ) + + df_quality_file = pd.merge(df_patterns_agg_file, df_annotations_agg_file, on="startNode", how="left") + df_quality_file["actualAnnotations"] = df_quality_file["actualAnnotations"].apply( + lambda x: x if isinstance(x, set) else set() + ) + + # Apply normalized comparison for per-file metrics + def calculate_metrics(row): + potential = row["potentialTags"] + actual = row["actualAnnotations"] + norm_potential = {normalize(p) for p in potential} + norm_actual = {normalize(a) for a in actual} + + total_actual_pf = len(norm_actual) + total_potential_pf = len(norm_potential - norm_actual) + + return total_actual_pf, total_potential_pf + + metrics = df_quality_file.apply(calculate_metrics, axis=1, result_type="expand") + df_quality_file[["actualAnnotationsCount", "potentialNewAnnotationsCount"]] = metrics + + df_quality_file["coverageRate"] = ( + ( + df_quality_file["actualAnnotationsCount"] + / (df_quality_file["actualAnnotationsCount"] + df_quality_file["potentialNewAnnotationsCount"]) + ) + * 100 + ).fillna(0) + + df_file_meta = fetch_annotation_states(annotation_state_view, file_view) + df_display_unfiltered = ( + pd.merge(df_quality_file, df_file_meta, left_on="startNode", right_on="fileExternalId", how="left") + if not df_file_meta.empty + else df_quality_file + ) + + with st.expander("Filter Per-File Quality Table"): + excluded_columns = [ + "Select", + "startNode", + "potentialTags", + "actualAnnotations", + "actualAnnotationsCount", + "potentialNewAnnotationsCount", + "coverageRate", + "externalId", + "space", + "annotatedPageCount", + "annotationMessage", + "fileAliases", + "fileAssets", + "fileIsuploaded", + "jobId", + "linkedFile", + "pageCount", + "patternModeJobId", + "sourceCreatedUser", + "sourceCreatedTime", + "sourceUpdatedTime", + "sourceUpdatedUser", + "fileSourceupdateduser", + "fileSourcecreatedUser", + "fileSourceId", + "createdTime", + "fileSourcecreateduser", + "patternModeMessage", + "fileSourceupdatedtime", + "fileSourcecreatedtime", + "fileUploadedtime", + ] + filterable_columns = sorted([col for col in df_display_unfiltered.columns if col not in excluded_columns]) + filter_col1, filter_col2 = st.columns(2) + with filter_col1: + selected_column = st.selectbox( + "Filter by Metadata Property", + options=["None"] + filterable_columns, + on_change=reset_selection, + key="per_file_filter", + ) + selected_values = [] + if selected_column != "None": + unique_values = sorted(df_display_unfiltered[selected_column].dropna().unique().tolist()) + with filter_col2: + selected_values = st.multiselect( + f"Select Value(s) for {selected_column}", options=unique_values, on_change=reset_selection + ) + coverage_range = st.slider( + "Filter by Annotation Coverage (%)", 0, 100, (0, 100), on_change=reset_selection, key="coverage_slider" + ) + + df_display = df_display_unfiltered.copy() + if selected_column != "None" and selected_values: + df_display = df_display[df_display[selected_column].isin(selected_values)] + df_display = df_display[ + (df_display["coverageRate"] >= coverage_range[0]) & (df_display["coverageRate"] <= coverage_range[1]) + ] + df_display = df_display.reset_index(drop=True) + df_display.insert(0, "Select", False) + + default_columns = [ + "Select", + "fileName", + "fileSourceid", + "fileMimetype", + "coverageRate", + "annotationMessage", + "patternModeMessage", + "lastUpdatedTime", + ] + all_columns = df_display.columns.tolist() + + with st.popover("Customize Table Columns"): + selected_columns = st.multiselect( + "Select columns to display:", + options=all_columns, + default=[col for col in default_columns if col in all_columns], + ) + if not selected_columns: + st.warning("Please select at least one column to display.") + st.stop() + + if st.session_state.get("selected_row_index") is not None and st.session_state.selected_row_index < len( + df_display + ): + df_display.at[st.session_state.selected_row_index, "Select"] = True + + edited_df = st.data_editor( + df_display[selected_columns], + key="quality_table_editor", + column_config={ + "Select": st.column_config.CheckboxColumn(required=True), + "fileName": "File Name", + "fileSourceid": "Source ID", + "fileMimetype": "Mime Type", + "fileExternalId": "File External ID", + "coverageRate": st.column_config.ProgressColumn( + "Annotation Coverage ℹ️", + help="The percentage of all unique tags (both actual and potential) that have been successfully annotated. Formula: Total Actual Annotations / (Total Actual Annotations + Total Potential New Annotations)", + format="%.2f%%", + min_value=0, + max_value=100, + ), + "annotationMessage": "Annotation Message", + "patternModeMessage": "Pattern Mode Message", + "lastUpdatedTime": "Last Updated Time", + }, + use_container_width=True, + hide_index=True, + disabled=df_display.columns.difference(["Select"]), + ) + + selected_indices = edited_df[edited_df.Select].index.tolist() + if len(selected_indices) > 1: + new_selection = [idx for idx in selected_indices if idx != st.session_state.get("selected_row_index")] + st.session_state.selected_row_index = new_selection[0] if new_selection else None + st.rerun() + elif len(selected_indices) == 1: + st.session_state.selected_row_index = selected_indices[0] + elif len(selected_indices) == 0 and st.session_state.get("selected_row_index") is not None: + st.session_state.selected_row_index = None + st.rerun() + + st.divider() + if st.session_state.get("selected_row_index") is not None and st.session_state.selected_row_index < len( + df_display + ): + selected_file_data = df_display.iloc[st.session_state.selected_row_index] + selected_file = selected_file_data["startNode"] + st.markdown(f"**Displaying Tag Comparison for file:** `{selected_file}`") + file_space_series = df_patterns_file[df_patterns_file["startNode"] == selected_file]["startNodeSpace"] + if not file_space_series.empty: + file_space = file_space_series.iloc[0] + file_node_id = NodeId(space=file_space, external_id=selected_file) + df_potential_tags_details = df_patterns_file[df_patterns_file["startNode"] == selected_file][ + ["startNodeText", "endNodeResourceType"] + ] + df_actual_annotations_details = ( + df_annotations_file[df_annotations_file["startNode"] == selected_file][ + ["startNodeText", "endNodeResourceType"] + ] + if not df_annotations_file.empty + else pd.DataFrame(columns=["startNodeText", "endNodeResourceType"]) + ) + + # Use normalized comparison for per-file detail view + potential_set = set(df_potential_tags_details["startNodeText"]) + actual_set = set(df_actual_annotations_details["startNodeText"]) + norm_potential = {normalize(p) for p in potential_set} + norm_actual = {normalize(a) for a in actual_set} + + # We need a map from normalized text back to original for accurate filtering + potential_map = {normalize(text): text for text in potential_set} + + norm_unmatched = norm_potential - norm_actual + + potential_new_annotations_set = {potential_map[t] for t in norm_unmatched if t in potential_map} + + actual_df = df_actual_annotations_details.drop_duplicates() + potential_df = df_potential_tags_details[ + df_potential_tags_details["startNodeText"].isin({potential_map[t] for t in norm_unmatched}) + ].drop_duplicates(subset=["startNodeText", "endNodeResourceType"]) + + if st.button("Create in Canvas", key=f"canvas_btn_{selected_file}"): + with st.spinner("Generating Industrial Canvas with bounding boxes..."): + # The 'regions' column is no longer available in the RAW table. + # You will need to adjust the canvas generation logic to handle this. + # For now, we will pass an empty list. + potential_tags_for_canvas = build_unmatched_tags_with_regions( + df=df_metrics_input, + file_id=selected_file, + potential_new_annotations=potential_new_annotations_set, + ) + canvas_url = generate_file_canvas( + file_id=file_node_id, + file_view=file_view, + ep_config=ep_config, + unmatched_tags_with_regions=potential_tags_for_canvas, + ) + if canvas_url: + st.session_state["generated_canvas_url"] = canvas_url + else: + st.session_state.pop("generated_canvas_url", None) + + if "generated_canvas_url" in st.session_state and st.session_state.generated_canvas_url: + st.markdown( + f"**[Open Last Generated Canvas]({st.session_state.generated_canvas_url})**", + unsafe_allow_html=True, + ) + + st.divider() + col1, col2 = st.columns(2) + with col1: + st.metric( + "βœ… Actual Annotations in this File", + len(actual_df), + ) + st.dataframe( + actual_df, + column_config={"startNodeText": "Tag", "endNodeResourceType": "Resource Type"}, + use_container_width=True, + hide_index=True, + ) + with col2: + st.metric( + "πŸ’‘ Potential New Annotations in this File", + len(potential_df), + ) + + unmatched_display = potential_df[["startNodeText", "endNodeResourceType"]].copy() + unmatched_display.insert(0, "Select", False) + + occurrences = ( + df_patterns_file[df_patterns_file["startNode"] == selected_file] + .groupby("startNodeText") + .size() + .reset_index(name="occurrenceCount") + ) + + unmatched_display = unmatched_display.merge(occurrences, on="startNodeText", how="left") + + if st.session_state.selected_unmatched_per_file_index is not None: + idx = st.session_state.selected_unmatched_per_file_index + + if idx in unmatched_display.index: + unmatched_display.loc[:, "Select"] = False + unmatched_display.at[idx, "Select"] = True + + unmatched_editor_key = "unmatched_tags_editor" + unmatched_data_editor = st.data_editor( + unmatched_display, + key=unmatched_editor_key, + column_config={ + "Select": st.column_config.CheckboxColumn(required=True), + "startNodeText": "Tag", + "endNodeResourceType": "Resource Type", + "occurrenceCount": "Occurrences", + }, + use_container_width=True, + hide_index=True, + disabled=unmatched_display.columns.difference(["Select"]), + ) + + selected_indices = unmatched_data_editor[unmatched_data_editor.Select].index.tolist() + + if len(selected_indices) > 1: + new_selection = [ + idx for idx in selected_indices if idx != st.session_state.selected_unmatched_per_file_index + ] + st.session_state.selected_unmatched_per_file_index = new_selection[0] if new_selection else None + st.rerun() + elif len(selected_indices) == 1: + st.session_state.selected_unmatched_per_file_index = selected_indices[0] + elif len(selected_indices) == 0 and st.session_state.selected_unmatched_per_file_index is not None: + st.session_state.selected_unmatched_per_file_index = None + st.rerun() + + if st.session_state.selected_unmatched_per_file_index is not None: + selected_tag_row = unmatched_display.loc[st.session_state.selected_unmatched_per_file_index] + selected_tag_text = selected_tag_row["startNodeText"] + show_connect_unmatched_ui( + selected_tag_text, + file_view, + target_entities_view, + file_resource_property, + target_entities_resource_property, + associated_files=[selected_file], + tab="per_file", + db_name=db_name, + pattern_table=pattern_table, + apply_config=apply_config, + ) + + else: + st.info("βœ”οΈ Select a file in the table above to see a detailed breakdown of its tags.") + +# ========================================== +# PATTERN MANAGEMENT TAB +# ========================================== +with management_tab: + primary_scope_prop = ep_config.get("launchFunction", {}).get("primaryScopeProperty") + secondary_scope_prop = ep_config.get("launchFunction", {}).get("secondaryScopeProperty") + + st.subheader( + "Existing Manual Patterns", + help="An action-oriented tool for improving pattern quality. After identifying missed tags in the other tabs, come here to add new manual patterns or edit existing ones to enhance the detection logic for future pipeline runs.", + ) + df_manual_patterns_manage = fetch_manual_patterns(db_name, manual_patterns_table) + + edited_df_manage = st.data_editor( + df_manual_patterns_manage, + num_rows="dynamic", + use_container_width=True, + column_config={ + "key": st.column_config.TextColumn("Scope Key", disabled=True), + "sample": st.column_config.TextColumn("Pattern String", required=True), + "annotation_type": st.column_config.SelectboxColumn( + "Annotation Type", options=["diagrams.FileLink", "diagrams.AssetLink"], required=True + ), + "resource_type": st.column_config.TextColumn("Resource Type", required=True), + "scope_level": st.column_config.SelectboxColumn( + "Scope Level", + options=["Global", "Primary Scope", "Secondary Scope"], + required=True, + ), + "primary_scope": st.column_config.TextColumn("Primary Scope"), + "secondary_scope": st.column_config.TextColumn("Secondary Scope"), + "created_by": st.column_config.TextColumn("Created By", required=True), + }, + ) + + if st.button("Save Changes", type="primary", key="save_patterns"): + with st.spinner("Saving changes to RAW..."): + try: + save_manual_patterns(edited_df_manage, db_name, manual_patterns_table) + st.success("Changes saved successfully!") + st.cache_data.clear() + st.rerun() + except Exception as e: + st.error(f"Failed to save changes: {e}") + + st.divider() + + st.subheader("Add a New Pattern") + scope_level = st.selectbox( + "1. Select Scope Level", ["Global", "Primary Scope", "Secondary Scope"], key="scope_level_selector" + ) + + with st.form(key="new_pattern_form", clear_on_submit=True): + st.write("2. Enter Pattern Details") + new_pattern = st.text_input("Pattern String", placeholder="e.g., [PI]-00000") + new_annotation_type = st.selectbox( + "Annotation Type", ["diagrams.FileLink", "diagrams.AssetLink"], key="new_annotation_type_selector" + ) + new_resource_type = st.text_input("Resource Type", placeholder="e.g., Asset") + + primary_scope_value = "" + if scope_level in ["Primary Scope", "Secondary Scope"]: + primary_scope_value = st.text_input(f"Primary Scope Value ({primary_scope_prop or 'not configured'})") + + secondary_scope_value = "" + if scope_level == "Secondary Scope": + secondary_scope_value = st.text_input(f"Secondary Scope Value ({secondary_scope_prop or 'not configured'})") + + submit_button = st.form_submit_button(label="Add New Pattern") + + if submit_button: + if not all([new_pattern, new_resource_type]): + st.warning("Pattern String and Resource Type are required.") + else: + with st.spinner("Adding new pattern..."): + try: + new_row = pd.DataFrame( + [ + { + "sample": new_pattern, + "resource_type": new_resource_type, + "scope_level": scope_level, + "annotation_type": new_annotation_type, + "primary_scope": primary_scope_value, + "secondary_scope": secondary_scope_value, + "created_by": "streamlit", + } + ] + ) + updated_df = pd.concat([edited_df_manage, new_row], ignore_index=True) + + save_manual_patterns(updated_df, db_name, manual_patterns_table) + st.success("New pattern added successfully!") + st.cache_data.clear() + st.rerun() + except Exception as e: + st.error(f"Failed to add pattern: {e}") diff --git a/modules/contextualization/cdf_file_annotation/streamlit/file_annotation_dashboard/pages/Status_Overview.py b/modules/contextualization/cdf_file_annotation/streamlit/file_annotation_dashboard/pages/Status_Overview.py deleted file mode 100644 index 0a8c9bdc..00000000 --- a/modules/contextualization/cdf_file_annotation/streamlit/file_annotation_dashboard/pages/Status_Overview.py +++ /dev/null @@ -1,127 +0,0 @@ -import streamlit as st -import pandas as pd -from helper import ( - fetch_annotation_states, - fetch_extraction_pipeline_config, -) - -# --- Page Configuration --- -st.set_page_config( - page_title="Annotation Status Overview", - page_icon="πŸ“„", - layout="wide", -) - -# --- Data Fetching --- -ep_config, annotation_state_view, file_view = fetch_extraction_pipeline_config() -df_raw = fetch_annotation_states(annotation_state_view, file_view) - - -# --- Main Application --- -st.title("Annotation Status Overview") -st.markdown("This page provides an audit trail and overview of the file annotation process.") - -if not df_raw.empty: - # --- Sidebar Filters --- - st.sidebar.title("Filters") - - # Status Filter - all_statuses = ["All"] + sorted(df_raw["status"].unique().tolist()) - selected_status = st.sidebar.selectbox("Filter by Status", options=all_statuses) - - # Date Range Filter - min_date = df_raw["lastUpdatedTime"].min().date() - max_date = df_raw["lastUpdatedTime"].max().date() - # THE FIX IS HERE: Changed max_date to max_value - date_range = st.sidebar.date_input( - "Filter by Last Updated Date", - value=(min_date, max_date), - min_value=min_date, - max_value=max_date, - ) - - # Dynamic Scope Property Filters - primary_scope_property = ep_config["launchFunction"].get("primaryScopeProperty") - secondary_scope_property = ep_config["launchFunction"].get("secondaryScopeProperty") - - selected_primary_scope = "All" - if primary_scope_property and f"file{primary_scope_property.capitalize()}" in df_raw.columns: - primary_scope_options = ["All"] + df_raw[f"file{primary_scope_property.capitalize()}"].unique().tolist() - selected_primary_scope = st.sidebar.selectbox( - f"Filter by {primary_scope_property}", options=primary_scope_options - ) - - selected_secondary_scope = "All" - if secondary_scope_property and f"file{secondary_scope_property.capitalize()}" in df_raw.columns: - secondary_scope_options = ["All"] + df_raw[f"file{secondary_scope_property.capitalize()}"].unique().tolist() - selected_secondary_scope = st.sidebar.selectbox( - f"Filter by {secondary_scope_property}", options=secondary_scope_options - ) - - # Apply all filters - df_filtered = df_raw.copy() - if selected_status != "All": - df_filtered = df_filtered[df_filtered["status"] == selected_status] - - if len(date_range) == 2: - start_date, end_date = date_range - df_filtered = df_filtered[ - (df_filtered["lastUpdatedTime"].dt.date >= start_date) - & (df_filtered["lastUpdatedTime"].dt.date <= end_date) - ] - - if selected_primary_scope != "All": - df_filtered = df_filtered[df_filtered[f"file{primary_scope_property.capitalize()}"] == selected_primary_scope] - - if selected_secondary_scope != "All": - df_filtered = df_filtered[ - df_filtered[f"file{secondary_scope_property.capitalize()}"] == selected_secondary_scope - ] - - # --- Dashboard Metrics --- - st.subheader("Status Overview") - - status_counts = df_filtered["status"].value_counts() - - col1, col2, col3, col4 = st.columns(4) - with col1: - st.metric("Total Files", len(df_filtered)) - with col2: - st.metric("Annotated", status_counts.get("Annotated", 0)) - with col3: - st.metric("New", status_counts.get("New", 0)) - st.metric("Processing", status_counts.get("Processing", 0)) - with col4: - st.metric("Finalizing", status_counts.get("Finalizing", 0)) - st.metric("Failed", status_counts.get("Failed", 0)) - - # --- Detailed Data View --- - default_columns = [ - "fileName", - "status", - "jobId", - "annotationMessage", - "filePageCount", - "retries", - "fileTags", - "lastUpdatedTime", - ] - - available_columns = df_filtered.columns.tolist() - default_selection = [col for col in default_columns if col in available_columns] - - with st.popover("Customize Columns"): - selected_columns = st.multiselect( - "Select columns to display:", - options=available_columns, - default=default_selection, - label_visibility="collapsed", - ) - - if selected_columns: - st.dataframe(df_filtered[selected_columns], use_container_width=True) - else: - st.warning("Please select at least one column to display.") - -else: - st.info("No annotation state data returned from Cognite Data Fusion. Please check your settings and data model.") diff --git a/modules/contextualization/cdf_file_annotation/streamlit/file_annotation_dashboard/requirements.txt b/modules/contextualization/cdf_file_annotation/streamlit/file_annotation_dashboard/requirements.txt index be55ec1a..e3938ef4 100644 --- a/modules/contextualization/cdf_file_annotation/streamlit/file_annotation_dashboard/requirements.txt +++ b/modules/contextualization/cdf_file_annotation/streamlit/file_annotation_dashboard/requirements.txt @@ -1,5 +1,5 @@ -pandas -altair -PyYaml -pyodide-http==0.2.1 +pandas +altair +PyYaml +pyodide-http==0.2.1 cognite-sdk==7.73.4 \ No newline at end of file diff --git a/modules/contextualization/cdf_file_annotation/workflows/TRIGGER_ARCHITECTURE.md b/modules/contextualization/cdf_file_annotation/workflows/TRIGGER_ARCHITECTURE.md new file mode 100644 index 00000000..8ec6ee7e --- /dev/null +++ b/modules/contextualization/cdf_file_annotation/workflows/TRIGGER_ARCHITECTURE.md @@ -0,0 +1,401 @@ +# Data Modeling Event Triggers Architecture + +## Overview + +The file annotation workflows now use **data modeling event triggers** instead of scheduled triggers to eliminate wasteful serverless function executions. Triggers fire only when there's actual work to process, dramatically improving cost efficiency and responsiveness. + +## Architecture + +### Trigger Flow + +``` +Files uploaded with "ToAnnotate" tag + ↓ (triggers v1_prepare) +Prepare Function creates AnnotationState with status="New" + ↓ (triggers v1_launch) +Launch Function creates diagram detect jobs, sets status="Processing" + ↓ (triggers v1_finalize) +Finalize Function processes results, sets status="Annotated"/"Failed" + └─ (if pattern-mode enabled) Creates annotation edges with status="Suggested" + ↓ (triggers v1_promote) +Promote Function attempts to resolve pattern-mode annotations to actual entities +``` + +### Trigger Configurations + +#### 1. Prepare Trigger (`wf_prepare_trigger`) + +**Fires when:** Files have `tags` containing "ToAnnotate" WITHOUT ["AnnotationInProcess", "Annotated", "AnnotationFailed"] + +**Batch Config:** + +- Size: 100 files +- Timeout: 60 seconds + +**Query:** + +```yaml +with: + files_to_prepare: + nodes: + filter: + and: + - equals: + property: ["node", "space"] + value: fileInstanceSpace + - in: + property: [fileSchemaSpace, "fileExternalId/version", "tags"] + values: ["ToAnnotate"] + - not: + in: + property: [fileSchemaSpace, "fileExternalId/version", "tags"] + values: ["AnnotationInProcess", "Annotated", "AnnotationFailed"] +``` + +**Function Input:** `${workflow.input.items}` - Array of file instances + +**Loop Prevention:** Once processed, files get "AnnotationInProcess" tag, preventing re-triggering + +--- + +#### 2. Launch Trigger (`wf_launch_trigger`) + +**Fires when:** AnnotationState instances have `annotationStatus` IN ["New", "Retry"] AND `linkedFile` exists + +**Batch Config:** + +- Size: 50 instances +- Timeout: 30 seconds + +**Query:** + +```yaml +with: + states_to_launch: + nodes: + filter: + and: + - equals: + property: ["node", "space"] + value: fileInstanceSpace + - in: + property: + [ + annotationStateSchemaSpace, + "annotationStateExternalId/version", + "annotationStatus", + ] + values: ["New", "Retry"] + - exists: + property: + [ + annotationStateSchemaSpace, + "annotationStateExternalId/version", + "linkedFile", + ] +``` + +**Function Input:** `${workflow.input.items}` - Array of AnnotationState instances + +**Loop Prevention:** Function updates `annotationStatus="Processing"`, preventing re-triggering + +--- + +#### 3. Finalize Trigger (`wf_finalize_trigger`) + +**Fires when:** AnnotationState instances have `annotationStatus="Processing"` AND `diagramDetectJobId` exists + +**Batch Config:** + +- Size: 20 instances +- Timeout: 60 seconds + +**Query:** + +```yaml +with: + jobs_to_finalize: + nodes: + filter: + and: + - equals: + property: ["node", "space"] + value: fileInstanceSpace + - equals: + property: + [ + annotationStateSchemaSpace, + "annotationStateExternalId/version", + "annotationStatus", + ] + value: "Processing" + - exists: + property: + [ + annotationStateSchemaSpace, + "annotationStateExternalId/version", + "diagramDetectJobId", + ] +``` + +**Function Input:** `${workflow.input.items}` - Array of AnnotationState instances with job IDs + +**Loop Prevention:** Function updates `annotationStatus="Annotated"/"Failed"`, preventing re-triggering + +--- + +#### 4. Promote Trigger (`wf_promote_trigger`) + +**Fires when:** Annotation edges have `status="Suggested"` AND `tags` does NOT contain `"PromoteAttempted"` + +**Batch Config:** + +- Size: 100 edges +- Timeout: 300 seconds (5 minutes) + +**Query:** + +```yaml +with: + edges_to_promote: + edges: + filter: + and: + - equals: + property: ["edge", "space"] + value: patternModeInstanceSpace + - equals: + property: [cdf_cdm, "CogniteDiagramAnnotation/v1", "status"] + value: "Suggested" + - not: + in: + property: [cdf_cdm, "CogniteDiagramAnnotation/v1", "tags"] + values: ["PromoteAttempted"] +``` + +**Function Input:** `${workflow.input.items}` - Array of annotation edges (pattern-mode annotations) + +**Loop Prevention:** Function adds `"PromoteAttempted"` tag to edges, preventing re-triggering + +**Note:** This trigger queries **edges** (not nodes) since promote processes annotation relationships. The trigger fires when the finalize function creates pattern-mode annotations (edges pointing to the sink node with `status="Suggested"`). + +--- + +## Instance Space Filtering + +**All triggers include instance space filtering** to ensure they only fire for instances in the configured `{{fileInstanceSpace}}`. + +### Node-based Triggers (Prepare, Launch, Finalize) + +For triggers that query nodes, filtering is achieved by checking the node's space property: + +```yaml +- equals: + property: ["node", "space"] + value: { { fileInstanceSpace } } +``` + +**Example from Prepare Trigger:** + +```yaml +filter: + and: + - equals: + property: ["node", "space"] + value: { { fileInstanceSpace } } + - in: + property: + [ + { { fileSchemaSpace } }, + "{{fileExternalId}}/{{fileVersion}}", + "tags", + ] + values: ["ToAnnotate"] + - # ... other filters +``` + +### Edge-based Triggers (Promote) + +For the promote trigger that queries edges, filtering is achieved by checking the edge's own space property: + +```yaml +- equals: + property: ["edge", "space"] + value: { { patternModeInstanceSpace } } +``` + +This ensures only pattern-mode annotation edges stored in your configured pattern mode results instance space trigger the promote workflow. Pattern-mode edges are created by the finalize function and stored in a dedicated instance space (`patternModeInstanceSpace`, typically `sp_dat_pattern_mode_results`). + +### Benefits + +This approach ensures: + +- βœ… **Isolation**: Triggers only fire for instances in the configured instance space +- βœ… **Consistency**: Matches the behavior of scheduled functions using the extraction pipeline config +- βœ… **Multi-tenancy**: Supports multiple isolated environments using the same data model +- βœ… **Performance**: Reduces query scope to only relevant instances + +The `fileInstanceSpace` and `patternModeInstanceSpace` variables are configured in `default.config.yaml`: + +- `fileInstanceSpace`: Used for node-based triggers (prepare, launch, finalize) to filter files and annotation states +- `patternModeInstanceSpace`: Used for edge-based triggers (promote) to filter pattern-mode annotation edges + +--- + +## How Triggers Work + +According to the [Cognite documentation](https://docs.cognite.com/cdf/data_workflows/triggers/), data modeling triggers use a **change-based polling mechanism**: + +1. **Polling**: System periodically checks for instances matching filter criteria +2. **Change Detection**: Triggers detect changes based on `lastUpdatedTime` of instances +3. **Batching**: Multiple matching instances are collected into batches +4. **Execution**: When batch criteria are met (size or timeout), workflow starts with collected instances as input + +### Trigger Input Format + +The trigger passes data to the workflow via `${workflow.input.items}`: + +```json +{ + "version": "v1_prepare", + "items": [ + { + "instanceType": "node", + "externalId": "file123", + "space": "mySpace", + "properties": { + "mySpace": { + "FileView/v1": { + "name": "diagram.pdf", + "tags": ["ToAnnotate"], + "externalId": "file123" + } + } + } + } + ] +} +``` + +## Benefits + +| Benefit | Impact | +| ------------------- | ------------------------------------------------------ | +| **Cost Efficiency** | 50-90% reduction in wasted function executions | +| **Responsiveness** | <2 min latency (vs 0-15 min with scheduled triggers) | +| **Scalability** | Automatic batching handles bursts of files efficiently | +| **Architecture** | Clean separation of prepare/launch/finalize phases | +| **Observability** | Built-in trigger run history for monitoring | + +### Cost Comparison + +**Before (Scheduled):** + +- 96 function executions per day (6 Γ— 4/hour Γ— 24h) +- 60-90% exit early with no work done +- **Wasted: ~60-85 executions/day** + +**After (Event-Driven):** + +- Functions only execute when data is ready +- Zero wasted cold starts +- **Savings: 50-90% reduction** + +## State Machine & Re-triggering Prevention + +The architecture prevents infinite loops through careful state management: + +``` +Prepare Trigger: + Fires on β†’ files.tags contains "ToAnnotate" without "AnnotationInProcess" + Function β†’ adds "AnnotationInProcess" tag + Result β†’ βœ… Won't re-trigger (tags changed) + +Launch Trigger: + Fires on β†’ AnnotationState.status IN ["New", "Retry"] AND linkedFile exists + Function β†’ updates status="Processing" + Result β†’ βœ… Won't re-trigger (status changed) + +Finalize Trigger: + Fires on β†’ AnnotationState.status="Processing" + Function β†’ updates status="Annotated"/"Failed" + Result β†’ βœ… Won't re-trigger (status changed) +``` + +**No additional flags needed** - existing `annotationStatus` property and file `tags` handle state transitions perfectly. + +## Function Behavior + +### Current Implementation + +Functions currently **poll for data internally** using the same queries that the triggers use. This means: + +1. **Trigger fires** when data matches criteria (e.g., files with "ToAnnotate" tag) +2. **Function receives** `triggerInput` parameter with matching instances +3. **Function can use** the trigger input OR continue polling (flexible approach) + +### Migration Path + +**Phase 1 (Current):** Functions receive `triggerInput` but continue internal polling + +- Zero code changes required in function logic +- Triggers ensure functions only run when work exists +- Already eliminates 50-90% of wasteful executions + +**Phase 2 (Future Optimization):** Update functions to process only `triggerInput` + +- Remove internal polling/querying logic +- Process only the instances provided by trigger +- Further improve efficiency and reduce query costs + +## Monitoring + +Track trigger performance using the trigger run history API: + +- **Fire time**: When the trigger executed +- **Status**: Success or failure +- **Workflow execution ID**: Link to workflow run +- **Failure reason**: Debugging information + +Example query: + +```python +trigger_runs = client.workflows.triggers.runs.list( + external_id="wf_prepare_trigger", + limit=100 +) +``` + +## Configuration Variables + +The following variables in `default.config.yaml` control trigger behavior: + +```yaml +# Workflow versions +prepareWorkflowVersion: v1_prepare +launchWorkflowVersion: v1_launch +finalizeWorkflowVersion: v1_finalize + +# Trigger external IDs +prepareWorkflowTrigger: wf_prepare_trigger +launchWorkflowTrigger: wf_launch_trigger +finalizeWorkflowTrigger: wf_finalize_trigger + +# Data model configuration +fileSchemaSpace: +fileInstanceSpace: # IMPORTANT: Filters trigger scope +fileExternalId: +fileVersion: + +annotationStateSchemaSpace: sp_hdm +annotationStateExternalId: FileAnnotationState +annotationStateVersion: v1.0.0 +``` + +**Note:** The `fileInstanceSpace` variable is critical for ensuring triggers only fire for instances in your configured space. This must match the instance space used in your extraction pipeline configuration. + +## References + +- [Cognite Workflows Triggers Documentation](https://docs.cognite.com/cdf/data_workflows/triggers/) +- [Data Modeling Queries](https://docs.cognite.com/cdf/data_workflows/triggers/#trigger-on-data-modeling-events) +- [Prevent Excessive Trigger Runs](https://docs.cognite.com/cdf/data_workflows/triggers/#prevent-excessive-data-modeling-trigger-runs) diff --git a/modules/contextualization/cdf_file_annotation/workflows/wf_file_annotation.WorkflowTrigger.yaml b/modules/contextualization/cdf_file_annotation/workflows/wf_file_annotation.WorkflowTrigger.yaml index 18d1c21f..681b84d3 100644 --- a/modules/contextualization/cdf_file_annotation/workflows/wf_file_annotation.WorkflowTrigger.yaml +++ b/modules/contextualization/cdf_file_annotation/workflows/wf_file_annotation.WorkflowTrigger.yaml @@ -1,9 +1,159 @@ -externalId: {{workflowExternalId}} -triggerRule: - triggerType: schedule - cronExpression: "{{workflowSchedule}}" -workflowExternalId: {{workflowExternalId}} -workflowVersion: {{workflowVersion}} -authentication: - clientId: {{functionClientId}} - clientSecret: {{functionClientSecret}} \ No newline at end of file +# Prepare Trigger: Fires when files have "ToAnnotate" tag without annotation processing tags +- externalId: {{prepareWorkflowTrigger}} + triggerRule: + triggerType: dataModeling + dataModelingQuery: + with: + files_to_prepare: + nodes: + filter: + and: + - equals: + property: ["node", "space"] + value: {{fileInstanceSpace}} + - in: + property: [{{fileSchemaSpace}}, '{{fileExternalId}}/{{fileVersion}}', 'tags'] + values: ['ToAnnotate'] + - not: + in: + property: [{{fileSchemaSpace}}, '{{fileExternalId}}/{{fileVersion}}', 'tags'] + values: ['AnnotationInProcess', 'Annotated', 'AnnotationFailed'] + limit: 100 + select: + files_to_prepare: + sources: + - source: + type: view + space: {{fileSchemaSpace}} + externalId: {{fileExternalId}} + version: {{fileVersion}} + properties: + - name + - tags + batchSize: 100 + batchTimeout: 60 + workflowExternalId: {{workflowExternalId}} + workflowVersion: {{prepareWorkflowVersion}} + authentication: + clientId: {{functionClientId}} + clientSecret: {{functionClientSecret}} + +# Launch Trigger: Fires when AnnotationState instances have status "New" or "Retry" AND linkedFile exists +- externalId: {{launchWorkflowTrigger}} + triggerRule: + triggerType: dataModeling + dataModelingQuery: + with: + states_to_launch: + nodes: + filter: + and: + - equals: + property: ["node", "space"] + value: {{fileInstanceSpace}} + - in: + property: [{{annotationStateSchemaSpace}}, '{{annotationStateExternalId}}/{{annotationStateVersion}}', 'annotationStatus'] + values: ['New', 'Retry'] + - exists: + property: [{{annotationStateSchemaSpace}}, '{{annotationStateExternalId}}/{{annotationStateVersion}}', 'linkedFile'] + limit: 50 + select: + states_to_launch: + sources: + - source: + type: view + space: {{annotationStateSchemaSpace}} + externalId: {{annotationStateExternalId}} + version: {{annotationStateVersion}} + properties: + - annotationStatus + - linkedFile + - attemptCount + batchSize: 50 + batchTimeout: 30 + workflowExternalId: {{workflowExternalId}} + workflowVersion: {{launchWorkflowVersion}} + authentication: + clientId: {{functionClientId}} + clientSecret: {{functionClientSecret}} + +# Finalize Trigger: Fires when AnnotationState instances have status "Processing" TODO: may need to only make one thread +- externalId: {{finalizeWorkflowTrigger}} + triggerRule: + triggerType: dataModeling + dataModelingQuery: + with: + jobs_to_finalize: + nodes: + filter: + and: + - equals: + property: ["node", "space"] + value: {{fileInstanceSpace}} + - equals: + property: [{{annotationStateSchemaSpace}}, '{{annotationStateExternalId}}/{{annotationStateVersion}}', 'annotationStatus'] + value: 'Processing' + - exists: + property: [{{annotationStateSchemaSpace}}, '{{annotationStateExternalId}}/{{annotationStateVersion}}', 'diagramDetectJobId'] + limit: 20 + select: + jobs_to_finalize: + sources: + - source: + type: view + space: {{annotationStateSchemaSpace}} + externalId: {{annotationStateExternalId}} + version: {{annotationStateVersion}} + properties: + - annotationStatus + - diagramDetectJobId + - patternModeJobId + - linkedFile + batchSize: 20 + batchTimeout: 60 + workflowExternalId: {{workflowExternalId}} + workflowVersion: {{finalizeWorkflowVersion}} + authentication: + clientId: {{functionClientId}} + clientSecret: {{functionClientSecret}} + +# Promote Trigger: Fires when annotation edges have status "Suggested" and haven't been promoted yet +- externalId: {{promoteWorkflowTrigger}} + triggerRule: + triggerType: dataModeling + dataModelingQuery: + with: + edges_to_promote: + edges: + filter: + and: + - equals: + property: ["edge", "space"] + value: {{patternModeInstanceSpace}} + - equals: + property: [cdf_cdm, 'CogniteDiagramAnnotation/v1', 'status'] + value: 'Suggested' + - not: + in: + property: [cdf_cdm, 'CogniteDiagramAnnotation/v1', 'tags'] + values: ['PromoteAttempted'] + limit: 100 + select: + edges_to_promote: + sources: + - source: + type: view + space: cdf_cdm + externalId: CogniteDiagramAnnotation + version: v1 + properties: + - status + - tags + - startNodeText + batchSize: 100 + batchTimeout: 300 + workflowExternalId: {{workflowExternalId}} + workflowVersion: {{promoteWorkflowVersion}} + authentication: + clientId: {{functionClientId}} + clientSecret: {{functionClientSecret}} diff --git a/modules/contextualization/cdf_file_annotation/workflows/wf_file_annotation.WorkflowVersion.yaml b/modules/contextualization/cdf_file_annotation/workflows/wf_file_annotation.WorkflowVersion.yaml index a1d6d30d..94e366e8 100644 --- a/modules/contextualization/cdf_file_annotation/workflows/wf_file_annotation.WorkflowVersion.yaml +++ b/modules/contextualization/cdf_file_annotation/workflows/wf_file_annotation.WorkflowVersion.yaml @@ -1,106 +1,163 @@ -workflowExternalId: {{ workflowExternalId }} -version: "v1" -workflowDefinition: - description: "A workflow for annotating P&ID and documents." - tasks: - - externalId: fn_launch - type: "function" - parameters: - function: - externalId: {{ launchFunctionExternalId }} - data: - { - "ExtractionPipelineExtId": {{ extractionPipelineExternalId }}, - "logLevel": "INFO", - } - isAsyncComplete: false - name: Launch File Annotations - description: Launch - retries: 0 - timeout: 600 - onFailure: "abortWorkflow" +- workflowExternalId: {{ workflowExternalId }} + version: {{ prepareWorkflowVersion }} + workflowDefinition: + description: "Create annotation state instances for files marked to annotate." + tasks: + - externalId: fn_prepare + type: "function" + parameters: + function: + externalId: {{ prepareFunctionExternalId }} + data: + { + "ExtractionPipelineExtId": {{ extractionPipelineExternalId }}, + "logLevel": "INFO", + "triggerInput": "${workflow.input.items}" + } + isAsyncComplete: false + name: Prepare File Annotations + description: Prepare + retries: 0 + timeout: 600 + onFailure: "abortWorkflow" - - externalId: fn_finalize_thread_1 - type: "function" - parameters: - function: - externalId: {{ finalizeFunctionExternalId }} - data: - { - "ExtractionPipelineExtId": {{ extractionPipelineExternalId }}, - "logLevel": "INFO", - } - isAsyncComplete: false - name: Finalize File Annotations - Thread 1 - description: Finalize - retries: 0 - timeout: 600 - onFailure: "abortWorkflow" +- workflowExternalId: {{ workflowExternalId }} + version: {{ launchWorkflowVersion }} + workflowDefinition: + description: "Create diagram detect jobs for annotation state instances marked new or retry." + tasks: + - externalId: fn_launch + type: "function" + parameters: + function: + externalId: {{ launchFunctionExternalId }} + data: + { + "ExtractionPipelineExtId": {{ extractionPipelineExternalId }}, + "logLevel": "INFO", + "triggerInput": "${workflow.input.items}" + } + isAsyncComplete: false + name: Launch File Annotations + description: Launch + retries: 0 + timeout: 600 + onFailure: "abortWorkflow" - - externalId: fn_finalize_thread_2 - type: "function" - parameters: - function: - externalId: {{ finalizeFunctionExternalId }} - data: - { - "ExtractionPipelineExtId": {{ extractionPipelineExternalId }}, - "logLevel": "INFO", - } - isAsyncComplete: false - name: Finalize File Annotations - Thread 2 - description: Finalize - retries: 0 - timeout: 600 - onFailure: "abortWorkflow" +- workflowExternalId: {{ workflowExternalId }} + version: {{ finalizeWorkflowVersion }} + workflowDefinition: + description: "Process the diagram detect jobs created by the launch workflow" + tasks: + - externalId: fn_finalize_thread_1 + type: "function" + parameters: + function: + externalId: {{ finalizeFunctionExternalId }} + data: + { + "ExtractionPipelineExtId": {{ extractionPipelineExternalId }}, + "logLevel": "INFO", + "triggerInput": "${workflow.input.items}" + } + isAsyncComplete: false + name: Finalize File Annotations - Thread 1 + description: Finalize + retries: 0 + timeout: 600 + onFailure: "abortWorkflow" - - externalId: fn_finalize_thread_3 - type: "function" - parameters: - function: - externalId: {{ finalizeFunctionExternalId }} - data: - { - "ExtractionPipelineExtId": {{ extractionPipelineExternalId }}, - "logLevel": "INFO", - } - isAsyncComplete: false - name: Finalize File Annotations - Thread 3 - description: Finalize - retries: 0 - timeout: 600 - onFailure: "abortWorkflow" + - externalId: fn_finalize_thread_2 + type: "function" + parameters: + function: + externalId: {{ finalizeFunctionExternalId }} + data: + { + "ExtractionPipelineExtId": {{ extractionPipelineExternalId }}, + "logLevel": "INFO", + "triggerInput": "${workflow.input.items}" + } + isAsyncComplete: false + name: Finalize File Annotations - Thread 2 + description: Finalize + retries: 0 + timeout: 600 + onFailure: "abortWorkflow" - - externalId: fn_finalize_thread_4 - type: "function" - parameters: - function: - externalId: {{ finalizeFunctionExternalId }} - data: - { - "ExtractionPipelineExtId": {{ extractionPipelineExternalId }}, - "logLevel": "INFO", - } - isAsyncComplete: false - name: Finalize File Annotations - Thread 4 - description: Finalize - retries: 0 - timeout: 600 - onFailure: "abortWorkflow" + - externalId: fn_finalize_thread_3 + type: "function" + parameters: + function: + externalId: {{ finalizeFunctionExternalId }} + data: + { + "ExtractionPipelineExtId": {{ extractionPipelineExternalId }}, + "logLevel": "INFO", + "triggerInput": "${workflow.input.items}" + } + isAsyncComplete: false + name: Finalize File Annotations - Thread 3 + description: Finalize + retries: 0 + timeout: 600 + onFailure: "abortWorkflow" - - externalId: fn_finalize_thread_5 - type: "function" - parameters: - function: - externalId: {{ finalizeFunctionExternalId }} - data: - { - "ExtractionPipelineExtId": {{ extractionPipelineExternalId }}, - "logLevel": "INFO", - } - isAsyncComplete: false - name: Finalize File Annotations - Thread 5 - description: Finalize - retries: 0 - timeout: 600 - onFailure: "abortWorkflow" + - externalId: fn_finalize_thread_4 + type: "function" + parameters: + function: + externalId: {{ finalizeFunctionExternalId }} + data: + { + "ExtractionPipelineExtId": {{ extractionPipelineExternalId }}, + "logLevel": "INFO", + "triggerInput": "${workflow.input.items}" + } + isAsyncComplete: false + name: Finalize File Annotations - Thread 4 + description: Finalize + retries: 0 + timeout: 600 + onFailure: "abortWorkflow" + + - externalId: fn_finalize_thread_5 + type: "function" + parameters: + function: + externalId: {{ finalizeFunctionExternalId }} + data: + { + "ExtractionPipelineExtId": {{ extractionPipelineExternalId }}, + "logLevel": "INFO", + "triggerInput": "${workflow.input.items}" + } + isAsyncComplete: false + name: Finalize File Annotations - Thread 5 + description: Finalize + retries: 0 + timeout: 600 + onFailure: "abortWorkflow" + +- workflowExternalId: {{ workflowExternalId }} + version: {{ promoteWorkflowVersion }} + workflowDefinition: + description: "Attempt to automatically promote annotation edges created from the pattern mode results in the finalize workflow." + tasks: + - externalId: fn_promote + type: "function" + parameters: + function: + externalId: {{ promoteFunctionExternalId }} + data: + { + "ExtractionPipelineExtId": {{ extractionPipelineExternalId }}, + "logLevel": "INFO", + "triggerInput": "${workflow.input.items}" + } + isAsyncComplete: false + name: Promote File Annotations + description: Auto promote tags + retries: 0 + timeout: 600 + onFailure: "abortWorkflow"