Skip to content

Query Performance Optimization for Large-Scale Entity Retrieval #122

@dude-with-a-mug

Description

@dude-with-a-mug

Description

The File Annotation Pipeline's get_instances_entities() method in DataModelService.py frequently encounters 408 timeout errors when querying data modeling instances on projects with large datasets (1,000+ entities per scope).

Problem Statement

When the pipeline retrieves entities for diagram detection caching, the query:

target_entities = client.data_modeling.instances.list(
    instance_type="node",
    sources=target_entities_view.as_view_id(),
    space=target_entities_view.instance_space,
    filter=target_filter,  # Equals on primary + secondary scope
    limit=-1,
)

Consistently times out with the error:

Graph query timed out. Reduce load or contention, or optimise your query.
code: 408

Root Cause Analysis

Finding Details
Bottleneck PostgreSQL filter execution on multi-property Equals filters
Affected endpoints Both /list and /query (same PostgreSQL backend)
Why /search won't work Limited to 1,000 results max with no pagination support

Endpoint Comparison

Endpoint Backend Pagination Max Results Status
/list PostgreSQL ✅ Cursor-based Unlimited ❌ Times out
/query PostgreSQL ✅ Cursor-based Unlimited ❌ Times out
/search Elasticsearch ❌ Not supported 1,000 max ❌ Insufficient limit

Proposed Solutions

Solution 1: Create Composite Indexes (Recommended)

Add composite indexes on frequently-filtered properties:

Index Type Properties Rationale
Composite BTree primaryScopeProperty + secondaryScopeProperty Most common filter combination
Single BTree primaryScopeProperty Fallback when no secondary scope

Solution 2: Extend Cache TTL

Current cache invalidation may be too aggressive. Consider extending cache validity:

# Current: Configurable, often 24 hours
# Recommended: 48-72 hours for entity data that rarely changes
cache_time_limit: int = 72  # hours

The query performance at scale really highlights the important of the raw table cache.

If in the future the /search endpoint has pagination, it likely will be a more consistent solution for retrieving the list of entities used in the diagram detect job.

Related Files

  • services/DataModelService.py - get_instances_entities() method
  • services/CacheService.py - Entity caching implementation

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions