-
Notifications
You must be signed in to change notification settings - Fork 0
Query Performance Optimization for Large-Scale Entity Retrieval #122
Copy link
Copy link
Open
Description
Description
The File Annotation Pipeline's get_instances_entities() method in DataModelService.py frequently encounters 408 timeout errors when querying data modeling instances on projects with large datasets (1,000+ entities per scope).
Problem Statement
When the pipeline retrieves entities for diagram detection caching, the query:
target_entities = client.data_modeling.instances.list(
instance_type="node",
sources=target_entities_view.as_view_id(),
space=target_entities_view.instance_space,
filter=target_filter, # Equals on primary + secondary scope
limit=-1,
)Consistently times out with the error:
Graph query timed out. Reduce load or contention, or optimise your query.
code: 408
Root Cause Analysis
| Finding | Details |
|---|---|
| Bottleneck | PostgreSQL filter execution on multi-property Equals filters |
| Affected endpoints | Both /list and /query (same PostgreSQL backend) |
Why /search won't work |
Limited to 1,000 results max with no pagination support |
Endpoint Comparison
| Endpoint | Backend | Pagination | Max Results | Status |
|---|---|---|---|---|
/list |
PostgreSQL | ✅ Cursor-based | Unlimited | ❌ Times out |
/query |
PostgreSQL | ✅ Cursor-based | Unlimited | ❌ Times out |
/search |
Elasticsearch | ❌ Not supported | 1,000 max | ❌ Insufficient limit |
Proposed Solutions
Solution 1: Create Composite Indexes (Recommended)
Add composite indexes on frequently-filtered properties:
| Index Type | Properties | Rationale |
|---|---|---|
| Composite BTree | primaryScopeProperty + secondaryScopeProperty |
Most common filter combination |
| Single BTree | primaryScopeProperty |
Fallback when no secondary scope |
Solution 2: Extend Cache TTL
Current cache invalidation may be too aggressive. Consider extending cache validity:
# Current: Configurable, often 24 hours
# Recommended: 48-72 hours for entity data that rarely changes
cache_time_limit: int = 72 # hoursThe query performance at scale really highlights the important of the raw table cache.
If in the future the /search endpoint has pagination, it likely will be a more consistent solution for retrieving the list of entities used in the diagram detect job.
Related Files
services/DataModelService.py-get_instances_entities()methodservices/CacheService.py- Entity caching implementation
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels