Skip to content
Merged

Chunk #195

Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
c0bd99a
Add elbow detection and chunk deduplication utilities
m1rl0k Jan 24, 2026
d9ee48f
Add improved O(n log n) chunk deduplication with substring detection
m1rl0k Jan 24, 2026
4da2177
Create termination.py
m1rl0k Jan 24, 2026
19993f5
Integrate unified language mappings and improve analysis
m1rl0k Jan 24, 2026
636bd52
Add postinstall script to set execute permission
m1rl0k Jan 24, 2026
4512de2
Expand __all__ exports in qdrant.py and update shim
m1rl0k Jan 24, 2026
bbf0a1d
Fix score handling and concept type casing issues
m1rl0k Jan 24, 2026
fa78782
Add tests for chunking, deduplication, elbow, and termination
m1rl0k Jan 24, 2026
558b08d
Update ci.yml
m1rl0k Jan 24, 2026
ca0f533
Handle TOON-formatted results in search command
m1rl0k Jan 24, 2026
d4e3685
Make symbol graph edges always enabled and update elbow detection
m1rl0k Jan 24, 2026
56e7522
Refine Page-Hinkley test and update related tests
m1rl0k Jan 24, 2026
1e58047
Refactor graph backend selection and fallback logic
m1rl0k Jan 24, 2026
2df5afa
Add xxhash to project dependencies
m1rl0k Jan 24, 2026
7690217
Fix min_results calculation for zero limit in hybrid search
m1rl0k Jan 24, 2026
e4143a5
Lazy load elbow detection to avoid hard numpy dependency
m1rl0k Jan 24, 2026
9dd2c0d
Enable and document deferred pseudo-tag generation
m1rl0k Jan 24, 2026
d6b1544
Add Cursor MCP config support to VSCode extension
m1rl0k Jan 24, 2026
aa67bf7
Update README.md
m1rl0k Jan 24, 2026
d1453b1
Improve robustness in termination and deduplication logic
m1rl0k Jan 24, 2026
50caf65
Update termination.py
m1rl0k Jan 24, 2026
0121701
Add INDEX_WORKERS support and optimize fresh collection indexing
m1rl0k Jan 24, 2026
4390e03
Fix indentation and cleanup in termination logic
m1rl0k Jan 24, 2026
ef1264f
Improve symbol lookup and simplify Mann-Whitney U ranking
m1rl0k Jan 24, 2026
d7c8513
Improve import parsing and path filtering logic
m1rl0k Jan 24, 2026
64901f2
Add embedding, exception, cache, and model modules
m1rl0k Jan 24, 2026
c569e49
Improve context hit extraction and logging
m1rl0k Jan 24, 2026
c85a054
Fix indentation and update debug messages in scripts
m1rl0k Jan 24, 2026
9e85f08
Restrict call attribution to valid symbol kinds
m1rl0k Jan 24, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .env.example
Original file line number Diff line number Diff line change
Expand Up @@ -149,6 +149,11 @@ SEMANTIC_EXPANSION_CACHE_TTL=3600
# HYBRID_RECENCY_WEIGHT=0.1
# RERANK_EXPAND=1

# Elbow detection filter: adaptive threshold based on score distribution (Kneedle algorithm)
# Filters out low-relevance results by detecting the "elbow" point in the score curve
# Improves precision by only returning results above the natural relevance drop-off
# HYBRID_ELBOW_FILTER=0

# Caching (embeddings and search results)
# MAX_EMBED_CACHE=16384
# HYBRID_RESULTS_CACHE=128
Expand Down
14 changes: 9 additions & 5 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -65,14 +65,18 @@ jobs:
python -c "from fastembed import TextEmbedding; m = TextEmbedding(model_name='BAAI/bge-base-en-v1.5'); list(m.embed(['test']))"

- name: Run tests
run: pytest -q
run: pytest -q --junitxml=test-results.xml

- name: Upload test results
uses: actions/upload-artifact@v4
if: always()
with:
name: test-results
path: |
.pytest_cache/
test-results.xml
path: test-results.xml
retention-days: 7

- name: Test Summary
uses: test-summary/action@v2
if: always()
with:
paths: test-results.xml
7 changes: 4 additions & 3 deletions ctx-mcp-bridge/package.json
Original file line number Diff line number Diff line change
@@ -1,14 +1,15 @@
{
"name": "@context-engine-bridge/context-engine-mcp-bridge",
"version": "0.0.15",
"version": "0.0.16",
"description": "Context Engine MCP bridge (http/stdio proxy combining indexer + memory servers)",
"bin": {
"ctxce": "bin/ctxce.js",
"ctxce-bridge": "bin/ctxce.js"
},
"type": "module",
"scripts": {
"start": "node bin/ctxce.js"
"start": "node bin/ctxce.js",
"postinstall": "node -e \"try{require('fs').chmodSync('bin/ctxce.js',0o755)}catch(e){}\""
},
"dependencies": {
"@modelcontextprotocol/sdk": "^1.24.3",
Expand All @@ -20,4 +21,4 @@
"engines": {
"node": ">=18.0.0"
}
}
}
1 change: 1 addition & 0 deletions deploy/kubernetes/configmap.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -151,3 +151,4 @@ data:
USE_GPU_DECODER: '0'
USE_TREE_SITTER: '1'
WATCH_DEBOUNCE_SECS: '4'
PSEUDO_DEFER_TO_WORKER: '1'
14 changes: 9 additions & 5 deletions docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -453,9 +453,12 @@ services:
- LEX_SPARSE_NAME=${LEX_SPARSE_NAME:-}
# Pattern vectors for structural code similarity
- PATTERN_VECTORS=${PATTERN_VECTORS:-}
# Graph edges for symbol relationships
- INDEX_GRAPH_EDGES=${INDEX_GRAPH_EDGES:-1}
# Graph edges for symbol relationships (always on)
- INDEX_GRAPH_EDGES_MODE=${INDEX_GRAPH_EDGES_MODE:-symbol}
# Defer pseudo-tag generation to watcher worker for faster initial indexing
- PSEUDO_DEFER_TO_WORKER=${PSEUDO_DEFER_TO_WORKER:-1}
# Parallel indexing - number of worker threads (default: 4, use -1 for CPU count)
- INDEX_WORKERS=${INDEX_WORKERS:-4}
volumes:
- workspace_pvc:/work:rw
- codebase_pvc:/work/.codebase:rw
Expand Down Expand Up @@ -514,12 +517,13 @@ services:
- LEX_SPARSE_NAME=${LEX_SPARSE_NAME:-}
# Pattern vectors for structural code similarity
- PATTERN_VECTORS=${PATTERN_VECTORS:-}
# Graph edges for symbol relationships
- INDEX_GRAPH_EDGES=${INDEX_GRAPH_EDGES:-1}
# Graph edges for symbol relationships (always on - Qdrant flat graph)
- INDEX_GRAPH_EDGES_MODE=${INDEX_GRAPH_EDGES_MODE:-symbol}
- GRAPH_BACKFILL_ENABLED=${GRAPH_BACKFILL_ENABLED:-1}
# Neo4j graph backend (when set, edges go to Neo4j instead of Qdrant _graph collection)
# Neo4j graph backend (optional - takes precedence over Qdrant flat graph)
- NEO4J_GRAPH=${NEO4J_GRAPH:-}
# Defer pseudo-tag generation - watcher runs backfill worker thread
- PSEUDO_DEFER_TO_WORKER=${PSEUDO_DEFER_TO_WORKER:-1}
volumes:
- workspace_pvc:/work:rw
- codebase_pvc:/work/.codebase:rw
Expand Down
19 changes: 17 additions & 2 deletions docs/CONFIGURATION.md
Original file line number Diff line number Diff line change
Expand Up @@ -377,12 +377,26 @@ REFRAG_RUNTIME=glm # or openai, minimax, llamacpp

### Pseudo Backfill Worker

Deferred pseudo/tag generation runs asynchronously after initial indexing.
Deferred pseudo/tag generation runs asynchronously after initial indexing. This significantly speeds up initial indexing by skipping LLM-based pseudo-tag generation during the indexer run, deferring it to a background worker thread in the watcher service.

| Name | Description | Default |
|------|-------------|---------|
| PSEUDO_BACKFILL_ENABLED | Enable async pseudo/tag backfill worker | 0 (disabled) |
| PSEUDO_DEFER_TO_WORKER | Skip inline pseudo, defer to backfill worker | 0 (disabled) |
| PSEUDO_DEFER_TO_WORKER | Skip inline pseudo, defer to backfill worker | 1 (enabled) |
| GRAPH_BACKFILL_ENABLED | Enable graph edge backfill in watcher worker | 1 (enabled) |

**How it works:**
1. When `PSEUDO_DEFER_TO_WORKER=1`, the indexer generates only base chunks (no pseudo-tags)
2. The watcher service starts a `_start_pseudo_backfill_worker` daemon thread
3. This thread periodically calls `pseudo_backfill_tick()` to enrich chunks with LLM-generated tags
4. If `GRAPH_BACKFILL_ENABLED=1`, it also calls `graph_backfill_tick()` to populate symbol graph edges

**Benefits:**
- Initial indexing is 2-5x faster (no LLM calls blocking indexer)
- Background enrichment happens continuously without blocking searches
- Failed LLM calls don't break indexing; worker retries automatically

**Recommended for production:** Enable both for fastest initial indexing with eventual enrichment.

### Adaptive Span Sizing

Expand Down Expand Up @@ -523,6 +537,7 @@ Useful for Kubernetes deployments where a shared filesystem is not reliable.
| CODEBASE_STATE_REDIS_LOCK_WAIT_MS | Redis lock wait in ms | 2000 |
| CODEBASE_STATE_REDIS_SOCKET_TIMEOUT | Redis socket timeout in seconds | 2 |
| CODEBASE_STATE_REDIS_CONNECT_TIMEOUT | Redis connect timeout in seconds | 2 |
| CODEBASE_STATE_REDIS_MAX_CONNECTIONS | Redis connection pool size limit | 10 |

### Semantic Expansion

Expand Down
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,7 @@ dependencies = [
"rich>=13.0.0",
"typer>=0.9.0",
"requests>=2.28.0",
"xxhash>=3.0.0",
]

[project.optional-dependencies]
Expand Down
Loading