Skip to content

feat: add diagram awareness for visual content extraction#65

Open
m2ux wants to merge 24 commits intomainfrom
feat/diagram-awareness
Open

feat: add diagram awareness for visual content extraction#65
m2ux wants to merge 24 commits intomainfrom
feat/diagram-awareness

Conversation

@m2ux
Copy link
Copy Markdown
Owner

@m2ux m2ux commented Dec 29, 2025

Summary

Implements diagram awareness feature to extract, describe, and integrate visual content (diagrams, charts, tables, figures) into the RAG system for enhanced semantic search.

🎫 Issue #51

ADR: adr0056-diagram-awareness


Motivation

Technical documents often contain diagrams that provide critical semantic context not captured in text. This feature enables:

  • Semantic descriptions of diagrams via Vision LLM
  • Searchable diagram metadata integrated with text content
  • Concept-based diagram discovery
  • Non-destructive database migration (augments existing production data)

Changes

M1: Infrastructure

  • Visual domain model (src/domain/models/visual.ts) - Visual type with bounding box, description, concepts
  • VisualRepository interface (src/domain/interfaces/repositories/visual-repository.ts) - Full CRUD + search operations
  • LanceDB implementation (src/infrastructure/lancedb/repositories/lancedb-visual-repository.ts)
  • Migration script (scripts/add-visuals-table.ts) - Non-destructive table addition

M2: Extraction Pipeline

  • Visual extraction types (src/infrastructure/visual-extraction/types.ts)
  • PDF page renderer (src/infrastructure/visual-extraction/pdf-page-renderer.ts)
  • Image processor (src/infrastructure/visual-extraction/image-processor.ts) - Sharp-based grayscale conversion
  • Vision LLM service (src/infrastructure/visual-extraction/vision-llm-service.ts) - Classification + description
  • Visual extractor (src/infrastructure/visual-extraction/visual-extractor.ts)
  • Extraction script (scripts/extract-visuals.ts)

M3: Description & Embedding

  • Description script (scripts/describe-visuals.ts) - Vision LLM semantic descriptions
  • Prompts (prompts/visual-classification.txt, prompts/visual-description.txt)
  • Concept extraction from descriptions
  • Chunk-to-visual linking by page number

M4: Search Integration

  • get_visuals MCP tool (src/tools/operations/get-visuals-tool.ts)
  • Container integration with optional table detection
  • Updated tool selection guide with visual workflows

M5: Finalization

  • ADR status updated to Accepted
  • Documentation updated

Testing

  • ✅ Build succeeds (tsc --noEmit)
  • ✅ Unit tests: 1423/1424 passing (1 pre-existing timeout in catalog-repository integration test)
  • ✅ Migration script tested on test database

Usage

# 1. Migrate database (non-destructive)
npx tsx scripts/add-visuals-table.ts --dbpath ~/.concept_rag

# 2. Extract visuals from documents
npx tsx scripts/extract-visuals.ts --dbpath ~/.concept_rag

# 3. Generate descriptions
OPENROUTER_API_KEY=... npx tsx scripts/describe-visuals.ts --dbpath ~/.concept_rag

# 4. Query via MCP
# get_visuals(concept: "architecture")
# get_visuals(catalog_id: 12345678)

Submission Checklist

  • Code compiles without errors
  • All new code has appropriate documentation
  • ADR created and status updated to Accepted
  • Tool selection guide updated
  • No debug code or temporary files committed

Fork Strategy

  • Build succeeds
  • No formatting errors
  • Linting passes

- Vision LLM approach for semantic diagram understanding
- Store only semantic diagrams (not photos/decorative)
- Grayscale storage with color analysis
- New visuals table with external image storage
- Non-destructive database migration

Issue: #51
@m2ux m2ux self-assigned this Dec 29, 2025
m2ux added 8 commits December 29, 2025 16:43
- Visual domain model for diagrams, charts, tables, figures
- VisualType enum: diagram, flowchart, chart, table, figure
- BoundingBox type with parse/serialize helpers
- VisualRepository interface with full CRUD operations
- Export from domain/models and domain/interfaces/repositories

WP: Diagram Awareness (M1: Infrastructure)
- Full CRUD operations for visuals table
- Vector search for semantic queries
- Query by catalog, type, page, concept, chunk associations
- Batch add/update operations
- Arrow Vector and JSON field parsing

WP: Diagram Awareness (M1: Infrastructure)
- Safe migration that augments existing database
- Creates visuals table with proper schema
- Creates images/ directory for extracted diagrams
- --force flag to recreate if table exists
- Does NOT modify existing tables (catalog, chunks, concepts, categories)

Usage: npx tsx scripts/add-visuals-table.ts --dbpath ~/.concept_rag

WP: Diagram Awareness (M1: Infrastructure)
Visual extraction infrastructure:
- PDFPageRenderer: Renders PDF pages using pdftoppm
- ImageProcessor: Crop, grayscale conversion using sharp
- VisionLLMService: Classification (diagram vs photo) via OpenRouter
- VisualExtractor: Orchestrates extraction pipeline

Classification filters non-semantic content:
- Stores only: diagram, flowchart, chart, table, figure
- Filters out: photos, screenshots, decorative images

Dependencies:
- Added sharp for image processing

Scripts:
- extract-visuals.ts: Extract diagrams from catalog documents

WP: Diagram Awareness (M2: Extraction Pipeline)
Scripts:
- describe-visuals.ts: Generate semantic descriptions via Vision LLM
  - Updates visuals with descriptions and embeddings
  - Extracts concepts from descriptions
  - Links visuals to chunks on same page
  - Rate limiting for API calls
  - --redescribe flag to regenerate

Prompts:
- visual-classification.txt: Diagram vs photo classification
- visual-description.txt: Semantic description generation

Features:
- Concept extraction from descriptions
- Chunk-to-visual linking by page number
- Dry-run mode for testing

WP: Diagram Awareness (M3: Description & Embedding)
New MCP Tool:
- get_visuals: Retrieve diagrams, charts, tables, figures from documents
  - Filter by catalog_id, visual_type, page_number, concept
  - Returns description, image path, concept associations

Repository Enhancements:
- findByConceptName: Search visuals by concept name (case-insensitive)
- Updated interface and LanceDB implementation

Container Integration:
- Visuals table detection on startup
- Conditional tool registration when table exists

WP: Diagram Awareness (M4: Search Integration)
Updates:
- Added get_visuals to tool overview table (12 tools now)
- Added detailed get_visuals selection criteria section
- Added visual enrichment workflows (5. Enrich Search with Diagrams, 6. Browse Diagrams)
- Added test cases for visual queries

WP: Diagram Awareness (M4: Tool Documentation)
WP: Diagram Awareness (M5: Finalization)
@m2ux m2ux marked this pull request as ready for review December 29, 2025 17:06
m2ux added 15 commits December 29, 2025 17:11
Scripts:
- seed-test-visuals.ts: Populates test database with 8 sample visuals
  - Covers all visual types: diagram, flowchart, chart, table, figure
  - Links to existing catalog entries and concepts
  - Creates embeddings for semantic search

- test-get-visuals.ts: Verifies get_visuals functionality
  - Tests concept name search
  - Tests visual type filtering
  - Tests catalog ID filtering
  - Validates all repository methods work correctly

WP: Diagram Awareness (Test Database)
Breaking change from page-based to image-based extraction:
- Use pdfimages (poppler-utils) to extract actual embedded images
- Individual diagrams/figures now extracted, not full pages
- Image dimensions vary based on actual content (e.g., 725x493, 450x206)

Configuration:
- Add visionModel to LLMConfig (OPENROUTER_VISION_MODEL env var)
- Default: qwen/qwen2.5-vl-72b-instruct (configurable)
- Vision model no longer hardcoded in source

PDF extraction improvements:
- extractPdfImages() function in pdf-page-renderer.ts
- Minimum size filtering (100x100 default)
- Page number tracking from pdfimages -list output
- cleanupExtractedImages() for temp file cleanup

Test results (23 documents):
- 268 semantic visuals extracted
- 199 non-semantic images filtered
- Individual diagram extraction verified
New naming scheme: {author}_{short-title}_{year}
Examples:
  - martin_clean-architecture_2017
  - gamma_design-patterns_1994
  - unknown_cosmos-blockchain_2023

Changes:
- Add slugify.ts utility with slugifyDocument(), formatVisualFilename()
- Update VisualExtractor to accept DocumentInfo and generate folder slug
- Update extract-visuals.ts to pass document metadata
- Add --cleanup flag to describe-visuals.ts for stale record removal
- Silently skip missing images instead of warning spam
…suals

- catalog_search: add catalog_id, replace source with title
- chunks_search: use catalog_id input instead of source path
- broad_chunks_search: add catalog_id, title, page_number, concepts
- concept_search: rename source_filter to title_filter, add image_ids
- get_visuals: add ids[] input for batch retrieval, remove chunk_ids

All tool workflows verified for interoperability:
- catalog_search → chunks_search (via catalog_id)
- catalog_search → get_visuals (via catalog_id)
- concept_search → get_visuals (via image_ids → ids)
- catalog_search: output now includes catalog_id and title (was source)
- chunks_search: input uses catalog_id (was source path)
- broad_chunks_search: output includes catalog_id, title, page_number, concepts
- concept_search: input uses title_filter (was source_filter), output includes image_ids
- get_visuals: add ids[] parameter, document full schema
- Update workflows to show catalog_id-based navigation
- Bump schema version to v8
- GetVisualsTool: basic retrieval, by IDs, by catalog_id, by type
- ConceptSearchTool: verify image_ids and catalog_id in output
- CatalogSearchTool: verify catalog_id in output
- Workflow: concept_search → get_visuals via image_ids
- Workflow: catalog_search → get_visuals via catalog_id
- Schema compliance: required fields, no deprecated fields

14 tests, all passing against db/test
- Verify images have descriptions relevant to searched concept
- Check image concepts match search terms (architecture, dependency, software)
- Validate diagram descriptions are meaningful (>20 chars, not errors)
- 100% relevance achieved on test database

18 tests, all passing
Empty responses from Vision LLM are expected for rate-limited or
simple images. Only log warnings when there's actual response content
to debug.
- Add ImageEmbeddedMetadata interface and embedMetadataInPng() function
- Update convertToGrayscale() to accept optional embedded metadata
- Visual extractor now passes document metadata (title, author, year,
  page, index, catalogId) when saving images
- Add --resume flag to extract-visuals.ts to skip already-processed docs
- Create update-image-metadata.ts script to backfill metadata on
  existing images

Metadata embedded includes: Title, Author, Year, Page, ImageIndex,
CatalogId, Software identifier
Resolve merge conflicts:
- Keep sharp ^0.34.5 (newer version)
- Remove @types/sharp (sharp 0.33+ includes own types)
- Fix TypeScript errors in image-processor.ts and vision-llm-service.ts
- Update ImageEmbeddedMetadata.year to accept string | number
Add high-performance pre-filter to skip page-sized images before LLM
classification. This dramatically improves processing of OCR-scanned
documents by avoiding expensive API calls for full-page scans.

Pre-filter rules:
- Skip images covering >70% of page area (full-page scans)
- Skip images matching page dimensions (>95% width AND height)
- Skip horizontal page-width strips (headers/footers)

Performance improvement:
- OCR-scanned 'Mastering Elliott Wave': 2873 images → 0 LLM calls
- Native PDFs with diagrams: all legitimate images pass to LLM

Additional changes:
- Add getPdfPageDimensions() using pdfinfo
- Add analyzeImageVsPageSize() for pre-filter logic
- Add parallel batch processing (5 concurrent LLM calls)
- Update progress reporting with pre-filter stats
- Update classification prompt to reject scanned pages
Replace Vision LLM classification with local LayoutParser model for
diagram detection. This eliminates API costs and enables offline
operation while maintaining high accuracy (95%+ on test images).

New components:
- scripts/python/classify_visual.py: Python classifier with two modes
  - classify: single image classification (native PDFs)
  - detect: region detection with bounding boxes (scanned PDFs)
- local-classifier.ts: TypeScript wrapper for Python script
- document-analyzer.ts: Auto-detect native vs scanned documents
- region-cropper.ts: Crop detected regions from page images

Changes:
- visual-extractor.ts: Unified pipeline using local classifier
- extract-visuals.ts: No longer requires OPENROUTER_API_KEY
- index.ts: Export new modules

Performance:
- Classification cost: $0 (was ~$0.002/image)
- Classification speed: ~0.1s/image (was ~0.5s API latency)
- Accuracy: ~95% (verified on Clean Architecture diagrams)

Prerequisites:
- Python 3.8+ with LayoutParser + Detectron2
- Setup: cd scripts/python && ./setup.sh
- Scanned PDFs are now skipped entirely during visual extraction
- Native PDFs with all page-sized images are detected as scanned and skipped
- This avoids unreliable text-vs-diagram classification in OCR documents
- Added opencv-python to Python dependencies for future use
- Create EpubImageExtractor class for extracting images from EPUB files
- Add extractFromEpub() method to VisualExtractor
- Add unified extract() entry point that auto-detects format (PDF/EPUB)
- Update types with chapterIndex and chapterTitle fields for EPUB context
- Update extract-visuals.ts script to support both PDF and EPUB formats
- Include pre-filtering for cover images, icons, and decorative elements

Tested with 'Thinking in Systems' EPUB, successfully extracted 83 diagrams.
@m2ux
Copy link
Copy Markdown
Owner Author

m2ux commented Jan 18, 2026

Use this for local inference. looks good! https://github.com/rednote-hilab/dots.ocr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant