feat: add diagram awareness for visual content extraction#65
Open
feat: add diagram awareness for visual content extraction#65
Conversation
- Vision LLM approach for semantic diagram understanding - Store only semantic diagrams (not photos/decorative) - Grayscale storage with color analysis - New visuals table with external image storage - Non-destructive database migration Issue: #51
- Visual domain model for diagrams, charts, tables, figures - VisualType enum: diagram, flowchart, chart, table, figure - BoundingBox type with parse/serialize helpers - VisualRepository interface with full CRUD operations - Export from domain/models and domain/interfaces/repositories WP: Diagram Awareness (M1: Infrastructure)
- Full CRUD operations for visuals table - Vector search for semantic queries - Query by catalog, type, page, concept, chunk associations - Batch add/update operations - Arrow Vector and JSON field parsing WP: Diagram Awareness (M1: Infrastructure)
- Safe migration that augments existing database - Creates visuals table with proper schema - Creates images/ directory for extracted diagrams - --force flag to recreate if table exists - Does NOT modify existing tables (catalog, chunks, concepts, categories) Usage: npx tsx scripts/add-visuals-table.ts --dbpath ~/.concept_rag WP: Diagram Awareness (M1: Infrastructure)
Visual extraction infrastructure: - PDFPageRenderer: Renders PDF pages using pdftoppm - ImageProcessor: Crop, grayscale conversion using sharp - VisionLLMService: Classification (diagram vs photo) via OpenRouter - VisualExtractor: Orchestrates extraction pipeline Classification filters non-semantic content: - Stores only: diagram, flowchart, chart, table, figure - Filters out: photos, screenshots, decorative images Dependencies: - Added sharp for image processing Scripts: - extract-visuals.ts: Extract diagrams from catalog documents WP: Diagram Awareness (M2: Extraction Pipeline)
Scripts: - describe-visuals.ts: Generate semantic descriptions via Vision LLM - Updates visuals with descriptions and embeddings - Extracts concepts from descriptions - Links visuals to chunks on same page - Rate limiting for API calls - --redescribe flag to regenerate Prompts: - visual-classification.txt: Diagram vs photo classification - visual-description.txt: Semantic description generation Features: - Concept extraction from descriptions - Chunk-to-visual linking by page number - Dry-run mode for testing WP: Diagram Awareness (M3: Description & Embedding)
New MCP Tool: - get_visuals: Retrieve diagrams, charts, tables, figures from documents - Filter by catalog_id, visual_type, page_number, concept - Returns description, image path, concept associations Repository Enhancements: - findByConceptName: Search visuals by concept name (case-insensitive) - Updated interface and LanceDB implementation Container Integration: - Visuals table detection on startup - Conditional tool registration when table exists WP: Diagram Awareness (M4: Search Integration)
Updates: - Added get_visuals to tool overview table (12 tools now) - Added detailed get_visuals selection criteria section - Added visual enrichment workflows (5. Enrich Search with Diagrams, 6. Browse Diagrams) - Added test cases for visual queries WP: Diagram Awareness (M4: Tool Documentation)
WP: Diagram Awareness (M5: Finalization)
Scripts: - seed-test-visuals.ts: Populates test database with 8 sample visuals - Covers all visual types: diagram, flowchart, chart, table, figure - Links to existing catalog entries and concepts - Creates embeddings for semantic search - test-get-visuals.ts: Verifies get_visuals functionality - Tests concept name search - Tests visual type filtering - Tests catalog ID filtering - Validates all repository methods work correctly WP: Diagram Awareness (Test Database)
Breaking change from page-based to image-based extraction: - Use pdfimages (poppler-utils) to extract actual embedded images - Individual diagrams/figures now extracted, not full pages - Image dimensions vary based on actual content (e.g., 725x493, 450x206) Configuration: - Add visionModel to LLMConfig (OPENROUTER_VISION_MODEL env var) - Default: qwen/qwen2.5-vl-72b-instruct (configurable) - Vision model no longer hardcoded in source PDF extraction improvements: - extractPdfImages() function in pdf-page-renderer.ts - Minimum size filtering (100x100 default) - Page number tracking from pdfimages -list output - cleanupExtractedImages() for temp file cleanup Test results (23 documents): - 268 semantic visuals extracted - 199 non-semantic images filtered - Individual diagram extraction verified
New naming scheme: {author}_{short-title}_{year}
Examples:
- martin_clean-architecture_2017
- gamma_design-patterns_1994
- unknown_cosmos-blockchain_2023
Changes:
- Add slugify.ts utility with slugifyDocument(), formatVisualFilename()
- Update VisualExtractor to accept DocumentInfo and generate folder slug
- Update extract-visuals.ts to pass document metadata
- Add --cleanup flag to describe-visuals.ts for stale record removal
- Silently skip missing images instead of warning spam
…suals - catalog_search: add catalog_id, replace source with title - chunks_search: use catalog_id input instead of source path - broad_chunks_search: add catalog_id, title, page_number, concepts - concept_search: rename source_filter to title_filter, add image_ids - get_visuals: add ids[] input for batch retrieval, remove chunk_ids All tool workflows verified for interoperability: - catalog_search → chunks_search (via catalog_id) - catalog_search → get_visuals (via catalog_id) - concept_search → get_visuals (via image_ids → ids)
- catalog_search: output now includes catalog_id and title (was source) - chunks_search: input uses catalog_id (was source path) - broad_chunks_search: output includes catalog_id, title, page_number, concepts - concept_search: input uses title_filter (was source_filter), output includes image_ids - get_visuals: add ids[] parameter, document full schema - Update workflows to show catalog_id-based navigation - Bump schema version to v8
- GetVisualsTool: basic retrieval, by IDs, by catalog_id, by type - ConceptSearchTool: verify image_ids and catalog_id in output - CatalogSearchTool: verify catalog_id in output - Workflow: concept_search → get_visuals via image_ids - Workflow: catalog_search → get_visuals via catalog_id - Schema compliance: required fields, no deprecated fields 14 tests, all passing against db/test
- Verify images have descriptions relevant to searched concept - Check image concepts match search terms (architecture, dependency, software) - Validate diagram descriptions are meaningful (>20 chars, not errors) - 100% relevance achieved on test database 18 tests, all passing
Empty responses from Vision LLM are expected for rate-limited or simple images. Only log warnings when there's actual response content to debug.
- Add ImageEmbeddedMetadata interface and embedMetadataInPng() function - Update convertToGrayscale() to accept optional embedded metadata - Visual extractor now passes document metadata (title, author, year, page, index, catalogId) when saving images - Add --resume flag to extract-visuals.ts to skip already-processed docs - Create update-image-metadata.ts script to backfill metadata on existing images Metadata embedded includes: Title, Author, Year, Page, ImageIndex, CatalogId, Software identifier
Resolve merge conflicts: - Keep sharp ^0.34.5 (newer version) - Remove @types/sharp (sharp 0.33+ includes own types) - Fix TypeScript errors in image-processor.ts and vision-llm-service.ts - Update ImageEmbeddedMetadata.year to accept string | number
Add high-performance pre-filter to skip page-sized images before LLM classification. This dramatically improves processing of OCR-scanned documents by avoiding expensive API calls for full-page scans. Pre-filter rules: - Skip images covering >70% of page area (full-page scans) - Skip images matching page dimensions (>95% width AND height) - Skip horizontal page-width strips (headers/footers) Performance improvement: - OCR-scanned 'Mastering Elliott Wave': 2873 images → 0 LLM calls - Native PDFs with diagrams: all legitimate images pass to LLM Additional changes: - Add getPdfPageDimensions() using pdfinfo - Add analyzeImageVsPageSize() for pre-filter logic - Add parallel batch processing (5 concurrent LLM calls) - Update progress reporting with pre-filter stats - Update classification prompt to reject scanned pages
Replace Vision LLM classification with local LayoutParser model for diagram detection. This eliminates API costs and enables offline operation while maintaining high accuracy (95%+ on test images). New components: - scripts/python/classify_visual.py: Python classifier with two modes - classify: single image classification (native PDFs) - detect: region detection with bounding boxes (scanned PDFs) - local-classifier.ts: TypeScript wrapper for Python script - document-analyzer.ts: Auto-detect native vs scanned documents - region-cropper.ts: Crop detected regions from page images Changes: - visual-extractor.ts: Unified pipeline using local classifier - extract-visuals.ts: No longer requires OPENROUTER_API_KEY - index.ts: Export new modules Performance: - Classification cost: $0 (was ~$0.002/image) - Classification speed: ~0.1s/image (was ~0.5s API latency) - Accuracy: ~95% (verified on Clean Architecture diagrams) Prerequisites: - Python 3.8+ with LayoutParser + Detectron2 - Setup: cd scripts/python && ./setup.sh
- Scanned PDFs are now skipped entirely during visual extraction - Native PDFs with all page-sized images are detected as scanned and skipped - This avoids unreliable text-vs-diagram classification in OCR documents - Added opencv-python to Python dependencies for future use
- Create EpubImageExtractor class for extracting images from EPUB files - Add extractFromEpub() method to VisualExtractor - Add unified extract() entry point that auto-detects format (PDF/EPUB) - Update types with chapterIndex and chapterTitle fields for EPUB context - Update extract-visuals.ts script to support both PDF and EPUB formats - Include pre-filtering for cover images, icons, and decorative elements Tested with 'Thinking in Systems' EPUB, successfully extracted 83 diagrams.
Owner
Author
|
Use this for local inference. looks good! https://github.com/rednote-hilab/dots.ocr |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements diagram awareness feature to extract, describe, and integrate visual content (diagrams, charts, tables, figures) into the RAG system for enhanced semantic search.
🎫 Issue #51
ADR: adr0056-diagram-awareness
Motivation
Technical documents often contain diagrams that provide critical semantic context not captured in text. This feature enables:
Changes
M1: Infrastructure
src/domain/models/visual.ts) -Visualtype with bounding box, description, conceptssrc/domain/interfaces/repositories/visual-repository.ts) - Full CRUD + search operationssrc/infrastructure/lancedb/repositories/lancedb-visual-repository.ts)scripts/add-visuals-table.ts) - Non-destructive table additionM2: Extraction Pipeline
src/infrastructure/visual-extraction/types.ts)src/infrastructure/visual-extraction/pdf-page-renderer.ts)src/infrastructure/visual-extraction/image-processor.ts) - Sharp-based grayscale conversionsrc/infrastructure/visual-extraction/vision-llm-service.ts) - Classification + descriptionsrc/infrastructure/visual-extraction/visual-extractor.ts)scripts/extract-visuals.ts)M3: Description & Embedding
scripts/describe-visuals.ts) - Vision LLM semantic descriptionsprompts/visual-classification.txt,prompts/visual-description.txt)M4: Search Integration
src/tools/operations/get-visuals-tool.ts)M5: Finalization
Testing
tsc --noEmit)Usage
Submission Checklist
Fork Strategy