Skip to content

(retriever) Add .store() task for persisting extracted images (#1675)#1714

Open
jioffe502 wants to merge 6 commits intomainfrom
jioffe/store-extracted-images
Open

(retriever) Add .store() task for persisting extracted images (#1675)#1714
jioffe502 wants to merge 6 commits intomainfrom
jioffe/store-extracted-images

Conversation

@jioffe502
Copy link
Copy Markdown
Collaborator

@jioffe502 jioffe502 commented Mar 24, 2026

Summary

Adds a new .store() pipeline task to nemo_retriever that persists extracted images (full-page + sub-page table/chart/infographic/image assets) to local or cloud storage via fsspec/UPath.

  • Wire .store() into InProcessIngestor, BatchIngestor (map_batches), and batch CLI via --store-images-uri
  • Add StoreParams with per-content toggles, image_format, and strip_base64 (default False — opt-in memory reduction, preserves multimodal embedding compatibility)
  • Ensure direct-write file extensions match actual encoded bytes (declared encoding field + magic-byte sniff fallback); cropped outputs follow configured image_format
  • Fine-grained control (per-content-type toggles, public_base_url, storage_options) available via StoreParams in library use; CLI exposes --store-images-uri for common case

Test plan

  • test_io_image_store.py: page images, structured crops, natural images, format consistency, stripping, edge cases
  • test_multimodal_embed.py: verifies store → embed behavior for multimodal with/without stripping
  • Integration: bo20 dataset (20 PDFs, 496 pages) → 842 stored assets (496 page images + 346 table/chart crops)
uv run pytest tests/test_io_image_store.py tests/test_multimodal_embed.py -q

Known limitations (follow-ups)

  • Schema coupling: store targets singular structured columns (table, chart, infographic) matching OCR stage output
  • Corrupt image_b64 skip: if direct image_b64 exists but decode fails, we skip rather than falling back to bbox_xyxy_norm crop from page image
  • _safe_stem collision: two files with same basename but different directories share output subdirectory; hash-based naming is a follow-up
  • Observability: per-type write counters (pages, tables, charts, failures) deferred

Closes #1675

🤖 Generated with Claude Code

jioffe502 and others added 2 commits March 24, 2026 18:41
- Add store_extracted_images() with fsspec/UPath support for local and cloud storage
- Wire .store() into InProcessIngestor, BatchIngestor, and batch_pipeline CLI
- Add StoreParams model, unit tests, and fsspec/universal-pathlib deps

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>
…mn name fix

- Fix column names (table/chart/infographic not plural) to match OCR output
- Add magic byte sniffing so file extension matches actual image encoding
- Make base64 stripping opt-in (strip_base64=False) to preserve multimodal compat
- Add multimodal embed interaction tests and format consistency tests

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>
@jioffe502 jioffe502 requested review from a team as code owners March 24, 2026 20:21
@jioffe502 jioffe502 requested a review from edknv March 24, 2026 20:21
Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>
Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>
  storage_uri to absolute

Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>
Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>
Copy link
Copy Markdown
Collaborator

@jperez999 jperez999 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would love to see an actual graph unit test for this store behavior. Perhaps we could run a few of the stages in a graph locally get to the OCR stage and export. Might be worth having tests that show we can export at multiple stages in the pipeline and get the things we are targeting at each store call.

lancedb: LanceDbParams = Field(default_factory=LanceDbParams)


class StoreParams(_ParamsModel):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this allow me to store full chunks also... I think that is the only permutation you dont have here, text..?

from nemo_retriever.params import StoreParams


def _make_tiny_png_b64(width: int = 4, height: int = 4, color=(255, 0, 0)) -> str:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of these functions are doubled up in other files. We are going to need to consolidate that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEA]: nemo_retriever library: Support storing extracted images to disk

2 participants