Skip to content

Replace pdf-extract with pdf-inspector and add scanned PDF image extraction#453

Open
devin-ai-integration[bot] wants to merge 8 commits intomasterfrom
devin/1772577735-swap-pdf-inspector
Open

Replace pdf-extract with pdf-inspector and add scanned PDF image extraction#453
devin-ai-integration[bot] wants to merge 8 commits intomasterfrom
devin/1772577735-swap-pdf-inspector

Conversation

@devin-ai-integration
Copy link
Contributor

@devin-ai-integration devin-ai-integration bot commented Mar 3, 2026

Summary

Swaps the pdf-extract crate for firecrawl/pdf-inspector to improve PDF text extraction in the Tauri desktop app, and adds scanned PDF page image extraction for vision-model OCR (e.g. Qwen 3 VL).

Library swap (pdf-inspector)

  • Smart PDF type detection (text-based, scanned, image-based, mixed)
  • Markdown-formatted output (headers, tables, lists, code blocks, multi-column)
  • Encoding issue detection with logging
  • Better handling of CID fonts, ToUnicode CMaps, and linearized PDFs

The Tauri command API (extract_document_content) keeps the same request/response shape — no breaking TypeScript changes.

Scanned PDF image extraction (new)

Instead of returning an error for scanned/image-based PDFs, the code now extracts embedded page images and returns them as base64 JPEG data URLs in a new page_images field on DocumentData. This enables the frontend to send these images to a vision model for OCR.

How it works:

  1. pdf-inspector detects PDF type → text-based pages extracted as markdown (fast, free)
  2. For scanned/image-based PDFs with no extractable text → extract_page_images() parses the PDF structure via lopdf, finds Image XObjects, and converts them to JPEG data URLs
  3. Supported image encodings: DCTDecode (JPEG passthrough), CCITTFaxDecode (wrapped in minimal TIFF, decoded, re-encoded as JPEG), FlateDecode (zlib decompression → raw pixels → JPEG)
  4. The page_images field is Option<Vec<String>> with #[serde(skip_serializing_if = "Option::is_none")] — fully backward compatible, omitted from JSON for text-based PDFs

New dependencies: lopdf (git), image v0.25 (tiff/jpeg/png features), flate2

Frontend handling (UnifiedChat.tsx)

  • Added page_images?: string[] to RustDocumentResponse TypeScript interface
  • Scanned PDFs are detected via result.document?.page_images?.length check (runs before the text_content check, since empty string is falsy in JS and would silently drop the response)
  • Only a placeholder message is stored in documentText: "[Scanned PDF: N page image(s) extracted. OCR via vision model is not yet supported.]" — the actual base64 image data is intentionally not included in documentText to avoid sending multi-MB blobs as plain text to the AI model
  • Note: The actual vision-model OCR call (sending images to Qwen 3 VL) is NOT wired up yet — this PR provides the Rust extraction and frontend plumbing; the OCR integration is a follow-up

Key changes

  • DocumentData.page_images: New optional field containing base64 JPEG data URLs for scanned page images
  • extract_page_images(): Parses PDF page structure, extracts Image XObjects, handles 3 compression formats
  • PDF Filter array handling: Per PDF spec §7.3.4.2, Filter can be a single Name or an Array of Names — code now handles both (uses last array element, which describes the final data format)
  • wrap_ccitt_as_tiff(): Builds a minimal TIFF container (8-entry IFD) to decode CCITT Group 3/4 fax data via the image crate
  • decode_flate_to_data_url(): Handles zlib-compressed raw pixel data (DeviceGray 1-bit/8-bit, DeviceRGB 8-bit)
  • Test: extract_scanned_pdf_returns_page_images: Verifies scanned PDF returns page_images with valid JPEG data URL (>1KB decoded)
  • Test: test_wrap_ccitt_as_tiff_structure: Verifies TIFF wrapper byte layout (header, IFD offset, data appended at end)
  • 12 tests total (was 11): All pass including the new scanned PDF image extraction tests
  • Both test fixtures (test_fixtures/bitcoin_whitepaper.pdf and test_fixtures/scanned_letter.pdf) are read from disk at test time via std::fs::read() inside the #[cfg(test)] module — nothing is embedded in production builds

Updates since last revision

  • Fixed Devin Review finding (commit f1738be): Removed page_images from the documentText JSON to prevent sending multi-MB base64 blobs as plain text to the AI model. Only the placeholder message is now stored in documentText.
  • Merged master (commit ecca4d1): Picked up nix flake update for desktop login fix.

Review & Testing Checklist for Human

  • Multi-filter streams may silently produce corrupt images: The code recognizes Filter arrays like [/ASCII85Decode /DCTDecode] but uses raw stream.content without decoding through intermediate filters. DCTDecode path will base64-encode ASCII85 data (producing corrupt JPEG), other paths will error. Multi-filter image streams are rare in scanned PDFs but this is a legitimate bug for PDFs that use them. See Devin Review comment 2882169679.
  • FlateDecode PNG predictor not handled: Many FlateDecode images use PNG prediction (Predictor=10-15 in DecodeParms) which prepends a filter byte to each row. Current code will fail with buffer size mismatch. See Devin Review comment 2882169740.
  • TIFF wrapper only tested with one fixture: The hand-rolled TIFF IFD builder for CCITT fax data (wrap_ccitt_as_tiff) is only tested with a single 1-bit B&W scanned letter. Test with other scanned PDFs (multi-page, color scans, different CCITT variants, Group 3 vs Group 4) to verify image decoding works correctly.
  • Vision-model OCR not wired up: Users uploading scanned PDFs see a placeholder "[Scanned PDF: N page image(s) extracted. OCR via vision model is not yet supported.]" with no actual OCR. This is intentional for this PR but should be wired up in follow-up work.
  • End-to-end UI testing: Test with a variety of real PDFs (text-based, scanned, mixed, multi-column, table-heavy, multi-page scanned) on a real desktop build. The Rust tests cover extraction logic but not the full UI flow (file picker → base64 encode → Tauri invoke → display in chat).

Notes

  • Requested by @AnthonyRonning
  • Devin Session
  • The pdf-inspector library does not perform OCR — it extracts text operators from the PDF's internal structure. For truly image-only scanned PDFs, no text will be extracted and the new page_images field will contain the rasterized page images for external OCR (e.g. via Qwen 3 VL).
  • Git dependency on main branch: pdf-inspector is a git dep pinned to branch = "main". The lockfile pins a specific commit, but future cargo update could pull breaking changes. Consider pinning to a specific rev for stability.
  • Devin Review findings: Two edge-case bugs were identified in later review (multi-filter streams and PNG predictor handling). These are flagged as known limitations since they affect uncommon PDF variants and fail gracefully (errors instead of silent corruption, except for DCTDecode multi-filter case).

Open with Devin

Swap out the pdf-extract crate for pdf-inspector from firecrawl, which provides:
- Smart PDF type detection (text-based, scanned, image-based, mixed)
- Markdown output with headers, tables, lists, code blocks
- Multi-column layout support
- Encoding issue detection
- Better handling of various PDF types including CID fonts

The Tauri command API (extract_document_content) remains unchanged -
same request/response shape, so no TypeScript changes needed.

For scanned/image-based PDFs, returns a clear error message instead
of silently returning garbage text.

Co-Authored-By: tony@opensecret.cloud <TonyGiorgio@protonmail.com>
@devin-ai-integration
Copy link
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

@cloudflare-workers-and-pages
Copy link

cloudflare-workers-and-pages bot commented Mar 3, 2026

Deploying maple with  Cloudflare Pages  Cloudflare Pages

Latest commit: f1738be
Status: ✅  Deploy successful!
Preview URL: https://7c9c471a.maple-ca8.pages.dev
Branch Preview URL: https://devin-1772577735-swap-pdf-in.maple-ca8.pages.dev

View logs

Copy link
Contributor Author

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 4 additional findings.

Open in Devin Review

devin-ai-integration bot and others added 4 commits March 4, 2026 00:42
Co-Authored-By: tony@opensecret.cloud <TonyGiorgio@protonmail.com>
Co-Authored-By: tony@opensecret.cloud <TonyGiorgio@protonmail.com>
…traction on all types

Co-Authored-By: tony@opensecret.cloud <TonyGiorgio@protonmail.com>
- Add page_images field to DocumentData (Option<Vec<String>>)
- Extract embedded images from scanned PDFs using lopdf
- Handle CCITTFaxDecode (TIFF wrapper), DCTDecode (JPEG passthrough), FlateDecode (zlib)
- Return JPEG data URLs for scanned pages instead of erroring
- Add flate2 dependency for zlib decompression
- Add image crate with tiff/jpeg/png features for image decoding
- New test: extract_scanned_pdf_returns_page_images verifies JPEG output
- New test: test_wrap_ccitt_as_tiff_structure verifies TIFF wrapper format
- Text-based PDFs unchanged (page_images=None, skipped in JSON via serde)

Co-Authored-By: tony@opensecret.cloud <TonyGiorgio@protonmail.com>
@devin-ai-integration devin-ai-integration bot changed the title Replace pdf-extract with firecrawl/pdf-inspector for PDF parsing Replace pdf-extract with pdf-inspector and add scanned PDF image extraction Mar 4, 2026
devin-ai-integration[bot]

This comment was marked as resolved.

- Handle PDF Filter as array per spec §7.3.4.2 (e.g. [/ASCII85Decode /DCTDecode])
- Use last element of filter array (describes final data format)
- Add page_images to RustDocumentResponse TypeScript interface
- Handle scanned PDF response: check page_images before text_content
- Show placeholder text for scanned PDFs pending vision-model OCR

Addresses Devin Review findings on PR #453.

Co-Authored-By: tony@opensecret.cloud <TonyGiorgio@protonmail.com>
Copy link
Contributor Author

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 new potential issues.

View 13 additional findings in Devin Review.

Open in Devin Review

Comment on lines +187 to +201
f.as_name().map(|n| n.to_vec()).ok().or_else(|| {
f.as_array()
.ok()
.and_then(|arr| arr.last())
.and_then(|last| last.as_name().ok())
.map(|n| n.to_vec())
})
})
.unwrap_or_default();

let data_url = match filter_name.as_slice() {
b"DCTDecode" => {
// Raw JPEG -- pass through directly
let b64 = BASE64.encode(&stream.content);
format!("data:image/jpeg;base64,{b64}")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Multi-filter streams decoded using raw bytes without applying preceding filters

When the PDF Filter entry is an array (e.g. [/ASCII85Decode /DCTDecode]), the code correctly identifies the last filter as the image format but then uses stream.content (the raw, fully-encoded bytes) as if only the last filter was applied. The raw bytes still need to be decoded through all preceding filters first.

Root Cause and Impact

At frontend/src-tauri/src/pdf_extractor.rs:182-201, the code extracts the last filter name from the array to determine the image type, but at lines 200, 205, and 209 it passes stream.content directly to the handler functions. For a stream with Filter: [/ASCII85Decode /DCTDecode], stream.content contains ASCII85-encoded JPEG data — not raw JPEG bytes.

  • DCTDecode case (line 200): The ASCII85-encoded bytes are silently base64-encoded and returned as a JPEG data URL. The resulting image will be silently corrupt — no error is raised, but the image is garbage.
  • FlateDecode case (line 254): flate2::ZlibDecoder will fail because the input isn't valid zlib data, producing an error.
  • CCITTFaxDecode case (line 236): The TIFF wrapper will contain incorrectly-encoded data, and image::load_from_memory will likely fail.

While multi-filter streams are less common in scanned documents, the code explicitly handles the array case (lines 188-193), which creates an expectation that it works correctly. The silent corruption in the DCTDecode path is particularly dangerous.

Prompt for agents
In frontend/src-tauri/src/pdf_extractor.rs, the extract_page_images function (around lines 182-216) handles multi-filter streams incorrectly. When Filter is an array like [/ASCII85Decode /DCTDecode], the code identifies the last filter but uses stream.content (raw encoded bytes) without decoding through the preceding filters.

Two possible fixes:

1. (Simpler, safer) Only handle single-filter streams. When Filter is an array with more than one element, log a warning and skip the image (similar to the unsupported filter case). Change the filter detection logic around lines 182-195 to check if the array has length > 1 and skip/warn.

2. (More complete) Use lopdf's built-in stream decoding (e.g., Stream::decompressed_content() or equivalent) to get fully-decoded bytes, and then only use the last filter to determine the final image format. This would require checking if lopdf's decoding handles all the intermediate filters correctly.

Option 1 is recommended for safety since multi-filter image streams are uncommon in scanned PDFs, and silently producing corrupt data is worse than skipping.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged — this is a valid edge case. For now, multi-filter streams are very rare in scanned documents (scanners typically use single-filter encoding like /CCITTFaxDecode or /DCTDecode). If we encounter one in practice, the non-DCTDecode paths will error out (caught gracefully), and the DCTDecode path would produce a corrupt image.

I've left the array handling in place so single-element arrays like [/DCTDecode] work correctly, which is the common case per the spec. For true multi-filter pipelines, this would need lopdf's stream decoding to apply intermediate filters first. Flagging as a known limitation for human review.

Comment on lines +245 to +258
fn decode_flate_to_data_url(stream: &lopdf::Stream) -> Result<String, String> {
let width =
get_stream_int(&stream.dict, b"Width").ok_or("FlateDecode image missing Width")? as u32;
let height =
get_stream_int(&stream.dict, b"Height").ok_or("FlateDecode image missing Height")? as u32;
let bpc = get_stream_int(&stream.dict, b"BitsPerComponent").unwrap_or(8) as u32;

// Decompress zlib data
use std::io::Read;
let mut decoder = flate2::read::ZlibDecoder::new(&stream.content[..]);
let mut raw_pixels = Vec::new();
decoder
.read_to_end(&mut raw_pixels)
.map_err(|e| format!("Failed to decompress FlateDecode image: {e}"))?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 FlateDecode handler ignores PNG/TIFF Predictor from DecodeParms, causing extraction failures

The decode_flate_to_data_url function decompresses zlib data and assumes the result is pure pixel data, but many PDF FlateDecode image streams use a PNG predictor (Predictor=10–15 in DecodeParms) that prepends a filter byte to each row.

Root Cause and Impact

At frontend/src-tauri/src/pdf_extractor.rs:244-258, after zlib decompression the raw bytes are passed directly to image::GrayImage::from_raw() or image::RgbImage::from_raw(). These functions expect exactly width * height * channels bytes.

When a PNG predictor is used, each row has an extra filter byte prepended, so the decompressed data is height * (width * channels + 1) bytes instead of height * width * channels. For example, for a 100×100 grayscale image:

  • Expected by from_raw(): 10,000 bytes
  • Actual with PNG predictor: 10,100 bytes (100 extra filter bytes)

from_raw() returns None when the buffer size doesn't match, which gets converted to an error like "Failed to construct grayscale image from raw pixels".

PNG predictors with FlateDecode are very common in real-world PDFs — many PDF generators (including popular scanners) use them for better compression. The DecodeParms dictionary's Predictor key (accessible via the existing get_decode_parm_int helper at line 393) would indicate this, but it is never checked in the FlateDecode path.

Impact: Many valid FlateDecode images in scanned PDFs will fail to extract, returning an error instead of image data.

Prompt for agents
In frontend/src-tauri/src/pdf_extractor.rs, the decode_flate_to_data_url function (lines 245-293) needs to handle PNG prediction (Predictor values 10-15 in DecodeParms) after zlib decompression.

After decompressing the zlib data at line 258, add logic to:
1. Read the Predictor value from DecodeParms using the existing get_decode_parm_int helper: get_decode_parm_int(&stream.dict, b"Predictor").unwrap_or(1)
2. If Predictor is 1 (or absent), proceed as-is (no prediction).
3. If Predictor is 10-15 (PNG predictors), apply PNG row-filter reversal: each row starts with a 1-byte filter type (0=None, 1=Sub, 2=Up, 3=Average, 4=Paeth) followed by width*channels bytes. You need to strip the filter bytes and undo the prediction to get raw pixel data. The Columns parameter from DecodeParms gives the number of pixel columns per row.
4. If Predictor is 2 (TIFF predictor), apply horizontal differencing reversal.

For a simpler initial fix, you could at minimum handle Predictor=10 (PNG None filter) by just stripping the leading byte from each row, since that's the most common PNG predictor variant in PDFs. Log a warning and skip images with other predictor values.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid point — PNG predictor handling is missing from the FlateDecode path. In practice, our scanned PDF test fixture uses CCITTFaxDecode (the most common encoding for B&W scanned documents), and the FlateDecode path is mainly a fallback for color scanned images.

If a FlateDecode image with PNG predictor is encountered, from_raw() will return None due to the size mismatch, and the error is caught gracefully (that image is skipped/errors out, but doesn't crash the whole extraction).

Adding full PNG predictor support (None/Sub/Up/Average/Paeth filter reversal) would be worthwhile if we see FlateDecode images failing in real-world use. Flagging as a known limitation for human review.

devin-ai-integration[bot]

This comment was marked as resolved.

…MB blobs to AI model

Co-Authored-By: tony@opensecret.cloud <TonyGiorgio@protonmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant