Skip to content

Fix PP-DocLayoutV3 head aliasing in layout loader#180

Open
VooDisss wants to merge 7 commits intozai-org:mainfrom
VooDisss:ppdoclayout-head-remap-minimal
Open

Fix PP-DocLayoutV3 head aliasing in layout loader#180
VooDisss wants to merge 7 commits intozai-org:mainfrom
VooDisss:ppdoclayout-head-remap-minimal

Conversation

@VooDisss
Copy link
Copy Markdown
Contributor

@VooDisss VooDisss commented Apr 1, 2026

Summary

Fixes #179

  • fix PP-DocLayoutV3 checkpoint loading in GLM-OCR by aliasing tied enc_* detection-head weights to the decoder head names expected by PPDocLayoutV3ForObjectDetection
  • load the layout model from a prepared state dict so transformers no longer initializes missing decoder detection heads
  • replace deprecated PPDocLayoutV3ImageProcessorFast usage with PPDocLayoutV3ImageProcessor
  • add focused unit coverage for the aliasing logic and update detector startup mocks for the new load path and processor rename

Why

The published PaddlePaddle/PP-DocLayoutV3_safetensors checkpoint stores trained detection-head weights under:

  • model.enc_score_head.*
  • model.enc_bbox_head.layers.*

but the object-detection wrapper used by GLM-OCR expects:

  • model.decoder.class_embed.*
  • model.decoder.bbox_embed.layers.*

Without aliasing those keys before model load, the decoder detection heads are treated as missing and newly initialized, which degrades layout detection in practice.

The runtime also still used the deprecated PPDocLayoutV3ImageProcessorFast entry point, which produced a warning on every worker startup under transformers 5.4.0.

Validation

  • pytest glmocr/tests/test_unit.py -k "detector_device_selection or detector_prepares_pp_doclayout_decoder_head_aliases"
  • real self-hosted OCR pipeline run on local PDFs with successful processing after the fix

Scope

This PR intentionally keeps the fix narrow:

  • no meta-tensor recovery logic
  • no broader layout runtime refactor
  • no broader processor refactor beyond replacing the deprecated entry point used by the detector

Those can be handled separately if needed.

Route PP-DocLayoutV3 'number' regions through OCR instead of dropping them, then extract printed page number evidence from recognized number blocks in the result formatter. Preserve the feature as number-only, support both numeric and Roman labels, and derive three explicit output layers: page_number_candidates, document_page_numbering, and page_metadata.

Keep the general json_result contract lean by stripping transient layout_index and layout_score fields from final blocks while retaining native_label, and wrap saved paper.json only when real printed-page data exists. Also expose detect_printed_page_numbers through config, constructor, and environment overrides, align MaaS output with self-hosted behavior, add contract-focused tests for legacy-vs-wrapped save behavior and lean json_result output, and document the exact save contract in the English and Chinese READMEs.
Add an SDK-owned image asset export path for PP-DocLayoutV3 image regions. Rendered region assets are now the base behavior of this feature and are written to imgs_rendered/. When enable_image_asset_export=True, the SDK additionally inspects embedded PDF images via PyMuPDF, matches them to layout image regions using same-page geometry, IoU, containment, aspect-ratio plausibility, and one-to-one assignment, and writes matched assets to imgs_embedded/. Markdown selection is controlled by markdown_image_preference ('embedded' or 'rendered').

The image block contract is explicit and stable: image_path is the selected asset path, rendered_image_path reflects the rendered asset when one exists, embedded_image_path is null when no embedded match exists, and image_asset_source records whether the selected asset is rendered or embedded. The implementation avoids center-distance-only matching, preserves formatter-produced rendered assets in self-hosted mode instead of re-deriving them, and aggressively prevents stale asset advertisement: if a rendered asset cannot actually be preserved or regenerated, final JSON and Markdown do not continue to reference it. Focused regression tests cover rendered-only mode, embedded preference, rendered preference, nested asset persistence, preservation misses, no-render-pages recovery, crop-failure recovery, and both rendered-origin and embedded-origin stale-markdown cleanup.
Close the remaining stale-asset recovery gaps in the SDK image export path so final JSON and Markdown do not advertise rendered assets unless they were actually preserved or produced. This covers both no-render-pages and explicit crop-failure branches, including cases where stale markdown originally pointed at embedded assets.

Also document the image asset export feature in README.md and README_zh.md, including the exposed configuration surface, the imgs_rendered/ and imgs_embedded/ directory contract, and the stable image block fields: image_path, rendered_image_path, embedded_image_path, and image_asset_source.
GLM-OCR loaded PPDocLayoutV3ForObjectDetection directly from the published Hugging Face checkpoint, but the checkpoint stores the tied detection head weights under model.enc_score_head.* and model.enc_bbox_head.layers.* while the object-detection wrapper expects model.decoder.class_embed.* and model.decoder.bbox_embed.layers.*. In practice this caused the decoder detection heads to be treated as missing and newly initialized, which surfaced as startup warnings, unstable layout behavior, and degraded self-hosted OCR results.

The fix keeps the change narrow: load the PP-DocLayoutV3 config separately, load model.safetensors directly, alias the tied encoder-head keys onto the decoder-head names before model construction, and instantiate the model with from_pretrained(None, config=..., state_dict=...). This avoids broader runtime recovery logic and keeps the compatibility repair at the checkpoint-loading boundary where the mismatch actually occurs.

The background investigation included local inspection of the cached safetensors checkpoint, installed transformers 5.4.0 PP-DocLayoutV3 source, Paddle inference artifacts, and upstream release context. The key finding was that the checkpoint is not headless: the trained head weights are present under enc_* names, and the local transformers implementation explicitly declares decoder.class_embed <-> enc_score_head and decoder.bbox_embed <-> enc_bbox_head as tied/shared weight groups. That made aliasing the minimal defensible fix for GLM-OCR rather than reworking the full layout runtime.

Tests were updated only as needed for the new load path. Existing detector device-selection tests now stub the config and prepared state-dict helpers, and a focused unit test verifies that _prepare_pp_doclayout_state_dict aliases encoder-head weights into the decoder-head keys expected by the object-detection wrapper. Validation also included a real self-hosted pipeline run over local PDFs, where the old missing decoder-head load report disappeared and processing completed successfully after the fix.
Follow up the PP-DocLayoutV3 checkpoint aliasing fix by switching the layout detector from PPDocLayoutV3ImageProcessorFast to PPDocLayoutV3ImageProcessor. Under the current transformers 5.4.0 runtime, the Fast-suffixed processor emits a deprecation warning on every worker startup even though the rest of the layout path is functioning correctly.

This change is intentionally narrow. It does not alter the checkpoint aliasing logic, model loading strategy, layout post-processing, or device-selection behavior. The only production change is to use the non-deprecated image processor entry point that transformers now expects. Tests were updated only where the detector startup path mocks the image processor loader.

The need for this cleanup was confirmed by a real self-hosted OCR pipeline run after the head-aliasing fix landed. That run showed successful PP-DocLayoutV3 startup and processing, but still printed the deprecation warning telling callers to use PPDocLayoutV3ImageProcessor instead of the Fast variant. Replacing the import and matching test patches removes that remaining startup warning without widening the scope of the loader fix.

Validation included re-running the focused detector test slice covering detector device selection and PP-DocLayout decoder-head aliasing, which passed after the rename. A subsequent real OCR pipeline run on local PDFs also started and processed documents without the previous deprecation warning, confirming that the cleanup behaves correctly in the actual self-hosted path.
@VooDisss
Copy link
Copy Markdown
Contributor Author

VooDisss commented Apr 1, 2026

@JaredforReal note that my PRs are stacked and not based directly on main:

For #180 specifically, I spent some time ruling out user/config error vs an actual GLM-OCR integration issue.

What I found is that under the current transformers 5.x PP-DocLayoutV3 load path, the layout detector comes up with missing decoder detection-head weights and then fails during startup/device move. In practice this showed up when running the self-hosted OCR pipeline with multi-worker layout initialization.

Minimal symptom:

PPDocLayoutV3ForObjectDetection LOAD REPORT
model.decoder.bbox_embed.layers.{0,1,2}.bias   | MISSING
model.decoder.bbox_embed.layers.{0,1,2}.weight | MISSING
model.decoder.class_embed.weight               | MISSING
model.decoder.class_embed.bias                 | MISSING

NotImplementedError: Cannot copy out of meta tensor; no data!
Please use torch.nn.Module.to_empty() instead of torch.nn.Module.to()
when moving module from meta to a different device.

After digging into the checkpoint and local transformers implementation, the issue appears to be a checkpoint/load-path mismatch rather than simple user error:

  • the published PaddlePaddle/PP-DocLayoutV3_safetensors checkpoint stores the tied prediction head weights under enc_* names
  • PPDocLayoutV3ForObjectDetection expects the corresponding decoder.* names at load time
  • without aliasing those keys before model construction, the decoder detection heads are treated as missing and newly initialized

That is what #180 fixes:

  • alias enc_score_head.* -> decoder.class_embed.*
  • alias enc_bbox_head.layers.* -> decoder.bbox_embed.layers.*
  • construct the model from the prepared state dict
  • also switch from the deprecated PPDocLayoutV3ImageProcessorFast entry point to PPDocLayoutV3ImageProcessor

I opened the upstream issue with the evidence here:

So from my side this does not look like just a local environment mistake; it looks like a real PP-DocLayoutV3 HF checkpoint/load-path compatibility bug that GLM-OCR needs to bridge explicitly.

@JaredforReal
Copy link
Copy Markdown
Collaborator

I cannot reproduce the load weight mismatch error and DDocLayoutV3ImageProcessorFast to DDocLayoutV3ImageProcessor related issue has been solved, double check if we need this PR anymore.
Thanks for your contributions! @VooDisss

@VooDisss
Copy link
Copy Markdown
Contributor Author

VooDisss commented Apr 8, 2026

@JaredforReal I gave a few hours to reproduce it. Below in the attached markdown file you will find all the instructions to reproduce this - just give your agent the markdown file and ask it to reproduce it in a clean environment (new git worktree and a new virtual environment within it).

It's 100% real issue and is not fixed by transformers 5.5.0 package. It must get fixed if GLM-OCR SDK wants to progress further. I know no other way to fix it than remapping. In simpler single-file or direct-SDK runs, the problem was easier to miss; under heavier layout-sensitive workflows, it became much easier to observe unstable or degraded behavior.

pr180-repro-notes.md

Please give it some attention.

@VooDisss
Copy link
Copy Markdown
Contributor Author

VooDisss commented Apr 8, 2026

@JaredforReal I have some more context now, as I also tested on transformers 5.3.0 version:

in #179 @zRzRzRzRzRzRzR has commented to use transformers 5.3.0 version. I have tried to reproduce this issue with 5.3.0 and 5.4.0 and 5.5.0.
Let's aggregate the information and discussion in this PR discussion rather than the #179 issue discussion, okay?

We have somewhat of a paradox, as using transformers 5.3.0 is not really a solution to the problem - it also brings another problem that needs hot-fixing...


I validated this in a clean clone of zai-org/GLM-OCR with a fresh Python 3.12 venv.

1. Current latest GLM-OCR resolves to transformers==5.5.0

With a fresh editable install of the current repo:

python -m pip install -e ".[layout,server]"

the environment resolved to transformers==5.5.0 for me, since the repo currently requires transformers>=5.3.0 rather than pinning to a narrower version.

So the practical default for a fresh user is now effectively 5.5.0, not 5.3.0.

2. The structural checkpoint/model mismatch is directly visible on 5.3.0, 5.4.0, and 5.5.0

I checked the checkpoint keys and the model expected keys in the clean clone.

Checkpoint contains:

model.denoising_class_embed.weight
model.enc_score_head.weight
model.enc_score_head.bias
model.enc_bbox_head.layers.0.weight
model.enc_bbox_head.layers.0.bias
model.enc_bbox_head.layers.1.weight
model.enc_bbox_head.layers.1.bias
model.enc_bbox_head.layers.2.weight
model.enc_bbox_head.layers.2.bias

Model class expects:

model.denoising_class_embed.weight
model.decoder.class_embed.weight
model.decoder.class_embed.bias
model.decoder.bbox_embed.layers.0.weight
model.decoder.bbox_embed.layers.0.bias
model.decoder.bbox_embed.layers.1.weight
model.decoder.bbox_embed.layers.1.bias
model.decoder.bbox_embed.layers.2.weight
model.decoder.bbox_embed.layers.2.bias

I directly rechecked this on:

  • transformers==5.3.0
  • transformers==5.4.0
  • transformers==5.5.0

and the same structural mismatch remained.

3. transformers==5.3.0 is not a drop-in workaround for current latest GLM-OCR code

With the current clean-clone code as-is, pinning to:

transformers==5.3.0

fails earlier because the current code imports:

PPDocLayoutV3ImageProcessor

but under 5.3.0 that symbol is not available, so the SDK errors with:

ImportError: cannot import name 'PPDocLayoutV3ImageProcessor' from 'transformers'

4. After a tiny compatibility hot-fix, 5.3.0 can run

I then hot-fixed the clean clone locally just for testing by making the import version-compatible:

  • try PPDocLayoutV3ImageProcessor
  • fallback to PPDocLayoutV3ImageProcessorFast when unavailable

After that small compatibility patch, the direct SDK self-hosted parse did get through startup and into real parsing on 5.3.0.

So I agree 5.3.0 can be a practical workaround path if that import compatibility is restored first.

5. But that still does not remove the PR #180 issue

Even after the 5.3.0 compatibility hot-fix, the same checkpoint/model head-name mismatch remained.

So from my side, the clean-clone evidence suggests there are actually two separate issues:

  1. Import compatibility issue

    • current latest code uses PPDocLayoutV3ImageProcessor
    • 5.3.0 still needs PPDocLayoutV3ImageProcessorFast (or a fallback)
  2. Head-name/load-path mismatch

    • checkpoint stores the trained prediction head under enc_*
    • model expects the corresponding decoder.* keys
    • this mismatch still exists on 5.3.0, 5.4.0, and 5.5.0

6. Practical options seem to be

From the clean-clone validation, the practical choices look like:

So from my perspective, “use 5.3.0” by itself is not enough for the current latest GLM-OCR code.

If helpful, I can also paste the exact clean-clone reproduction commands and outputs directly into this PR thread.

@VooDisss
Copy link
Copy Markdown
Contributor Author

I can see that this PR/issue sees no further attention, thus I will just move on with my local fork of GLM-OCR instead of doing effort into PR's...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PP-DocLayoutV3 checkpoint/load-path mismatch under transformers 5.4.0

2 participants