Add support to ONNX models as new engine provider for Deep Learning models #1353

lfoppiano · 2026-01-08T21:05:23Z

This PR introduces pure Java ONNX inference for GROBID's deep learning models, leaving the possibility to eliminate the runtime dependency on Python/JEP/TensorFlow (the JEP approach to call DeLFT will not be touched). The implementation enables a lighter Docker image shipping DL models that provide the accuracy benefits of deep learning models with reduced complexity and resource requirements.

Architecture Overview

flowchart TB
    subgraph Input
        PDF[PDF Document]
    end
    
    subgraph "GROBID Processing"
        PDF --> Parser[pdfalto]
        Parser --> Features[Feature Extraction]
        Features --> Engine{Engine Selection}
        
        Engine -->|engine: wapiti| Wapiti[Wapiti CRF]
        Engine -->|engine: onnx| ONNX[ONNX Runtime]
        Engine -->|engine: delft| DeLFT[DeLFT/Python]
    end
    
    subgraph "ONNX Inference Stack"
        ONNX --> Runner[OnnxModelRunner]
        Runner --> RT[ONNX Runtime 1.23.2]
        
        ONNX --> Embeddings[WordEmbeddings]
        Embeddings --> LMDB[(LMDB Database)]
        
        ONNX --> CRF[CRFDecoder]
        CRF --> Viterbi[Viterbi Algorithm]
    end
    
    subgraph Output
        Wapiti --> TEI[TEI XML]
        ONNX --> TEI
        DeLFT --> TEI
    end

Implementation Components

Core Java Classes

File	Lines	Purpose
`DeLFTOnnxModel.java`	588	Main entry point for sequence labeling inference
`OnnxClassificationModel.java`	286	Text classification (copyright, license)
`OnnxTagger.java`	69	`GenericTagger` implementation
`OnnxModelRunner.java`	214	ONNX Runtime wrapper
`CRFDecoder.java`	187	Viterbi decoding for CRF layer
`WordEmbeddings.java`	263	LMDB-based embedding lookup
`Preprocessor.java`	267	Tokenization & feature handling

Factory Integration

File	Change
`TaggerFactory.java`	Added `ONNX` case → `OnnxTagger`
`ClassifierFactory.java`	Added `ONNX` case → `OnnxClassificationModel`
`GrobidCRFEngine.java`	New `ONNX("onnx")` enum value

Supported Models

Sequence Labeling (BidLSTM_CRF_FEATURES)

Model	ONNX Directory	Status
header	`header-BidLSTM_CRF_FEATURES.onnx/`	✅ Ready
citation	`citation-BidLSTM_CRF_FEATURES.onnx/`	✅ Ready
affiliation-address	`affiliation-address-BidLSTM_CRF_FEATURES.onnx/`	✅ Ready
date	`date-BidLSTM_CRF_FEATURES.onnx/`	✅ Ready
reference-segmenter	`reference-segmenter-BidLSTM_CRF_FEATURES.onnx/`	✅ Ready
funding-acknowledgement	`funding-acknowledgement-BidLSTM_CRF_FEATURES.onnx/`	More test is needed
header-coi-ac	`header-coi-ac-BidLSTM_CRF_FEATURES.onnx/`	✅ Ready

Text Classification (GRU)

Model	ONNX Directory	Status
copyright	`copyright-gru.onnx/`	✅ Ready
license	`license-gru.onnx/`	✅ Ready

Model Bundle Structure

Each ONNX model directory contains:

{model}-{architecture}.onnx/
├── encoder.onnx      # or classifier.onnx for classification
├── config.json       # Model configuration (maxSequenceLength, embeddingSize, etc.)
├── vocab.json        # Character vocabulary, tag index, feature mappings
└── crf_params.json   # CRF transition matrices (sequence labeling only)

Configuration

Engine Selection (`grobid.yaml`)

grobid:
  models:
    - name: "header"
      engine: "onnx"                          # Use ONNX instead of delft/wapiti
      onnx:
        architecture: "BidLSTM_CRF_FEATURES"

    - name: "segmentation"
      engine: "wapiti"                        # Keep Wapiti for long sequences

ONNX-Only Configuration

A dedicated grobid-onnx.yaml provides a complete configuration for ONNX+Wapiti deployment (no need for Python/Tensorflow)

Docker Image

Dockerfile.onnx

Dockerfile.onnx creates a lightweight image:

# Key differences from standard Dockerfile:
- No Python/TensorFlow/DeLFT installation
- Uses grobid-onnx.yaml configuration
- Preloads embeddings with standalone script (no DeLFT dependency)
- Runtime: eclipse-temurin:21-jre + libxml2/libfontconfig only

Build & Run:

docker build -t grobid/grobid:0.8.0-onnx --file Dockerfile.onnx .
docker run -t --rm --init -p 8070:8070 -p 8071:8071 grobid/grobid:0.8.0-onnx

CI/CD Workflow

ci-build-manual-onnx.yml:

Trigger: Push to onnx-docker branch or manual dispatch
Output: lfoppiano/grobid:latest-onnx on Docker Hub
Tags: latest-onnx, custom tag, or commit SHA

Embedding Preloading

Standalone Script

preload_embeddings_standalone.py (283 lines):

No DeLFT dependency - uses only lmdb and requests
Downloads GloVe embeddings from registry URL
Stores as raw float32 bytes (little-endian) in LMDB
Compatible with Java WordEmbeddings.java

pip install lmdb requests
python3 preload_embeddings_standalone.py --registry ./resources-registry.json

Format Compatibility

Warning

The LMDB database must contain raw float32 arrays, not pickled numpy arrays. The WordEmbeddings.java class validates this at startup.

Technical Details

Inference Pipeline

sequenceDiagram
    participant Client
    participant OnnxTagger
    participant DeLFTOnnxModel
    participant Preprocessor
    participant WordEmbeddings
    participant OnnxModelRunner
    participant CRFDecoder

    Client->>OnnxTagger: label(data)
    OnnxTagger->>DeLFTOnnxModel: labelGrobidInput(data)
    
    DeLFTOnnxModel->>Preprocessor: tokensToCharIndices()
    DeLFTOnnxModel->>Preprocessor: tokensToFeatureIndices()
    DeLFTOnnxModel->>WordEmbeddings: getEmbeddings(words)
    WordEmbeddings-->>DeLFTOnnxModel: float[][] embeddings
    
    DeLFTOnnxModel->>OnnxModelRunner: runInference(embs, chars, features)
    OnnxModelRunner-->>DeLFTOnnxModel: float[][][] emissions
    
    DeLFTOnnxModel->>CRFDecoder: decode(emissions, mask)
    CRFDecoder-->>DeLFTOnnxModel: int[] tagIndices
    
    DeLFTOnnxModel-->>OnnxTagger: labeled output
    OnnxTagger-->>Client: result

Long Sequence Handling

DeLFTOnnxModel.labelSequenceWithChunking() automatically chunks sequences exceeding maxSequenceLength (typically 3000 tokens), processes each chunk independently, and concatenates results.

Dependencies

// build.gradle
implementation 'com.microsoft.onnxruntime:onnxruntime:1.23.2'  // CPU
implementation 'org.lmdbjava:lmdbjava:0.9.0'                   // LMDB

Note

GPU support available via onnxruntime_gpu:1.23.2 (Linux with NVIDIA CUDA only).

Testing

Integration Test

HeaderOnnxIntegrationTest.java (213 lines):

Test	Description
`testModelCanBeLoaded`	Verifies ONNX model loads successfully
`testMaxSequenceLength`	Confirms config.json parsing
`testAnnotateSimpleHeader`	End-to-end inference with features
`testLabelGrobidInput`	GROBID format input/output
`testLabelMultipleSequences`	Multi-sequence handling

Prerequisites:

ONNX model at grobid-home/models/header-BidLSTM_CRF_FEATURES.onnx/
Embeddings at {delft}/data/db/glove-840B

Comparison: ONNX vs DeLFT

Aspect	ONNX	DeLFT (Python/JEP)
Python required	❌ No	✅ Yes
Docker image size	~1.5 GB	~5.5 GB
Startup time	Fast	Slow (Python init)
Memory footprint	Lower	Higher
GPU support	Optional	Built-in
Model accuracy	Identical*	Identical
Transformer models	❌ Not yet	✅ Yes

* ONNX models are exported from trained DeLFT models, ensuring identical predictions.

Limitations

Caution

The following are known limitations of the current ONNX implementation:

No yet tested with transformer Models: BERT/SciBERT sequence labeling to be tested
CPU Only by Default: GPU requires Linux + NVIDIA CUDA + onnxruntime_gpu
No Training: ONNX is inference-only; training requires DeLFT/Python
Embedding Format: Requires raw float32 LMDB (not pickled numpy)
The ONNX models needs to be exported from DeLFT in the right format usable on the Grobid side

Future Work

Transformer model export (BERT_CRF, SciBERT_CRF)
Java BERT tokenizer integration
GPU support documentation
Performance benchmarks vs DeLFT

Files Changed Summary

New Files

Path	Lines	Description
`grobid-core/.../delft/DeLFTOnnxModel.java`	588	Sequence labeling inference
`grobid-core/.../delft/OnnxClassificationModel.java`	286	Classification inference
`grobid-core/.../tagging/OnnxTagger.java`	69	GenericTagger implementation
`grobid-core/.../delft/OnnxModelRunner.java`	214	ONNX Runtime wrapper
`grobid-core/.../delft/CRFDecoder.java`	187	Viterbi CRF decoder
`grobid-core/.../delft/WordEmbeddings.java`	263	LMDB embedding lookup
`grobid-core/.../delft/Preprocessor.java`	267	Feature preprocessing
`grobid-core/.../tagging/ClassifierFactory.java`	93	Classification factory
`grobid-core/.../tagging/GenericClassifier.java`	~20	Classifier interface
`grobid-home/config/grobid-onnx.yaml`	254	ONNX-only config
`grobid-home/scripts/preload_embeddings_standalone.py`	283	Standalone embedding script
`Dockerfile.onnx`	121	Lightweight Docker image
`.github/workflows/ci-build-manual-onnx.yml`	80	CI workflow
`grobid-core/.../HeaderOnnxIntegrationTest.java`	213	Integration tests

Modified Files

Path	Change
`grobid-core/.../GrobidCRFEngine.java`	Added `ONNX` enum
`grobid-core/.../TaggerFactory.java`	Added ONNX case
`build.gradle`	Added onnxruntime, lmdbjava dependencies

Model Directories Added

12 ONNX model bundles under grobid-home/models/:

header-BidLSTM_CRF_FEATURES.onnx/
citation-BidLSTM_CRF_FEATURES.onnx/
affiliation-address-BidLSTM_CRF_FEATURES.onnx/
date-BidLSTM_CRF_FEATURES.onnx/
reference-segmenter-BidLSTM_CRF_FEATURES.onnx/
funding-acknowledgement-BidLSTM_CRF_FEATURES.onnx/
header-BidLSTM_ChainCRF_FEATURES.onnx/
header-coi-ac-BidLSTM_CRF_FEATURES.onnx/
header-coi-ac-BidLSTM_ChainCRF_FEATURES.onnx/
copyright-gru.onnx/
license-gru.onnx/

Usage

Enable ONNX for a Model

# grobid.yaml
grobid:
  models:
    - name: "header"
      engine: "onnx"
      onnx:
        architecture: "BidLSTM_CRF_FEATURES"

Run with Docker (ONNX-only)

docker pull lfoppiano/grobid:latest-onnx
docker run -t --rm --init -p 8070:8070 -p 8071:8071 lfoppiano/grobid:latest-onnx

Local Development

Ensure embeddings are preloaded:

python3 grobid-home/scripts/preload_embeddings.py --embedding glove-840B

Update grobid.yaml to use engine: "onnx"
Run GROBID:
```
./gradlew run
```

… Java versions greater than 1.8.

…ions during open and access operations and updating Javadoc.

… into FeaturesVectorHeader

…and prevent issues with pickled numpy data.

.github/workflows/ci-build-manual-onnx.yml

+    needs: [ build ]
+    runs-on: ubuntu-latest
+
+    steps:
+      - name: Create more disk space
+        run: sudo rm -rf /usr/share/dotnet && sudo rm -rf /opt/ghc && sudo rm -rf "/usr/local/share/boost" && sudo rm -rf "$AGENT_TOOLSDIRECTORY"
+      - uses: actions/checkout@v4
+      - name: Build and push
+        id: docker_build
+        uses: mr-smithers-excellent/docker-build-push@v5
+        with:
+          username: ${{ secrets.DOCKERHUB_USERNAME_LFOPPIANO }}
+          password: ${{ secrets.DOCKERHUB_TOKEN_LFOPPIANO }}
+          image: lfoppiano/grobid
+          registry: docker.io
+          pushImage: true
+          tags: latest-onnx, ${{ github.event.inputs.custom_tag}}
+          dockerfile: Dockerfile.onnx
+      - name: Image digest
+        run: echo ${{ steps.docker_build.outputs.digest }}
+      - name: Docker Image Summary
+        run: |
+          echo "## 🐳 Docker Image Uploaded Successfully" >> $GITHUB_STEP_SUMMARY
+          echo "" >> $GITHUB_STEP_SUMMARY
+          echo "**Image Details:**" >> $GITHUB_STEP_SUMMARY
+          echo "- **Registry:** docker.io" >> $GITHUB_STEP_SUMMARY
+          echo "- **Image:** lfoppiano/grobid" >> $GITHUB_STEP_SUMMARY
+          echo "- **Type:** ONNX/Wapiti only (lightweight, no Python/DeLFT)" >> $GITHUB_STEP_SUMMARY
+          echo "- **Tags:**" >> $GITHUB_STEP_SUMMARY
+          echo "  - \`latest-onnx\`" >> $GITHUB_STEP_SUMMARY
+          echo "  - \`${{ github.event.inputs.custom_tag }}\`" >> $GITHUB_STEP_SUMMARY
+          echo "- **Digest:** \`${{ steps.docker_build.outputs.digest }}\`" >> $GITHUB_STEP_SUMMARY
+          echo "" >> $GITHUB_STEP_SUMMARY
+          echo "**Features:**" >> $GITHUB_STEP_SUMMARY
+          echo "- ONNX Runtime for deep learning models (CPU only)" >> $GITHUB_STEP_SUMMARY
+          echo "- Wapiti CRF for traditional models" >> $GITHUB_STEP_SUMMARY
+          echo "- No Python, TensorFlow, or DeLFT dependencies" >> $GITHUB_STEP_SUMMARY
+          echo "" >> $GITHUB_STEP_SUMMARY
+          echo "**Usage:**" >> $GITHUB_STEP_SUMMARY
+          echo "\`\`\`bash" >> $GITHUB_STEP_SUMMARY
+          echo "docker pull lfoppiano/grobid:${{ github.event.inputs.custom_tag }}" >> $GITHUB_STEP_SUMMARY
+          echo "docker run -t --rm --init -p 8070:8070 -p 8071:8071 lfoppiano/grobid:${{ github.event.inputs.custom_tag }}" >> $GITHUB_STEP_SUMMARY
+          echo "\`\`\`" >> $GITHUB_STEP_SUMMARY


In general, this problem is fixed by adding an explicit permissions section to the workflow (either at the root or per job) that grants only the minimal scopes required. Since this workflow only checks out code, builds it, pushes Docker images to Docker Hub (using Docker Hub credentials), and writes to the step summary, it does not need write access to the repository; read-only access to contents is sufficient.

The best minimal fix, without changing any functionality, is to add a top‑level permissions block applying to all jobs. This should be placed between the on: block (ending at line 16) and the jobs: block (starting at line 18). The block will set contents: read, which allows actions/checkout to function but prevents unnecessary write access to repo contents. No additional imports or methods are needed, as this is purely a YAML configuration change.

.github/workflows/ci-build-manual-onnx.yml

+    runs-on: ubuntu-latest
+
+    steps:
+    - uses: actions/checkout@v5
+      with:
+        fetch-tags: true
+        fetch-depth: 0
+    - name: Set up JDK 21
+      uses: actions/setup-java@v5
+      with:
+        java-version: '21'
+        distribution: 'temurin'
+        cache: 'gradle'
+    - name: Build with Gradle
+      run: ./gradlew build -x test
+
+  docker-build-onnx:


To fix the issue, add an explicit permissions block so that the GITHUB_TOKEN has only the minimal scopes required. In this workflow, jobs only need to read the repository (for actions/checkout) and write to the GitHub Actions step summary (which does not require contents: write). They do not modify code, issues, or pull requests, nor do they appear to require id-token or other elevated scopes. Therefore, we can safely set contents: read at the workflow level, which applies to both build and docker-build-onnx jobs.

Concretely, edit .github/workflows/ci-build-manual-onnx.yml and insert a permissions section near the top, after the name: and before on: (or immediately after on:—GitHub accepts either; placing it after name is conventional). The block should be:

permissions: contents: read

No additional imports, methods, or definitions are required because this is just a YAML configuration change. Existing functionality—checkout, Gradle build, Docker build and push using Docker Hub credentials, and writing to $GITHUB_STEP_SUMMARY—will continue to work under these permissions.

lfoppiano added 29 commits January 2, 2026 15:24

add interface to ONNX models

f35c280

force wapiti for tests using models interaction

d5a1faa

fix: test needs to force wapiti

062378a

fix: tests that needs to use wapiti

05df8c5

fix: update wapiti force into service too

2354913

feat: migrate preload-embeddings to store float32 instead of picle

4cf6428

feat: update dockerfile

a378fdb

feat: add more tests

6740cd7

fix: Add required --add-opens JVM arguments for LMDB compatibility on…

1b3997d

… Java versions greater than 1.8.

fix: move chunking of large sequences in the onnx engine

58aa749

refactor: update onnx models, remove unused delft models

9999aa6

fix: defensive programming against longer keys

77acb04

feat: Update LMDB dependency to 0.9.2

eb56e53

fix: enhance LMDB database error handling by catching specific except…

d89355d

…ions during open and access operations and updating Javadoc.

fix: update onnx runtime latest version

dd11f93

feat: add integration tests

8fe9e90

chore: Add JVM argument for LMDB sun.nio.ch access in tests.

0ad2596

feat: Refactor annotation process by consolidating feature extraction…

baee295

… into FeaturesVectorHeader

feat: Add validation for word embedding format to ensure raw float32 …

7430273

…and prevent issues with pickled numpy data.

update evaluation metrics

3ffe5a3

update header model

95053fe

update header model without COI and AC

38aaa74

add BidLSTM_ChainCRF_FEATURES models

d78ee6e

Add GPU-ready library for linux

e0287cd

add classification models with ONNX

ad269e7

fix: concurrency

774e9fa

fix: move licence classifier debug information in DEBUG

ed4e9c4

feat: update ONNX Dockerfile and add CI for building

f864a15

feat: build automatically onnx image on this branch

ae4dd99

github-advanced-security bot found potential problems Jan 8, 2026

View reviewed changes

lfoppiano added 4 commits January 8, 2026 22:44

fix: CI build

f239aa6

fix: remove onnx models from the crf only image

42912dd

fix: lmdb path

a72c7c7

fix: make the CI build work

18d6fc2

github-advanced-security bot found potential problems Jan 8, 2026

View reviewed changes

lfoppiano added 3 commits January 9, 2026 08:06

fix: JAVA OPS

311e804

fix: get the model name from the right configuration block

884001d

fix: classification models configuration

e07d5f8

@@ -3,6 +3,9 @@
             # This workflow builds the lightweight ONNX/Wapiti-only Docker image
             # (no Python/DeLFT/TensorFlow dependencies)
+            permissions:
+              contents: read
             on:
               push:
                 branches:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support to ONNX models as new engine provider for Deep Learning models #1353

Add support to ONNX models as new engine provider for Deep Learning models #1353

Uh oh!

lfoppiano commented Jan 8, 2026 •

edited

Loading

Uh oh!

Uh oh!

Check warning

Copilot Autofix

Check warning

Copilot Autofix

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add support to ONNX models as new engine provider for Deep Learning models #1353

Are you sure you want to change the base?

Add support to ONNX models as new engine provider for Deep Learning models #1353

Uh oh!

Conversation

lfoppiano commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Architecture Overview

Implementation Components

Core Java Classes

Factory Integration

Supported Models

Sequence Labeling (BidLSTM_CRF_FEATURES)

Text Classification (GRU)

Model Bundle Structure

Configuration

Engine Selection (grobid.yaml)

ONNX-Only Configuration

Docker Image

Dockerfile.onnx

CI/CD Workflow

Embedding Preloading

Standalone Script

Format Compatibility

Technical Details

Inference Pipeline

Long Sequence Handling

Dependencies

Testing

Integration Test

Comparison: ONNX vs DeLFT

Limitations

Future Work

Files Changed Summary

New Files

Modified Files

Model Directories Added

Usage

Enable ONNX for a Model

Run with Docker (ONNX-only)

Local Development

Uh oh!

Uh oh!

Check warning

Copilot Autofix

Check warning

Copilot Autofix

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lfoppiano commented Jan 8, 2026 •

edited

Loading

Engine Selection (`grobid.yaml`)