Skip to content

Conversation

@lfoppiano
Copy link
Member

@lfoppiano lfoppiano commented Jan 8, 2026

This PR introduces pure Java ONNX inference for GROBID's deep learning models, leaving the possibility to eliminate the runtime dependency on Python/JEP/TensorFlow (the JEP approach to call DeLFT will not be touched). The implementation enables a lighter Docker image shipping DL models that provide the accuracy benefits of deep learning models with reduced complexity and resource requirements.


Architecture Overview

flowchart TB
    subgraph Input
        PDF[PDF Document]
    end
    
    subgraph "GROBID Processing"
        PDF --> Parser[pdfalto]
        Parser --> Features[Feature Extraction]
        Features --> Engine{Engine Selection}
        
        Engine -->|engine: wapiti| Wapiti[Wapiti CRF]
        Engine -->|engine: onnx| ONNX[ONNX Runtime]
        Engine -->|engine: delft| DeLFT[DeLFT/Python]
    end
    
    subgraph "ONNX Inference Stack"
        ONNX --> Runner[OnnxModelRunner]
        Runner --> RT[ONNX Runtime 1.23.2]
        
        ONNX --> Embeddings[WordEmbeddings]
        Embeddings --> LMDB[(LMDB Database)]
        
        ONNX --> CRF[CRFDecoder]
        CRF --> Viterbi[Viterbi Algorithm]
    end
    
    subgraph Output
        Wapiti --> TEI[TEI XML]
        ONNX --> TEI
        DeLFT --> TEI
    end
Loading

Implementation Components

Core Java Classes

File Lines Purpose
DeLFTOnnxModel.java 588 Main entry point for sequence labeling inference
OnnxClassificationModel.java 286 Text classification (copyright, license)
OnnxTagger.java 69 GenericTagger implementation
OnnxModelRunner.java 214 ONNX Runtime wrapper
CRFDecoder.java 187 Viterbi decoding for CRF layer
WordEmbeddings.java 263 LMDB-based embedding lookup
Preprocessor.java 267 Tokenization & feature handling

Factory Integration

File Change
TaggerFactory.java Added ONNX case → OnnxTagger
ClassifierFactory.java Added ONNX case → OnnxClassificationModel
GrobidCRFEngine.java New ONNX("onnx") enum value

Supported Models

Sequence Labeling (BidLSTM_CRF_FEATURES)

Model ONNX Directory Status
header header-BidLSTM_CRF_FEATURES.onnx/ ✅ Ready
citation citation-BidLSTM_CRF_FEATURES.onnx/ ✅ Ready
affiliation-address affiliation-address-BidLSTM_CRF_FEATURES.onnx/ ✅ Ready
date date-BidLSTM_CRF_FEATURES.onnx/ ✅ Ready
reference-segmenter reference-segmenter-BidLSTM_CRF_FEATURES.onnx/ ✅ Ready
funding-acknowledgement funding-acknowledgement-BidLSTM_CRF_FEATURES.onnx/ More test is needed
header-coi-ac header-coi-ac-BidLSTM_CRF_FEATURES.onnx/ ✅ Ready

Text Classification (GRU)

Model ONNX Directory Status
copyright copyright-gru.onnx/ ✅ Ready
license license-gru.onnx/ ✅ Ready

Model Bundle Structure

Each ONNX model directory contains:

{model}-{architecture}.onnx/
├── encoder.onnx      # or classifier.onnx for classification
├── config.json       # Model configuration (maxSequenceLength, embeddingSize, etc.)
├── vocab.json        # Character vocabulary, tag index, feature mappings
└── crf_params.json   # CRF transition matrices (sequence labeling only)

Configuration

Engine Selection (grobid.yaml)

grobid:
  models:
    - name: "header"
      engine: "onnx"                          # Use ONNX instead of delft/wapiti
      onnx:
        architecture: "BidLSTM_CRF_FEATURES"

    - name: "segmentation"
      engine: "wapiti"                        # Keep Wapiti for long sequences

ONNX-Only Configuration

A dedicated grobid-onnx.yaml provides a complete configuration for ONNX+Wapiti deployment (no need for Python/Tensorflow)


Docker Image

Dockerfile.onnx

Dockerfile.onnx creates a lightweight image:

# Key differences from standard Dockerfile:
- No Python/TensorFlow/DeLFT installation
- Uses grobid-onnx.yaml configuration
- Preloads embeddings with standalone script (no DeLFT dependency)
- Runtime: eclipse-temurin:21-jre + libxml2/libfontconfig only

Build & Run:

docker build -t grobid/grobid:0.8.0-onnx --file Dockerfile.onnx .
docker run -t --rm --init -p 8070:8070 -p 8071:8071 grobid/grobid:0.8.0-onnx

CI/CD Workflow

ci-build-manual-onnx.yml:

  • Trigger: Push to onnx-docker branch or manual dispatch
  • Output: lfoppiano/grobid:latest-onnx on Docker Hub
  • Tags: latest-onnx, custom tag, or commit SHA

Embedding Preloading

Standalone Script

preload_embeddings_standalone.py (283 lines):

  • No DeLFT dependency - uses only lmdb and requests
  • Downloads GloVe embeddings from registry URL
  • Stores as raw float32 bytes (little-endian) in LMDB
  • Compatible with Java WordEmbeddings.java
pip install lmdb requests
python3 preload_embeddings_standalone.py --registry ./resources-registry.json

Format Compatibility

Warning

The LMDB database must contain raw float32 arrays, not pickled numpy arrays. The WordEmbeddings.java class validates this at startup.


Technical Details

Inference Pipeline

sequenceDiagram
    participant Client
    participant OnnxTagger
    participant DeLFTOnnxModel
    participant Preprocessor
    participant WordEmbeddings
    participant OnnxModelRunner
    participant CRFDecoder

    Client->>OnnxTagger: label(data)
    OnnxTagger->>DeLFTOnnxModel: labelGrobidInput(data)
    
    DeLFTOnnxModel->>Preprocessor: tokensToCharIndices()
    DeLFTOnnxModel->>Preprocessor: tokensToFeatureIndices()
    DeLFTOnnxModel->>WordEmbeddings: getEmbeddings(words)
    WordEmbeddings-->>DeLFTOnnxModel: float[][] embeddings
    
    DeLFTOnnxModel->>OnnxModelRunner: runInference(embs, chars, features)
    OnnxModelRunner-->>DeLFTOnnxModel: float[][][] emissions
    
    DeLFTOnnxModel->>CRFDecoder: decode(emissions, mask)
    CRFDecoder-->>DeLFTOnnxModel: int[] tagIndices
    
    DeLFTOnnxModel-->>OnnxTagger: labeled output
    OnnxTagger-->>Client: result
Loading

Long Sequence Handling

DeLFTOnnxModel.labelSequenceWithChunking() automatically chunks sequences exceeding maxSequenceLength (typically 3000 tokens), processes each chunk independently, and concatenates results.

Dependencies

// build.gradle
implementation 'com.microsoft.onnxruntime:onnxruntime:1.23.2'  // CPU
implementation 'org.lmdbjava:lmdbjava:0.9.0'                   // LMDB

Note

GPU support available via onnxruntime_gpu:1.23.2 (Linux with NVIDIA CUDA only).


Testing

Integration Test

HeaderOnnxIntegrationTest.java (213 lines):

Test Description
testModelCanBeLoaded Verifies ONNX model loads successfully
testMaxSequenceLength Confirms config.json parsing
testAnnotateSimpleHeader End-to-end inference with features
testLabelGrobidInput GROBID format input/output
testLabelMultipleSequences Multi-sequence handling

Prerequisites:

  1. ONNX model at grobid-home/models/header-BidLSTM_CRF_FEATURES.onnx/
  2. Embeddings at {delft}/data/db/glove-840B

Comparison: ONNX vs DeLFT

Aspect ONNX DeLFT (Python/JEP)
Python required ❌ No ✅ Yes
Docker image size ~1.5 GB ~5.5 GB
Startup time Fast Slow (Python init)
Memory footprint Lower Higher
GPU support Optional Built-in
Model accuracy Identical* Identical
Transformer models ❌ Not yet ✅ Yes

* ONNX models are exported from trained DeLFT models, ensuring identical predictions.


Limitations

Caution

The following are known limitations of the current ONNX implementation:

  1. No yet tested with transformer Models: BERT/SciBERT sequence labeling to be tested
  2. CPU Only by Default: GPU requires Linux + NVIDIA CUDA + onnxruntime_gpu
  3. No Training: ONNX is inference-only; training requires DeLFT/Python
  4. Embedding Format: Requires raw float32 LMDB (not pickled numpy)
  5. The ONNX models needs to be exported from DeLFT in the right format usable on the Grobid side

Future Work

  • Transformer model export (BERT_CRF, SciBERT_CRF)
  • Java BERT tokenizer integration
  • GPU support documentation
  • Performance benchmarks vs DeLFT

Files Changed Summary

New Files

Path Lines Description
grobid-core/.../delft/DeLFTOnnxModel.java 588 Sequence labeling inference
grobid-core/.../delft/OnnxClassificationModel.java 286 Classification inference
grobid-core/.../tagging/OnnxTagger.java 69 GenericTagger implementation
grobid-core/.../delft/OnnxModelRunner.java 214 ONNX Runtime wrapper
grobid-core/.../delft/CRFDecoder.java 187 Viterbi CRF decoder
grobid-core/.../delft/WordEmbeddings.java 263 LMDB embedding lookup
grobid-core/.../delft/Preprocessor.java 267 Feature preprocessing
grobid-core/.../tagging/ClassifierFactory.java 93 Classification factory
grobid-core/.../tagging/GenericClassifier.java ~20 Classifier interface
grobid-home/config/grobid-onnx.yaml 254 ONNX-only config
grobid-home/scripts/preload_embeddings_standalone.py 283 Standalone embedding script
Dockerfile.onnx 121 Lightweight Docker image
.github/workflows/ci-build-manual-onnx.yml 80 CI workflow
grobid-core/.../HeaderOnnxIntegrationTest.java 213 Integration tests

Modified Files

Path Change
grobid-core/.../GrobidCRFEngine.java Added ONNX enum
grobid-core/.../TaggerFactory.java Added ONNX case
build.gradle Added onnxruntime, lmdbjava dependencies

Model Directories Added

12 ONNX model bundles under grobid-home/models/:

  • header-BidLSTM_CRF_FEATURES.onnx/
  • citation-BidLSTM_CRF_FEATURES.onnx/
  • affiliation-address-BidLSTM_CRF_FEATURES.onnx/
  • date-BidLSTM_CRF_FEATURES.onnx/
  • reference-segmenter-BidLSTM_CRF_FEATURES.onnx/
  • funding-acknowledgement-BidLSTM_CRF_FEATURES.onnx/
  • header-BidLSTM_ChainCRF_FEATURES.onnx/
  • header-coi-ac-BidLSTM_CRF_FEATURES.onnx/
  • header-coi-ac-BidLSTM_ChainCRF_FEATURES.onnx/
  • copyright-gru.onnx/
  • license-gru.onnx/

Usage

Enable ONNX for a Model

# grobid.yaml
grobid:
  models:
    - name: "header"
      engine: "onnx"
      onnx:
        architecture: "BidLSTM_CRF_FEATURES"

Run with Docker (ONNX-only)

docker pull lfoppiano/grobid:latest-onnx
docker run -t --rm --init -p 8070:8070 -p 8071:8071 lfoppiano/grobid:latest-onnx

Local Development

  1. Ensure embeddings are preloaded:

    python3 grobid-home/scripts/preload_embeddings.py --embedding glove-840B
  2. Update grobid.yaml to use engine: "onnx"

  3. Run GROBID:

    ./gradlew run

…ions during open and access operations and updating Javadoc.
Comment on lines 37 to 79
needs: [ build ]
runs-on: ubuntu-latest

steps:
- name: Create more disk space
run: sudo rm -rf /usr/share/dotnet && sudo rm -rf /opt/ghc && sudo rm -rf "/usr/local/share/boost" && sudo rm -rf "$AGENT_TOOLSDIRECTORY"
- uses: actions/checkout@v4
- name: Build and push
id: docker_build
uses: mr-smithers-excellent/docker-build-push@v5
with:
username: ${{ secrets.DOCKERHUB_USERNAME_LFOPPIANO }}
password: ${{ secrets.DOCKERHUB_TOKEN_LFOPPIANO }}
image: lfoppiano/grobid
registry: docker.io
pushImage: true
tags: latest-onnx, ${{ github.event.inputs.custom_tag}}
dockerfile: Dockerfile.onnx
- name: Image digest
run: echo ${{ steps.docker_build.outputs.digest }}
- name: Docker Image Summary
run: |
echo "## 🐳 Docker Image Uploaded Successfully" >> $GITHUB_STEP_SUMMARY
echo "" >> $GITHUB_STEP_SUMMARY
echo "**Image Details:**" >> $GITHUB_STEP_SUMMARY
echo "- **Registry:** docker.io" >> $GITHUB_STEP_SUMMARY
echo "- **Image:** lfoppiano/grobid" >> $GITHUB_STEP_SUMMARY
echo "- **Type:** ONNX/Wapiti only (lightweight, no Python/DeLFT)" >> $GITHUB_STEP_SUMMARY
echo "- **Tags:**" >> $GITHUB_STEP_SUMMARY
echo " - \`latest-onnx\`" >> $GITHUB_STEP_SUMMARY
echo " - \`${{ github.event.inputs.custom_tag }}\`" >> $GITHUB_STEP_SUMMARY
echo "- **Digest:** \`${{ steps.docker_build.outputs.digest }}\`" >> $GITHUB_STEP_SUMMARY
echo "" >> $GITHUB_STEP_SUMMARY
echo "**Features:**" >> $GITHUB_STEP_SUMMARY
echo "- ONNX Runtime for deep learning models (CPU only)" >> $GITHUB_STEP_SUMMARY
echo "- Wapiti CRF for traditional models" >> $GITHUB_STEP_SUMMARY
echo "- No Python, TensorFlow, or DeLFT dependencies" >> $GITHUB_STEP_SUMMARY
echo "" >> $GITHUB_STEP_SUMMARY
echo "**Usage:**" >> $GITHUB_STEP_SUMMARY
echo "\`\`\`bash" >> $GITHUB_STEP_SUMMARY
echo "docker pull lfoppiano/grobid:${{ github.event.inputs.custom_tag }}" >> $GITHUB_STEP_SUMMARY
echo "docker run -t --rm --init -p 8070:8070 -p 8071:8071 lfoppiano/grobid:${{ github.event.inputs.custom_tag }}" >> $GITHUB_STEP_SUMMARY
echo "\`\`\`" >> $GITHUB_STEP_SUMMARY

Check warning

Code scanning / CodeQL

Workflow does not contain permissions Medium

Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {contents: read}

Copilot Autofix

AI 11 days ago

In general, this problem is fixed by adding an explicit permissions section to the workflow (either at the root or per job) that grants only the minimal scopes required. Since this workflow only checks out code, builds it, pushes Docker images to Docker Hub (using Docker Hub credentials), and writes to the step summary, it does not need write access to the repository; read-only access to contents is sufficient.

The best minimal fix, without changing any functionality, is to add a top‑level permissions block applying to all jobs. This should be placed between the on: block (ending at line 16) and the jobs: block (starting at line 18). The block will set contents: read, which allows actions/checkout to function but prevents unnecessary write access to repo contents. No additional imports or methods are needed, as this is purely a YAML configuration change.

Suggested changeset 1
.github/workflows/ci-build-manual-onnx.yml

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/.github/workflows/ci-build-manual-onnx.yml b/.github/workflows/ci-build-manual-onnx.yml
--- a/.github/workflows/ci-build-manual-onnx.yml
+++ b/.github/workflows/ci-build-manual-onnx.yml
@@ -15,6 +15,9 @@
         required: true
         default: "latest-onnx"
 
+permissions:
+  contents: read
+
 jobs:
   build:
     runs-on: ubuntu-latest
EOF
@@ -15,6 +15,9 @@
required: true
default: "latest-onnx"

permissions:
contents: read

jobs:
build:
runs-on: ubuntu-latest
Copilot is powered by AI and may make mistakes. Always verify output.
Comment on lines +20 to +36
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v5
with:
fetch-tags: true
fetch-depth: 0
- name: Set up JDK 21
uses: actions/setup-java@v5
with:
java-version: '21'
distribution: 'temurin'
cache: 'gradle'
- name: Build with Gradle
run: ./gradlew build -x test

docker-build-onnx:

Check warning

Code scanning / CodeQL

Workflow does not contain permissions Medium

Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {contents: read}

Copilot Autofix

AI 11 days ago

To fix the issue, add an explicit permissions block so that the GITHUB_TOKEN has only the minimal scopes required. In this workflow, jobs only need to read the repository (for actions/checkout) and write to the GitHub Actions step summary (which does not require contents: write). They do not modify code, issues, or pull requests, nor do they appear to require id-token or other elevated scopes. Therefore, we can safely set contents: read at the workflow level, which applies to both build and docker-build-onnx jobs.

Concretely, edit .github/workflows/ci-build-manual-onnx.yml and insert a permissions section near the top, after the name: and before on: (or immediately after on:—GitHub accepts either; placing it after name is conventional). The block should be:

permissions:
  contents: read

No additional imports, methods, or definitions are required because this is just a YAML configuration change. Existing functionality—checkout, Gradle build, Docker build and push using Docker Hub credentials, and writing to $GITHUB_STEP_SUMMARY—will continue to work under these permissions.

Suggested changeset 1
.github/workflows/ci-build-manual-onnx.yml

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/.github/workflows/ci-build-manual-onnx.yml b/.github/workflows/ci-build-manual-onnx.yml
--- a/.github/workflows/ci-build-manual-onnx.yml
+++ b/.github/workflows/ci-build-manual-onnx.yml
@@ -3,6 +3,9 @@
 # This workflow builds the lightweight ONNX/Wapiti-only Docker image
 # (no Python/DeLFT/TensorFlow dependencies)
 
+permissions:
+  contents: read
+
 on:
   push:
     branches:
EOF
@@ -3,6 +3,9 @@
# This workflow builds the lightweight ONNX/Wapiti-only Docker image
# (no Python/DeLFT/TensorFlow dependencies)

permissions:
contents: read

on:
push:
branches:
Copilot is powered by AI and may make mistakes. Always verify output.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants