-
Notifications
You must be signed in to change notification settings - Fork 529
Add support to ONNX models as new engine provider for Deep Learning models #1353
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
… Java versions greater than 1.8.
…ions during open and access operations and updating Javadoc.
… into FeaturesVectorHeader
…and prevent issues with pickled numpy data.
| needs: [ build ] | ||
| runs-on: ubuntu-latest | ||
|
|
||
| steps: | ||
| - name: Create more disk space | ||
| run: sudo rm -rf /usr/share/dotnet && sudo rm -rf /opt/ghc && sudo rm -rf "/usr/local/share/boost" && sudo rm -rf "$AGENT_TOOLSDIRECTORY" | ||
| - uses: actions/checkout@v4 | ||
| - name: Build and push | ||
| id: docker_build | ||
| uses: mr-smithers-excellent/docker-build-push@v5 | ||
| with: | ||
| username: ${{ secrets.DOCKERHUB_USERNAME_LFOPPIANO }} | ||
| password: ${{ secrets.DOCKERHUB_TOKEN_LFOPPIANO }} | ||
| image: lfoppiano/grobid | ||
| registry: docker.io | ||
| pushImage: true | ||
| tags: latest-onnx, ${{ github.event.inputs.custom_tag}} | ||
| dockerfile: Dockerfile.onnx | ||
| - name: Image digest | ||
| run: echo ${{ steps.docker_build.outputs.digest }} | ||
| - name: Docker Image Summary | ||
| run: | | ||
| echo "## 🐳 Docker Image Uploaded Successfully" >> $GITHUB_STEP_SUMMARY | ||
| echo "" >> $GITHUB_STEP_SUMMARY | ||
| echo "**Image Details:**" >> $GITHUB_STEP_SUMMARY | ||
| echo "- **Registry:** docker.io" >> $GITHUB_STEP_SUMMARY | ||
| echo "- **Image:** lfoppiano/grobid" >> $GITHUB_STEP_SUMMARY | ||
| echo "- **Type:** ONNX/Wapiti only (lightweight, no Python/DeLFT)" >> $GITHUB_STEP_SUMMARY | ||
| echo "- **Tags:**" >> $GITHUB_STEP_SUMMARY | ||
| echo " - \`latest-onnx\`" >> $GITHUB_STEP_SUMMARY | ||
| echo " - \`${{ github.event.inputs.custom_tag }}\`" >> $GITHUB_STEP_SUMMARY | ||
| echo "- **Digest:** \`${{ steps.docker_build.outputs.digest }}\`" >> $GITHUB_STEP_SUMMARY | ||
| echo "" >> $GITHUB_STEP_SUMMARY | ||
| echo "**Features:**" >> $GITHUB_STEP_SUMMARY | ||
| echo "- ONNX Runtime for deep learning models (CPU only)" >> $GITHUB_STEP_SUMMARY | ||
| echo "- Wapiti CRF for traditional models" >> $GITHUB_STEP_SUMMARY | ||
| echo "- No Python, TensorFlow, or DeLFT dependencies" >> $GITHUB_STEP_SUMMARY | ||
| echo "" >> $GITHUB_STEP_SUMMARY | ||
| echo "**Usage:**" >> $GITHUB_STEP_SUMMARY | ||
| echo "\`\`\`bash" >> $GITHUB_STEP_SUMMARY | ||
| echo "docker pull lfoppiano/grobid:${{ github.event.inputs.custom_tag }}" >> $GITHUB_STEP_SUMMARY | ||
| echo "docker run -t --rm --init -p 8070:8070 -p 8071:8071 lfoppiano/grobid:${{ github.event.inputs.custom_tag }}" >> $GITHUB_STEP_SUMMARY | ||
| echo "\`\`\`" >> $GITHUB_STEP_SUMMARY |
Check warning
Code scanning / CodeQL
Workflow does not contain permissions Medium
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 11 days ago
In general, this problem is fixed by adding an explicit permissions section to the workflow (either at the root or per job) that grants only the minimal scopes required. Since this workflow only checks out code, builds it, pushes Docker images to Docker Hub (using Docker Hub credentials), and writes to the step summary, it does not need write access to the repository; read-only access to contents is sufficient.
The best minimal fix, without changing any functionality, is to add a top‑level permissions block applying to all jobs. This should be placed between the on: block (ending at line 16) and the jobs: block (starting at line 18). The block will set contents: read, which allows actions/checkout to function but prevents unnecessary write access to repo contents. No additional imports or methods are needed, as this is purely a YAML configuration change.
-
Copy modified lines R18-R20
| @@ -15,6 +15,9 @@ | ||
| required: true | ||
| default: "latest-onnx" | ||
|
|
||
| permissions: | ||
| contents: read | ||
|
|
||
| jobs: | ||
| build: | ||
| runs-on: ubuntu-latest |
| runs-on: ubuntu-latest | ||
|
|
||
| steps: | ||
| - uses: actions/checkout@v5 | ||
| with: | ||
| fetch-tags: true | ||
| fetch-depth: 0 | ||
| - name: Set up JDK 21 | ||
| uses: actions/setup-java@v5 | ||
| with: | ||
| java-version: '21' | ||
| distribution: 'temurin' | ||
| cache: 'gradle' | ||
| - name: Build with Gradle | ||
| run: ./gradlew build -x test | ||
|
|
||
| docker-build-onnx: |
Check warning
Code scanning / CodeQL
Workflow does not contain permissions Medium
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 11 days ago
To fix the issue, add an explicit permissions block so that the GITHUB_TOKEN has only the minimal scopes required. In this workflow, jobs only need to read the repository (for actions/checkout) and write to the GitHub Actions step summary (which does not require contents: write). They do not modify code, issues, or pull requests, nor do they appear to require id-token or other elevated scopes. Therefore, we can safely set contents: read at the workflow level, which applies to both build and docker-build-onnx jobs.
Concretely, edit .github/workflows/ci-build-manual-onnx.yml and insert a permissions section near the top, after the name: and before on: (or immediately after on:—GitHub accepts either; placing it after name is conventional). The block should be:
permissions:
contents: readNo additional imports, methods, or definitions are required because this is just a YAML configuration change. Existing functionality—checkout, Gradle build, Docker build and push using Docker Hub credentials, and writing to $GITHUB_STEP_SUMMARY—will continue to work under these permissions.
-
Copy modified lines R6-R8
| @@ -3,6 +3,9 @@ | ||
| # This workflow builds the lightweight ONNX/Wapiti-only Docker image | ||
| # (no Python/DeLFT/TensorFlow dependencies) | ||
|
|
||
| permissions: | ||
| contents: read | ||
|
|
||
| on: | ||
| push: | ||
| branches: |
This PR introduces pure Java ONNX inference for GROBID's deep learning models, leaving the possibility to eliminate the runtime dependency on Python/JEP/TensorFlow (the JEP approach to call DeLFT will not be touched). The implementation enables a lighter Docker image shipping DL models that provide the accuracy benefits of deep learning models with reduced complexity and resource requirements.
Architecture Overview
flowchart TB subgraph Input PDF[PDF Document] end subgraph "GROBID Processing" PDF --> Parser[pdfalto] Parser --> Features[Feature Extraction] Features --> Engine{Engine Selection} Engine -->|engine: wapiti| Wapiti[Wapiti CRF] Engine -->|engine: onnx| ONNX[ONNX Runtime] Engine -->|engine: delft| DeLFT[DeLFT/Python] end subgraph "ONNX Inference Stack" ONNX --> Runner[OnnxModelRunner] Runner --> RT[ONNX Runtime 1.23.2] ONNX --> Embeddings[WordEmbeddings] Embeddings --> LMDB[(LMDB Database)] ONNX --> CRF[CRFDecoder] CRF --> Viterbi[Viterbi Algorithm] end subgraph Output Wapiti --> TEI[TEI XML] ONNX --> TEI DeLFT --> TEI endImplementation Components
Core Java Classes
DeLFTOnnxModel.javaOnnxClassificationModel.javaOnnxTagger.javaGenericTaggerimplementationOnnxModelRunner.javaCRFDecoder.javaWordEmbeddings.javaPreprocessor.javaFactory Integration
TaggerFactory.javaONNXcase →OnnxTaggerClassifierFactory.javaONNXcase →OnnxClassificationModelGrobidCRFEngine.javaONNX("onnx")enum valueSupported Models
Sequence Labeling (BidLSTM_CRF_FEATURES)
header-BidLSTM_CRF_FEATURES.onnx/citation-BidLSTM_CRF_FEATURES.onnx/affiliation-address-BidLSTM_CRF_FEATURES.onnx/date-BidLSTM_CRF_FEATURES.onnx/reference-segmenter-BidLSTM_CRF_FEATURES.onnx/funding-acknowledgement-BidLSTM_CRF_FEATURES.onnx/header-coi-ac-BidLSTM_CRF_FEATURES.onnx/Text Classification (GRU)
copyright-gru.onnx/license-gru.onnx/Model Bundle Structure
Each ONNX model directory contains:
Configuration
Engine Selection (
grobid.yaml)ONNX-Only Configuration
A dedicated
grobid-onnx.yamlprovides a complete configuration for ONNX+Wapiti deployment (no need for Python/Tensorflow)Docker Image
Dockerfile.onnx
Dockerfile.onnxcreates a lightweight image:# Key differences from standard Dockerfile: - No Python/TensorFlow/DeLFT installation - Uses grobid-onnx.yaml configuration - Preloads embeddings with standalone script (no DeLFT dependency) - Runtime: eclipse-temurin:21-jre + libxml2/libfontconfig onlyBuild & Run:
docker build -t grobid/grobid:0.8.0-onnx --file Dockerfile.onnx . docker run -t --rm --init -p 8070:8070 -p 8071:8071 grobid/grobid:0.8.0-onnxCI/CD Workflow
ci-build-manual-onnx.yml:onnx-dockerbranch or manual dispatchlfoppiano/grobid:latest-onnxon Docker Hublatest-onnx, custom tag, or commit SHAEmbedding Preloading
Standalone Script
preload_embeddings_standalone.py(283 lines):lmdbandrequestsWordEmbeddings.javaFormat Compatibility
Warning
The LMDB database must contain raw float32 arrays, not pickled numpy arrays. The
WordEmbeddings.javaclass validates this at startup.Technical Details
Inference Pipeline
sequenceDiagram participant Client participant OnnxTagger participant DeLFTOnnxModel participant Preprocessor participant WordEmbeddings participant OnnxModelRunner participant CRFDecoder Client->>OnnxTagger: label(data) OnnxTagger->>DeLFTOnnxModel: labelGrobidInput(data) DeLFTOnnxModel->>Preprocessor: tokensToCharIndices() DeLFTOnnxModel->>Preprocessor: tokensToFeatureIndices() DeLFTOnnxModel->>WordEmbeddings: getEmbeddings(words) WordEmbeddings-->>DeLFTOnnxModel: float[][] embeddings DeLFTOnnxModel->>OnnxModelRunner: runInference(embs, chars, features) OnnxModelRunner-->>DeLFTOnnxModel: float[][][] emissions DeLFTOnnxModel->>CRFDecoder: decode(emissions, mask) CRFDecoder-->>DeLFTOnnxModel: int[] tagIndices DeLFTOnnxModel-->>OnnxTagger: labeled output OnnxTagger-->>Client: resultLong Sequence Handling
DeLFTOnnxModel.labelSequenceWithChunking()automatically chunks sequences exceedingmaxSequenceLength(typically 3000 tokens), processes each chunk independently, and concatenates results.Dependencies
Note
GPU support available via
onnxruntime_gpu:1.23.2(Linux with NVIDIA CUDA only).Testing
Integration Test
HeaderOnnxIntegrationTest.java(213 lines):testModelCanBeLoadedtestMaxSequenceLengthtestAnnotateSimpleHeadertestLabelGrobidInputtestLabelMultipleSequencesPrerequisites:
grobid-home/models/header-BidLSTM_CRF_FEATURES.onnx/{delft}/data/db/glove-840BComparison: ONNX vs DeLFT
* ONNX models are exported from trained DeLFT models, ensuring identical predictions.
Limitations
Caution
The following are known limitations of the current ONNX implementation:
Future Work
Files Changed Summary
New Files
grobid-core/.../delft/DeLFTOnnxModel.javagrobid-core/.../delft/OnnxClassificationModel.javagrobid-core/.../tagging/OnnxTagger.javagrobid-core/.../delft/OnnxModelRunner.javagrobid-core/.../delft/CRFDecoder.javagrobid-core/.../delft/WordEmbeddings.javagrobid-core/.../delft/Preprocessor.javagrobid-core/.../tagging/ClassifierFactory.javagrobid-core/.../tagging/GenericClassifier.javagrobid-home/config/grobid-onnx.yamlgrobid-home/scripts/preload_embeddings_standalone.pyDockerfile.onnx.github/workflows/ci-build-manual-onnx.ymlgrobid-core/.../HeaderOnnxIntegrationTest.javaModified Files
grobid-core/.../GrobidCRFEngine.javaONNXenumgrobid-core/.../TaggerFactory.javabuild.gradleModel Directories Added
12 ONNX model bundles under
grobid-home/models/:header-BidLSTM_CRF_FEATURES.onnx/citation-BidLSTM_CRF_FEATURES.onnx/affiliation-address-BidLSTM_CRF_FEATURES.onnx/date-BidLSTM_CRF_FEATURES.onnx/reference-segmenter-BidLSTM_CRF_FEATURES.onnx/funding-acknowledgement-BidLSTM_CRF_FEATURES.onnx/header-BidLSTM_ChainCRF_FEATURES.onnx/header-coi-ac-BidLSTM_CRF_FEATURES.onnx/header-coi-ac-BidLSTM_ChainCRF_FEATURES.onnx/copyright-gru.onnx/license-gru.onnx/Usage
Enable ONNX for a Model
Run with Docker (ONNX-only)
Local Development
Ensure embeddings are preloaded:
Update
grobid.yamlto useengine: "onnx"Run GROBID: