Table Extraction Pipeline

This repo turns document images (like receipts or invoices) into clean, structured JSON and optional visuals for your frontend. It combines PaddleX for text extraction, Table Transformer for layout detection, and LLaMA2 for smart header matching.

How It Works

Upload & Prep
- Drop in images or PDFs.
- Auto-convert to JPG if needed.
- Quick format and error checks.
Table Processing
- OCR: PaddleOCR grabs text and bounding boxes.
- Structure Detection: Table Transformer finds rows, columns and cells.
Smart Parsing
- LLaMA2 links headers to your target fields (think “fee” vs. “price”).

Output: JSON file with only the fields you need (currently service_date, item_code, unit_price, quantity, and gst.

Service Usage

cd Invoice_service/

Project (Service) Structure

.
├── bin
├── image_analysis
│   ├── config
│   ├── models # (place the 2 required models)
│   ├── schemas
│   ├── servces
│   └── views
│       └── html
└── table_detection
    ├── config
    ├── detr
    │   ├── d2
    │   │   ├── configs
    │   │   └── detr
    │   ├── datasets
    │   ├── models
    │   └── util
    └── src

Under models/

1. Download new_model_with_header.pt from Google Drive: https://drive.google.com/file/d/1MBOiAizY6_4m8py8ziaS5k0Q67Dw3fg4/view?usp=sharing
2. wget https://huggingface.co/bsmock/tatr-pubtables1m-v1.0/resolve/main/pubtables1m_detection_detr_r18.pth

Docker (Recommended)

Under Invoice_service/

sudo docker compose up -d --build

Note: You can access the UI service on port 8080 by SSH local port forwarding.

What's in test/

- input processing
	- handle bad example
	- handle different input file type
	- training set, testing set, validation set partition
	
- model training 
	- training scripts and configurations
	- training logs

- other
  - ollama prompt tests
  - PaddleX, PaddleOCR configuration tests
  - OCR token formatter
  - output csv similarity tests (CER, WER, Cos Similarity)
  - column header handler

You can refer to main.ipynb to explore some intermediate steps.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
Invoice_service		Invoice_service
tests		tests
.DS_Store		.DS_Store
LICENSE		LICENSE
README.md		README.md
main.ipynb		main.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Table Extraction Pipeline

How It Works

Service Usage

Project (Service) Structure

Under models/

Docker (Recommended)

What's in test/

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Table Extraction Pipeline

How It Works

Service Usage

Project (Service) Structure

Under models/

Docker (Recommended)

What's in test/

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages