Skip to content

Anzhu-W/Invoice-detection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Table Extraction Pipeline

This repo turns document images (like receipts or invoices) into clean, structured JSON and optional visuals for your frontend. It combines PaddleX for text extraction, Table Transformer for layout detection, and LLaMA2 for smart header matching.

How It Works

  1. Upload & Prep
    • Drop in images or PDFs.
    • Auto-convert to JPG if needed.
    • Quick format and error checks.
  2. Table Processing
    • OCR: PaddleOCR grabs text and bounding boxes.
    • Structure Detection: Table Transformer finds rows, columns and cells.
  3. Smart Parsing
    • LLaMA2 links headers to your target fields (think “fee” vs. “price”).

Output: JSON file with only the fields you need (currently service_date, item_code, unit_price, quantity, and gst.


Service Usage

cd Invoice_service/

Project (Service) Structure

.
├── bin
├── image_analysis
│   ├── config
│   ├── models # (place the 2 required models)
│   ├── schemas
│   ├── servces
│   └── views
│       └── html
└── table_detection
    ├── config
    ├── detr
    │   ├── d2
    │   │   ├── configs
    │   │   └── detr
    │   ├── datasets
    │   ├── models
    │   └── util
    └── src

Under models/

1. Download new_model_with_header.pt from Google Drive: https://drive.google.com/file/d/1MBOiAizY6_4m8py8ziaS5k0Q67Dw3fg4/view?usp=sharing
2. wget https://huggingface.co/bsmock/tatr-pubtables1m-v1.0/resolve/main/pubtables1m_detection_detr_r18.pth

Docker (Recommended)

Under Invoice_service/

sudo docker compose up -d --build

Note: You can access the UI service on port 8080 by SSH local port forwarding.


What's in test/

- input processing
	- handle bad example
	- handle different input file type
	- training set, testing set, validation set partition
	
- model training 
	- training scripts and configurations
	- training logs

- other
  - ollama prompt tests
  - PaddleX, PaddleOCR configuration tests
  - OCR token formatter
  - output csv similarity tests (CER, WER, Cos Similarity)
  - column header handler

You can refer to main.ipynb to explore some intermediate steps.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors