Skip to content

rajdeepbanerjee-git/StructuralTFIDF

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Structural TF-IDF

A modified TF-IDF algorithm for extracting features from invoices, receipts, and bills where layout matters more than word count.

Information Retrieval Document Classification Vector Space Model


Abstract

Traditional IR algorithms rely on the bag-of-words assumption, where semantic relevance is reflected in term frequency and global rarity. While effective for unstructured prose, this assumption fails for semi-structured transactional documents (e.g., invoices, forms), where high-frequency templates signal noise and structural importance is defined by spatial topology.

We introduce Structural TF-IDF (STF-IDF) — a vector space model that incorporates layout information via three mechanisms: a penalized term frequency to suppress repetitive structural noise, a dampened IDF to modulate ubiquitous structural terms, and a Spatial Beta-Boost to project tokens onto distinct vector manifolds based on adjacency to variable entities. Hyperparameters α and β are jointly optimized via an unsupervised signal-to-noise ratio loop.

Metric Value Context
Relative improvement +11.1% DocLayNet dense layouts
Spatial boost recovery 71.5% Adversarial ablation accuracy
Baseline collapse 17.5% Non-discriminative vocabulary

Note: Details of the experiments available on request


Key Mechanisms

  1. Penalized term frequency (TF) — suppresses repetitive structural noise common in templated transactional documents.
  2. Dampened inverse document frequency (IDF) — modulates the penalty applied to ubiquitous structural terms across the corpus.
  3. Spatial Beta-Boost — projects tokens onto distinct vector manifolds based on their adjacency to variable entities, encoding layout topology.

Usage

from structural_tfidf import StructuralTFIDF

docs = ["Invoice Total $500", "Total estimate $500"]

model = StructuralTFIDF(max_features=1000)

# Auto-tune alpha and beta via SNR optimization loop
model.fit(docs)
model.optimize_parameters(docs)

embeddings = model.transform(docs)

Paper

📄 Publication link — coming soon

About

A modified tf-idf to handle nuances of bills, receipts or invoice data

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages