Structural TF-IDF

A modified TF-IDF algorithm for extracting features from invoices, receipts, and bills where layout matters more than word count.

Abstract

Traditional IR algorithms rely on the bag-of-words assumption, where semantic relevance is reflected in term frequency and global rarity. While effective for unstructured prose, this assumption fails for semi-structured transactional documents (e.g., invoices, forms), where high-frequency templates signal noise and structural importance is defined by spatial topology.

We introduce Structural TF-IDF (STF-IDF) — a vector space model that incorporates layout information via three mechanisms: a penalized term frequency to suppress repetitive structural noise, a dampened IDF to modulate ubiquitous structural terms, and a Spatial Beta-Boost to project tokens onto distinct vector manifolds based on adjacency to variable entities. Hyperparameters α and β are jointly optimized via an unsupervised signal-to-noise ratio loop.

Metric	Value	Context
Relative improvement	+11.1%	DocLayNet dense layouts
Spatial boost recovery	71.5%	Adversarial ablation accuracy
Baseline collapse	17.5%	Non-discriminative vocabulary

Note: Details of the experiments available on request

Key Mechanisms

Penalized term frequency (TF) — suppresses repetitive structural noise common in templated transactional documents.
Dampened inverse document frequency (IDF) — modulates the penalty applied to ubiquitous structural terms across the corpus.
Spatial Beta-Boost — projects tokens onto distinct vector manifolds based on their adjacency to variable entities, encoding layout topology.

Usage

from structural_tfidf import StructuralTFIDF

docs = ["Invoice Total $500", "Total estimate $500"]

model = StructuralTFIDF(max_features=1000)

# Auto-tune alpha and beta via SNR optimization loop
model.fit(docs)
model.optimize_parameters(docs)

embeddings = model.transform(docs)

Paper

📄 Publication link — coming soon

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
model.py		model.py
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Structural TF-IDF

Abstract

Note: Details of the experiments available on request

Key Mechanisms

Usage

Paper

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Structural TF-IDF

Abstract

Note: Details of the experiments available on request

Key Mechanisms

Usage

Paper

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages