A modified TF-IDF algorithm for extracting features from invoices, receipts, and bills where layout matters more than word count.
Traditional IR algorithms rely on the bag-of-words assumption, where semantic relevance is reflected in term frequency and global rarity. While effective for unstructured prose, this assumption fails for semi-structured transactional documents (e.g., invoices, forms), where high-frequency templates signal noise and structural importance is defined by spatial topology.
We introduce Structural TF-IDF (STF-IDF) — a vector space model that incorporates layout information via three mechanisms: a penalized term frequency to suppress repetitive structural noise, a dampened IDF to modulate ubiquitous structural terms, and a Spatial Beta-Boost to project tokens onto distinct vector manifolds based on adjacency to variable entities. Hyperparameters α and β are jointly optimized via an unsupervised signal-to-noise ratio loop.
| Metric | Value | Context |
|---|---|---|
| Relative improvement | +11.1% | DocLayNet dense layouts |
| Spatial boost recovery | 71.5% | Adversarial ablation accuracy |
| Baseline collapse | 17.5% | Non-discriminative vocabulary |
- Penalized term frequency (TF) — suppresses repetitive structural noise common in templated transactional documents.
- Dampened inverse document frequency (IDF) — modulates the penalty applied to ubiquitous structural terms across the corpus.
- Spatial Beta-Boost — projects tokens onto distinct vector manifolds based on their adjacency to variable entities, encoding layout topology.
from structural_tfidf import StructuralTFIDF
docs = ["Invoice Total $500", "Total estimate $500"]
model = StructuralTFIDF(max_features=1000)
# Auto-tune alpha and beta via SNR optimization loop
model.fit(docs)
model.optimize_parameters(docs)
embeddings = model.transform(docs)📄 Publication link — coming soon