From 323a47daa95b153bcb6eeed01c02b512b7c4cf7e Mon Sep 17 00:00:00 2001 From: PiyushInt Date: Fri, 27 Mar 2026 20:08:23 +0530 Subject: [PATCH] Docs: add model transparency section (fixes #73) --- README.md | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) diff --git a/README.md b/README.md index 4389217..4982d01 100644 --- a/README.md +++ b/README.md @@ -277,6 +277,23 @@ Benchmarked with vLLM on a single NVIDIA H100 80GB GPU using a diverse mix of do |---|:---:|:---:|:---:|:---:| | vLLM, 96 concurrent sequences | 1.44 | 60s | 156s | 0% | +## Model transparency + +### Training data + +Chandra OCR 2 is trained on large-scale collections of document images designed to cover math, tables, forms, handwriting, scans, and multilingual content. The goal of the training corpus is to match the distribution of the benchmarks and real‑world documents shown above. We do not list individual datasets or sources here, but as with most OCR systems, the model may inherit patterns, artifacts, and biases that exist in the underlying data. + +If we publish more detailed information about the training data composition or preprocessing, we will link it from this section. + +### Limitations and risks + +- Performance can degrade on extremely low‑resolution scans, heavy blur, or severe compression artifacts. +- Rare scripts, highly stylized fonts, or unusual page layouts may be misrecognized or reordered. +- The model may hallucinate or omit text, tables, or symbols, especially in very noisy or partially occluded regions. +- Outputs can reflect social or cultural biases present in the training data and should not be treated as ground truth for high‑stakes decisions (e.g., legal, medical, or safety‑critical use) without human review. + +Users should carefully review outputs before use in production workflows and consider adding domain‑specific validation or post‑processing where appropriate. + # Credits Thank you to the following open source projects: