A robust, modular Python pipeline and REST API designed to assess the quality of digital documents (primarily PDFs). This system validates document structure, extracts semantic metadata using LLM interfaces, and performs topic modeling to cluster documents by content.
The system is designed as a multi-stage pipeline that can be run via a CLI for batch processing or hosted as a FastAPI service for real-time validation.
- Loader: Efficient batch downloading of documents from JSON sources.
- Structural validator:
- Performs strict file signature checks to reject masked binary/video files.
- Analyzes text density, page count, and metadata integrity.
- Assigns a quality score to every document. --> this structural analysis depends on typology-based criteria (practice abstracts & policy briefs, project deliverables & reports, promotional content & newsletters, or scientific/technical papers).
- Metadata extractor: Interface for external LLM-based metadata extraction (title, summary, keywords, topics).
- Topic modeling: Implements BERTopic, with UMAP and HDBSCAN, model to generate semantic clusters from the validated dataset.
- Python 3.10+
- Docker & Docker Compose (optional, for containerized deployment)
- NVIDIA GPU (recommended for topic modeling / Torch)
-
Clone the repository:
git clone https://github.com/adrmisty/doc-quality-app.git cd doc-quality -
Install dependencies:
pip install -r requirements.txt
The application uses a unified entry point main.py located in the app/ module. You can run individual functionality stages (of the pipeline: downloading, metadata extraction or topic modeling) or the full workflow.
Note: Run these commands from the project root.
-
Download a maximum of
ndocuments from a source list of URLs off a JSON file:python3 -m doc_quality.app.main download --n 100 --output ./data/pdf
-
Extract metadata and validate document structure for a maximum of
n_metafiles. Note: takes quite long depending on how many/type of documents used as input.python3 -m doc_quality.app.main metadata --n_meta 50 --input_dir ./data/pdf
-
Train topic model on extracted metadata and content:
python3 -m doc_quality.app.main topics --input_dir ./data/metadata/valid --output_dir ./data/models
-
Run full pipeline sequentially:
python3 -m doc_quality.app.main all --n 500
The project includes a production-ready FastAPI server that exposes the quality assessment logic, accessibly locally at http://localhost:8000/docs.
python3 -m doc_quality.app.main serve --host 0.0.0.0 --port 8000The project is containerized for easy deployment, including (much-needed) GPU support for the machine learning components.
-
Build the container:
docker-compose build
-
Start the service:
# must have NVIDIA Container Toolkit installed for GPU support docker-compose up -d
doc-quality-app/
├── Dockerfile # containerization
├── docker-compose.yml
├── requirements.txt # python dependencies
└── ko_quality/
├── app/ # app entry points and API routers
├── config/ # config prompt templates, .env defs if needed
├── pipeline/
│ ├── loader/ # document ingestion
│ ├── metadata/ # outsourced metadata extraction
│ ├── quality/ # structural validation
│ └── topics/ # topic model training & inference
└── scripts/ # CLI command wrappers
Configuration is managed via the pydantic-settings library. You can override defaults (to be defined in config.py) using environment variables or a .env file.
The system employs several software design patterns to ensure modularity, scalability, and maintainability:
-
Singleton pattern: Used in the API's
global_stateto manage heavy resources (like the topic model), which are loaded only once upon application startup, preventing memory shortages and reducing inference latency. -
Strategy / Factory pattern: common interface for different file types and their respective processing, to be extended (e.g. PowerPoints, HTML... or videos)
-
Heuristics: The PDF processor uses a multi-layered heuristic approach to score document quality before using expensive downstream LLM calls.
Adriana R. Flórez Computational Linguist & Software Engineer GitHub Profile | LinkedIn
Built with ❤️ using Python, FastAPI, and BERTopic.