Document Analysis Project

A comprehensive document processing and analysis system that extracts, processes, and generates markdown outputs from various document formats using multiple AI models.

📁 Project Structure

├── Archive/                    # Archived files and backups
├── Md Results/                 # Markdown output results by document
│   ├── Doc1/                  # Results for Document 1
│   ├── Doc2/                  # Results for Document 2
│   ├── Doc3/                  # Results for Document 3
│   ├── Doc4/                  # Results for Document 4
│   └── Doc5/                  # Results for Document 5
├── Conclusive analysis.md      # Final analysis and conclusions
├── Data.xlsx                   # Source data in Excel format
├── Doc1.pdf                    # Source document 1
├── Doc2.pdf                    # Source document 2
├── Doc3.pdf                    # Source document 3
├── Doc4.pdf                    # Source document 4
├── Doc5.pdf                    # Source document 5
├── README.md                   # This file
├── structured_ocr.ipynb        # Jupyter notebook for OCR processing
└── test.ipynb                  # Testing and experimentation notebook

🎯 Purpose

This project processes a collection of PDF documents through various AI models to extract insights, generate structured analysis, and produce markdown-formatted results. Each document is individually processed and the results are organized by document for easy comparison and analysis.

🚀 Getting Started

Prerequisites

Python 3.8+
Jupyter Notebook
Required Python packages (install via requirements.txt if available)
Access to AI models/APIs used in the analysis

Installation

Clone this repository:

git clone <your-repo-url>
cd <your-repo-name>

Install dependencies:
```
pip install -r requirements.txt
```
Set up any required API keys or configuration files

Usage

Document Processing:
- Place your PDF documents in the root directory
- Run the structured OCR notebook: structured_ocr.ipynb
Analysis:
- Execute the analysis pipeline through the notebooks
- Results will be automatically organized in the Md Results/ folders
Review Results:
- Check individual document results in their respective folders under Md Results/
- Review the conclusive analysis in Conclusive analysis.md

📊 Data Sources

PDF Documents (Doc1-5.pdf): Primary source documents for analysis
Data.xlsx: Supporting data in Excel format
Processed Results: Markdown outputs organized by source document

🔬 Analysis Pipeline

The project follows this general workflow:

Document Ingestion: PDF documents are loaded and preprocessed
OCR Processing: Text extraction using structured OCR techniques
Model Processing: Documents are processed through multiple AI models
Result Generation: Structured markdown outputs are generated for each document
Comparative Analysis: Results are compiled into a conclusive analysis

📁 Output Structure

Each document's results are stored in dedicated folders under Md Results/:

Individual model outputs
Comparative analyses
Structured summaries
Key insights and findings

🛠️ Technical Details

OCR Processing: Implemented in structured_ocr.ipynb
Testing Framework: Available in test.ipynb
Data Format: Results output in Markdown for easy readability and version control
Archive System: Historical results and backups stored in Archive/

📈 Results

The analysis produces:

Individual document insights
Cross-document patterns and themes
Structured markdown reports
Comparative analysis across all documents
Final conclusive findings

🤝 Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add some amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📝 Notes

All PDF source documents are included for reproducibility
Results are version-controlled as markdown files
The archive folder maintains historical versions
Notebooks provide interactive analysis capabilities

🔧 Troubleshooting

Ensure all required packages are installed
Check API keys and configuration settings
Verify PDF documents are not corrupted
Review notebook execution order

👥 Authors

Kah Meng Lee

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Document Analysis Project

📁 Project Structure

🎯 Purpose

🚀 Getting Started

Prerequisites

Installation

Usage

📊 Data Sources

🔬 Analysis Pipeline

📁 Output Structure

🛠️ Technical Details

📈 Results

🤝 Contributing

📝 Notes

🔧 Troubleshooting

👥 Authors

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
Archive		Archive
Md Results		Md Results
Conclusive analysis.md		Conclusive analysis.md
Data.xlsx		Data.xlsx
Doc1.pdf		Doc1.pdf
Doc2.pdf		Doc2.pdf
Doc3.pdf		Doc3.pdf
Doc4.pdf		Doc4.pdf
Doc5.pdf		Doc5.pdf
README.md		README.md
Report (final).pdf		Report (final).pdf
Report.docx		Report.docx
structured_ocr.ipynb		structured_ocr.ipynb

KahMeng2/PDF-Parser-Comparison

Folders and files

Latest commit

History

Repository files navigation

Document Analysis Project

📁 Project Structure

🎯 Purpose

🚀 Getting Started

Prerequisites

Installation

Usage

📊 Data Sources

🔬 Analysis Pipeline

📁 Output Structure

🛠️ Technical Details

📈 Results

🤝 Contributing

📝 Notes

🔧 Troubleshooting

👥 Authors

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages