Skip to content

aryaman108/pdf-summariser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Intelligent Document Agent: Hybrid Text Summarization System

A sophisticated NLP-powered document summarization system that combines extractive and abstractive techniques to deliver high-quality summaries of educational content. Built with modern machine learning frameworks and optimized for both performance and accuracy.

🚀 Key Highlights

  • Advanced NLP Pipeline: Leverages state-of-the-art transformer models for intelligent text analysis
  • Hybrid Approach: Combines extractive sentence selection with abstractive text generation
  • Production-Ready: Web application with REST API, file upload support, and comprehensive evaluation
  • Research-Driven: Addresses real-world challenges in educational content summarization
  • Scalable Architecture: Handles documents from short notes to long research papers

🎯 Research Gaps Addressed

Computational Feasibility

  • Uses distilroberta-base (66M parameters) and google/flan-t5-base (248M parameters)
  • Optimized for resource-constrained environments (8GB VRAM, 16GB RAM)
  • Implements efficient chunking and hierarchical summarization for long documents

Scalability & Long Document Handling

  • Divide-and-conquer strategy: Automatically chunks documents >2000 characters
  • Hierarchical summarization: Summarizes chunks then combines results
  • Global context preservation: Maintains document coherence across chunks

Hallucination Mitigation

  • Constrained decoding: Forces inclusion of keywords from extractive phase
  • Over-extraction strategy: Extracts comprehensive factual basis before abstraction
  • Factual consistency scoring: Evaluates hallucination rates and keyword retention

Robust Data Preprocessing

  • OCR error correction: Handles common misrecognitions (1→I, 0→O, etc.)
  • Educational content cleaning: Removes citations, headers, figure captions
  • Academic text segmentation: Proper handling of abbreviations and references

🧠 Agentic Framework

The system implements a cognitive loop of Perception → Planning → Action:

  • Perception: Analyzes document structure, preprocesses noisy content
  • Planning: Determines optimal chunking strategy based on document length
  • Action: Executes hybrid extractive-abstractive pipeline with constrained decoding

🚀 Features

  • Hybrid NLP Pipeline: DistilRoBERTa extractive summarization + FLAN-T5 abstractive generation
  • Constrained Generation: Keyword-grounded abstractive summaries to maintain factual accuracy
  • Long Document Support: Automatic chunking and hierarchical summarization for documents >2000 characters
  • Comprehensive Evaluation: ROUGE, METEOR, BLEU, and custom factual consistency metrics
  • Full-Stack Web Application: Flask-based UI with real-time processing and evaluation
  • REST API: Programmatic access for seamless integration
  • Resource Optimization: Designed for laptops with limited hardware (8GB VRAM, 16GB RAM)
  • Multi-format Support: PDF text extraction, OCR error correction, and plain text processing
  • Quality Modes: Fast, Balanced, and High-Quality processing options

🛠️ Technical Stack

Core Technologies

  • Python 3.8+: Primary programming language
  • PyTorch: Deep learning framework for model inference
  • Transformers: Hugging Face library for pre-trained models
  • Flask: Web framework for REST API and user interface
  • NLTK: Natural language processing toolkit

Machine Learning Models

  • DistilRoBERTa-base: Extractive summarization (66M parameters)
  • FLAN-T5-base: Abstractive summarization (248M parameters)

Supporting Libraries

  • scikit-learn: Machine learning utilities and evaluation metrics
  • rouge-score: ROUGE evaluation metrics implementation
  • sentencepiece: Text tokenization for T5 models
  • NumPy & Pandas: Data processing and analysis

📦 Installation

Prerequisites

  • Python 3.8 or higher
  • pip package manager
  • Git (for cloning the repository)

Setup Instructions

# Clone the repository
git clone https://github.com/aryaman108/pdf-summariser.git
cd pdf-summariser

# Create virtual environment
python -m venv .venv
.venv\Scripts\activate  # Windows
# source .venv/bin/activate  # Linux/Mac

# Install dependencies
pip install -r requirements.txt

Dependencies

transformers>=4.21.0
torch>=1.12.0
scikit-learn>=1.1.0
rouge-score>=0.1.2
nltk>=3.7
sentencepiece>=0.1.96
flask>=2.2.0
numpy>=1.21.0
pandas>=1.5.0

🎮 Usage

Web Interface (Recommended)

python app.py
# Access at http://localhost:4500

Features:

  • Interactive web interface with real-time processing
  • File upload support (PDF/TXT)
  • Quality mode selection (Fast/Balanced/High)
  • Live evaluation metrics display
  • REST API endpoint for programmatic access

Desktop Application

python main.py

Features:

  • Command-line interface for batch processing
  • Direct text input and file processing
  • Customizable summarization parameters

Programmatic Usage

from src.hybrid_summarizer import HybridSummarizer

# Initialize the summarizer
summarizer = HybridSummarizer()

# Generate summary with quality control
summary = summarizer.summarize(
    text="Your long document text here...",
    quality_mode="balanced",  # Options: "fast", "balanced", "high"
    verbose=True  # Enable detailed logging
)

print(f"Summary: {summary}")

API Usage

# POST request to REST API
curl -X POST http://localhost:4500/api/summarize \
  -H "Content-Type: application/json" \
  -d '{"text": "Your text here", "quality_mode": "balanced"}'

📊 Evaluation Metrics

The system provides comprehensive evaluation:

  • ROUGE-1/2/L: N-gram overlap with reference summaries
  • METEOR: Semantic similarity metric
  • BLEU: N-gram precision-based metric
  • Factual Consistency: Keyword retention and hallucination detection
  • Compression Ratio: Summary length vs. original length

📸 Demo & Screenshots

Web Interface

The application provides an intuitive web interface for document summarization:

Main Features:

  • Text input area for direct text entry
  • File upload support for PDF and TXT documents
  • Quality mode selection (Fast/Balanced/High Quality)
  • Real-time processing with progress indicators
  • Comprehensive evaluation metrics display
  • REST API for programmatic access

Screenshot: (Interface shows input text area, quality options, and results with metrics)

API Integration

// Example frontend integration
fetch('/api/summarize', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    text: documentText,
    quality_mode: 'balanced'
  })
})
.then(response => response.json())
.then(data => {
  console.log('Summary:', data.summary);
  console.log('Compression Ratio:', data.compression_ratio);
});

🏗️ Project Structure

ext-summarizer-roberta-t5/
├── app.py                    # Flask web application
├── main.py                   # Tkinter desktop application
├── src/
│   ├── hybrid_summarizer.py  # Main agent implementation
│   ├── roberta_extractive.py # DistilRoBERTa extractive summarizer
│   ├── t5_abstractive.py     # FLAN-T5 abstractive summarizer
│   └── evaluation.py         # ROUGE/METEOR evaluation suite
├── utils/
│   └── preprocessing.py      # Educational content preprocessing
├── plan.md                   # Research plan and gap analysis
└── README.md                 # This file

🔬 Technical Implementation

Extractive Phase (DistilRoBERTa-base)

  • Over-extraction: Selects 2x more sentences than needed for comprehensive coverage
  • Multi-criteria scoring: Position, length, semantic centrality, lexical diversity, and connectivity
  • Keyword extraction: TF-IDF-based keyword identification with enhanced scoring
  • Sentence embeddings: CLS token embeddings for semantic representation

Abstractive Phase (FLAN-T5-base)

  • Constrained decoding: Forces inclusion of extracted keywords for factual consistency
  • Prompt engineering: Context-preserving prompts with keyword integration
  • Post-processing: Fluency enhancement and punctuation correction
  • Quality optimization: Length control and repetition penalty tuning

Chunking Strategy

  • Adaptive chunking: 1000-character chunks with sentence boundaries
  • Hierarchical combination: Summarizes chunk summaries for final output
  • Context preservation: Maintains document flow across chunks

📈 Performance Characteristics

  • Memory Usage: <4GB VRAM for typical documents
  • Processing Speed: ~10 seconds for 5000-word documents
  • Accuracy: ROUGE-1 F1 > 0.45 on educational datasets
  • Factual Consistency: >85% keyword retention

🎯 Skills Demonstrated

This project showcases expertise in:

🤖 Machine Learning & NLP

  • Transformer Models: Implementation of RoBERTa and T5 architectures
  • Fine-tuning & Optimization: Model selection and parameter optimization
  • Text Processing: Tokenization, embedding generation, and sequence processing
  • Evaluation Metrics: ROUGE, METEOR, BLEU, and custom factual consistency scoring

🏗️ Software Engineering

  • Full-Stack Development: Flask web application with REST API
  • Modular Architecture: Clean separation of extractive, abstractive, and hybrid components
  • Error Handling: Robust exception handling and fallback mechanisms
  • Performance Optimization: Memory-efficient processing for resource-constrained environments

📊 Data Science & Analytics

  • Feature Engineering: Sentence scoring with multiple criteria (position, length, semantics)
  • Statistical Analysis: Complexity analysis and content type detection
  • Evaluation Frameworks: Comprehensive metrics calculation and reporting
  • Data Preprocessing: OCR correction, text cleaning, and normalization

☁️ Web Development

  • REST API Design: Clean API endpoints with JSON responses
  • Frontend Integration: HTML/CSS/JavaScript interface with real-time updates
  • File Upload Handling: Secure file processing for PDF and text documents
  • User Experience: Intuitive interface with progress indicators and error handling

🔧 DevOps & Deployment

  • Environment Management: Virtual environment setup and dependency management
  • Cross-Platform Compatibility: Windows/Linux/Mac support
  • Model Management: Efficient loading and caching of large language models
  • Scalability: Chunking strategies for handling large documents

🎓 Educational Content Support

Optimized for:

  • Research papers and academic articles
  • Lecture notes and study materials
  • PDF documents (OCR-processed)
  • Noisy scanned content
  • Multi-column layouts
  • Citations and references

🚀 Future Enhancements

Planned Features

  • Multi-language Support: Extend beyond English to support major world languages
  • Domain-Specific Models: Fine-tuned models for legal, medical, and technical documents
  • Real-time Collaboration: Multi-user editing and shared document processing
  • Advanced Evaluation: Integration with human evaluation workflows
  • Cloud Deployment: Docker containerization and cloud-native deployment options

Technical Improvements

  • Model Optimization: Quantization and distillation for faster inference
  • GPU Acceleration: Enhanced CUDA optimization for better performance
  • Caching Layer: Intelligent caching of processed documents and embeddings
  • Batch Processing: Parallel processing for multiple documents

🤝 Contributing

This project welcomes contributions from developers, researchers, and NLP enthusiasts. Areas of interest:

Development Opportunities

  • New Model Integration: Support for newer transformer architectures (BERT, GPT, etc.)
  • Performance Optimization: Memory and speed improvements
  • UI/UX Enhancement: Better user interfaces and user experience
  • Testing & QA: Comprehensive test suites and quality assurance

Research Opportunities

  • Evaluation Metrics: Novel evaluation methods for summarization quality
  • Dataset Curation: Educational and domain-specific datasets
  • Model Fine-tuning: Domain adaptation and fine-tuning workflows
  • Comparative Studies: Benchmarking against other summarization systems

How to Contribute

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

📄 License

MIT License - see LICENSE file for details

📚 Citation

If you use this system in your research or applications, please cite:

Intelligent Document Agent: Hybrid Text Summarization System
Author: Aryaman Parashar Behera && Chandra Sai Badam
Repository: https://github.com/aryaman108/pdf-summariser
Year: 2024

📞 Contact

For questions, collaboration opportunities, or technical discussions:

  • GitHub Issues: Report bugs and request features
  • Pull Requests: Submit improvements and new features
  • Email: Contact through GitHub profile

Built with ❤️ using Python, PyTorch, and Transformers

About

An advanced NLP project showcasing expertise in machine learning, full-stack development, and AI engineering. Built with PyTorch, Transformers, and Flask, this production-ready document summarization system demonstrates proficiency in RoBERTa/T5 model integration, REST API development, and scalable web application architecture.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors