A sophisticated NLP-powered document summarization system that combines extractive and abstractive techniques to deliver high-quality summaries of educational content. Built with modern machine learning frameworks and optimized for both performance and accuracy.
- Advanced NLP Pipeline: Leverages state-of-the-art transformer models for intelligent text analysis
- Hybrid Approach: Combines extractive sentence selection with abstractive text generation
- Production-Ready: Web application with REST API, file upload support, and comprehensive evaluation
- Research-Driven: Addresses real-world challenges in educational content summarization
- Scalable Architecture: Handles documents from short notes to long research papers
- Uses
distilroberta-base(66M parameters) andgoogle/flan-t5-base(248M parameters) - Optimized for resource-constrained environments (8GB VRAM, 16GB RAM)
- Implements efficient chunking and hierarchical summarization for long documents
- Divide-and-conquer strategy: Automatically chunks documents >2000 characters
- Hierarchical summarization: Summarizes chunks then combines results
- Global context preservation: Maintains document coherence across chunks
- Constrained decoding: Forces inclusion of keywords from extractive phase
- Over-extraction strategy: Extracts comprehensive factual basis before abstraction
- Factual consistency scoring: Evaluates hallucination rates and keyword retention
- OCR error correction: Handles common misrecognitions (1→I, 0→O, etc.)
- Educational content cleaning: Removes citations, headers, figure captions
- Academic text segmentation: Proper handling of abbreviations and references
The system implements a cognitive loop of Perception → Planning → Action:
- Perception: Analyzes document structure, preprocesses noisy content
- Planning: Determines optimal chunking strategy based on document length
- Action: Executes hybrid extractive-abstractive pipeline with constrained decoding
- Hybrid NLP Pipeline: DistilRoBERTa extractive summarization + FLAN-T5 abstractive generation
- Constrained Generation: Keyword-grounded abstractive summaries to maintain factual accuracy
- Long Document Support: Automatic chunking and hierarchical summarization for documents >2000 characters
- Comprehensive Evaluation: ROUGE, METEOR, BLEU, and custom factual consistency metrics
- Full-Stack Web Application: Flask-based UI with real-time processing and evaluation
- REST API: Programmatic access for seamless integration
- Resource Optimization: Designed for laptops with limited hardware (8GB VRAM, 16GB RAM)
- Multi-format Support: PDF text extraction, OCR error correction, and plain text processing
- Quality Modes: Fast, Balanced, and High-Quality processing options
- Python 3.8+: Primary programming language
- PyTorch: Deep learning framework for model inference
- Transformers: Hugging Face library for pre-trained models
- Flask: Web framework for REST API and user interface
- NLTK: Natural language processing toolkit
- DistilRoBERTa-base: Extractive summarization (66M parameters)
- FLAN-T5-base: Abstractive summarization (248M parameters)
- scikit-learn: Machine learning utilities and evaluation metrics
- rouge-score: ROUGE evaluation metrics implementation
- sentencepiece: Text tokenization for T5 models
- NumPy & Pandas: Data processing and analysis
- Python 3.8 or higher
- pip package manager
- Git (for cloning the repository)
# Clone the repository
git clone https://github.com/aryaman108/pdf-summariser.git
cd pdf-summariser
# Create virtual environment
python -m venv .venv
.venv\Scripts\activate # Windows
# source .venv/bin/activate # Linux/Mac
# Install dependencies
pip install -r requirements.txttransformers>=4.21.0
torch>=1.12.0
scikit-learn>=1.1.0
rouge-score>=0.1.2
nltk>=3.7
sentencepiece>=0.1.96
flask>=2.2.0
numpy>=1.21.0
pandas>=1.5.0
python app.py
# Access at http://localhost:4500Features:
- Interactive web interface with real-time processing
- File upload support (PDF/TXT)
- Quality mode selection (Fast/Balanced/High)
- Live evaluation metrics display
- REST API endpoint for programmatic access
python main.pyFeatures:
- Command-line interface for batch processing
- Direct text input and file processing
- Customizable summarization parameters
from src.hybrid_summarizer import HybridSummarizer
# Initialize the summarizer
summarizer = HybridSummarizer()
# Generate summary with quality control
summary = summarizer.summarize(
text="Your long document text here...",
quality_mode="balanced", # Options: "fast", "balanced", "high"
verbose=True # Enable detailed logging
)
print(f"Summary: {summary}")# POST request to REST API
curl -X POST http://localhost:4500/api/summarize \
-H "Content-Type: application/json" \
-d '{"text": "Your text here", "quality_mode": "balanced"}'The system provides comprehensive evaluation:
- ROUGE-1/2/L: N-gram overlap with reference summaries
- METEOR: Semantic similarity metric
- BLEU: N-gram precision-based metric
- Factual Consistency: Keyword retention and hallucination detection
- Compression Ratio: Summary length vs. original length
The application provides an intuitive web interface for document summarization:
Main Features:
- Text input area for direct text entry
- File upload support for PDF and TXT documents
- Quality mode selection (Fast/Balanced/High Quality)
- Real-time processing with progress indicators
- Comprehensive evaluation metrics display
- REST API for programmatic access
Screenshot: (Interface shows input text area, quality options, and results with metrics)
// Example frontend integration
fetch('/api/summarize', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
text: documentText,
quality_mode: 'balanced'
})
})
.then(response => response.json())
.then(data => {
console.log('Summary:', data.summary);
console.log('Compression Ratio:', data.compression_ratio);
});ext-summarizer-roberta-t5/
├── app.py # Flask web application
├── main.py # Tkinter desktop application
├── src/
│ ├── hybrid_summarizer.py # Main agent implementation
│ ├── roberta_extractive.py # DistilRoBERTa extractive summarizer
│ ├── t5_abstractive.py # FLAN-T5 abstractive summarizer
│ └── evaluation.py # ROUGE/METEOR evaluation suite
├── utils/
│ └── preprocessing.py # Educational content preprocessing
├── plan.md # Research plan and gap analysis
└── README.md # This file
- Over-extraction: Selects 2x more sentences than needed for comprehensive coverage
- Multi-criteria scoring: Position, length, semantic centrality, lexical diversity, and connectivity
- Keyword extraction: TF-IDF-based keyword identification with enhanced scoring
- Sentence embeddings: CLS token embeddings for semantic representation
- Constrained decoding: Forces inclusion of extracted keywords for factual consistency
- Prompt engineering: Context-preserving prompts with keyword integration
- Post-processing: Fluency enhancement and punctuation correction
- Quality optimization: Length control and repetition penalty tuning
- Adaptive chunking: 1000-character chunks with sentence boundaries
- Hierarchical combination: Summarizes chunk summaries for final output
- Context preservation: Maintains document flow across chunks
- Memory Usage: <4GB VRAM for typical documents
- Processing Speed: ~10 seconds for 5000-word documents
- Accuracy: ROUGE-1 F1 > 0.45 on educational datasets
- Factual Consistency: >85% keyword retention
This project showcases expertise in:
- Transformer Models: Implementation of RoBERTa and T5 architectures
- Fine-tuning & Optimization: Model selection and parameter optimization
- Text Processing: Tokenization, embedding generation, and sequence processing
- Evaluation Metrics: ROUGE, METEOR, BLEU, and custom factual consistency scoring
- Full-Stack Development: Flask web application with REST API
- Modular Architecture: Clean separation of extractive, abstractive, and hybrid components
- Error Handling: Robust exception handling and fallback mechanisms
- Performance Optimization: Memory-efficient processing for resource-constrained environments
- Feature Engineering: Sentence scoring with multiple criteria (position, length, semantics)
- Statistical Analysis: Complexity analysis and content type detection
- Evaluation Frameworks: Comprehensive metrics calculation and reporting
- Data Preprocessing: OCR correction, text cleaning, and normalization
- REST API Design: Clean API endpoints with JSON responses
- Frontend Integration: HTML/CSS/JavaScript interface with real-time updates
- File Upload Handling: Secure file processing for PDF and text documents
- User Experience: Intuitive interface with progress indicators and error handling
- Environment Management: Virtual environment setup and dependency management
- Cross-Platform Compatibility: Windows/Linux/Mac support
- Model Management: Efficient loading and caching of large language models
- Scalability: Chunking strategies for handling large documents
Optimized for:
- Research papers and academic articles
- Lecture notes and study materials
- PDF documents (OCR-processed)
- Noisy scanned content
- Multi-column layouts
- Citations and references
- Multi-language Support: Extend beyond English to support major world languages
- Domain-Specific Models: Fine-tuned models for legal, medical, and technical documents
- Real-time Collaboration: Multi-user editing and shared document processing
- Advanced Evaluation: Integration with human evaluation workflows
- Cloud Deployment: Docker containerization and cloud-native deployment options
- Model Optimization: Quantization and distillation for faster inference
- GPU Acceleration: Enhanced CUDA optimization for better performance
- Caching Layer: Intelligent caching of processed documents and embeddings
- Batch Processing: Parallel processing for multiple documents
This project welcomes contributions from developers, researchers, and NLP enthusiasts. Areas of interest:
- New Model Integration: Support for newer transformer architectures (BERT, GPT, etc.)
- Performance Optimization: Memory and speed improvements
- UI/UX Enhancement: Better user interfaces and user experience
- Testing & QA: Comprehensive test suites and quality assurance
- Evaluation Metrics: Novel evaluation methods for summarization quality
- Dataset Curation: Educational and domain-specific datasets
- Model Fine-tuning: Domain adaptation and fine-tuning workflows
- Comparative Studies: Benchmarking against other summarization systems
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
MIT License - see LICENSE file for details
If you use this system in your research or applications, please cite:
Intelligent Document Agent: Hybrid Text Summarization System
Author: Aryaman Parashar Behera && Chandra Sai Badam
Repository: https://github.com/aryaman108/pdf-summariser
Year: 2024
For questions, collaboration opportunities, or technical discussions:
- GitHub Issues: Report bugs and request features
- Pull Requests: Submit improvements and new features
- Email: Contact through GitHub profile
Built with ❤️ using Python, PyTorch, and Transformers