- Introduction
- Features
- System Architecture
- Data Flow
- Workflow
- Module Architecture
- API Integration
- How to Use
- Running Locally
- Configuration
- Tech Stack
- Security Considerations
- Troubleshooting
The AI-Powered Web Article Summarizer is a sophisticated multi-page Streamlit application designed to automatically extract and summarize text from web articles. Leveraging Google Gemini AI for intelligent summarization, the app provides:
- Batch URL Summarization – Process multiple URLs simultaneously
- Keyword-Based Discovery – Automatically find and summarize top search results from Google
- Flexible Summary Types – Choose from concise, detailed, or key points formats
- Custom AI Instructions – Fine-tune summarization behavior with custom prompts
- Intuitive UI – Clean interface with collapsible sections for organized content viewing
This project is ideal for researchers, students, content curators, and professionals who need to efficiently digest large volumes of web content.
- Secure API key management with session-based storage
- Support for multiple API providers:
- Gemini AI API key for summarization
- Google API Key for search integration
- Google CSE ID for custom search engine
- Keys stored in session state for seamless cross-page access
- Batch processing of multiple URLs (one per line)
- Intelligent content extraction using Trafilatura
- Configurable summary types for each URL
- Real-time processing with progress indicators
- Collapsible sections displaying both extracted text and summaries
- Dynamic search result retrieval via Google Custom Search
- Configurable number of top websites (1-10)
- Automatic content extraction from search results
- Batch summarization with consistent formatting
- Organized display with per-website collapsible sections
- Natural language guidance for AI summarization
- Control over tone, focus areas, and detail level
- Applied consistently across all processed content
The application follows a modular architecture with clear separation of concerns:
┌─────────────────────────────────────────────────────────────────┐
│ Streamlit Frontend │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │
│ │ Home Page │ │ URL Summarizer│ │ Keyword Summarizer │ │
│ │ (API Setup) │ │ Page │ │ Page │ │
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────────────┘ │
│ │ │ │ │
└─────────┼──────────────────┼──────────────────┼──────────────────┘
│ │ │
▼ ▼ ▼
┌─────────────────────────────────────────────────────────────────┐
│ Application Logic Layer │
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌────────────────┐ │
│ │ Extractor │ │ Preprocessor │ │ Summarizer │ │
│ │ (extractor.py) │ │(preprocessor.py) │ │(summarizer.py) │ │
│ │ │ │ │ │ │ │
│ │ • URL fetching │ │ • Text cleaning │ │ • AI prompting │ │
│ │ • Content parse │ │ • Normalization │ │ • Response │ │
│ │ • Main text │ │ • Preprocessing │ │ handling │ │
│ │ extraction │ │ │ │ │ │
│ └────────┬────────┘ └────────┬─────────┘ └────────┬───────┘ │
│ │ │ │ │
└───────────┼────────────────────┼──────────────────────┼─────────┘
│ │ │
▼ ▼ ▼
┌─────────────────────────────────────────────────────────────────┐
│ Utility & Support Layer │
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌────────────────┐ │
│ │ Keyword Search │ │ Utilities │ │ Configuration │ │
│ │(keyword_search) │ │ (utils.py) │ │ (config.py) │ │
│ │ │ │ │ │ │ │
│ │ • Google CSE │ │ • Logging setup │ │ • API keys │ │
│ │ • URL discovery │ │ • File I/O │ │ • Defaults │ │
│ │ • Result │ │ • Timestamps │ │ • Paths │ │
│ │ formatting │ │ │ │ │ │
│ └────────┬────────┘ └────────┬────────┘ └────────┬───────┘ │
│ │ │ │ │
└───────────┼────────────────────┼─────────────────────┼─────────┘
│ │ │
▼ ▼ ▼
┌─────────────────────────────────────────────────────────────────┐
│ External Services │
│ │
│ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────┐ │
│ │ Trafilatura │ │ Google Gemini │ │ Google CSE │ │
│ │ Web Scraping │ │ AI API │ │ Search API │ │
│ └──────────────────┘ └──────────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────────┘
- Home Page: API key configuration and session management
- URL Summarizer Page: Multi-URL batch processing interface
- Keyword Summarizer Page: Search-based content discovery interface
- Session State Management: Persistent API key storage across pages
- Extractor Module: Handles web content fetching and extraction using Trafilatura
- Preprocessor Module: Cleans, normalizes, and prepares text for AI processing
- Summarizer Module: Interfaces with Google Gemini AI for intelligent summarization
- Keyword Search: Google Custom Search Engine integration
- Utilities: Logging, file operations, timestamp management
- Configuration: Centralized settings and environment management
- Trafilatura: Robust web scraping and content extraction
- Google Gemini AI: Advanced language model for summarization
- Google Custom Search: Keyword-based URL discovery
sequenceDiagram
participant User
participant UI as Streamlit UI
participant Session as Session State
participant Extractor
participant Preprocessor
participant Summarizer
participant Trafilatura
participant GeminiAI
participant GoogleCSE
Note over User,GoogleCSE: 1. API Setup Phase
User->>UI: Enter API Keys
UI->>Session: Store keys in session_state
Session-->>UI: Confirmation
UI-->>User: Keys saved successfully
Note over User,GoogleCSE: 2A. URL Summarization Flow
User->>UI: Enter URLs + Summary Type
UI->>Session: Retrieve Gemini API key
loop For each URL
UI->>Extractor: get_and_preprocess_text(url)
Extractor->>Trafilatura: fetch_url(url)
Trafilatura-->>Extractor: HTML content
Extractor->>Trafilatura: extract(content)
Trafilatura-->>Extractor: Raw text
Extractor->>Preprocessor: clean_text(raw_text)
Preprocessor-->>Extractor: Cleaned text
Extractor->>Preprocessor: preprocess_text(cleaned_text)
Preprocessor-->>Extractor: Processed text
Extractor-->>UI: Final text
UI->>Summarizer: summarize_text(text, type, instructions)
Summarizer->>GeminiAI: generate_content(prompt)
GeminiAI-->>Summarizer: Summary response
Summarizer-->>UI: Formatted summary
UI-->>User: Display extracted text + summary
end
Note over User,GoogleCSE: 2B. Keyword Summarization Flow
User->>UI: Enter Keyword + Number of Results
UI->>Session: Retrieve API keys
UI->>GoogleCSE: get_top_urls_from_keyword(keyword)
GoogleCSE->>GoogleCSE: search(keyword, num_results)
GoogleCSE-->>UI: List of top URLs
loop For each URL
UI->>Extractor: get_and_preprocess_text(url)
Note over Extractor,Preprocessor: Same extraction flow as above
Extractor-->>UI: Extracted text
UI->>Summarizer: summarize_text(text)
Summarizer->>GeminiAI: generate_content(prompt)
GeminiAI-->>Summarizer: Summary
Summarizer-->>UI: Formatted summary
UI-->>User: Display results per URL
end
┌──────────────────┐
│ Raw Web URL │
└────────┬─────────┘
│
▼
┌──────────────────┐
│ Trafilatura │──── Fetches HTML content
│ Fetch & Extract │──── Removes ads, navigation, boilerplate
└────────┬─────────┘──── Extracts main article text
│
▼
┌──────────────────┐
│ Preprocessor │──── clean_text(): Remove HTML entities
│ Clean Text │──── Normalize whitespace
└────────┬─────────┘──── Remove control characters
│
▼
┌──────────────────┐
│ Preprocessor │──── Optional: lowercase conversion
│ Preprocess Text │──── Optional: number removal
└────────┬─────────┘──── Punctuation normalization
│
▼
┌──────────────────┐
│ Summarizer │──── Build prompt based on summary type
│ Build Prompt │──── Apply custom instructions
└────────┬─────────┘──── Configure generation parameters
│
▼
┌──────────────────┐
│ Google Gemini │──── Process with gemini-2.5-flash
│ AI Processing │──── Temperature: 0.3
└────────┬─────────┘──── Max tokens: 20,000
│
▼
┌──────────────────┐
│ Final Summary │──── Formatted based on type
│ Output │──── Ready for display
└──────────────────┘
flowchart TD
Start([User Opens Application]) --> Home[Home Page: API Setup]
Home --> SaveKeys{Save API Keys?}
SaveKeys -->|No| WaitKeys[Display Warning]
WaitKeys --> Home
SaveKeys -->|Yes| StoreKeys[Store in Session State]
StoreKeys --> SelectPage{Select Page}
SelectPage -->|URL Summarizer| URLInput[Enter URLs]
SelectPage -->|Keyword Summarizer| KeywordInput[Enter Keyword]
URLInput --> URLConfig[Configure Summary Type]
URLConfig --> URLCustom[Optional: Custom Instructions]
URLCustom --> URLProcess[Process Each URL]
KeywordInput --> KeywordConfig[Select Number of Results]
KeywordConfig --> KeywordType[Configure Summary Type]
KeywordType --> KeywordCustom[Optional: Custom Instructions]
KeywordCustom --> SearchURLs[Fetch URLs from Google CSE]
SearchURLs --> KeywordProcess[Process Each URL]
URLProcess --> Extract[Extract Content]
KeywordProcess --> Extract
Extract --> Validate{Content Extracted?}
Validate -->|No| ShowError[Display Error]
Validate -->|Yes| Preprocess[Clean & Preprocess Text]
Preprocess --> BuildPrompt[Build AI Prompt]
BuildPrompt --> Summarize[Call Gemini AI]
Summarize --> FormatOutput[Format Summary]
FormatOutput --> Display[Display in Collapsible Section]
Display --> MoreURLs{More URLs?}
MoreURLs -->|Yes| Extract
MoreURLs -->|No| Complete([Summarization Complete])
ShowError --> MoreURLs
-
Input Collection
- User inputs one or multiple URLs (one per line)
- Selects summary type: concise, detailed, or key_points
- Optionally provides custom instructions
-
Content Extraction
- Trafilatura fetches HTML content from each URL
- Extracts main article text, filtering out:
- Navigation menus
- Advertisements
- Sidebars
- Boilerplate content
-
Text Processing
clean_text(): Removes HTML entities and control characterspreprocess_text(): Normalizes whitespace and punctuation- Optionally converts to lowercase (configurable)
-
AI Summarization
- Constructs prompt based on summary type and custom instructions
- Sends to Google Gemini AI (gemini-2.5-flash model)
- Receives structured summary response
-
Display Results
- Each URL gets a collapsible expander section
- Shows both extracted text and generated summary
- Maintains clean, organized interface
-
Search Configuration
- User enters search keyword
- Specifies number of top websites to fetch (1-10)
- Selects summary type and optional custom instructions
-
URL Discovery
- Google Custom Search API queries for keyword
- Retrieves top N URLs based on user specification
- Validates and formats URL list
-
Batch Processing
- Iterates through each discovered URL
- Follows same extraction → preprocessing → summarization pipeline
- Displays results progressively as each URL completes
-
Result Organization
- Each website gets dedicated collapsible section
- Shows extracted text and summary
- Numbered for easy reference
# Application initialization
- Sets page configuration (title, icon, layout)
- Creates sidebar navigation menu
- Routes to appropriate page based on user selection
- Manages page imports and rendering# Handles API key management
- Secure password-type input fields for API keys
- Validates all required keys are provided
- Stores keys in st.session_state for persistence
- Provides user feedback on successful save# URL-based summarization interface
- Multi-line text area for URL input
- Summary type selection dropdown
- Custom instructions text area
- Validates API key presence before processing
- Iterates through URLs with progress indicators
- Displays results in collapsible expanders# Keyword-based summarization interface
- Keyword text input field
- Number input for result count (1-10)
- Summary type selection
- Custom instructions support
- Integrates with Google Custom Search
- Processes discovered URLs automatically# Web content extraction functions
Functions:
- get_and_preprocess_text(url): Main extraction pipeline
* Fetches URL content via Trafilatura
* Extracts main text content
* Applies cleaning and preprocessing
* Returns processed text or None
- save_extracted_text(text, filename): Persists extracted content
- is_allowed_file(file_name): Validates file extensions# Text cleaning and normalization
Functions:
- clean_text(text): Basic cleaning operations
* Removes HTML entities ( , &, etc.)
* Normalizes whitespace and newlines
* Removes control/non-printable characters
- preprocess_text(text, lowercase, remove_numbers): Advanced processing
* Optional lowercase conversion
* Optional number removal
* Punctuation spacing normalization
- basic_preprocess_pipeline(text): Combined pipeline# Google Gemini AI integration
Function: summarize_text(text, gemini_api_key, summary_type, custom_instructions)
- Configures Gemini API with user's key
- Builds context-aware prompts based on summary type:
* Short Summary: 3-4 sentence concise summary
* Detailed Summary: Paragraph-wise detailed analysis
* Bullet Points: Structured key points format
- Applies custom instructions when provided
- Handles API errors gracefully
- Returns formatted summary text# Google Custom Search Engine integration
Function: get_top_urls_from_keyword(keyword, api_key, cse_id, num_results)
- Builds Google Custom Search service
- Executes search query with specified parameters
- Extracts URLs from search results
- Returns list of top N URLs
- Handles API errors and empty results# Support utilities
Functions:
- setup_logging(): Configures application-wide logging
- timestamp(): Generates formatted timestamps
- save_text_to_file(content, folder, prefix): Generic file saving
- read_text_file(file_path): Safe file reading
- save_extracted_text(text, filename): Specialized text saving# Centralized configuration
Settings:
- API Configuration: Keys, model names
- Default Values: Temperature, max tokens, summary type
- Directory Paths: Logs, documents, outputs
- Model Settings: Embedding model, generation config
- File Validation: Allowed extensionsPurpose: Generates intelligent summaries using advanced language models
Configuration:
Model: gemini-2.5-flash
Temperature: 0.3 (for consistent, focused outputs)
Max Output Tokens: 20,000Summary Types:
- Short Summary: 3-4 sentences, concise overview
- Detailed Summary: Paragraph-wise breakdown with comprehensive coverage
- Bullet Points: Structured key points in list format
Custom Instructions: Users can provide natural language guidance to control:
- Tone (formal, casual, technical)
- Focus areas (specific topics or sections)
- Detail level
- Output format preferences
Purpose: Discovers relevant URLs for keyword-based summarization
Configuration:
API: Custom Search JSON API v1
Results per query: 1-10 (user configurable)Process:
- User provides search keyword
- API queries custom search engine
- Returns top N URLs with metadata
- URLs passed to extraction pipeline
Required Credentials:
- Google API Key
- Custom Search Engine ID (CSE ID)
Purpose: Extracts main content from web pages while filtering noise
Features:
- Removes navigation, ads, and boilerplate
- Handles various HTML structures
- Supports multiple languages
- Fast and reliable extraction
- No configuration required
- Launch the application
- Navigate to Home page from sidebar
- Enter required API credentials:
- Click "Save API Keys"
- Wait for confirmation message
- Select URL Summarizer from sidebar
- Enter URLs in text area (one URL per line)
- Choose summary type:
- concise: Brief 3-4 sentence overview
- detailed: Comprehensive paragraph-wise summary
- key_points: Structured bullet-point format
- (Optional) Add custom instructions for AI
- Click "Summarize URLs"
- Review results in collapsible sections
- Select Keyword Summarizer from sidebar
- Enter search keyword
- Set number of top websites (1-10)
- Choose summary type
- (Optional) Add custom instructions
- Click "Fetch & Summarize"
- Review discovered URLs and their summaries
- Start with 3-5 URLs for faster processing
- Use custom instructions to focus on specific aspects
- Collapsible sections keep interface clean and organized
- API keys persist throughout session but not between sessions
- Python 3.8 or higher
- pip (Python package manager)
- Internet connection for API access
-
Clone the repository:
git clone https://github.com/AkshayBasutkar/Web_Summary.git cd Web_Summary -
Create a virtual environment:
# Linux/macOS python -m venv venv source venv/bin/activate # Windows python -m venv venv venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
-
Run the application:
streamlit run streamlit_app.py
-
Access the application:
- Open your browser
- Navigate to
http://localhost:8501 - Application will load automatically
For development with auto-reload:
streamlit run streamlit_app.py --server.runOnSave trueCreate a .env file in the project root for default settings:
# API Keys (if you want defaults)
GEMINI_API_KEY=your_gemini_key_here
GOOGLE_API_KEY=your_google_key_here
GOOGLE_CSE_ID=your_cse_id_here
# Model Configuration
GEMINI_MODEL_NAME=gemini-2.5-flash
EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2
# Generation Settings
DEFAULT_TEMPERATURE=0.3
DEFAULT_MAX_TOKENS=20000
DEFAULT_SUMMARY_TYPE=conciseCustomize behavior in config.py:
# Model Settings
DEFAULT_TEMPERATURE = 0.3 # AI response randomness (0.0-1.0)
DEFAULT_MAX_TOKENS = 20000 # Maximum output length
DEFAULT_SUMMARY_TYPE = 'concise' # Default summary format
# File Management
ALLOWED_EXTENSIONS = [".pdf", ".txt", ".docx", ".csv", ".md"]
# Directory Structure
LOGS_DIR = "logs" # Log file location
DOCUMENTS_DIR = "data/documents" # Extracted text storage
OUTPUTS_DIR = "data/outputs" # Summary storageFrontend Framework
- Streamlit 1.25.0+: Modern Python web framework for data applications
- Built-in session state management
- Reactive UI updates
- Multi-page support
- Collapsible components
Backend Language
- Python 3.8+: Application logic and data processing
Web Scraping & Extraction
- trafilatura 1.6.3+: High-quality web content extraction
- Removes boilerplate content
- Language detection
- HTML parsing
- Content cleaning
AI & Machine Learning
- google-generativeai 0.4.1+: Google Gemini AI integration
- Advanced language model access
- Prompt engineering support
- Response streaming
- Error handling
Search Integration
- google-api-python-client 2.100.0+: Google Custom Search API
- Programmatic search access
- Result filtering
- Quota management
Text Processing
- regex 2023.13.1+: Advanced pattern matching
- beautifulsoup4 4.12.2+: HTML parsing (auxiliary)
Utilities
- python-dotenv 1.0.0+: Environment variable management
- loguru 0.7.0+: Enhanced logging capabilities
- requests 2.31.0+: HTTP client for web requests
- Git for version control
- Virtual environments for dependency isolation
- pip for package management
- Session-based storage: Keys stored in Streamlit session state (temporary)
- Not persisted: Keys cleared when browser session ends
- Password input fields: Keys hidden during entry
- No file storage: Keys never written to disk in plain text
- Never commit API keys to version control
- Use .env files for local development (add to .gitignore)
- Rotate keys regularly for enhanced security
- Monitor API usage to detect unauthorized access
- Use environment variables in production deployments
- No data storage: Extracted text and summaries not permanently stored by default
- Optional file saving: User-controlled data persistence
- Session isolation: Each user session is independent
- HTTPS recommended: Use secure connections in production
- Google Custom Search: 100 queries/day (free tier)
- Gemini AI: Rate limits apply based on API plan
- Monitor usage: Implement error handling for quota exhaustion
Symptom: Warning message "Gemini API key not set" Solution:
- Navigate to Home page
- Re-enter all API keys
- Click "Save API Keys"
- Return to desired page
Symptom: "Failed to extract text from URL" Possible Causes:
- URL is inaccessible or blocked
- Website has anti-scraping measures
- Content is behind paywall or login Solutions:
- Verify URL is publicly accessible
- Try different URL from same topic
- Check internet connection
Symptom: "Summarization failed" message Possible Causes:
- Invalid Gemini API key
- API quota exceeded
- Network connectivity issues Solutions:
- Verify API key is correct and active
- Check API quota in Google Cloud Console
- Wait and retry after some time
Symptom: Empty results from keyword search Possible Causes:
- Invalid Google API credentials
- Incorrect CSE ID
- API quota exceeded
- Keyword has no results Solutions:
- Verify Google API Key and CSE ID
- Check Custom Search Engine configuration
- Try different, more common keywords
- Review API quota limits
Symptom: Streamlit fails to launch Solutions:
# Verify Python version
python --version # Should be 3.8+
# Reinstall dependencies
pip install --upgrade -r requirements.txt
# Check port availability
netstat -an | grep 8501
# Try different port
streamlit run streamlit_app.py --server.port 8502Symptom: URLs take long time to process Causes:
- Large articles
- Multiple URLs
- API latency Solutions:
- Process fewer URLs at once (3-5 recommended)
- Choose "concise" summary type for faster results
- Ensure stable internet connection
Documentation: Refer to inline code comments and docstrings Issues: Report bugs on GitHub repository issues page API Documentation:
Complete dependency list (see requirements.txt):
# Web Framework
streamlit>=1.25.0
# Text Extraction
trafilatura>=1.6.3
# AI Summarization
google-generativeai>=0.4.1
# Google Custom Search
google-api-python-client>=2.100.0
# Environment Management
python-dotenv>=1.0.0
# Text Processing
regex>=2023.13.1
# Logging
loguru>=0.7.0
# HTML Parsing
beautifulsoup4>=4.12.2
# HTTP Requests
requests>=2.31.0
Contributions are welcome! Please feel free to submit a Pull Request.
This project is open source and available under the MIT License.
- Google Gemini AI for powerful summarization capabilities
- Trafilatura for reliable content extraction
- Streamlit for intuitive UI framework
- Google Custom Search for keyword discovery
For questions or feedback, please open an issue on the GitHub repository.
Built with ❤️ using Streamlit and Google Gemini AI