PubMed Literature Retrieval and Interactive Reading List Generator

An automated tool for querying PubMed, extracting article metadata, enriching with journal impact factors, and generating interactive HTML reading lists with sidebar navigation and persistent state management.

✨ Features

Core Functionality

🔍 Advanced PubMed Search: Full support for E-utilities query syntax with field tags, boolean operators, and wildcards
📊 Impact Factor Integration: Automatic scraping of IF and Quartile information from ScienceDirect
📁 Structured Export: Saves metadata to Excel with 11 columns (PMID, Title, Journal, IF, Quartile, Abstract, DOI, etc.)
🌐 Interactive HTML: Beautiful night-mode reading list with full interactivity

HTML Reading List Features

🌙 Night Mode Design: Dark gradient background optimized for comfortable reading
🎨 Keyword Highlighting: Automatic highlighting of search terms in titles (yellow) and abstracts (orange)
📑 Collapsible Sidebar:
- Navigate between articles with Journal. YYYYMMDD bookmarks
- Real-time status indicators: ⭐ (starred), ✓ (read)
- Smooth show/hide transitions
⭐ Star System: Mark important papers with persistent state
✓ Read Tracking: Track reading progress across sessions
💾 Persistent State: All user interactions saved in browser localStorage
- Isolated Storage: Each query has independent localStorage space (v2.1+)
- No state interference between different HTML files

🚀 Quick Start

Prerequisites

pip install biopython pandas openpyxl requests beautifulsoup4 tqdm

Basic Usage

Clone the repository:

git clone https://github.com/yourusername/grab-pubmed-info.git
cd grab-pubmed-info

Open the notebook:

jupyter notebook pubmed_query.ipynb

Configure your search (Cell 2):

api_key = "your_ncbi_api_key"  # Get from https://www.ncbi.nlm.nih.gov/account/
search_key_words = "(wnt5a NOT cancer) AND fibro*"
release_date_cutoff = 365  # Papers from last year
paper_type = "Journal Article"
save_path = "./paper_donload/my_query.xlsx"

Run all cells to:
- Query PubMed
- Fetch metadata
- Scrape impact factors
- Generate interactive HTML
Open the HTML file in your browser to start reading!

📚 Documentation

PubMed Query Syntax

The tool supports full NCBI E-utilities advanced search syntax:

Boolean Operators:

search_key_words = "wnt5a AND cancer"        # Both terms
search_key_words = "wnt5a OR wnt7a"          # Either term
search_key_words = "wnt5a NOT cancer"        # Exclude term
search_key_words = "(wnt5a OR wnt7a) AND cancer"  # Combined

Field Tags:

search_key_words = "wnt5a[Title]"                        # Title only
search_key_words = "wnt5a[Title/Abstract]"               # Title or Abstract
search_key_words = "Smith J[Author]"                     # Specific author
search_key_words = "Nature[Journal]"                     # Specific journal
search_key_words = "breast cancer AND China[Affiliation]"  # Institution

Wildcards:

search_key_words = "fibro*"  # Matches: fibroblast, fibrosis, fibrotic, etc.

Excel Column Schema

Generated Excel files have 11 columns:

Column	Description
PMID	PubMed unique identifier
Title	Article title
Journal	Journal abbreviation
IF	Impact Factor (from ScienceDirect)
JCR_Quartile	JCR Quartile (Q1/Q2/Q3/Q4)
CSA_Quartile	CSA Quartile
Top	Top journal indicator
Open Access	OA status
publish_date	Publication date (YYYYMMDD)
Abstract	Full abstract text
DOI	Digital Object Identifier

HTML Interface Guide

Sidebar Navigation:

Click ☰ button to toggle sidebar
Bookmarks format: Nat Commun. 20251216
⭐ = Starred articles
✓ = Read articles

Article Cards:

Click ⭐ to star important papers (gold left border appears)
Click ✓ to mark as read (card opacity reduces to 0.6)
All states persist across browser sessions

🛠️ Advanced Usage

Independent IF Update

Update impact factors for existing Excel files without re-querying PubMed:

from pubmed_utils import pubmed_utils

utils = pubmed_utils()
utils.embed_IF_into_excel('./paper_donload/existing_file.xlsx')

Batch Processing

Process multiple queries:

queries = [
    ("wnt5a AND fibrosis", "wnt5a_fibrosis.xlsx"),
    ("wnt7a AND regeneration", "wnt7a_regen.xlsx"),
]

for keywords, path in queries:
    utils.get_main_info_into_excel(api_key, keywords, 365, "Journal Article", None, path)
    utils.embed_IF_into_excel(path)
    generate_reading_list(path, path.replace('.xlsx', '_reading_list.html'))

Custom HTML Styling

Modify html_generate.py to customize:

Colors (CSS variables in <style> section)
Layout (adjust .card, .sidebar styles)
Highlighting patterns (_build_pattern_from_query() function)

📊 Project Structure

grab-pubmed-info-master/
├── pumbed_query.ipynb          # Main workflow notebook (⭐ Start here)
├── pubmed_utils.py             # PubMed API & IF scraping logic
├── html_generate.py            # HTML generation with interactivity
├── paper_donload/              # Output directory (auto-created)
│   ├── *.xlsx                  # Excel files with metadata
│   └── *_reading_list.html     # Interactive HTML reading lists
├── README.md                   # This file
├── LICENSE                     # MIT License
└── requirements.txt            # Python dependencies

🐛 Troubleshooting

Common Issues

Problem: NCBI API rate limit exceeded
Solution: Get free API key from https://www.ncbi.nlm.nih.gov/account/ (increases limit from 3 to 10 req/sec)

Problem: Empty IF column in Excel
Solution: Journal name mismatch - use refine_IF_matching() method for manual correction

Problem: HTML buttons not clickable
Solution: Ensure you're using a modern browser (Chrome/Firefox/Edge). Check browser console for JavaScript errors.

Problem: Sidebar bookmarks show "Unknown"
Solution: Verify Excel has Journal and publish_date columns properly populated

Problem: HTML not updating after code changes
Solution: Use importlib.reload(html_generate) before calling generate_reading_list()

Problem: Star/read marks appear in wrong HTML file
Solution: Update to v2.1+. Each HTML now uses isolated localStorage. Regenerate HTML files to fix.

Error Reporting

Found a bug? Please open an issue with:

Error message
Python version
Browser (for HTML issues)
Minimal reproducible example

📋 Changelog

Version 2.1 (2025-12-23)

Bug Fix: localStorage State Isolation

🔧 Fixed localStorage state sharing between different query HTML files
✨ Each HTML file now uses unique storage keys based on filename
🎯 Prevents star/read marks from interfering across different queries
⚠️ Note: Existing HTML files will need regeneration (old states not preserved)

Technical Details:

Added storage_key_suffix extraction from output filename
Injected STORAGE_KEY_PREFIX constant in JavaScript
Updated all localStorage API calls to use dynamic keys
Example keys: starred_wnt5a_reading_list, read_breast_cancer_reading_list

Version 2.0 (2025-12-22)

Major Features:

Added collapsible sidebar navigation with bookmark links
Implemented star and read marking with persistent state
Real-time status synchronization between article cards and sidebar
Optimized for GitHub with comprehensive documentation
Simplified Excel column names for better compatibility

Version 1.0 (Original)

PubMed query and metadata extraction
Impact factor scraping from ScienceDirect
Basic HTML generation with keyword highlighting

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

📝 Citation

If you use this tool in your research, please cite:

@software{pubmed_info_grabber2025,
  author = {Li, Xiang and GitHub Copilot},
  title = {PubMed Literature Retrieval and Interactive Reading List Generator},
  year = {2025},
  url = {https://github.com/GatewayPhd/GrabPubmed}
}

📜 License

This project is licensed under the MIT License - see the LICENSE file for details.

👥 Authors

李想 (Li Xiang) - Initial work and concept
GitHub Copilot - Interactive HTML features, code optimization

🙏 Acknowledgments

NCBI for providing the E-utilities API
ScienceDirect for impact factor data
The Python scientific computing community

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.github		.github
paper_donload		paper_donload
.env.example		.env.example
.gitignore		.gitignore
JCR_CSA_2025.xlsx		JCR_CSA_2025.xlsx
LICENSE		LICENSE
README.md		README.md
demo_screenshot.png		demo_screenshot.png
html_generate.py		html_generate.py
pubmed_query.ipynb		pubmed_query.ipynb
pubmed_utils.py		pubmed_utils.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PubMed Literature Retrieval and Interactive Reading List Generator

✨ Features

Core Functionality

HTML Reading List Features

🚀 Quick Start

Prerequisites

Basic Usage

📚 Documentation

PubMed Query Syntax

Excel Column Schema

HTML Interface Guide

🛠️ Advanced Usage

Independent IF Update

Batch Processing

Custom HTML Styling

📊 Project Structure

🐛 Troubleshooting

Common Issues

Error Reporting

📋 Changelog

Version 2.1 (2025-12-23)

Version 2.0 (2025-12-22)

Version 1.0 (Original)

🤝 Contributing

📝 Citation

📜 License

👥 Authors

🙏 Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PubMed Literature Retrieval and Interactive Reading List Generator

✨ Features

Core Functionality

HTML Reading List Features

🚀 Quick Start

Prerequisites

Basic Usage

📚 Documentation

PubMed Query Syntax

Excel Column Schema

HTML Interface Guide

🛠️ Advanced Usage

Independent IF Update

Batch Processing

Custom HTML Styling

📊 Project Structure

🐛 Troubleshooting

Common Issues

Error Reporting

📋 Changelog

Version 2.1 (2025-12-23)

Version 2.0 (2025-12-22)

Version 1.0 (Original)

🤝 Contributing

📝 Citation

📜 License

👥 Authors

🙏 Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages