An automated tool for querying PubMed, extracting article metadata, enriching with journal impact factors, and generating interactive HTML reading lists with sidebar navigation and persistent state management.
- π Advanced PubMed Search: Full support for E-utilities query syntax with field tags, boolean operators, and wildcards
- π Impact Factor Integration: Automatic scraping of IF and Quartile information from ScienceDirect
- π Structured Export: Saves metadata to Excel with 11 columns (PMID, Title, Journal, IF, Quartile, Abstract, DOI, etc.)
- π Interactive HTML: Beautiful night-mode reading list with full interactivity
- π Night Mode Design: Dark gradient background optimized for comfortable reading
- π¨ Keyword Highlighting: Automatic highlighting of search terms in titles (yellow) and abstracts (orange)
- π Collapsible Sidebar:
- Navigate between articles with
Journal. YYYYMMDDbookmarks - Real-time status indicators: β (starred), β (read)
- Smooth show/hide transitions
- Navigate between articles with
- β Star System: Mark important papers with persistent state
- β Read Tracking: Track reading progress across sessions
- πΎ Persistent State: All user interactions saved in browser localStorage
- Isolated Storage: Each query has independent localStorage space (v2.1+)
- No state interference between different HTML files
pip install biopython pandas openpyxl requests beautifulsoup4 tqdm- Clone the repository:
git clone https://github.com/yourusername/grab-pubmed-info.git
cd grab-pubmed-info- Open the notebook:
jupyter notebook pubmed_query.ipynb- Configure your search (Cell 2):
api_key = "your_ncbi_api_key" # Get from https://www.ncbi.nlm.nih.gov/account/
search_key_words = "(wnt5a NOT cancer) AND fibro*"
release_date_cutoff = 365 # Papers from last year
paper_type = "Journal Article"
save_path = "./paper_donload/my_query.xlsx"-
Run all cells to:
- Query PubMed
- Fetch metadata
- Scrape impact factors
- Generate interactive HTML
-
Open the HTML file in your browser to start reading!
The tool supports full NCBI E-utilities advanced search syntax:
Boolean Operators:
search_key_words = "wnt5a AND cancer" # Both terms
search_key_words = "wnt5a OR wnt7a" # Either term
search_key_words = "wnt5a NOT cancer" # Exclude term
search_key_words = "(wnt5a OR wnt7a) AND cancer" # CombinedField Tags:
search_key_words = "wnt5a[Title]" # Title only
search_key_words = "wnt5a[Title/Abstract]" # Title or Abstract
search_key_words = "Smith J[Author]" # Specific author
search_key_words = "Nature[Journal]" # Specific journal
search_key_words = "breast cancer AND China[Affiliation]" # InstitutionWildcards:
search_key_words = "fibro*" # Matches: fibroblast, fibrosis, fibrotic, etc.Generated Excel files have 11 columns:
| Column | Description |
|---|---|
| PMID | PubMed unique identifier |
| Title | Article title |
| Journal | Journal abbreviation |
| IF | Impact Factor (from ScienceDirect) |
| JCR_Quartile | JCR Quartile (Q1/Q2/Q3/Q4) |
| CSA_Quartile | CSA Quartile |
| Top | Top journal indicator |
| Open Access | OA status |
| publish_date | Publication date (YYYYMMDD) |
| Abstract | Full abstract text |
| DOI | Digital Object Identifier |
Sidebar Navigation:
- Click
β°button to toggle sidebar - Bookmarks format:
Nat Commun. 20251216 - β = Starred articles
- β = Read articles
Article Cards:
- Click β to star important papers (gold left border appears)
- Click β to mark as read (card opacity reduces to 0.6)
- All states persist across browser sessions
Update impact factors for existing Excel files without re-querying PubMed:
from pubmed_utils import pubmed_utils
utils = pubmed_utils()
utils.embed_IF_into_excel('./paper_donload/existing_file.xlsx')Process multiple queries:
queries = [
("wnt5a AND fibrosis", "wnt5a_fibrosis.xlsx"),
("wnt7a AND regeneration", "wnt7a_regen.xlsx"),
]
for keywords, path in queries:
utils.get_main_info_into_excel(api_key, keywords, 365, "Journal Article", None, path)
utils.embed_IF_into_excel(path)
generate_reading_list(path, path.replace('.xlsx', '_reading_list.html'))Modify html_generate.py to customize:
- Colors (CSS variables in
<style>section) - Layout (adjust
.card,.sidebarstyles) - Highlighting patterns (
_build_pattern_from_query()function)
grab-pubmed-info-master/
βββ pumbed_query.ipynb # Main workflow notebook (β Start here)
βββ pubmed_utils.py # PubMed API & IF scraping logic
βββ html_generate.py # HTML generation with interactivity
βββ paper_donload/ # Output directory (auto-created)
β βββ *.xlsx # Excel files with metadata
β βββ *_reading_list.html # Interactive HTML reading lists
βββ README.md # This file
βββ LICENSE # MIT License
βββ requirements.txt # Python dependencies
Problem: NCBI API rate limit exceeded
Solution: Get free API key from https://www.ncbi.nlm.nih.gov/account/ (increases limit from 3 to 10 req/sec)
Problem: Empty IF column in Excel
Solution: Journal name mismatch - use refine_IF_matching() method for manual correction
Problem: HTML buttons not clickable
Solution: Ensure you're using a modern browser (Chrome/Firefox/Edge). Check browser console for JavaScript errors.
Problem: Sidebar bookmarks show "Unknown"
Solution: Verify Excel has Journal and publish_date columns properly populated
Problem: HTML not updating after code changes
Solution: Use importlib.reload(html_generate) before calling generate_reading_list()
Problem: Star/read marks appear in wrong HTML file
Solution: Update to v2.1+. Each HTML now uses isolated localStorage. Regenerate HTML files to fix.
Found a bug? Please open an issue with:
- Error message
- Python version
- Browser (for HTML issues)
- Minimal reproducible example
Bug Fix: localStorage State Isolation
- π§ Fixed localStorage state sharing between different query HTML files
- β¨ Each HTML file now uses unique storage keys based on filename
- π― Prevents star/read marks from interfering across different queries
β οΈ Note: Existing HTML files will need regeneration (old states not preserved)
Technical Details:
- Added
storage_key_suffixextraction from output filename - Injected
STORAGE_KEY_PREFIXconstant in JavaScript - Updated all
localStorageAPI calls to use dynamic keys - Example keys:
starred_wnt5a_reading_list,read_breast_cancer_reading_list
Major Features:
- Added collapsible sidebar navigation with bookmark links
- Implemented star and read marking with persistent state
- Real-time status synchronization between article cards and sidebar
- Optimized for GitHub with comprehensive documentation
- Simplified Excel column names for better compatibility
- PubMed query and metadata extraction
- Impact factor scraping from ScienceDirect
- Basic HTML generation with keyword highlighting
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
If you use this tool in your research, please cite:
@software{pubmed_info_grabber2025,
author = {Li, Xiang and GitHub Copilot},
title = {PubMed Literature Retrieval and Interactive Reading List Generator},
year = {2025},
url = {https://github.com/GatewayPhd/GrabPubmed}
}This project is licensed under the MIT License - see the LICENSE file for details.
- ζζ³ (Li Xiang) - Initial work and concept
- GitHub Copilot - Interactive HTML features, code optimization
- NCBI for providing the E-utilities API
- ScienceDirect for impact factor data
- The Python scientific computing community
