Skip to content

GatewayPhd/GrabPubmed

Repository files navigation

PubMed Literature Retrieval and Interactive Reading List Generator

Python License Jupyter

An automated tool for querying PubMed, extracting article metadata, enriching with journal impact factors, and generating interactive HTML reading lists with sidebar navigation and persistent state management.

✨ Features

This a picture

Core Functionality

  • πŸ” Advanced PubMed Search: Full support for E-utilities query syntax with field tags, boolean operators, and wildcards
  • πŸ“Š Impact Factor Integration: Automatic scraping of IF and Quartile information from ScienceDirect
  • πŸ“ Structured Export: Saves metadata to Excel with 11 columns (PMID, Title, Journal, IF, Quartile, Abstract, DOI, etc.)
  • 🌐 Interactive HTML: Beautiful night-mode reading list with full interactivity

HTML Reading List Features

  • πŸŒ™ Night Mode Design: Dark gradient background optimized for comfortable reading
  • 🎨 Keyword Highlighting: Automatic highlighting of search terms in titles (yellow) and abstracts (orange)
  • πŸ“‘ Collapsible Sidebar:
    • Navigate between articles with Journal. YYYYMMDD bookmarks
    • Real-time status indicators: ⭐ (starred), βœ“ (read)
    • Smooth show/hide transitions
  • ⭐ Star System: Mark important papers with persistent state
  • βœ“ Read Tracking: Track reading progress across sessions
  • πŸ’Ύ Persistent State: All user interactions saved in browser localStorage
    • Isolated Storage: Each query has independent localStorage space (v2.1+)
    • No state interference between different HTML files

πŸš€ Quick Start

Prerequisites

pip install biopython pandas openpyxl requests beautifulsoup4 tqdm

Basic Usage

  1. Clone the repository:
git clone https://github.com/yourusername/grab-pubmed-info.git
cd grab-pubmed-info
  1. Open the notebook:
jupyter notebook pubmed_query.ipynb
  1. Configure your search (Cell 2):
api_key = "your_ncbi_api_key"  # Get from https://www.ncbi.nlm.nih.gov/account/
search_key_words = "(wnt5a NOT cancer) AND fibro*"
release_date_cutoff = 365  # Papers from last year
paper_type = "Journal Article"
save_path = "./paper_donload/my_query.xlsx"
  1. Run all cells to:

    • Query PubMed
    • Fetch metadata
    • Scrape impact factors
    • Generate interactive HTML
  2. Open the HTML file in your browser to start reading!

πŸ“š Documentation

PubMed Query Syntax

The tool supports full NCBI E-utilities advanced search syntax:

Boolean Operators:

search_key_words = "wnt5a AND cancer"        # Both terms
search_key_words = "wnt5a OR wnt7a"          # Either term
search_key_words = "wnt5a NOT cancer"        # Exclude term
search_key_words = "(wnt5a OR wnt7a) AND cancer"  # Combined

Field Tags:

search_key_words = "wnt5a[Title]"                        # Title only
search_key_words = "wnt5a[Title/Abstract]"               # Title or Abstract
search_key_words = "Smith J[Author]"                     # Specific author
search_key_words = "Nature[Journal]"                     # Specific journal
search_key_words = "breast cancer AND China[Affiliation]"  # Institution

Wildcards:

search_key_words = "fibro*"  # Matches: fibroblast, fibrosis, fibrotic, etc.

Excel Column Schema

Generated Excel files have 11 columns:

Column Description
PMID PubMed unique identifier
Title Article title
Journal Journal abbreviation
IF Impact Factor (from ScienceDirect)
JCR_Quartile JCR Quartile (Q1/Q2/Q3/Q4)
CSA_Quartile CSA Quartile
Top Top journal indicator
Open Access OA status
publish_date Publication date (YYYYMMDD)
Abstract Full abstract text
DOI Digital Object Identifier

HTML Interface Guide

Sidebar Navigation:

  • Click ☰ button to toggle sidebar
  • Bookmarks format: Nat Commun. 20251216
  • ⭐ = Starred articles
  • βœ“ = Read articles

Article Cards:

  • Click ⭐ to star important papers (gold left border appears)
  • Click βœ“ to mark as read (card opacity reduces to 0.6)
  • All states persist across browser sessions

πŸ› οΈ Advanced Usage

Independent IF Update

Update impact factors for existing Excel files without re-querying PubMed:

from pubmed_utils import pubmed_utils

utils = pubmed_utils()
utils.embed_IF_into_excel('./paper_donload/existing_file.xlsx')

Batch Processing

Process multiple queries:

queries = [
    ("wnt5a AND fibrosis", "wnt5a_fibrosis.xlsx"),
    ("wnt7a AND regeneration", "wnt7a_regen.xlsx"),
]

for keywords, path in queries:
    utils.get_main_info_into_excel(api_key, keywords, 365, "Journal Article", None, path)
    utils.embed_IF_into_excel(path)
    generate_reading_list(path, path.replace('.xlsx', '_reading_list.html'))

Custom HTML Styling

Modify html_generate.py to customize:

  • Colors (CSS variables in <style> section)
  • Layout (adjust .card, .sidebar styles)
  • Highlighting patterns (_build_pattern_from_query() function)

πŸ“Š Project Structure

grab-pubmed-info-master/
β”œβ”€β”€ pumbed_query.ipynb          # Main workflow notebook (⭐ Start here)
β”œβ”€β”€ pubmed_utils.py             # PubMed API & IF scraping logic
β”œβ”€β”€ html_generate.py            # HTML generation with interactivity
β”œβ”€β”€ paper_donload/              # Output directory (auto-created)
β”‚   β”œβ”€β”€ *.xlsx                  # Excel files with metadata
β”‚   └── *_reading_list.html     # Interactive HTML reading lists
β”œβ”€β”€ README.md                   # This file
β”œβ”€β”€ LICENSE                     # MIT License
└── requirements.txt            # Python dependencies

πŸ› Troubleshooting

Common Issues

Problem: NCBI API rate limit exceeded
Solution: Get free API key from https://www.ncbi.nlm.nih.gov/account/ (increases limit from 3 to 10 req/sec)

Problem: Empty IF column in Excel
Solution: Journal name mismatch - use refine_IF_matching() method for manual correction

Problem: HTML buttons not clickable
Solution: Ensure you're using a modern browser (Chrome/Firefox/Edge). Check browser console for JavaScript errors.

Problem: Sidebar bookmarks show "Unknown"
Solution: Verify Excel has Journal and publish_date columns properly populated

Problem: HTML not updating after code changes
Solution: Use importlib.reload(html_generate) before calling generate_reading_list()

Problem: Star/read marks appear in wrong HTML file
Solution: Update to v2.1+. Each HTML now uses isolated localStorage. Regenerate HTML files to fix.

Error Reporting

Found a bug? Please open an issue with:

  • Error message
  • Python version
  • Browser (for HTML issues)
  • Minimal reproducible example

πŸ“‹ Changelog

Version 2.1 (2025-12-23)

Bug Fix: localStorage State Isolation

  • πŸ”§ Fixed localStorage state sharing between different query HTML files
  • ✨ Each HTML file now uses unique storage keys based on filename
  • 🎯 Prevents star/read marks from interfering across different queries
  • ⚠️ Note: Existing HTML files will need regeneration (old states not preserved)

Technical Details:

  • Added storage_key_suffix extraction from output filename
  • Injected STORAGE_KEY_PREFIX constant in JavaScript
  • Updated all localStorage API calls to use dynamic keys
  • Example keys: starred_wnt5a_reading_list, read_breast_cancer_reading_list

Version 2.0 (2025-12-22)

Major Features:

  • Added collapsible sidebar navigation with bookmark links
  • Implemented star and read marking with persistent state
  • Real-time status synchronization between article cards and sidebar
  • Optimized for GitHub with comprehensive documentation
  • Simplified Excel column names for better compatibility

Version 1.0 (Original)

  • PubMed query and metadata extraction
  • Impact factor scraping from ScienceDirect
  • Basic HTML generation with keyword highlighting

🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

πŸ“ Citation

If you use this tool in your research, please cite:

@software{pubmed_info_grabber2025,
  author = {Li, Xiang and GitHub Copilot},
  title = {PubMed Literature Retrieval and Interactive Reading List Generator},
  year = {2025},
  url = {https://github.com/GatewayPhd/GrabPubmed}
}

πŸ“œ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ‘₯ Authors

  • ζŽζƒ³ (Li Xiang) - Initial work and concept
  • GitHub Copilot - Interactive HTML features, code optimization

πŸ™ Acknowledgments

  • NCBI for providing the E-utilities API
  • ScienceDirect for impact factor data
  • The Python scientific computing community

About

PubMed literature retrieval and analysis tool with interactive HTML reading lists

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors