Python port of the PubMatrixR R package.
For every pair of search terms (A, B), it counts how many PubMed or PMC publications mention both. Good for mapping relationships between genes, diseases, and pathways across the literature.
Based on: Becker et al. (2003) PubMatrix: a tool for multiplex literature mining. BMC Bioinformatics 4:61. https://doi.org/10.1186/1471-2105-4-61
Requires uv. Install it with:
curl -LsSf https://astral.sh/uv/install.sh | shClone and install dependencies:
git clone <repo-url>
cd PubMatrixPython
uv sync --all-groupsAll uv commands must be run from the project root (PubMatrixPython/), where pyproject.toml lives.
cd /path/to/PubMatrixPython
uv run jupyter labThen open any notebook from the notebooks/ folder in the browser.
| Notebook | What it covers |
|---|---|
01_pubmatrix.ipynb |
Basic queries, date filtering, PMC database, file input, CSV export, heatmap visualisation |
02_example_wnt.ipynb |
Full worked example: WNT genes × obesity genes |
uv run pythonfrom pubmatrix import pubmatrix, plot_pubmatrix_heatmap
A = ["WNT1", "WNT2", "CTNNB1"]
B = ["obesity", "diabetes", "cancer"]
result = pubmatrix(A=A, B=B)
print(result)
plot_pubmatrix_heatmap(result, title="WNT × Disease")Create a file my_analysis.py:
from pubmatrix import pubmatrix, plot_pubmatrix_heatmap
A = ["WNT1", "WNT2", "WNT3A", "WNT5A", "CTNNB1"]
B = ["obesity", "diabetes", "cancer", "inflammation"]
result = pubmatrix(
A=A,
B=B,
database="pubmed",
daterange=[2010, 2024], # optional date filter
outfile="results",
export_format="csv", # saves results.csv with PubMed hyperlinks
)
print(result)
plot_pubmatrix_heatmap(
result,
title="WNT Genes × Disease",
filename="heatmap.png", # saves to file instead of displaying
)Run it with:
uv run python my_analysis.pyCreate terms.txt:
WNT1
WNT2
CTNNB1
#
obesity
diabetes
cancer
from pubmatrix import pubmatrix_from_file
result = pubmatrix_from_file("terms.txt")
print(result)uv run python my_analysis.pyQuery PubMed and return a pandas.DataFrame (rows = B, cols = A).
pubmatrix(
A, # list of str — column terms
B, # list of str — row terms
api_key=None, # NCBI API key (10 req/s vs 3 req/s default)
database="pubmed", # "pubmed" or "pmc"
daterange=None, # e.g. [2015, 2024]
outfile=None, # base filename for export
export_format=None, # None | "csv" | "ods"
n_tries=2, # retries on network failure
)Load terms from a plain-text file and run pubmatrix().
File format:
WNT1
WNT2
#
obesity
diabetes
result = pubmatrix_from_file("terms.txt", database="pubmed")Heatmap of overlap percentages with optional hierarchical clustering.
plot_pubmatrix_heatmap(
matrix, # DataFrame from pubmatrix()
title="PubMatrix Co-occurrence Heatmap",
cluster_rows=True,
cluster_cols=True,
show_numbers=True,
color_palette=None, # list of hex colours
filename=None, # save to PNG if set
width=10, height=8,
scale_font=True,
)Quick wrapper around plot_pubmatrix_heatmap() with all defaults.
Without a key: 3 requests/second. With a key: 10 requests/second. Get one at https://account.ncbi.nlm.nih.gov/
result = pubmatrix(A=A, B=B, api_key="YOUR_KEY_HERE")