Journal-aware daily discovery and ranking for AI-related papers from PubMed, bioRxiv, and arXiv.
PubMed Signal is built for people who want a serious daily reading list instead of a noisy firehose. It starts with trusted journals, widens to MEDLINE[sb] only when needed, then reaches into bioRxiv and arXiv to fill the pool. It scores the candidate set with OpenAI, reranks the shortlist with a stronger model, and produces both a full digest and a compact set of editor's picks.
A tasteful daily radar for research worth opening first.
- It is journal-first instead of preprint-first.
- It uses topic-aware ranking, so
bioinformatics,neuroscience,medical-ai, or a custom query do not all get treated like generic LLM news. - It keeps source lanes separate, so one noisy feed cannot crowd out the rest of the pool.
- It writes a digest for humans, JSON for automation, and editor's picks for fast scanning.
- It can post a polished briefing to Slack when you want delivery built in.
- A staged daily candidate ladder:
- PubMed fills to
50 - bioRxiv fills the pool to
80 - arXiv fills the pool to
100
- PubMed fills to
- Full-pool scoring with topic relevance, impact, interestingness, rigor,
awe_factor, andsurprise_factor - Editor's picks for:
- theoretical research
- methods / techniques / algorithmic improvement
- impactful application
- fun / humor / easy read
- Per-paper scores in the main digest
- Per-pick scores in editor's picks
- Incremental local state, so recurring runs do not keep resurfacing the same papers
Create a local .env file:
cat > .env <<'EOF'
OPENAI_API_KEY="your-openai-key"
NCBI_API_KEY="your-ncbi-key"
NCBI_EMAIL="you@example.com"
EOFRun the daily workflow:
./run_daily.shThe scripts automatically load .env from the repository root. That file is ignored by git.
Run the default daily workflow:
./run_daily.shRun a 10-day bioinformatics pass:
DAYS_BACK=10 TOPIC=bioinformatics ./run_daily.shRun without posting to Slack:
POST_TO_SLACK=0 ./run_daily.shRun the main digest only:
python pubmed_digest.py \
--days-back 10 \
--topic bioinformatics \
--candidate-pool-size 100 \
--retmax 10 \
--journal-whitelist journal_whitelist_top40.txtRun editor's picks only:
python editor_picks_from_pool.pyPost an existing day to Slack:
python post_to_slack.py --date 2026-04-20Built-in presets:
llmmedical-aibioinformaticsneurosciencenlp
Examples:
python pubmed_digest.py --topic neuroscienceTOPIC=bioinformatics ./run_daily.shFor a custom topic, put your PubMed query in a text file:
python pubmed_digest.py --topic-file topics/spatial_transcriptomics.txtor:
TOPIC_FILE=topics/spatial_transcriptomics.txt ./run_daily.shYou can also override directly:
python pubmed_digest.py --query '"spatial transcriptomics"[Title/Abstract] AND "foundation model"[Title/Abstract]'- Search PubMed over the chosen window.
- Fill the PubMed lane from the journal whitelist first.
- If PubMed is still below
50, addMEDLINE[sb]until PubMed reaches50. - If the combined pool is still below
80, add bioRxiv until the pool reaches80. - If the combined pool is still below
100, add arXivcsuntil the pool reaches100. - Tighten any overfull source inside its own lane before selection.
- Fetch metadata, abstracts, and PMC full text when available.
- Score the full candidate pool with a lighter OpenAI model.
- Rerank the final shortlist with a stronger OpenAI model.
- Select editor's picks from the same topic-aware pool.
Example:
If PubMed yields 48, bioRxiv can add up to 32 to bring the pool to 80. If PubMed plus bioRxiv only reaches 32, arXiv can add up to 68 to bring the pool to 100.
Each paper is scored on:
- topic relevance
- impact
- interestingness
- rigor
- awe
- surprise
By default, the project queries the OpenAI Models API, prefers a lighter model for full-pool scoring, and prefers the flagship model for final editorial reranking if both are available.
Each normal run writes into:
output/YYYY-MM-DD/digest.mdoutput/YYYY-MM-DD/digest.jsonoutput/YYYY-MM-DD/editor-picks.mdoutput/YYYY-MM-DD/editor-picks.json
For ad hoc experimental runs, you can point the scripts at a different output root; those runs are commonly kept outside the main daily folder structure.
digest.md is the full reading list.
editor-picks.md is the shorter human-friendly briefing.
The JSON files are useful for Slack, email, or downstream tooling.
If SLACK_WEBHOOK_URL is set in .env, run_daily.sh can post the finished digest to Slack automatically.
To skip Slack for a given run:
POST_TO_SLACK=0 ./run_daily.shThe Slack post includes:
- a short run summary
- top ranked papers
- editor's picks
- one-line reasons for each pick
- scores for both the main digest and the picks
pubmed_digest.py: retrieval, scoring, reranking, and digest writingeditor_picks_from_pool.py: editor's-picks selection from the daily candidate poolpost_to_slack.py: Slack delivery formatter and senderrun_daily.sh: one-command terminal runnerjournal_whitelist_top40.txt: curated journal whitelisttopics/: example custom topic filesassets/: GitHub-facing visualsoutput/: daily outputsdata/pubmed_digest.sqlite3: local incremental state
--query: override the default PubMed query--topic: switch to a built-in topic preset--topic-file: load a custom query from a text file--days-back: search N recent days across PubMed, bioRxiv, and arXiv--candidate-pool-size: maximum pool size before ranking--retmax: number of final reranked papers in the digest--model: override the first-pass scoring model--final-model: override the final reranking model--journal-whitelist: newline-delimited whitelist file--full-text-char-limit: trim long PMC full text before scoring--mark-seen-without-scoring: persist PMIDs even when no OpenAI key is set--mark-seen-on-error: persist PMIDs even when OpenAI scoring fails
Example cron entry:
0 7 * * * cd /path/to/pubmed && ./run_daily.sh >> output/cron.log 2>&1That runs every day at 7:00 AM in the machine's local time zone.
- PubMed does not guarantee full paper text for every record. The pipeline uses PMC full text when available and falls back to abstract-only ranking otherwise.
- Topic presets are meant to be easy to switch, but custom topic files are the best way to get very sharp domain-specific behavior.
- Preprint sources are opportunistic. If bioRxiv or arXiv are slow or flaky, the pipeline continues instead of failing the entire digest.
- Secrets stay local in
.env; generated outputs and local state are ignored by git.