Skip to content

lukeslp/citewright

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CiteWright

Python 3.8+ License: MIT Status: Active

Anybody else have a huge folder full of files with names like 235680_download.PDF and smith_et_al_2008_full.pdf(2)?

... yeah.

I wrote this because I got mass-downloading papers from Sci-Hub and then staring at a folder of cryptic filenames wondering which one was the paper about transformer attention mechanisms and which one was about soil bacteria. Life's too short.

What It Does

  • Strips text from documents and uses arXiv, Semantic Scholar, Crossref, PubMed, OpenLibrary, and Unpaywall to find the actual source
  • Renames files to Author_Year_Title.ext like a civilized person
  • Handles PDF, TXT, Markdown, DOC/DOCX, and Python files - throw it at it, let's find out
  • Maintains a BibTeX database so you don't have to
  • Logs everything, doesn't break anything, asks before doing anything destructive
  • Optionally uses a local LLM (Ollama) or cloud providers (OpenAI, Anthropic, Gemini) if the free APIs come up empty

The Philosophy

I built this with a "try the free stuff first" approach. Why pay for API calls when CrossRef is right there?

Tier What Happens
1 Check if the PDF already has metadata embedded. Usually garbage, but sometimes you get lucky.
2 Extract DOIs, arXiv IDs, ISBNs from the text and look them up. This is where the magic happens.
3 Search academic APIs using whatever title/author text it can scrape. Works more often than you'd think.
4 (Optional) Throw the text at an LLM and ask nicely. Costs money unless you're running Ollama locally.

Installation

git clone https://github.com/lukeslp/citewright.git
cd citewright
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate
pip install .

Want the LLM-powered features and media processing?

pip install ".[all]"

Usage

Preview what would happen (dry run, safe):

citewright pdf ~/papers

Actually rename things:

citewright pdf ~/papers --execute

Go recursive and spit out a BibTeX file:

citewright pdf ~/papers -r --execute --bibtex library.bib

Let the LLM analyze the stubborn ones:

citewright pdf ~/papers --ai --execute

Rename photos and videos too (uses EXIF data):

citewright media ~/photos --execute

Use vision models to describe images:

citewright media ~/photos --ai --execute

Oh no go back:

citewright undo

Configuration

Config lives at ~/.config/citewright/config.json, or use the CLI:

citewright config --show
citewright config --ai-provider openai  # Select LLM provider
citewright config --ai-enabled
citewright config --unpaywall-email "you@example.com"

The Unpaywall email is optional but they appreciate it. Be cool.

License

MIT. Do whatever.

Author

Luke Steuber https://github.com/lukeslp luke@dr.eamer.dev

About

Intelligent academic paper and media renaming tool with multi-source metadata extraction

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages