Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
115 changes: 74 additions & 41 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,50 +2,83 @@

All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [0.2.0] - 2026-03-23

## [1.0.0] - 2026-04-09

### Added

- V0.1.0 release polish - critical and major fixes

- **screening:** Add global top-percent selection for deep analysis


### Documentation

- **readme:** Update config and screening mode guidance

- **Multi-source discovery**: Support for both Semantic Scholar and OpenAlex APIs
- Configurable discovery sources via `discovery_sources` setting
- OpenAlex adapter with field mapping
- Global deduplication using DOI match and fuzzy title matching
- Source tracking (s2, openalex, both, citation_expansion)

- **Citation graph expansion**: Optional post-ranking stage to discover frequently referenced works
- Configurable via `expand_citations` and `min_cross_refs` settings
- Adds cross-referenced papers as recommended reading

- **Zotero integration**: Export top papers directly to Zotero library
- Support for user and group libraries
- Automatic PDF attachment
- Custom tagging and collection assignment
- Configurable via `zotero_*` settings

- **Run-quality telemetry**: Comprehensive metrics collection
- `RunMetrics` and `StageMetrics` models
- Per-stage timing, input/output counts, error tracking
- Aggregate statistics (candidates, screened, analyzed, exported)
- Source breakdown and PDF status tracking
- Written to `metrics.json` in output directory

- **Manual PDF injection**: Support for providing your own PDFs
- `--inject-pdfs` CLI flag
- Configurable via `inject_pdf_dir` setting
- Matching by paper_id or DOI filename
- Useful for papers behind paywalls

- **Token-budgeted PDF extraction**: Intelligent text extraction
- Replaces fixed first/last pages heuristic
- Keyword-based page scoring
- Configurable token budget
- Falls back gracefully when extraction fails

- **Abstract-fallback screening**: Multi-signal screening for papers without abstracts
- Uses title, venue, citation count, year, and PDF excerpts
- Conservative scoring bias toward inclusion
- Dedicated `screening_fallback.md` prompt

- **Robust error handling**: Resilience against external failures
- `parse_llm_json()` helper with comprehensive validation
- `retry_with_backoff()` decorator for API calls
- Configurable retry settings (`max_retries`, `retry_base_delay`)
- Graceful degradation when LLM returns malformed JSON

- **Security improvements**:
- Path sanitization via `safe_filename()` utility
- Atomic state persistence using temp file + os.replace

### Changed
- **PDF tracking**: Replaced `pdf_downloaded: bool` with richer fields
- `pdf_path: str | None` - relative path to PDF
- `pdf_status: Literal["not_attempted", "downloaded", "unavailable", "user_provided"]`
- `data_completeness: Literal["full", "abstract_only", "metadata_only"]`

- **Version source**: Single-source version via `importlib.metadata`
- Removed hardcoded version from `__init__.py`
- Version now sourced from `pyproject.toml`

- **Configuration**: Added `litresearch.toml.example` with all new options
- Renamed existing `litresearch.toml` to example file
- Real config files now gitignored

### Fixed

- **s2:** Enforce 1 rps throttling across S2 stages


### Maintenance

- Migrate to opencode workflow


### ci

- **release:** Add environment for trusted publisher


## [0.1.0] - 2026-03-09


### Added

- **ci:** Add oss release workflows


### Maintenance

- Initial project setup

- Release v0.1.0

- **Resume bug**: Fixed crash when resuming from `current_stage="start"`
- **State persistence**: Atomic writes prevent state corruption on interrupt
- **JSON parsing**: Proper handling of missing keys and validation errors in LLM responses
- **Path traversal**: Sanitized paper_id usage in filenames

### Dependencies
- Added `pyalex>=0.15` for OpenAlex integration
- Added `pyzotero>=1.6` for Zotero export
- Added `rapidfuzz` for fuzzy title matching (optional, falls back to difflib)
81 changes: 75 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,41 @@ ranked, and exported paper sets with structured reports.

## Overview
- Generates search facets and academic queries from one or more research questions
- Searches Semantic Scholar for candidate papers
- Discovers candidates from Semantic Scholar and OpenAlex
- Screens and analyzes papers with an LLM through LiteLLM
- Ranks papers and exports reports, references, JSON data, and PDFs
- Supports resume via a saved `state.json`
- Supports citation graph expansion for frequently referenced works
- Ranks papers and exports reports, references, JSON data, PDFs, and metrics
- Supports robust resume via a saved `state.json`

## What's New in v1.0.0

### Multi-source discovery (S2 + OpenAlex)
- Use `discovery_sources = ["s2", "openalex"]` for broader coverage.
- Candidates are deduplicated across sources and source provenance is tracked.

### Citation graph expansion
- Optional expansion stage adds highly cross-referenced papers after ranking.
- Configure with `expand_citations` and `min_cross_refs`.

### Zotero export
- Export top papers to Zotero user or group libraries.
- Supports collection assignment, tags, and PDF attachment when available.

### PDF injection
- Bring your own PDFs with `--inject-pdfs` or `inject_pdf_dir`.
- Match files by `{paper_id}.pdf` or DOI-based filenames.

### Run metrics and telemetry
- Every run writes `metrics.json` with stage timings and aggregate counts.
- Includes source breakdown plus PDF availability and usage metrics.

### Resume behavior improvements
- Improved resume reliability from `state.json` checkpoints.
- Safer state persistence with atomic writes.

### Token-budgeted PDF extraction
- Configurable extraction strategy supports token budgets for LLM context limits.
- Falls back gracefully when PDFs are unavailable or extraction is limited.

## Installation
```bash
Expand Down Expand Up @@ -59,6 +90,7 @@ output/
references.bib
references.ris
data.json
metrics.json
papers/
state.json
```
Expand Down Expand Up @@ -90,6 +122,12 @@ Resume an interrupted run:
litresearch resume output/state.json
```

Inject local PDFs for papers you already have:

```bash
litresearch run "Your research question" --inject-pdfs /path/to/pdfs
```

Inspect current configuration:

```bash
Expand All @@ -108,26 +146,44 @@ Supported environment variables:
- `ANTHROPIC_API_KEY`
- `OPENROUTER_API_KEY`
- `S2_API_KEY`
- `ZOTERO_API_KEY`
- `S2_TIMEOUT`
- `S2_REQUESTS_PER_SECOND`
- `SCREENING_SELECTION_MODE`
- `SCREENING_TOP_PERCENT`
- `SCREENING_TOP_K`
- `SCREENING_THRESHOLD`

Example `litresearch.toml`:
Start from the full example config:

```bash
cp litresearch.toml.example litresearch.toml
```

Key options include:

```toml
default_model = "openai/gpt-4o-mini"
llm_timeout = 120
max_retries = 3
retry_base_delay = 1.0
discovery_sources = ["s2"]
screening_selection_mode = "top_percent"
screening_top_percent = 0.3
screening_threshold = 60
top_n = 20
max_results_per_query = 20
expand_citations = false
min_cross_refs = 3
zotero_export = false
s2_timeout = 10
s2_requests_per_second = 1.0
pdf_extraction_mode = "budget"
pdf_token_budget = 4000
pdf_first_pages = 4
pdf_last_pages = 2
abstract_fallback = true
# inject_pdf_dir = "/path/to/pdfs"
output_dir = "output"
```

Expand All @@ -140,12 +196,25 @@ Semantic Scholar tuning:
- `s2_timeout`: request timeout in seconds
- `s2_requests_per_second`: global request rate cap across S2 endpoints

Discovery tuning:
- `discovery_sources`: choose `s2`, `openalex`, or both
- `openalex_email`: optional email for OpenAlex polite pool rate limits

Citation expansion tuning:
- `expand_citations`: enable or disable expansion stage
- `min_cross_refs`: minimum citation graph references to include

Zotero export tuning:
- `zotero_export`: enable export integration
- `zotero_library_id`, `zotero_library_type`, `zotero_collection_key`, `zotero_tag`

## Output Files
- `report.md`: main literature review report with research questions, search summary, top papers, and synthesis
- `paper_analyses.md`: detailed per-paper analysis for all analyzed papers
- `references.bib`: BibTeX for ranked papers when citation data is available
- `references.ris`: RIS export for citation managers
- `data.json`: machine-readable export of the pipeline state
- `metrics.json`: per-stage timings and aggregate run metrics
- `papers/`: downloaded open-access PDFs for ranked papers
- `state.json`: resumable pipeline checkpoint

Expand All @@ -156,5 +225,5 @@ uv run litresearch --help
```

## Status
This is an MVP-oriented proof of concept intended to answer one question clearly:
is the end-to-end literature research workflow useful enough to keep investing in?
`v1.0.0` delivers a production-ready core workflow for automated literature research,
including multi-source discovery, ranking, export, and operational telemetry.
Loading