Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 39 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,43 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

## [2.0.0] - 2026-03-12

### Added
- **Local RAG (AI Chat)** β€” ask questions about your documents using a fully local LLM. No data leaves your machine.
- Automatic model selection based on system RAM: Qwen2.5-7B (16 GB+), Qwen2.5-3B (8-16 GB), or Qwen2.5-1.5B (any)
- GPU acceleration: Apple Metal on Apple Silicon, CUDA on NVIDIA, CPU fallback
- Models downloaded once to `~/.cache/docfinder/models/` via Hugging Face Hub
- Chat button appears on search results when RAG is enabled in Settings
- Chat panel slides up from bottom-right with full conversation history per session
- **Page-aware context retrieval** β€” RAG uses document structure for smarter context:
- PDF: real page boundaries
- Markdown: heading-based sections
- Word (.docx): groups of 10 paragraphs
- Plain text: virtual pages of ~3000 characters
- Context expands symmetrically to adjacent pages/sections until the token budget is filled
- **RAG Settings UI** β€” new "AI Chat (RAG)" section in Settings:
- Toggle to enable/disable AI Chat
- Hardware detection shows available RAM
- Three model cards with size, RAM requirements, and "Recommended" badge
- Download progress bar with real-time bytes/percentage tracking
- Model status persisted across sessions
- **Multi-format document support** β€” DocFinder now indexes PDF, plain text (`.txt`), Markdown (`.md`), and Word (`.docx`) files
- **Spotlight-style quick-search panel** *(experimental)* β€” a floating `NSPanel` + `WKWebView` can be summoned via the global hotkey to search documents without switching to the main window
- New `[rag]` optional dependency group (`pip install docfinder[rag]`)
- New storage method `get_context_window()` for fixed-window chunk retrieval
- New storage method `get_context_by_page()` for page-aware chunk retrieval
- Search results now include `document_id` for downstream RAG integration

### Changed
- **Redesigned UI theme**
- `EmbeddingModel.embed()` accepts an optional `batch_size` override for low-RAM scenarios
- `build_chunks()` now stores `page` number in chunk metadata for all document formats
- Chunking pipeline uses page-aware `chunk_text_stream_paged()` to preserve page provenance

### New dependencies
- `llama-cpp-python >= 0.3.0` (optional, in `[rag]` extra)

## [1.2.0] - 2026-03-10

### Fixed
Expand Down Expand Up @@ -180,7 +217,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- Fixed linting issues for consistent code style
- Updated ruff configuration to use non-deprecated settings

[Unreleased]: https://github.com/filippostanghellini/DocFinder/compare/v1.2.0...HEAD
[Unreleased]: https://github.com/filippostanghellini/DocFinder/compare/v2.0.0...HEAD
[2.0.0]: https://github.com/filippostanghellini/DocFinder/compare/v1.2.0...v2.0.0
[1.2.0]: https://github.com/filippostanghellini/DocFinder/compare/v1.1.2...v1.2.0
[1.1.2]: https://github.com/filippostanghellini/DocFinder/compare/v1.1.1...v1.1.2
[1.1.1]: https://github.com/filippostanghellini/DocFinder/compare/v1.0.1...v1.1.1
Expand Down
56 changes: 56 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Commands

```bash
make setup # Create .venv and install all extras [dev,web,gui]
make test # Run pytest with coverage (term + HTML + XML)
make lint # ruff check src/ tests/
make format # ruff format src/ tests/
make format-check # Check formatting without modifying files
make check-all # lint + format-check + test
make run # Launch native desktop GUI (pywebview)
make run-web # Launch web interface at http://127.0.0.1:8000
```

Single test:
```bash
pytest tests/test_web_app.py -v
pytest tests/test_indexer.py::TestIndexer::test_method -v
```

On Linux CI, PyTorch is installed CPU-only:
```bash
pip install torch --index-url https://download.pytorch.org/whl/cpu
```

## Architecture

**Entry points:** `docfinder` (CLI via `cli.py` / typer) and `docfinder-gui` (`gui.py` spawns uvicorn in a thread, wraps FastAPI in pywebview).

**Core pipeline:**
1. `ingestion/pdf_loader.py` β€” PyMuPDF extracts text, splits into overlapping chunks (default 1200 chars, 200 overlap)
2. `embedding/encoder.py` β€” `EmbeddingModel` wraps SentenceTransformer; auto-detects CUDA β†’ MPS β†’ ROCm β†’ CPU; optionally uses ONNX/CoreML backends
3. `index/indexer.py` β€” `Indexer` orchestrates PDF discovery, chunking, embedding, and storage; reports progress via callback `(processed, total, current_file)`
4. `index/storage.py` β€” `SQLiteVectorStore` persists chunks + embeddings; WAL mode; cosine similarity via numpy; batch inserts with `executemany()`
5. `index/search.py` β€” `Searcher` queries the store

**Web layer (`web/app.py`):**
- FastAPI app with lifespan-based startup preloading of `EmbeddingModel` singleton (thread-safe double-checked locking via `_get_embedder()`)
- `/index` returns a `job_id` immediately; background work runs via `asyncio.create_task`; poll `/index/status/{job_id}` for progress
- Default DB path: `~/Documents/DocFinder/docfinder.db` (frozen app) or `data/docfinder.db` (dev)
- Path validation uses `realpath` + home-dir prefix check

**Frontend (`web/templates/index.html`):** Vanilla JS single-page app; no framework. Polls indexing progress every 600 ms. Uses `escHtml()` for XSS prevention.

**Settings:** Hotkey config in `settings.py`; `AppConfig` in `config.py` handles paths and model name.

## Key Constraints

- Python 3.10+ required (no walrus operator in type hints; use `from __future__ import annotations`)
- `numpy<3` pinned for C-extension compatibility
- SQLite used with no extensions (no sqlite-vec, no FTS5 for search β€” pure numpy cosine similarity)
- Ruff line length: 100, double quotes, target py310
- Tests run with `--strict-markers`; coverage is always collected
14 changes: 6 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,20 +12,18 @@
</p>

<p align="center">
<strong>Local-first semantic search for your PDF documents.</strong><br>
<strong>Local-first semantic search for your documents.</strong><br>
Supports PDF, Word (.docx), Markdown, and plain text files.<br>
Everything runs on your machine β€” no cloud, no accounts, complete privacy.
</p>

<table width="100%">
<tr>
<td width="50%"><img src="images/search.png" alt="Search" width="100%"></td>
<td width="50%"><img src="images/index.png" alt="Index" width="100%"></td>
</tr>
</table>
<p align="center">
<img src="images/demo.gif" alt="DocFinder Demo" width="700">
</p>

## Features

- **Semantic search** β€” find documents by meaning, not just keywords
- **Semantic search** β€” find documents by meaning, not just keywords (PDF, DOCX, Markdown, TXT)
- **100% local** β€” your files never leave your machine
- **GPU accelerated** β€” auto-detects Apple Silicon (Metal), NVIDIA (CUDA), AMD (ROCm)
- **Cross-platform** β€” native apps for macOS, Windows, and Linux
Expand Down
Binary file added images/demo.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
10 changes: 7 additions & 3 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,9 @@ build-backend = "setuptools.build_meta"

[project]
name = "docfinder"
version = "1.2.0"
version = "2.0.0"
license = "AGPL-3.0-or-later"
description = "Local-first semantic search CLI for PDF documents."
description = "Local-first semantic search CLI for your documents (PDF, DOCX, Markdown, TXT)."
authors = [
{ name = "DocFinder Team" }
]
Expand All @@ -24,7 +24,8 @@ dependencies = [
"typer[all]>=0.12.0",
"rich>=13.7.0",
"tqdm>=4.66.0",
"mpmath<1.4"
"mpmath<1.4",
"python-docx>=1.1.0"
]

[project.optional-dependencies]
Expand All @@ -49,6 +50,9 @@ gui = [
"pynput>=1.7.0",
"pyobjc-framework-Cocoa>=9.0; sys_platform == 'darwin'"
]
rag = [
"llama-cpp-python>=0.3.0",
]
gpu = [
"onnxruntime-gpu>=1.17.0"
]
Expand Down
2 changes: 1 addition & 1 deletion src/docfinder/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,4 @@

__all__ = ["__version__"]

__version__ = "1.1.2"
__version__ = "2.0.0"
18 changes: 10 additions & 8 deletions src/docfinder/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,10 +15,10 @@
from docfinder.index.indexer import Indexer
from docfinder.index.search import Searcher
from docfinder.index.storage import SQLiteVectorStore
from docfinder.utils.files import iter_pdf_paths
from docfinder.utils.files import iter_document_paths

console = Console()
app = typer.Typer(help="DocFinder - local semantic search for PDFs")
app = typer.Typer(help="DocFinder - local semantic search for your documents")


def _setup_logging(verbose: bool) -> None:
Expand All @@ -32,14 +32,16 @@ def _ensure_db_parent(db_path: Path) -> None:

@app.command()
def index(
inputs: List[Path] = typer.Argument(..., help="Paths with PDFs to index.", resolve_path=True),
inputs: List[Path] = typer.Argument(
..., help="Paths with documents to index.", resolve_path=True
),
db: Path = typer.Option(None, "--db", help="SQLite database path"),
model: str = typer.Option(AppConfig().model_name, help="Sentence-transformer model name"),
chunk_chars: int = typer.Option(AppConfig().chunk_chars, help="Chunk size in characters"),
overlap: int = typer.Option(AppConfig().overlap, help="Chunk overlap"),
verbose: bool = typer.Option(False, "--verbose", "-v", help="Verbose logging"),
) -> None:
"""Index one or more paths containing PDF files."""
"""Index one or more paths containing documents (PDF, DOCX, MD, TXT)."""
_setup_logging(verbose)
config = AppConfig(
db_path=db if db is not None else AppConfig().db_path,
Expand All @@ -56,13 +58,13 @@ def index(
indexer = Indexer(embedder, store, chunk_chars=config.chunk_chars, overlap=config.overlap)

console.print(f"Indexing into [bold]{resolved_db}[/bold]...")
pdf_paths = list(iter_pdf_paths(inputs))
if not pdf_paths:
console.print("[yellow]No PDFs found.[/yellow]")
doc_paths = list(iter_document_paths(inputs))
if not doc_paths:
console.print("[yellow]No supported documents found.[/yellow]")
store.close()
return

stats = indexer.index(pdf_paths)
stats = indexer.index(doc_paths)
console.print(
f"Inserted: {stats.inserted}, updated: {stats.updated}, "
f"skipped: {stats.skipped}, failed: {stats.failed}"
Expand Down
13 changes: 10 additions & 3 deletions src/docfinder/embedding/encoder.py
Original file line number Diff line number Diff line change
Expand Up @@ -211,12 +211,19 @@ def _log_backend_info(self) -> None:

logger.info(" | ".join(info_parts))

def embed(self, texts: Sequence[str] | Iterable[str]) -> np.ndarray:
"""Return float32 embeddings for input texts."""
def embed(
self, texts: Sequence[str] | Iterable[str], *, batch_size: int | None = None
) -> np.ndarray:
"""Return float32 embeddings for input texts.

Args:
texts: Input strings to embed.
batch_size: Override the configured batch size (useful for low-RAM scenarios).
"""
sentences = list(texts)
embeddings = self._model.encode(
sentences,
batch_size=self.config.batch_size,
batch_size=batch_size if batch_size is not None else self.config.batch_size,
show_progress_bar=False,
convert_to_numpy=True,
normalize_embeddings=self.config.normalize,
Expand Down
Loading
Loading