Vectorless, reasoning-based RAG using hierarchical document indexing with Vertex AI
Why chunk and embed when you can reason and structure?
PageIndex builds semantic tree structures from documents without embeddings or vector databases. Instead of chunking and embedding, it uses LLM reasoning to extract hierarchical structure, making document navigation and retrieval more intuitive.
Note: This is an independent implementation inspired by the PageIndex framework by VectifyAI. While the original uses OpenAI, this implementation uses Google Vertex AI (Gemini) and adds features like batch processing, repository indexing, and CLI tooling.
- PDF Processing - Extracts table of contents, detects document structure, and builds hierarchical trees with page-level precision
- Markdown Processing - Parses header hierarchy into navigable tree structures
- Batch Processing - Process entire folders of documents concurrently
- Repository Indexing - Generate semantic summaries for codebases
- Format Conversion - Convert DOCX, PPTX, HTML, and images via docling
PageIndex excels at structured, hierarchical documents but isn't the right tool for every use case:
| Use Case | Why PageIndex May Not Be Ideal | Better Alternative |
|---|---|---|
| Short documents (< 10 pages) | Overhead of tree construction isn't worth it | Direct LLM context or simple chunking |
| Unstructured content (chat logs, social media) | No inherent hierarchy to extract | Vector search with semantic embeddings |
| High-volume real-time queries | LLM reasoning per query adds latency | Pre-computed vector indices |
| Keyword/exact match search | PageIndex focuses on semantic structure | Full-text search (Elasticsearch, etc.) |
| Frequently updated documents | Tree must be regenerated on each change | Incremental vector indexing |
| Multi-document corpus search | Designed for single-document navigation | Vector DB with cross-document retrieval |
| Cost-sensitive applications | Each indexing run uses LLM API calls | One-time embedding generation |
- Documents have clear hierarchical structure (reports, manuals, textbooks, legal docs)
- You need explainable, traceable retrieval with section/page references
- Accuracy matters more than speed (financial analysis, compliance, research)
- Documents are long (50+ pages) where vector chunking loses context
- You want human-like navigation through complex documents
Clone the repository and install in editable mode:
git clone https://github.com/NP-compete/pageindex.git
cd pageindex
pip install -e .With document conversion support:
pip install -e ".[docling]"For development:
pip install -e ".[dev]"Process a PDF:
pageindex pdf document.pdf --project-id your-gcp-projectProcess a Markdown file:
pageindex md document.md --project-id your-gcp-projectProcess all documents in a folder:
pageindex folder ./docs --project-id your-gcp-projectIndex a code repository:
pageindex repo ./my-project --project-id your-gcp-projectfrom pageindex import page_index, md_to_tree, process_folder_sync, index_repository_sync
# Process a PDF
result = page_index(
"document.pdf",
project_id="your-gcp-project",
model="gemini-1.5-flash",
)
# Process Markdown
import asyncio
from pageindex import md_to_tree, PageIndexConfig
config = PageIndexConfig(project_id="your-gcp-project")
result = asyncio.run(md_to_tree("document.md", config=config))
# Batch process a folder
result = process_folder_sync(
"./docs",
project_id="your-gcp-project",
max_concurrent=5,
)
# Index a repository
result = index_repository_sync(
"./my-project",
project_id="your-gcp-project",
add_summaries=True,
)Set your Google Cloud project ID via environment variable:
export PAGEINDEX_PROJECT_ID=your-gcp-projectOr pass it directly to commands and functions.
| Option | Description | Default |
|---|---|---|
--project-id, -p |
Google Cloud project ID | PAGEINDEX_PROJECT_ID env |
--location, -l |
Vertex AI location | us-central1 |
--model, -m |
Gemini model | gemini-1.5-flash |
--output, -o |
Output file/directory | ./results/ |
--add-summary/--no-summary |
Generate node summaries | varies by command |
--add-text/--no-text |
Include full text in nodes | --no-text |
--add-node-id/--no-node-id |
Add hierarchical node IDs | --add-node-id |
| Option | Description | Default |
|---|---|---|
--toc-check-pages |
Pages to scan for TOC | 20 |
--max-pages-per-node |
Max pages before splitting | 10 |
--max-tokens-per-node |
Max tokens before splitting | 20000 |
| Option | Description | Default |
|---|---|---|
--max-concurrent, -c |
Concurrent processing tasks | 5 |
--convert/--no-convert |
Convert unsupported formats | --convert |
--docling-serve-url |
Remote docling-serve API URL | None |
--docling-serve-timeout |
API timeout (seconds) | 300 |
| Option | Description | Default |
|---|---|---|
--summaries/--no-summaries |
Generate directory summaries | --summaries |
--include, -i |
File patterns to include | See defaults |
--exclude, -e |
Patterns to exclude | See defaults |
--max-depth |
Tree display depth | 4 |
PageIndex outputs JSON with a hierarchical structure:
{
"doc_name": "example",
"doc_description": "A technical guide covering...",
"structure": [
{
"title": "Introduction",
"node_id": "0001",
"summary": "Overview of the document...",
"start_index": 1,
"end_index": 5,
"nodes": [
{
"title": "Background",
"node_id": "0001.0001",
"summary": "Historical context...",
"start_index": 2,
"end_index": 4
}
]
}
]
}PageIndex supports converting various formats to Markdown using docling:
Supported formats: DOCX, PPTX, XLSX, HTML, PNG, JPG, TIFF, BMP
# Start docling-serve
docker run -p 5001:5001 quay.io/docling-project/docling-serve
# Process with remote conversion
pageindex folder ./docs --docling-serve-url http://localhost:5001pip install pageindex[docling]
pageindex folder ./docs --convert- TOC Detection - Scans initial pages for table of contents
- Structure Extraction - Uses LLM to extract hierarchical structure from TOC or content
- Page Mapping - Maps logical sections to physical page numbers
- Verification - Validates extracted structure against actual content
- Large Node Splitting - Recursively splits oversized sections
- Summary Generation - Optionally generates summaries for each node
- Header Extraction - Parses markdown headers (H1-H6)
- Tree Building - Constructs hierarchy based on header levels
- Tree Thinning - Optionally merges small nodes
- Summary Generation - Optionally summarizes each section
- Directory Scanning - Walks repository respecting include/exclude patterns
- Context Building - Reads README files and key entry points
- Summary Generation - Uses LLM to summarize each directory's purpose
- Tree Construction - Builds navigable directory tree with metadata
- Python 3.10+
- Google Cloud project with Vertex AI API enabled
- Authentication via
gcloud auth application-default loginor service account
- PageIndex by VectifyAI - The original PageIndex framework for vectorless, reasoning-based RAG using OpenAI
- PageIndex.ai - Commercial platform for human-like document AI by VectifyAI
MIT - see LICENSE for details.
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
This project is inspired by the PageIndex framework developed by VectifyAI. Their research on vectorless, reasoning-based RAG demonstrates that similarity ≠ relevance — true document retrieval requires reasoning, not just embedding similarity.
Soham Dutta (@NP-compete)