PageIndex

Vectorless, reasoning-based RAG using hierarchical document indexing with Vertex AI

Why chunk and embed when you can reason and structure?

PageIndex builds semantic tree structures from documents without embeddings or vector databases. Instead of chunking and embedding, it uses LLM reasoning to extract hierarchical structure, making document navigation and retrieval more intuitive.

Note: This is an independent implementation inspired by the PageIndex framework by VectifyAI. While the original uses OpenAI, this implementation uses Google Vertex AI (Gemini) and adds features like batch processing, repository indexing, and CLI tooling.

Features

PDF Processing - Extracts table of contents, detects document structure, and builds hierarchical trees with page-level precision
Markdown Processing - Parses header hierarchy into navigable tree structures
Batch Processing - Process entire folders of documents concurrently
Repository Indexing - Generate semantic summaries for codebases
Format Conversion - Convert DOCX, PPTX, HTML, and images via docling

When NOT to Use PageIndex

PageIndex excels at structured, hierarchical documents but isn't the right tool for every use case:

Use Case	Why PageIndex May Not Be Ideal	Better Alternative
Short documents (< 10 pages)	Overhead of tree construction isn't worth it	Direct LLM context or simple chunking
Unstructured content (chat logs, social media)	No inherent hierarchy to extract	Vector search with semantic embeddings
High-volume real-time queries	LLM reasoning per query adds latency	Pre-computed vector indices
Keyword/exact match search	PageIndex focuses on semantic structure	Full-text search (Elasticsearch, etc.)
Frequently updated documents	Tree must be regenerated on each change	Incremental vector indexing
Multi-document corpus search	Designed for single-document navigation	Vector DB with cross-document retrieval
Cost-sensitive applications	Each indexing run uses LLM API calls	One-time embedding generation

PageIndex Shines When:

Documents have clear hierarchical structure (reports, manuals, textbooks, legal docs)
You need explainable, traceable retrieval with section/page references
Accuracy matters more than speed (financial analysis, compliance, research)
Documents are long (50+ pages) where vector chunking loses context
You want human-like navigation through complex documents

Installation

Clone the repository and install in editable mode:

git clone https://github.com/NP-compete/pageindex.git
cd pageindex
pip install -e .

With document conversion support:

pip install -e ".[docling]"

For development:

pip install -e ".[dev]"

Quick Start

CLI Usage

Process a PDF:

pageindex pdf document.pdf --project-id your-gcp-project

Process a Markdown file:

pageindex md document.md --project-id your-gcp-project

Process all documents in a folder:

pageindex folder ./docs --project-id your-gcp-project

Index a code repository:

pageindex repo ./my-project --project-id your-gcp-project

Python API

from pageindex import page_index, md_to_tree, process_folder_sync, index_repository_sync

# Process a PDF
result = page_index(
    "document.pdf",
    project_id="your-gcp-project",
    model="gemini-1.5-flash",
)

# Process Markdown
import asyncio
from pageindex import md_to_tree, PageIndexConfig

config = PageIndexConfig(project_id="your-gcp-project")
result = asyncio.run(md_to_tree("document.md", config=config))

# Batch process a folder
result = process_folder_sync(
    "./docs",
    project_id="your-gcp-project",
    max_concurrent=5,
)

# Index a repository
result = index_repository_sync(
    "./my-project",
    project_id="your-gcp-project",
    add_summaries=True,
)

Configuration

Set your Google Cloud project ID via environment variable:

export PAGEINDEX_PROJECT_ID=your-gcp-project

Or pass it directly to commands and functions.

CLI Options

Option	Description	Default
`--project-id`, `-p`	Google Cloud project ID	`PAGEINDEX_PROJECT_ID` env
`--location`, `-l`	Vertex AI location	`us-central1`
`--model`, `-m`	Gemini model	`gemini-1.5-flash`
`--output`, `-o`	Output file/directory	`./results/`
`--add-summary/--no-summary`	Generate node summaries	varies by command
`--add-text/--no-text`	Include full text in nodes	`--no-text`
`--add-node-id/--no-node-id`	Add hierarchical node IDs	`--add-node-id`

PDF-Specific Options

Option	Description	Default
`--toc-check-pages`	Pages to scan for TOC	`20`
`--max-pages-per-node`	Max pages before splitting	`10`
`--max-tokens-per-node`	Max tokens before splitting	`20000`

Folder Processing Options

Option	Description	Default
`--max-concurrent`, `-c`	Concurrent processing tasks	`5`
`--convert/--no-convert`	Convert unsupported formats	`--convert`
`--docling-serve-url`	Remote docling-serve API URL	None
`--docling-serve-timeout`	API timeout (seconds)	`300`

Repository Indexing Options

Option	Description	Default
`--summaries/--no-summaries`	Generate directory summaries	`--summaries`
`--include`, `-i`	File patterns to include	See defaults
`--exclude`, `-e`	Patterns to exclude	See defaults
`--max-depth`	Tree display depth	`4`

Output Format

PageIndex outputs JSON with a hierarchical structure:

{
  "doc_name": "example",
  "doc_description": "A technical guide covering...",
  "structure": [
    {
      "title": "Introduction",
      "node_id": "0001",
      "summary": "Overview of the document...",
      "start_index": 1,
      "end_index": 5,
      "nodes": [
        {
          "title": "Background",
          "node_id": "0001.0001",
          "summary": "Historical context...",
          "start_index": 2,
          "end_index": 4
        }
      ]
    }
  ]
}

Document Conversion

PageIndex supports converting various formats to Markdown using docling:

Supported formats: DOCX, PPTX, XLSX, HTML, PNG, JPG, TIFF, BMP

Using docling-serve (recommended for production)

# Start docling-serve
docker run -p 5001:5001 quay.io/docling-project/docling-serve

# Process with remote conversion
pageindex folder ./docs --docling-serve-url http://localhost:5001

Using local docling

pip install pageindex[docling]
pageindex folder ./docs --convert

How It Works

PDF Processing Pipeline

TOC Detection - Scans initial pages for table of contents
Structure Extraction - Uses LLM to extract hierarchical structure from TOC or content
Page Mapping - Maps logical sections to physical page numbers
Verification - Validates extracted structure against actual content
Large Node Splitting - Recursively splits oversized sections
Summary Generation - Optionally generates summaries for each node

Markdown Processing

Header Extraction - Parses markdown headers (H1-H6)
Tree Building - Constructs hierarchy based on header levels
Tree Thinning - Optionally merges small nodes
Summary Generation - Optionally summarizes each section

Repository Indexing

Directory Scanning - Walks repository respecting include/exclude patterns
Context Building - Reads README files and key entry points
Summary Generation - Uses LLM to summarize each directory's purpose
Tree Construction - Builds navigable directory tree with metadata

Requirements

Python 3.10+
Google Cloud project with Vertex AI API enabled
Authentication via gcloud auth application-default login or service account

Related Projects

PageIndex by VectifyAI - The original PageIndex framework for vectorless, reasoning-based RAG using OpenAI
PageIndex.ai - Commercial platform for human-like document AI by VectifyAI

License

MIT - see LICENSE for details.

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

Acknowledgments

This project is inspired by the PageIndex framework developed by VectifyAI. Their research on vectorless, reasoning-based RAG demonstrates that similarity ≠ relevance — true document retrieval requires reasoning, not just embedding similarity.

Author

Soham Dutta (@NP-compete)

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.github		.github
src/pageindex		src/pageindex
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PageIndex

Features

When NOT to Use PageIndex

PageIndex Shines When:

Installation

Quick Start

CLI Usage

Python API

Configuration

CLI Options

PDF-Specific Options

Folder Processing Options

Repository Indexing Options

Output Format

Document Conversion

Using docling-serve (recommended for production)

Using local docling

How It Works

PDF Processing Pipeline

Markdown Processing

Repository Indexing

Requirements

Related Projects

License

Contributing

Acknowledgments

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PageIndex

Features

When NOT to Use PageIndex

PageIndex Shines When:

Installation

Quick Start

CLI Usage

Python API

Configuration

CLI Options

PDF-Specific Options

Folder Processing Options

Repository Indexing Options

Output Format

Document Conversion

Using docling-serve (recommended for production)

Using local docling

How It Works

PDF Processing Pipeline

Markdown Processing

Repository Indexing

Requirements

Related Projects

License

Contributing

Acknowledgments

Author

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages