DeepDoc

DeepDoc is a full-stack, AI-powered PDF intelligence app that lets users upload documents and chat with them using retrieval-augmented generation (RAG).
It is built to demonstrate practical AI engineering skills for production use cases: document ingestion, semantic chunking, vector search, grounded LLM prompting, and scalable web delivery.

Why this project

Most "chat with PDF" demos stop at basic retrieval. DeepDoc focuses on production-minded improvements:

Better context quality with semantic chunking and chunk merging
Grounded answers constrained to retrieved evidence
Defensive reliability (timeouts, retries, and fallback behavior)
End-to-end application delivery (UI, APIs, DB, vector store, deployment)

This repository is aimed at showcasing AI role readiness across both model-centric and software engineering dimensions.

Core Features

Upload and process PDF files
- Extracts raw text from PDF files
- Stores the original file in Vercel Blob
Intelligent document chunking pipeline
- Sentence-aware segmentation
- Embedding-based topic shift detection
- Similarity-based chunk merging
Embedding generation with Gemini embeddings (text-embedding-004)
- Batch embedding support
- Timeout and retry behavior
- In-memory cache to reduce duplicate embedding calls
Vector indexing and retrieval with Pinecone
- Namespace isolation per uploaded document
- Metadata sanitization before upsert
- Similarity threshold filtering and redundancy deduplication
Grounded QA with Gemini (gemini-2.5-flash)
- Strict prompt instructions to stay within retrieved context
- Graceful "insufficient context" responses
- API-side retries with exponential backoff for transient failures
Multi-chat experience
- Chat history persisted in PostgreSQL (Neon) via Drizzle ORM
- Sidebar navigation across uploaded documents
- Integrated PDF viewer + conversation panel

AI/ML Design Highlights (for recruiters)

RAG pipeline design
- Query embedding -> vector retrieval -> dedupe/rank -> token-budget packing -> grounded generation
Retrieval quality controls
- Score thresholding + text redundancy suppression before context assembly
- Configurable topK, score thresholds, and token budget through environment variables
Robust inference behavior
- Timeout wrappers and retries for both embedding and generation calls
- Defensive request validation and upstream error handling in API routes
Prompt engineering for factuality
- Explicit grounding rules
- Clear constraints when context is missing or partial
Production-ready full-stack integration
- Next.js App Router APIs, server actions, persistent storage, cloud vector DB, and deployable infra

Tech Stack

Frontend: Next.js 15, React 19, Tailwind CSS, shadcn/ui patterns, TanStack Query
Backend: Next.js Route Handlers + Server Actions, TypeScript
AI: Google Gemini (gemini-2.5-flash, text-embedding-004)
Vector DB: Pinecone
Relational DB: Neon PostgreSQL + Drizzle ORM
File Storage: Vercel Blob
Auth (pages scaffolded): Clerk sign-in/sign-up routes

Architecture Overview

User uploads a PDF from the landing page.
Server extracts text (pdf-parse) and stores the file in Vercel Blob.
Text is chunked using sentence boundaries + semantic shift detection.
Chunk embeddings are generated and upserted to Pinecone (namespace = file key).
A chat record is created in PostgreSQL.
On each user question:
- Generate embedding for the question
- Retrieve top matches from Pinecone
- Filter and dedupe context
- Pack context within a token budget
- Ask Gemini with strict grounding instructions
Persist user + assistant messages in PostgreSQL and render in chat UI.

Repository Structure

app/
  api/
    chat/route.ts           # grounded LLM response endpoint
    get-messages/route.ts   # chat history fetch endpoint
  chat/[id]/page.tsx        # chat workspace (sidebar + pdf + messages)
components/
  PDFUpload.tsx             # upload flow
  ChatComponent.tsx         # conversation UI and mutation flow
  ChatSideBar.tsx           # chat/document navigation
  PDFViewer.tsx             # document preview panel
lib/
  pdf-process.ts            # upload + parse + chunk + embed + index pipeline
  chunking.ts               # semantic chunking logic
  embedding.ts              # embedding generation, retries, cache
  context.ts                # retrieval, dedupe, token budget packing
  pineconedb.ts             # Pinecone upsert helpers
  db/                       # Drizzle schema + Neon client

Local Setup

1) Clone and install

git clone <your-repo-url>
cd DeepDoc
npm install

2) Configure environment variables

Create a .env file in project root:

# Required
GEMINI_API_KEY=your_gemini_api_key
PINECONE_API_KEY=your_pinecone_api_key
DATABASE_URL=your_neon_database_url

# Optional (defaults are present in code)
PINECONE_INDEX=chatpdf
CONTEXT_TOP_K=20
CONTEXT_SCORE_THRESHOLD=0.7
CONTEXT_MAX_TOKENS=750

Also configure Vercel Blob token for upload support in your environment.

3) Run database migration

npx drizzle-kit generate
npx drizzle-kit migrate

4) Start development server

npm run dev

App runs on http://localhost:3000.

Key Engineering Decisions

Semantic chunking over naive fixed windows
Improves retrieval precision by aligning chunks to meaning boundaries.
Two-stage context hygiene (threshold + dedupe)
Reduces noise and repeated evidence before generation.
Token-budgeted context packing
Keeps prompts efficient and predictable under model limits.
Resilience-first API handling
Retries and timeout logic reduce transient provider/network failures.

Known Limitations / Next Improvements

Add citation spans and source highlighting in final answers
Add automated RAG evaluation set (faithfulness + answer relevance)
Add background job queue for large PDF ingestion
Add multi-tenant auth isolation across chats
Add streaming response UX for long completions

Portfolio Value for AI Roles

DeepDoc demonstrates readiness for roles such as:

AI Engineer
LLM Engineer
Applied ML Engineer
GenAI Full-Stack Engineer

Evidence shown in this project:

Practical RAG architecture and implementation
Embedding and retrieval optimization mindset
Prompt grounding and hallucination control
End-to-end product thinking from model to deployment
Strong TypeScript/Next.js engineering execution around AI systems

License

MIT (or your preferred license).

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
app		app
components		components
drizzle		drizzle
lib		lib
public		public
test/data		test/data
.codex		.codex
.gitignore		.gitignore
README.md		README.md
components.json		components.json
drizzle.config.ts		drizzle.config.ts
eslint.config.mjs		eslint.config.mjs
next.config.ts		next.config.ts
package-lock.json		package-lock.json
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
postcss.config.mjs		postcss.config.mjs
tailwind.config.js		tailwind.config.js
tsconfig.json		tsconfig.json
vercel.json		vercel.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DeepDoc

Why this project

Core Features

AI/ML Design Highlights (for recruiters)

Tech Stack

Architecture Overview

Repository Structure

Local Setup

1) Clone and install

2) Configure environment variables

3) Run database migration

4) Start development server

Key Engineering Decisions

Known Limitations / Next Improvements

Portfolio Value for AI Roles

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DeepDoc

Why this project

Core Features

AI/ML Design Highlights (for recruiters)

Tech Stack

Architecture Overview

Repository Structure

Local Setup

1) Clone and install

2) Configure environment variables

3) Run database migration

4) Start development server

Key Engineering Decisions

Known Limitations / Next Improvements

Portfolio Value for AI Roles

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages