LLM Document Chat

Take-Home Project: LLM-Powered Data Extraction & Summarization

A Next.js service that lets users upload documents, automatically classifies document types (invoices, purchase orders, receipts, etc.), and extracts key structured data using OpenAI's Vision API. The extracted data is stored in a database and accessible via retrieval APIs.

Features

Core Requirements ✅

Document Upload: File upload mechanism supporting PDFs, images, text files, and CSV
Document Classification: Automatically classifies documents as invoices, purchase orders, receipts, contracts, reports, forms, letters, or other
Structured Data Extraction: Extracts relevant fields based on document type:
- Invoices: Invoice number, date, vendor, total, tax, bill-to information, line items
- Purchase Orders: PO number, date, vendor, buyer, total, items, delivery date, terms
- Receipts: Store, date, total, tax, payment method, items
- Contracts: Parties, date, title, value, terms, duration
Database Storage: All extracted data saved to SQLite database with Prisma ORM
Retrieval APIs: Complete set of endpoints for document management and search

Additional Features

Vector Search: Semantic search across document content using OpenAI vector stores
Real-time Progress: SSE-based progress tracking for document processing
Structured Logging: Comprehensive logging with Pino for debugging and monitoring
Type Safety: Full TypeScript implementation with Zod validation
Testing: Jest test suite with API route coverage

Getting Started

Prerequisites

Node.js 18+ and npm
OpenAI API key
(Optional) Poppler tools for PDF processing (pdftocairo)
(Optional) Redis/Upstash for caching

Setup

Clone the repository:

git clone https://github.com/wolyslager/llm-document-chat.git
cd llm-document-chat

Install dependencies:

npm install

Set up environment variables:

cp env.example .env.local
# Edit .env.local with your OpenAI API key and other settings

Set up the database:

npx prisma migrate dev --name init

Create a vector store:

node setup-vector-store.js

Run the development server:

npm run dev

Open http://localhost:3000 with your browser to see the application.

Project Structure

src/app/api/ - Next.js API routes for document upload, search, and management
src/lib/ - Core utilities (OpenAI, database, logging, validation)
src/components/ - React components for the UI
scripts/ - Utility scripts for cleanup and setup
prisma/ - Database schema and migrations
src/__tests__/ - Jest test suite

API Routes

Document Processing

POST /api/upload - Upload documents for classification and data extraction

Retrieval APIs (Core Requirement)

GET /api/documents - Fetch list of uploaded documents with metadata and extracted data summaries
GET /api/documents/[id] - Fetch document with all relevant information including full extracted data
POST /api/search - Basic search across documents using semantic vector search
POST /api/search - Ask generic questions about documents via natural language queries

Vector Store Management

GET /api/vector-stores - List available vector stores
POST /api/vector-stores - Create new vector store
DELETE /api/documents/[id] - Remove documents and clean up associated data

Scripts

npm run dev - Start development server
npm run build - Build for production
npm run test - Run test suite
npm run lint - Run ESLint
npm run type-check - Run TypeScript compiler
./scripts/cleanup-all.js - Clean database and vector stores
./scripts/cleanup-database.js - Clean database only
./scripts/cleanup-vector-store.js - Clean vector store only

Implementation Approach

This solution demonstrates clean, well-structured code following best practices as requested:

Modular Architecture: Separation of concerns with dedicated modules for database, OpenAI integration, validation, and error handling
Type Safety: Full TypeScript implementation with Zod schemas for runtime validation
Error Handling: Unified error handling with structured JSON responses and proper HTTP status codes
Testing: Comprehensive Jest test suite covering API routes and core functionality
Logging: Structured logging with Pino for debugging and monitoring
Code Quality: ESLint configuration and consistent coding patterns

Technologies

Framework: Next.js 14 with App Router
Database: SQLite with Prisma ORM
AI/ML: OpenAI GPT-4 Vision API, Vector Stores, Assistants API
Validation: Zod schemas
Logging: Pino structured logging
Testing: Jest with API route testing
Styling: Tailwind CSS
TypeScript: Full type safety throughout

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
prisma		prisma
public		public
scripts		scripts
src		src
.gitignore		.gitignore
API.md		API.md
CLAUDE.md		CLAUDE.md
DEPLOYMENT.md		DEPLOYMENT.md
Dockerfile		Dockerfile
RAILWAY_DEPLOYMENT.md		RAILWAY_DEPLOYMENT.md
README.md		README.md
TESTING.md		TESTING.md
env.example		env.example
eslint.config.mjs		eslint.config.mjs
jest.config.js		jest.config.js
jest.setup.js		jest.setup.js
next.config.ts		next.config.ts
package-lock.json		package-lock.json
package.json		package.json
postcss.config.mjs		postcss.config.mjs
railway.json		railway.json
render.yaml		render.yaml
setup-vector-store.js		setup-vector-store.js
test-deployment.js		test-deployment.js
tsconfig.json		tsconfig.json
vercel.json		vercel.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Document Chat

Features

Core Requirements ✅

Additional Features

Getting Started

Prerequisites

Setup

Project Structure

API Routes

Document Processing

Retrieval APIs (Core Requirement)

Vector Store Management

Scripts

Implementation Approach

Technologies

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

wolyslager/llm-document-chat

Folders and files

Latest commit

History

Repository files navigation

LLM Document Chat

Features

Core Requirements ✅

Additional Features

Getting Started

Prerequisites

Setup

Project Structure

API Routes

Document Processing

Retrieval APIs (Core Requirement)

Vector Store Management

Scripts

Implementation Approach

Technologies

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages