Take-Home Project: LLM-Powered Data Extraction & Summarization
A Next.js service that lets users upload documents, automatically classifies document types (invoices, purchase orders, receipts, etc.), and extracts key structured data using OpenAI's Vision API. The extracted data is stored in a database and accessible via retrieval APIs.
- Document Upload: File upload mechanism supporting PDFs, images, text files, and CSV
- Document Classification: Automatically classifies documents as invoices, purchase orders, receipts, contracts, reports, forms, letters, or other
- Structured Data Extraction: Extracts relevant fields based on document type:
- Invoices: Invoice number, date, vendor, total, tax, bill-to information, line items
- Purchase Orders: PO number, date, vendor, buyer, total, items, delivery date, terms
- Receipts: Store, date, total, tax, payment method, items
- Contracts: Parties, date, title, value, terms, duration
- Database Storage: All extracted data saved to SQLite database with Prisma ORM
- Retrieval APIs: Complete set of endpoints for document management and search
- Vector Search: Semantic search across document content using OpenAI vector stores
- Real-time Progress: SSE-based progress tracking for document processing
- Structured Logging: Comprehensive logging with Pino for debugging and monitoring
- Type Safety: Full TypeScript implementation with Zod validation
- Testing: Jest test suite with API route coverage
- Node.js 18+ and npm
- OpenAI API key
- (Optional) Poppler tools for PDF processing (
pdftocairo) - (Optional) Redis/Upstash for caching
- Clone the repository:
git clone https://github.com/wolyslager/llm-document-chat.git
cd llm-document-chat- Install dependencies:
npm install- Set up environment variables:
cp env.example .env.local
# Edit .env.local with your OpenAI API key and other settings- Set up the database:
npx prisma migrate dev --name init- Create a vector store:
node setup-vector-store.js- Run the development server:
npm run devOpen http://localhost:3000 with your browser to see the application.
src/app/api/- Next.js API routes for document upload, search, and managementsrc/lib/- Core utilities (OpenAI, database, logging, validation)src/components/- React components for the UIscripts/- Utility scripts for cleanup and setupprisma/- Database schema and migrationssrc/__tests__/- Jest test suite
POST /api/upload- Upload documents for classification and data extraction
GET /api/documents- Fetch list of uploaded documents with metadata and extracted data summariesGET /api/documents/[id]- Fetch document with all relevant information including full extracted dataPOST /api/search- Basic search across documents using semantic vector searchPOST /api/search- Ask generic questions about documents via natural language queries
GET /api/vector-stores- List available vector storesPOST /api/vector-stores- Create new vector storeDELETE /api/documents/[id]- Remove documents and clean up associated data
npm run dev- Start development servernpm run build- Build for productionnpm run test- Run test suitenpm run lint- Run ESLintnpm run type-check- Run TypeScript compiler./scripts/cleanup-all.js- Clean database and vector stores./scripts/cleanup-database.js- Clean database only./scripts/cleanup-vector-store.js- Clean vector store only
This solution demonstrates clean, well-structured code following best practices as requested:
- Modular Architecture: Separation of concerns with dedicated modules for database, OpenAI integration, validation, and error handling
- Type Safety: Full TypeScript implementation with Zod schemas for runtime validation
- Error Handling: Unified error handling with structured JSON responses and proper HTTP status codes
- Testing: Comprehensive Jest test suite covering API routes and core functionality
- Logging: Structured logging with Pino for debugging and monitoring
- Code Quality: ESLint configuration and consistent coding patterns
- Framework: Next.js 14 with App Router
- Database: SQLite with Prisma ORM
- AI/ML: OpenAI GPT-4 Vision API, Vector Stores, Assistants API
- Validation: Zod schemas
- Logging: Pino structured logging
- Testing: Jest with API route testing
- Styling: Tailwind CSS
- TypeScript: Full type safety throughout
MIT