deployed link - https://kq-frontend-one.vercel.app/graph A full-stack knowledge graph search assistant for PDF and web URL content, combining:
- LLM-based knowledge graph extraction (Neo4j)
- Vector retrieval with Qdrant embeddings
- FastAPI backend (Python)
- Next.js frontend (React + ReactFlow)
- Authentication with Clerk JWT
- Storage with AWS S3 compatible API
c:\knowledge_graph is a mono-repo with:
kg_backend/— FastAPI backend handling upload, vectorization, KG extraction, and querying.kg_frontend/— Next.js app containing the UI for upload/query/history/graph.
- Upload PDF files
- Process web URLs
- Extract text using PyMuPDF + OCR fallback (Tesseract)
- Chunk text and build dense embeddings via
sentence-transformers/all-MiniLM-L6-v2 - Store vectors in Qdrant per user/task collection
- Create and store knowledge graph (entities+relationships) in Neo4j via LangChain and custom prompts
- Query by natural language from the UI using dense retrieval + KG context
- Inspect and highlight graph nodes + explanations
- Python 3.11+ (implied)
- FastAPI (HTTP API)
- Uvicorn (ASGI server runtime)
- LangChain ecosystem:
langchain,langchain-groq,langchain-huggingface,langchain-qdrant,langchain-community.
- Qdrant vector database (
qdrant-client) - Neo4j graph database (
neo4jdriver) - PyMuPDF (
fitz) for PDF - OCR:
pytesseract,Pillow(optional, image PDFs) - AWS S3 storage via
boto3(S3/R2/local abstraction) - JWT and Clerk integration via
PyJWT+requests - Caching with Redis (not strongly used but available in requirements)
- Next.js 15 (App Router)
- React 19
- Tailwind CSS, PostCSS
- Clerk auth via
@clerk/nextjs - ReactFlow for graph display
- react-toastify for notifications
- axios for upload requests
- Environment based configuration in
kg_backend/config.py - Dockerfile+cloudbuild scripts for CI/CD (GCP) in
kg_backend/
kg_backend/auth.pyuses Clerk JWKS fromhttps://valid-termite-98.clerk.accounts.dev.- Every endpoint (upload, query, graph metadata) requires
Authorization: Bearer <token>.
- User hits
POST /upload_pdfswith files (sidebar "Upload") - Backend stores files in S3 path
pdf_uploads/{user_id}/{task_id}/... - Async background task
process_pdf_documentsbegins:- downloads files from S3,
- extracts text page by page (OCR fallback for scanned images),
- splits text with
RecursiveCharacterTextSplitter(chunk_size=3500, overlap=200), - builds KG via
services.kg_service.process_documents_for_kg, - creates Qdrant collection per
user/taskviaservices.qdrant_service.create_vectorstore_for_task.
- Task status tracked in memory dict
task_statusand polled via/task_status/{task_id}using frontend hook.
- User enters URLs in sidebar and calls
POST /initialize_vector_index. services/url_service.process_url_documentsloads pages withUnstructuredURLLoader, chunk/split, create KG, store in Qdrant.
POST /ask_pdf(the primary query API; expectsquestion+task_ids[])- Qdrant search across selected collections uses high-level semantic search + prompt template.
- Also pulls KG context via
services.kg_service.query_kg_for_questionto enrich answers. - Returns LLM response, source chunks, relevant node list.
GET /knowledge_graphreturns graph for user + optionaltask_id.GET /knowledge_graph/historyreturns metadata + counts about KG snapshots.GET /knowledge_graph/node/explainreturns LLM explanation for a node.
GET /vector/historyreturns Qdrant task collection list.POST /vector/query/{task_id}?question=usesask_pdf_endpointas quick query wrapper.
main.py: FastAPI routes, startup (LLM init), CORS, request logging.config.py: env settings and validators + defaults.auth.py: Clerk JWT verification and user identity extraction.storage.py: S3 storage abstraction & helper functions.cache.py: query caching interface (not fully instrumented).services/pdf_service.py: document ingestion, PDF parsing, vector retrieval path.services/url_service.py: URL ingestion and KG + vector store creation.services/qdrant_service.py: Qdrant collection CRUD, search multi-task.services/kg_service.py: knowledge graph extraction, storage in Neo4j, queries, node explanations.
/root checksuseUserand redirects signed-in users to/graph/graphmain app layout with sidebar + graph canvas.
- Upload (PDF + URLs)
- Query (ask question, voice support, list sources + nodes)
- History (task/graph history, select graphs to display)
KnowledgeGraphrenders nodes/edges inreactflowwith auto-layout (dagre)- Node click triggers
/knowledge_graph/node/explain - Filtering by node type, refresh, selected graph overlays.
useTaskPolling(custom hook) pollsGET /task_status/{task_id}until complete.- Indicator panel appears while processing.
- Node 18+
- Python 3.11+
- Docker (optional) for Qdrant + Neo4j + local backend
- Neo4j running at
bolt://localhost:7687 - Qdrant running at
http://localhost:6333 - S3 bucket or local path exposed via S3 (or modify
storage.pyfor local files) - Tesseract installed for OCR (optional but recommended for scanned PDFs)
cd c:\knowledge_graph\kg_backend
python -m venv venv
venv\Scripts\activate
pip install -r requirements.txt
set GROQ_API_KEY=<your_groq_key>
set QDRANT_URL=http://localhost:6333
set NEO4J_URI=bolt://localhost:7687
set NEO4J_USER=neo4j
set NEO4J_PASSWORD=<your_password>
set AWS_S3_BUCKET_NAME=<bucket> # if using S3
set AWS_ACCESS_KEY_ID=<key>
set AWS_SECRET_ACCESS_KEY=<secret>
set CLERK_SECRET_KEY=unused
uvicorn main:app --reload --host 0.0.0.0 --port 8000cd c:\knowledge_graph\kg_frontend
npm install
set NEXT_PUBLIC_API_URL=http://localhost:8000
set NEXT_PUBLIC_CLERK_FRONTEND_API=<clerk_frontend_api>
set NEXT_PUBLIC_CLERK_PUBLISHABLE_KEY=<clerk_publishable_key>
npm run dev- See
kg_backend/Dockerfile,kg_backend/cloudbuild.yaml,kg_backend/DEPLOYMENT_GCP.md,kg_backend/QUICK_DEPLOY_GCP.mdfor cloud deployment details.
- User authenticates via Clerk in Next.js.
- User uploads PDF(s) or enters URLs.
- Sidebar calls backend endpoints:
/upload_pdfs//initialize_vector_index. - Backend spawns background tasks: OCR+text extraction, embedding + Qdrant index, KG extraction + Neo4j persist.
- Frontend polls status, updates graph list.
- User selects graph(s) in history tab.
- Graph view calls
/knowledge_graphand renders local force-layout. - User asks questions (via text or voice) - request goes to
/ask_pdf. - Backend queries Qdrant + KG and uses LLM to answer.
- UI displays answer, sources, relevant nodes, and allows node explanation calls.
GET /health checkGET /task_status/{task_id}POST /upload_pdfs(multipart PDF upload)POST /initialize_vector_index(URL ingestion)POST /ask_pdf(question + task_ids)GET /knowledge_graph(user graph data)GET /knowledge_graph/tasks(task list)GET /knowledge_graph/history(history metadata)GET /knowledge_graph/node/explain(node explanation)GET /vector/history(Qdrant history)POST /vector/query/{task_id}(quick query fallback)
- If queries return 403/401: check Clerk token setup and backend
authendpoint. - If no results from /ask_pdf: verify Qdrant collections appear in
GET /vector/historyand documents in Qdrant. - If KG is empty: ensure Neo4j is reachable and credentials are correct.
- If OCR fails: install Tesseract and confirm commands available in PATH.
- If LLM fails at startup: ensure
GROQ_API_KEYis present and valid.
- KG extraction runs sequentially per chunk
- LLM calls are slow and blocking
- Not scalable for large PDFs or multiple concurrent users
Convert the KG extraction pipeline into a distributed asynchronous system using a pub-sub/message broker like Kafka.
- After chunking, push each chunk as a message to a Kafka topic
kg_chunks
Stores chunk messages in the format:
{
"task_id": "...",
"user_id": "...",
"chunk_text": "...",
"chunk_id": "..."
}- Multiple worker services consume chunks in parallel
Each worker:
- Calls LLM (LangChain + Groq)
- Extracts entities and relationships
- Writes results to Neo4j
- Optionally track completion using Redis or a database
- Mark task as complete when all chunks are processed
- Kafka / Redpanda (lighter alternative)
- Celery (optional fallback)
- Redis (task tracking)
- Worker containers (horizontal scaling)
- 5–10x faster ingestion via parallel LLM calls
- Horizontal scalability by adding more workers
- Fault tolerance with retry mechanisms
- Decoupled architecture (clean system design)
- Batch multiple chunks per worker to reduce LLM cost
- Deduplicate entities before inserting into Neo4j
- Add priority queues (process smaller documents first for better UX)