Transform your PDF textbooks into intelligent study materials with AI-generated flashcards, quizzes, and semantic search.
Dojang is an AI-powered learning platform that transforms large PDF textbooks into intelligent study materials. Upload a PDF, and the system:
- Extracts semantic content using Marker (preserves structure: headings, lists, tables)
- Generates embeddings with OpenAI (768-dimensional vectors)
- Stores in PostgreSQL with pgvector for fast similarity search
- Enables learning features like flashcards, quizzes, and spaced repetition (in development)
Create a comprehensive CDN (Content Delivery Network) that integrates with vector databases, making AI-powered educational features easy to implement and open source for everyone.
-
PDF Processing Pipeline
- Semantic extraction with Marker
- Hierarchical content storage
- Memory-optimized for large textbooks
-
Vector Embeddings
- OpenAI text-embedding-ada-002
- Batch processing (100 chunks at a time)
- pgvector storage for similarity search
-
Beautiful UI
- Modern Next.js frontend
- Drag-and-drop PDF upload
- Real-time progress tracking
-
Robust Backend
- FastAPI with async support
- PostgreSQL with pgvector
- Comprehensive test suite
- Vector similarity search (RAG)
- AI flashcard generation
- Quiz creation
- Spaced repetition system
- Chat with documents
- User authentication
- Study progress tracking
Get up and running in 5 minutes:
- Prerequisites: Docker Desktop + OpenAI API key
- Clone and setup:
git clone <repository-url> cd dojang echo "OPENAI_API_KEY=your_key_here" > .env
- Start everything:
docker-compose up --build
- Access the app: http://localhost:3000
For detailed instructions, see QUICKSTART.md
| Document | Description |
|---|---|
| QUICKSTART.md | Get started in 5 minutes with Docker |
| SETUP.md | Detailed development setup (Docker & manual) |
| ARCHITECTURE.md | System design, data flow, and technical details |
| FEATURES.md | Implementation guides for new features |
- Framework: Next.js 15 (React 18)
- Language: TypeScript
- Styling: Tailwind CSS + Shadcn UI
- Testing: Playwright
- Framework: FastAPI
- Language: Python 3.11+
- ORM: SQLAlchemy (async)
- PDF Processing: PyMuPDF + Marker
- AI: OpenAI API
- Testing: Pytest
- DBMS: PostgreSQL 15+
- Vector Search: pgvector extension
- Schema: Hierarchical content storage
- Containerization: Docker + Docker Compose
- Development: Hot reload for frontend & backend
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Frontend (Next.js) β
β β’ PDF Upload UI β
β β’ Flashcards (TODO) β
β β’ Quizzes (TODO) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β HTTP/REST
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Backend (FastAPI) β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Document Processing Pipeline β β
β β PDF β Marker β JSON β DB β OpenAI β Embeddings β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β PostgreSQL + pgvector β
β β’ Document metadata β
β β’ Hierarchical content β
β β’ 768-dim embeddings β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
For detailed architecture, see ARCHITECTURE.md
Backend:
# Inside Docker
docker exec -it dojang-backend-1 pytest
# Or use the script
cd backend
./run_tests.sh # Unix/Mac
.\run_tests.ps1 # WindowsFrontend:
docker exec -it dojang-frontend-1 npx playwright test- Backend: Edit files in
backend/app/- auto-reloads - Frontend: Edit files in
frontend/- Fast Refresh - Database: Modify
backend/app/models.py+ create migration
docker-compose logs -f # All services
docker-compose logs -f backend # Backend only
docker-compose logs -f frontend # Frontend onlyknowledge_base_sources - Document metadata
- source_id, name, author, publisher, etc.
knowledge_base_content - Hierarchical content with embeddings
- content_id, source_id, parent_content_id
- title, content, content_type
- embedding (VECTOR(768))
users - User management (for future auth)
tags - Content tagging system
user_activity_log - Track learning progress
See ARCHITECTURE.md for full schema details.
We welcome contributions! Here's how to get started:
- Pick a feature from FEATURES.md
- Read the architecture in ARCHITECTURE.md
- Set up your environment with SETUP.md
- Create a feature branch:
git checkout -b feat/your-feature - Implement & test
- Submit a pull request
- π΄ Vector Similarity Search
- π΄ Flashcard Generation
- π΄ Quiz Generation
- π‘ Spaced Repetition System
- π‘ Chat with Documents
See FEATURES.md for detailed implementation guides.
dojang/
βββ backend/
β βββ app/
β β βββ main.py # FastAPI app
β β βββ models.py # Database models
β β βββ database.py # DB connection
β β βββ routers/ # API endpoints
β β βββ services/
β β βββ document_processor.py # Core pipeline
β β βββ source_intake.py # Content ingestion
β βββ tests/ # Pytest tests
β βββ Dockerfile
β βββ requirements.txt
βββ frontend/
β βββ app/ # Next.js pages
β βββ components/
β β βββ FileUpload.tsx # Upload UI
β β βββ ui/ # Shadcn components
β βββ tests/ # Playwright tests
β βββ Dockerfile
β βββ package.json
βββ docker-compose.yml # Multi-container setup
βββ init-db.sh # Database initialization
βββ QUICKSTART.md # Quick start guide
βββ SETUP.md # Detailed setup
βββ ARCHITECTURE.md # System design
βββ FEATURES.md # Feature guides
OPENAI_API_KEY=your_openai_api_key_here
DATABASE_URL=postgresql+asyncpg://postgres:zany12@localhost:5433/studyai
UPLOAD_DIR=./uploadsAll services run in Docker containers locally.
- Frontend: Vercel/Netlify
- Backend: AWS ECS / GCP Cloud Run
- Database: AWS RDS / GCP Cloud SQL
- Background Jobs: Celery + Redis
- Monitoring: Sentry, Prometheus, Grafana
This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0).
See LICENSE for full details.
- Marker - Excellent PDF semantic extraction
- pgvector - PostgreSQL vector similarity search
- OpenAI - Embeddings and GPT models
- FastAPI - Modern Python web framework
- Next.js - React framework for production
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Documentation: See docs folder
- PDF upload and processing
- Semantic content extraction
- Embedding generation
- Vector storage (pgvector)
- Beautiful UI
- Vector similarity search
- Flashcard generation
- Quiz creation
- Spaced repetition
- User authentication
- Progress tracking
- Multi-modal support (images, videos)
- Mobile app
- Open source CDN for education
Ready to get started? See QUICKSTART.md to begin!