Transform PDF presentations into cinematic, narrated videos — AI-generated scripts, local WebLLM TTS, scene-aware video analysis and bug reporting, Chrome-extension-assisted recording, and in-browser FFmpeg rendering.
- What It Does
- Why Origami?
- Key Features
- Getting Started
- Requirements
- How It Works
- Configuration
- Keyboard Shortcuts
- Development
- Troubleshooting
- Contributing
- Support
- Credits
Origami AI converts static PDF presentations into polished, cinematic videos and richer interactive outputs — primarily locally, with optional cloud-assisted analysis when needed. Key capabilities include:
- 🎬 AI-generated narration scripts — local WebLLM or remote Gemini/OpenAI APIs
- 🎙️ High-quality in-browser TTS — Kokoro.js with multiple voices
- 🔍 Scene-aware video analysis & issue reporting — auto-generate breakdowns and bug reports from MP4s
- 💬 AI Assistant Chat & WebLLM — conversational assistance with cloud fallbacks
- 🔒 Secure server-side API proxying —
LLM_API_KEYkept secret from client bundles - ⚡ WebGPU-accelerated inference — faster local model execution
- 📹 In-browser FFmpeg rendering — professional MP4 exports (720p/1080p)
- 🎵 Background music & audio mixing — auto-ducking and smooth transitions
- 🎯 Smart screen recording — auto-zoom with Chrome-extension DOM telemetry
Simply upload a PDF, generate narration or analyze a recorded clip, customize timing and transitions, and export a broadcast-quality MP4 — all from your browser with optional secure cloud features.
Origami is the art of folding paper into new shapes. Similarly, Origami AI transforms flat slides into cinematic videos by adding AI narration, music, and professional camera effects.
| Feature | Traditional Video Editors | Cloud AI Services | Origami AI |
|---|---|---|---|
| Learning Curve | Steep (complex UI) | Easy (simplified) | Minimal (automated) |
| Privacy | Local ✓ | Cloud-based ✗ | Local-first ✓ |
| Cost | One-time or free | Monthly credits | Free & open source |
| Voice | Your own / hire talent | Pay per minute | Unlimited local TTS |
| Time to Video | Hours | Minutes | ~10-30 min |
- 🔒 Privacy: Your presentation data never leaves your computer
- 💰 Cost-effective: No subscription fees or per-minute charges
- ⚡ Fast: After models load, rendering is typically 5-20 min depending on video length
- 🎯 Complete control: Edit scripts, timing, and effects at every step
- 📡 Works offline: After initial model downloads, all processing is local
- PDF Processing - Drag-and-drop upload with automatic text extraction and high-resolution image conversion (PDF.js)
- AI-Powered Scripts - Generate narrative scripts locally with WebLLM or use remote OpenAI-compatible APIs
- Text-to-Speech - Kokoro.js for high-quality local TTS with multiple voices (af_heart, af_bella, am_adam, etc.)
- Video Editing - Drag-and-drop slide reordering, per-slide script editing, transitions, and background music
- Smart Rendering - FFmpeg.wasm video composition with 720p/1080p export and real-time progress tracking
- Screen Recording - Record browser tabs or desktop with auto-zoom during idle periods (2+ sec inactivity)
- Chrome Extension - Capture cursor position and DOM interactions for precise follow effects
- Scene Analysis - Upload MP4 videos and auto-generate timestamped scene breakdowns with Gemini API
- AI Assistant Chat - Local chatbot with 9+ WebLLM models and image/video analysis support
- Issue Reporter - Capture bugs, get AI-powered analysis with debugging suggestions
- Project Backup - Export and import
.origamiarchives to move projects between devices
# Clone the repository
git clone https://github.com/IslandApps/Origami-AI.git
cd Origami-AI
# Install dependencies (requires Node.js >= 20.19.0)
npm install
# Start development server
npm run devOpen http://localhost:3000 in your browser. The development server sets required headers for SharedArrayBuffer and FFmpeg.wasm to function.
Important: Do not open
index.htmldirectly. The dev server with proper CORS/COOP/COEP headers is required.
| Command | Purpose |
|---|---|
npm run dev |
Start Express + Vite dev server with HMR |
npm run build |
Build production assets for deployment |
npm run preview |
Preview production build locally |
npm run lint |
Check code for linting issues |
For containerized deployment with proper header configuration:
docker compose up --buildApp will be available at http://localhost:3000.
For enhanced browser tab interaction tracking during screen recording:
- Open
chrome://extensionsin Chrome/Edge - Enable Developer mode (top-right toggle)
- Click Load unpacked and select the
chrome-extension/folder - See chrome-extension/README.md for full setup instructions
The extension is optional—Origami AI has fallback local interaction tracking if the extension is unavailable.
- Node.js >= 20.19.0
- WebGPU-compatible browser (see browser support below)
- Stable internet for initial model downloads (~1-5GB depending on models used)
- 50GB+ free storage for browser cache and model artifacts
| Browser | Min. Version | Notes |
|---|---|---|
| Chrome/Chromium | 113+ | Chrome Extension available for enhanced recording |
| Edge | 113+ | Chrome Extension available for enhanced recording |
| Firefox | Nightly | Enable dom.webgpu.enabled in about:config |
| Safari | 18+ (macOS Sonoma) | Desktop recording supported |
WebGPU is required for: Local AI narration generation, AI Assistant Chat, and screen recording with auto-zoom effects. If unavailable, you can use remote OpenAI-compatible APIs instead.
Minimum
- 4-core CPU, 8GB RAM, integrated GPU
- ~1-2 hours for initial model downloads and video rendering
Recommended
- 8-core CPU, 16GB RAM, dedicated GPU with F16 support
- NVMe SSD for faster model operations
- Hardware video encoding for screen recording workflows
AI Assistant Chat Models (optional)
- Gemma 2 2B: 1.4GB download, ~2GB VRAM
- Llama 3.2 1B: 800MB download, ~1.5GB VRAM
- Llama 3.2 3B: 1.7GB download, ~2.5GB VRAM
- Phi 3.5 Vision: 3.9GB download, ~4GB VRAM (includes image/video analysis)
- Upload a PDF presentation
- Extract slide images and text automatically
- Generate AI narration scripts (local WebLLM or remote API)
- Synthesize speech audio from scripts using Kokoro.js TTS
- Edit scripts, timing, transitions, and background music in the visual editor
- Render final MP4 video with FFmpeg.wasm (720p or 1080p)
- Download the finished video
Typical timeline: 10-30 minutes depending on slide count and GPU performance.
- Record browser tabs or desktop with precise interaction tracking
- Auto-zoom applied during idle periods (>2 seconds) for cinematic effect
- Captured interactions feed into video editor for synchronization
- Combine with PDF slides or use as standalone video content
- Scene Analysis - Upload MP4 videos to auto-generate timestamped scene breakdowns
- AI Assistant Chat - Ask questions, attach images/videos for analysis (local or cloud models)
- Issue Reporter - Record and analyze bugs with AI-powered debugging suggestions
Access settings via the ⚙️ Settings button in the app header. Key configuration options:
| Setting | Purpose |
|---|---|
| General | Intro fade timing, post-audio delay, default transition, recording options |
| TTS Model | Select Kokoro.js quantization (q4 for quality or q8 for fast) |
| WebLLM | Enable/disable local AI, select model, filter by precision (f16/f32) |
| API | Configure remote OpenAI-compatible providers (Gemini, OpenRouter, Ollama) |
| AI Prompt | Customize narration script generation behavior |
Origami AI supports both local inference (WebLLM, no API key needed) and cloud-based APIs (Gemini, OpenAI-compatible providers) for AI narration generation, video analysis, and bug reporting.
For local development, you can use browser-exposed API keys:
-
Get a Gemini API Key (free tier available):
- Go to Google AI Studio
- Click "Get API key"
- Copy the generated key
-
Configure for Development:
- Create a
.envfile in the project root (copy from.env.example) - Set
VITE_LLM_API_KEYto your Gemini API key:VITE_LLM_API_KEY=your_api_key_here VITE_LLM_BASE_URL=https://generativelanguage.googleapis.com/v1beta/openai/ VITE_LLM_MODEL=gemini-flash-latest
- Start the dev server:
npm run dev - The
VITE_prefix tells Vite to expose the key to the client bundle (safe only in dev)
- Create a
-
In Settings (app ⚙️ button):
- Go to API tab
- Verify Base URL and Model match your configuration
- The app will use your API key for AI operations (narration, analysis, chat)
To prevent exposing API keys in production builds, use server-side proxy endpoints:
-
Configure Server-Only Key:
- Set
LLM_API_KEYenvironment variable on your production server/host:export LLM_API_KEY=your_api_key_here - Do NOT set
VITE_LLM_API_KEYin production (it would be baked into the client bundle)
- Set
-
Client Configuration:
- In production builds, the client will have no API key
- The app automatically detects this and proxies requests to the server
- Server endpoints handle the API calls securely:
POST /api/llm/chat- Chat/completion requestsPOST /api/llm/analyze-video- Video analysis with file uploadPOST /api/llm/analyze-issue- Issue recording analysis
-
Deploy:
- Build:
npm run build - Deploy the
dist/folder to your host - Ensure
LLM_API_KEYis set in your host's environment (not in code) - Run:
npm run previewor use your host's production runner
- Build:
| Variable | Context | Purpose |
|---|---|---|
VITE_LLM_API_KEY |
Client (dev only) | Exposes API key to browser for development; NEVER set in production |
LLM_API_KEY |
Server (prod) | Server-side API key for proxy endpoints; kept secret from client |
VITE_LLM_BASE_URL |
Client | Endpoint URL (e.g., https://generativelanguage.googleapis.com/v1beta/openai/) |
VITE_LLM_MODEL |
Client | Model identifier (e.g., gemini-2.5-flash-lite) |
CLIENT_URL |
Server CORS | Comma-separated list of allowed client origins (e.g., http://localhost:3000) |
PORT |
Server | Port to run the server on (default: 3000) |
NODE_ENV |
Runtime | Set to production for production builds |
- ✅ Use local WebLLM when possible (no API key needed)
- ✅ Server-side keys only in production (use
LLM_API_KEYwithoutVITE_prefix) - ✅ Rotate keys if accidentally exposed in source control
- ✅ Use environment variables for secrets (never hardcode in source)
- ❌ Never commit
.envto source control (use.env.exampleas template) - ❌ Don't use
VITE_LLM_API_KEYin production builds - ❌ Don't expose
LLM_API_KEYthrough client-side code
Once a PDF is loaded, the Slide Editor provides five tabs:
- Overview - Script editing, AI fix, copy/revert, reorder slides
- Voice Settings - Per-slide voice selection, TTS generation, voice recording
- Audio Mixing - Background music with per-slide control and visualizer
- Batch Tools - Generate all audio, fix scripts, find & replace across slides
- Slide Media - Replace slide images/media, upload MP4 for analysis
For additional setup details:
- Scene Analysis & Alignment - Upload MP4 videos to auto-generate timestamped scene breakdowns with Gemini API
- Chrome Extension Setup - See chrome-extension/README.md for enhanced interaction tracking
- AI Assistant Chat - Configure local WebLLM models or remote API providers in Settings
- Issue Reporter - Requires Gemini API key configured in Settings
Export and import .origami archives to move projects between devices:
- Export saves slides, media, audio, and all settings
- Import validates archive format before restoring
Global settings are not affected by project import/export.
| Shortcut | Action |
|---|---|
Ctrl / Cmd + S |
Save project |
Ctrl / Cmd + Z |
Undo |
Ctrl / Cmd + Shift + Z |
Redo |
Ctrl / Cmd + E |
Export project |
Ctrl / Cmd + I |
Import project |
Space |
Play/pause preview |
Left Arrow |
Previous slide |
Right Arrow |
Next slide |
src/
├── components/ # React UI components
├── pages/ # Route components (AssistantPage, etc.)
├── services/ # Core business logic
│ ├── aiService.ts
│ ├── webLlmService.ts
│ ├── ttsService.ts
│ ├── pdfService.ts
│ └── BrowserVideoRenderer.ts
├── hooks/ # Custom React hooks
├── context/ # React context providers
└── utils/ # Utilities and helpers
- Service Layer Pattern - Modular services for AI, TTS, PDF, and video rendering
- React Hooks - State management with useState/useEffect
- IndexedDB - Persistent slide/app state with automatic blob URL conversion
- Event-Driven - Custom events for async operations (ttsEvents, videoEvents, webLlmEvents)
- Vite - Fast module bundling with manual chunks for large libraries
- WebGPU memory buffer patch is disabled to prevent performance degradation
- Vision features were removed in v1.0 - do not restore
- Always test production builds before submitting changes:
npm run build && npm run preview
See CONTRIBUTING.md for development guidelines and contribution process.
For comprehensive troubleshooting guidance including WebGPU issues, performance optimization, and error recovery, see TROUBLESHOOTING.md.
Quick Reference:
- WebGPU not detected - Enable hardware acceleration, update GPU drivers, use supported browser
- Dev server/FFmpeg errors - Run via
npm run dev; do not openindex.htmldirectly - Model download failures - Verify internet stability, clear browser cache, check storage permissions
- Out of memory - Use smaller models, close background apps, or reduce video resolution
- COOP/COEP warnings - Ensure dev server is running with proper headers
Frontend
- React 19.2.0 with TypeScript
- Vite 7.2.4
- Tailwind CSS 4.1.18
- React Router DOM 7.13.0
Core Libraries
@mlc-ai/web-llmfor local LLM inference (AI narration scripts, AI Assistant Chat)@ffmpeg/ffmpegand@ffmpeg/utilfor video rendering and screen recording compositionpdfjs-distfor PDF rendering and extractionkokoro-jsfor text-to-speech@dnd-kitfor drag-and-drop UI
Browser Extensions
- Chrome Extension (JavaScript) - MessagePort communication for DOM-level interaction telemetry
- Background service worker for recording state management
- Content script injection for cursor and event capture
Backend (Dev Server)
- Express.js 5.2.1
- TypeScript
AI & Analysis
- WebGPU for GPU acceleration of all local models
- Google Gemini API for video analysis and bug report generation (optional, requires API key)
- You can download a packaged ZIP of the Chrome/Edge extension directly from the app UI:
- Header Actions menu → "Download Chrome Extension"
- Slide Editor → Tools → Slide Media → "Download Extension" card → "Download ZIP"
- The ZIP included in source is located at
src/assets/extension/chrome-extension.zipand the production build emits it todist/assets/with a hashed filename (e.g.chrome-extension-<hash>.zip). - Install tip: download and unzip the archive, then open
chrome://extensions, enable Developer mode, click Load unpacked, and select the unzipped folder.
Developer note: the repo imports the ZIP as a Vite asset URL. Use the ?url suffix when importing so the file is treated as a static asset, e.g.:
import chromeExtensionZipUrl from '../assets/extension/chrome-extension.zip?url';This avoids Rollup parsing binary content errors during production builds.
- AI workflows can run locally in-browser; model downloads are cached after first use.
- First-time setup can take several minutes based on network speed and model size.
- Rendering and analysis performance depend on available CPU/GPU/memory.
- Screen recording with auto zoom works on all major browsers; Chrome extension provides enhanced DOM telemetry for browser tabs.
- AI Assistant Chat requires WebGPU; fall back to Gemini API for AI narration generation if unavailable.
- Issue Reporter and Video Analysis workflows require configured Gemini API key for AI processing.
- All user data stays local in the browser unless explicitly using cloud APIs (Gemini, OpenAI-compatible providers).
- Issues & Bugs: Report at GitHub Issues
- Troubleshooting: See TROUBLESHOOTING.md for common issues and error recovery
Include:
- Browser name and version
- Operating system
- Node.js version (
node -v) - Steps to reproduce the issue
- Relevant console logs or error messages
We welcome contributions from the community! See CONTRIBUTING.md for:
- Code contribution guidelines
- Development setup
- Pull request process
- Code style standards
- TechMitten LLC - Project creator and primary maintainer
Core Technologies
- WebLLM - Local LLM inference in browsers
- Kokoro.js - High-quality browser TTS
- FFmpeg.wasm - Video processing in browsers
- PDF.js - PDF rendering and text extraction
UI/UX Libraries
- React - UI framework
- Tailwind CSS - Styling
- Lucide React - Icons
- dnd-kit - Drag-and-drop functionality
License
This project is licensed under the MIT License.
Made with ❤️ by the Origami AI community
