The caption-craft.ai : AI-Powered Social Media Caption Generator is a revolutionary full-stack web application that leverages state-of-the-art artificial intelligence to transform ordinary images into compelling social media content. Built with modern web technologies and advanced AI models, this application provides intelligent, platform-specific caption generation for Instagram, Facebook, and LinkedIn.
In today's digital landscape, creating engaging social media content is crucial for personal branding, business growth, and community building. However, crafting the perfect caption that resonates with your audience while maintaining platform-specific best practices can be time-consuming and challenging. This application democratizes high-quality content creation by making AI-powered caption generation accessible to everyone.
- LLAVA 7B Vision Model: State-of-the-art image analysis with detailed scene understanding
- GPT-OSS 120B: Large language model for sophisticated caption generation
- DeepSeek-R1: Alternative reasoning model for diverse content creation
- Transparent AI Reasoning: View the AI's decision-making process with
<think></think>tags - Multi-Model Support: Switch between different AI models for varied results
- Instagram: Trendy, aesthetic-focused captions with strategic emojis and hashtags
- Facebook: Community-oriented, conversational tone with engagement-focused content
- LinkedIn: Professional, thought-leadership content with industry-relevant insights
- Smart Adaptation: AI automatically adjusts tone, style, and hashtags for each platform
- SHORT: Concise, punchy captions perfect for quick engagement
- STORY: Narrative, storytelling approach that draws readers in
- PHILOSOPHY: Deep, thought-provoking content that sparks reflection
- LIFESTYLE: Aspirational, lifestyle-focused content that inspires
- QUOTE: Inspirational, quote-style captions that motivate and uplift
- Drag & Drop Interface: Intuitive file upload with visual feedback
- Real-time Processing: Live progress indicators during AI analysis
- Dark/Light Mode: Beautiful, responsive UI with theme switching
- History Management: Save, organize, and revisit your generated captions
- Smart Sharing: Direct integration with social media platforms
- Duplicate Detection: Intelligent image deduplication using SHA-256 hashing
graph TB
A[User Interface] --> B[React Frontend]
B --> C[FastAPI Backend]
C --> D[LLAVA Vision Model]
C --> E[GPT-OSS/DeepSeek]
C --> F[History Storage]
D --> G[Image Analysis]
E --> H[Caption Generation]
G --> H
H --> I[Platform Optimization]
I --> J[Social Media Sharing]
๐ AI-Caption-Generator/
โโโ ๐ backend/ # FastAPI Backend Server
โ โโโ ๐ main.py # Core API endpoints and logic
โ โโโ ๐ requirements.txt # Python dependencies
โ โโโ ๐ Dockerfile # Backend container configuration
โ โโโ ๐ setup.py # Environment setup script
โ โโโ ๐ uploads/ # Temporary file storage
โ โโโ ๐ history.json # Persistent data storage
โโโ ๐ frontend/ # React + TypeScript Frontend
โ โโโ ๐ src/ # Source code directory
โ โ โโโ ๐ App.tsx # Main application component
โ โ โโโ ๐ main.tsx # Application entry point
โ โ โโโ ๐ pages/ # Page components
โ โ โ โโโ ๐ Home.tsx # Main application page
โ โ โ โโโ ๐ Result.tsx # Results display page
โ โ โโโ ๐ types.d.ts # TypeScript definitions
โ โโโ ๐ package.json # Node.js dependencies
โ โโโ ๐ Dockerfile # Frontend container configuration
โ โโโ ๐ tailwind.config.js # Tailwind CSS configuration
โ โโโ ๐ vite.config.mjs # Vite build configuration
โโโ ๐ docker-compose.yml # Multi-container orchestration
โโโ ๐ setup.py # Project setup automation
โโโ ๐ README.md # Project documentation
- Image Upload โ Frontend receives file via drag & drop
- File Validation โ Backend validates file type and size
- Hash Calculation โ SHA-256 hash for duplicate detection
- AI Analysis โ LLAVA model analyzes image content
- Caption Generation โ GPT-OSS/DeepSeek creates platform-specific captions
- Result Processing โ Backend formats and optimizes output
- User Display โ Frontend presents organized results
- History Storage โ JSON-based persistent storage
- Social Sharing โ Direct platform integration
| Requirement | Version | Purpose |
|---|---|---|
| Python | 3.11+ | Backend API development |
| Node.js | 18+ | Frontend development |
| Docker | Latest | Containerized deployment |
| Hugging Face API | Required | AI model access |
| Ollama | Latest | Local LLAVA model serving |
- RAM: Minimum 8GB (16GB recommended for optimal performance)
- Storage: 10GB free space for models and dependencies
- Network: Stable internet connection for AI model access
- OS: Windows 10+, macOS 10.15+, or Linux (Ubuntu 20.04+)
# Clone the repository
git clone https://github.com/Harry-jain/caption-craft.ai.git
cd caption-craft.ai
# Verify the project structure
ls -la# Navigate to backend directory
cd backend
# Create Python virtual environment
python -m venv venv
# Activate virtual environment
# On Windows:
venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activate
# Upgrade pip to latest version
python -m pip install --upgrade pip
# Install Python dependencies
pip install -r requirements.txt
# Configure environment variables
# Option A: Use automated setup script (Recommended)
python ../setup.py
# Option B: Manual configuration
echo "HF_TOKEN=your_huggingface_token_here" > .env
echo "HF_GPT_OSS_MODEL=openai/gpt-oss-120b:together" >> .env
# Start the FastAPI server
uvicorn main:app --host 0.0.0.0 --port 8000 --reload# Install Ollama from https://ollama.ai
# Download the appropriate installer for your OS
# Pull the LLAVA 7B model for image analysis
ollama pull llava:7b
# Start Ollama service in the background
ollama serve &
# Verify the installation and model
ollama list
# Test the model (optional)
ollama run llava:7b "Describe this image" --image /path/to/test/image.jpg# Navigate to frontend directory
cd frontend
# Install Node.js dependencies
npm install
# Start the development server
npm run dev
# Alternative: Build for production
npm run build
npm run preview| Service | URL | Description |
|---|---|---|
| Frontend | http://localhost:5173 | Main application interface |
| Backend API | http://localhost:8000 | REST API endpoints |
| API Documentation | http://localhost:8000/docs | Interactive API docs |
| Health Check | http://localhost:8000/health | Service status |
- Backend Health Check: Visit http://localhost:8000/docs
- Frontend Loading: Visit http://localhost:5173
- Ollama Status: Run
ollama listin terminal - API Connectivity: Check browser developer tools for API calls
# Build and run with Docker Compose (Recommended)
docker-compose up --build
# Run in detached mode
docker-compose up -d --build
# View logs
docker-compose logs -f
# Stop services
docker-compose down# Build backend container
docker build -t caption-backend ./backend
# Build frontend container
docker build -t caption-frontend ./frontend
# Run backend container
docker run -p 8000:8000 -e HF_TOKEN=your_token_here caption-backend
# Run frontend container
docker run -p 5173:5173 caption-frontendThe docker-compose.yml file orchestrates the entire application stack:
version: '3.8'
services:
backend:
build: ./backend
ports:
- "8000:8000"
environment:
- HF_TOKEN=${HF_TOKEN}
- HF_GPT_OSS_MODEL=${HF_GPT_OSS_MODEL}
volumes:
- ./backend:/app
command: uvicorn main:app --host 0.0.0.0 --port 8000
frontend:
build: ./frontend
ports:
- "5173:5173"
volumes:
- ./frontend:/app
command: npm run dev
depends_on:
- backend| Variable | Description | Required | Default |
|---|---|---|---|
HF_TOKEN |
Hugging Face API token for AI model access | โ Yes | - |
HF_GPT_OSS_MODEL |
Primary model for caption generation | โ No | openai/gpt-oss-120b:together |
OLLAMA_URL |
Ollama service endpoint | โ No | http://localhost:11434 |
UPLOAD_DIR |
Directory for temporary file storage | โ No | uploads |
- ๐ API Token Security: Never commit
.envfiles to version control - ๐ Token Rotation: Regularly rotate Hugging Face API tokens
- ๐ File Permissions: Ensure proper file permissions for upload directories
- ๐ Network Security: Use HTTPS in production environments
- ๐ Input Validation: All file uploads are validated for type and size
| Model | Provider | Size | Use Case | Performance |
|---|---|---|---|---|
| GPT-OSS 120B | Together AI | 120B parameters | Primary caption generation | High quality, slower |
| GPT-OSS 20B | Together AI | 20B parameters | Faster caption generation | Good quality, faster |
| DeepSeek-R1 | Fireworks AI | 7B parameters | Alternative reasoning | Fast, diverse output |
| LLAVA 7B | Ollama | 7B parameters | Image analysis | Local processing |
- Upload Image: Drag & drop or select an image file
- Choose Platform: Select Instagram, Facebook, or LinkedIn
- Select Model: Choose between GPT-OSS or DeepSeek-R1
- AI Analysis: LLAVA analyzes the image content
- Caption Generation: AI generates 5 distinct caption types
- Review & Share: Copy captions or share directly to platforms
Each platform generates 5 unique caption styles:
- SHORT: Concise, punchy captions with hashtags
- STORY: Narrative, storytelling approach
- PHILOSOPHY: Deep, thought-provoking content
- LIFESTYLE: Aspirational, lifestyle-focused
- QUOTE: Inspirational, quote-style captions
POST /api/describe
Content-Type: multipart/form-data
file: [image file]POST /api/caption
Content-Type: application/json
{
"description": "image description",
"image_name": "filename.jpg",
"image_hash": "optional_hash",
"tone": "instagram|facebook|linkedin",
"model_id": "optional_model_override"
}GET /api/history # Get all captions
GET /api/history/{id} # Get specific caption
DELETE /api/history/{id} # Delete caption
DELETE /api/history # Clear all history- FastAPI: Modern, fast web framework
- Python 3.11+: Latest Python features
- Hugging Face: AI model inference
- LLAVA: Vision-language model for image analysis
- OpenAI Client: Router API integration
- React 18: Modern React with hooks
- TypeScript: Type-safe development
- Tailwind CSS: Utility-first styling
- Vite: Fast build tool
- React Router: Client-side routing
- LLAVA 7B: Image recognition and description
- GPT-OSS 120B: Advanced caption generation
- DeepSeek-R1: Alternative caption model
- Instagram: Trendy, aesthetic-focused with strategic emojis
- Facebook: Community-oriented, conversational tone
- LinkedIn: Professional, thought-leadership content
- View the AI's thought process with
<think></think>tags - Understand how captions are adapted for each platform
- Transparent AI decision-making
- Direct platform integration
- Optimized content for each social network
- Copy-to-clipboard functionality
- Image Analysis: ~5-10 seconds with LLAVA
- Caption Generation: ~10-20 seconds with GPT-OSS
- Response Time: Optimized for real-time interaction
- Scalability: Docker-ready for production deployment
We welcome contributions! Please see our Contributing Guide for details.
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
This project is developed as a comprehensive demonstration of modern AI integration with full-stack web development. The application showcases advanced capabilities in computer vision, natural language processing, and user experience design.
- Hugging Face for providing the AI models
- Together AI for GPT-OSS model hosting
- Fireworks AI for DeepSeek-R1 hosting
- Open Source Community for the amazing tools
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: harry.jain@example.com
- Multi-language support
- Video caption generation
- Advanced hashtag optimization
- Social media scheduling integration
- Team collaboration features
- Analytics and insights
# Install Ollama
# Download from: https://ollama.ai
# Pull LLAVA model
ollama pull llava:7b
# Start service
ollama serve
# Verify in another terminal
ollama list# Create .env file in backend/ directory
echo "HF_TOKEN=your_actual_token_here" > backend/.env
echo "HF_GPT_OSS_MODEL=openai/gpt-oss-120b:together" >> backend/.env# Clear node_modules and reinstall
cd frontend
rm -rf node_modules package-lock.json
npm install# Check what's using the port
netstat -ano | findstr :8000 # Windows
lsof -i :8000 # Mac/Linux
# Kill the process or use different port
uvicorn main:app --host 0.0.0.0 --port 8001 --reloadMade with โค๏ธ by Harsh Jain
โญ Star this repo if you found it helpful!