A real-time Indian Sign Language (ISL) recognition system powered by a fine-tuned Swin3D-S deep learning model, served via a FastAPI backend and paired with a Flutter mobile app with live translation and sentence building.
app_working_demo.mp4
| Metric | Value |
|---|---|
| Top-1 Accuracy | 66.84% |
| Macro F1 | 0.638 |
| Weighted F1 | 0.648 |
| Classes | 76 ISL words |
| Test Samples | 187 |
| Random Baseline | 1.3% |
The model uses a Swin3D-S backbone pretrained on Kinetics-400 for spatiotemporal feature extraction from video clips.
Input Video (3 × 16 × 224 × 224)
↓
Patch Embedding (Conv3D, 96 channels)
↓
Swin Transformer Blocks (4 stages)
Stage 1: 2 blocks, dim=96, resolution=8×56×56
Stage 2: 2 blocks, dim=192, resolution=8×28×28
Stage 3: 18 blocks, dim=384, resolution=8×14×14
Stage 4: 2 blocks, dim=768, resolution=8×7×7
↓
Adaptive Average Pooling → 768-dim feature vector
↓
Linear Classification Head (768 → 76 classes)
Total Parameters: 33,112,492
Trainable Parameters: 9,510,988 (Stage 3 + Stage 4 + Norm + Head unfrozen)
Model Size: ~126 MB
- Frozen: Stages 1, 2 and patch embedding
- Unfrozen:
features[6](full Stage 3),normlayer, classification head - Loss: CrossEntropyLoss
- Optimizer: AdamW (lr=1e-4)
- Scheduler: ReduceLROnPlateau (factor=0.5, patience=5)
- Mixed Precision: fp16 via
torch.cuda.amp.GradScaler - Early Stopping: patience=5
Source: Indian Sign Language Words with Landmarks (Kaggle)
| Split | Samples |
|---|---|
| Train | 745 |
| Validation | 234 |
| Test | 187 |
| Total | 1,166 |
76 ISL word classes:
afternoon, animal, bad, beautiful, big, bird, blind, cat, cheap, clothing, cold, cow, curved, deaf, dog, dress, dry, evening, expensive, famous, fast, female, fish, flat, friday, good, happy, hat, healthy, horse, hot, hour, light, long, loose, loud, minute, monday, month, morning, mouse, narrow, new, night, old, pant, pocket, quiet, sad, saturday, second, shirt, shoes, short, sick, skirt, slow, small, suit, sunday, t_shirt, tall, thursday, time, today, tomorrow, tuesday, ugly, warm, wednesday, week, wet, wide, year, yesterday, young
Video format: .MOV, variable length, processed to 16 frames at 224×224
Preprocessing: Center crop, rescale (1/255), normalize (mean=0.5, std=0.5)
Augmentation (train only): RandomPerspective, ColorJitter
Platform: Kaggle Notebooks
Hardware: NVIDIA Tesla T4 (15GB VRAM)
Framework: PyTorch 2.10.0+cu128
Training config:
BATCH_SIZE = 32
CLIP_LENGTH = 16 # frames per video
CLIP_SIZE = 224 # spatial resolution
EPOCHS = 1000 # with early stopping
LR = 0.0001
PATIENCE = 5 # early stopping
SEED = 42Training time: ~3.5 minutes/epoch × ~15 epochs ≈ ~1 hour total
Pretrained weights: Swin3D_S_Weights.KINETICS400_V1 (torchvision)
Model hosted at: huggingface.co/Creator-090/isl-swin3d-model
.
├── api/ # FastAPI inference server (HF Space)
│ ├── app.py # REST endpoints (/predict, /health, /health/deep)
│ ├── model.py # Swin3D model + preprocessing + inference
│ ├── requirements.txt # Dependencies for deployment
│ └── Dockerfile # Container config for HF Spaces
│
├── backend/ # Flask + WebSocket server (real-time system layer)
│ ├── saved_captures/ # Temporary video clips generated from frame buffers
│ ├── authentication.py # Firebase auth (register/login users)
│ ├── history.py # Store & retrieve translation history (DB layer)
│ ├── main.py # Core server (REST + WebSocket, frame processing, pipeline orchestration)
│ ├── model.py # Client wrapper for FastAPI (sends video → gets prediction)
│ ├── requirements.txt # Backend dependencies (Flask, mediapipe, etc.)
│ ├── hand_landmarker.task # MediaPipe hand landmark model
│ ├── pose_landmarker_full.task # MediaPipe pose landmark model
│
├── frontend/ # Flutter mobile app (client)
│ ├── lib/
│ │ ├── providers/ # State management (app state, predictions, UI sync)
│ │ ├── screens/ # UI screens (e.g., LiveTranslationScreen)
│ │ └── services/ # API + WebSocket communication, sentence builder logic
│ ├── assets/ # Static assets (icons, images)
│ ├── android/ # Android-specific config
│ └── ios/ # iOS-specific config
│
└── model/ # Training + experimentation
├── is76words.ipynb # Kaggle notebook (training pipeline)
├── checkpoints/ # Saved trained weights (.pt files)
├── swin_small_ISL_gpu.py # Model architecture & training script
└── dataset/ # ISL gesture dataset
The FastAPI backend exposes:
| Endpoint | Method | Description |
|---|---|---|
/ |
GET | Health check |
/health |
GET | Model load status |
/health/deep |
GET | Verifies inference works |
/predict |
POST | Single video clip → predicted sign |
Live prediction response:
{
"prediction": "happy",
"confidence": 84.21,
"top_k": [
{"class": "happy", "confidence": 84.21},
{"class": "good", "confidence": 9.43}
],
"inference_time_ms": 312.5
}- Python 3.10+
- Flutter SDK ≥ 3.x
- Android Studio or Xcode
git clone https://github.com/Uni-Creator/signSight.git
cd signSight
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r backend/requirements.txt
cd backend
python app.py
# API available at http://localhost:5000cd frontend
flutter pub get
flutter runUpdate the API URL in frontend/lib/services/:
static const String API_URL = "https://creator-090-isl-api.hf.space";
// or for local: "http://YOUR_LOCAL_IP:7860"| Layer | Technology |
|---|---|
| Model backbone | Swin3D-S (torchvision) |
| Pretraining | Kinetics-400 |
| Training framework | PyTorch 2.10 + CUDA 12.8 |
| Mixed precision | torch.cuda.amp (fp16) |
| Video preprocessing | Decord + VivitImageProcessor |
| Backend API | FastAPI + Uvicorn |
| Model hosting | Hugging Face Hub |
| Mobile frontend | Flutter (Dart) |
| Auth & sync | Firebase / Pyrebase4 |
| Text-to-speech | flutter_tts |
| Data augmentation | torchvision v2 transforms |
Phone Camera
↓ (2-sec clips)
Flutter App (LiveISLTranslator)
↓ (multipart/form-data POST)
FastAPI /predict
↓
Swin3D-S Inference (fp16, CPU on HF Spaces)
↓
Smoothing (majority vote over last 3 predictions)
↓
SentenceBuilder (confirm word after 2x detection)
↓
Display + TTS
Latency (HF Spaces free tier): ~4–6 sec/clip (CPU inference)
- Fork the repository
- Create a feature branch
- Commit your changes
- Push and open a pull request
MIT License. See LICENSE for details.
For questions or inquiries, reach out at abhayr24564@gmail.com.