"Turning Vision Into Independence"
VGS Companion is a complete visual assistance system with 7 phases of functionality:
| Phase | Feature | Status |
|---|---|---|
| Phase 1 | Camera + YOLO + Basic TTS | ✅ |
| Phase 2 | Position Awareness (Left/Center/Right) | ✅ |
| Phase 3 | Distance Estimation | ✅ |
| Phase 4 | Indoor Navigation (Doors/Stairs) | ✅ |
| Phase 5 | Custom YOLO Training Setup | ✅ |
| Phase 6 | Multi-language TTS (EN/HI/TE) | ✅ |
| Phase 7 | Adaptive Learning | ✅ |
# Install dependencies
pip install -r requirements.txt
# Install Ollama for AI features (optional)
winget install Ollama.Ollama
ollama pull llava
# Run the assistant
python src/main.py| Command | Action |
|---|---|
| "Hey" / "Hey Assistant" | Activate assistant |
| "What do you see?" | Describe surroundings with position |
| "Where is the [object]?" | Locate specific object |
| "How far is it?" | Get distance to nearest object |
| "Help" | List capabilities |
| "Language Hindi/English/Telugu" | Change language |
| "Stop" / "Bye" | Return to standby |
vgs/
├── src/
│ ├── main.py # Main companion loop
│ ├── camera.py # Camera capture
│ ├── detector.py # Basic YOLO detection
│ ├── enhanced_detector.py # Phase 2-7 detection
│ ├── speaker.py # Multi-language TTS
│ ├── wake_word.py # Wake word detection
│ ├── listener.py # Speech recognition
│ ├── companion.py # LLaVA integration
│ └── train_model.py # Phase 5 training setup
├── models/
│ ├── yolov8n.pt # YOLOv8 model
│ └── vosk/ # Offline speech model
├── data/
│ └── user_feedback.json # Phase 7 learning data
├── requirements.txt
└── README.md
- Camera capture with OpenCV
- YOLOv8 object detection
- Text-to-speech output
- Fully offline operation
Objects are described by their position:
- Left: x < 35% of frame
- Center: 35% - 65% of frame
- Right: x > 65% of frame
Example: "I see a person on your left, and a car ahead."
Uses known object sizes to estimate distance:
distance = (real_height × focal_length) / pixel_height
Known object sizes (in meters):
| Object | Height |
|---|---|
| Person | 1.7m |
| Car | 1.5m |
| Chair | 0.9m |
| Bottle | 0.25m |
Detects:
- Doors (vertical lines)
- Stairs (horizontal line patterns)
- Walls (color detection)
Provides navigation hints like "Door on your left" or "Watch your step!"
-
Collect images of Indian road environments:
- Autorickshaws, handcarts, potholes, cattle
- Street vendors, speed breakers, manholes
-
Run setup:
python src/train_model.py setup
-
Annotate images with LabelImg or CVAT
-
Train model:
python src/train_model.py train
Supported languages:
- English (en) - Default
- Hindi (hi) - "नमस्ते! मैं आपका दृष्टि सहायक हूं"
- Telugu (te) - "నమస్కారం! నేను మీ విజువల్ అసిస్టెంట్"
Switch with: "Hey Assistant, change language to Hindi"
Tracks user behavior to prioritize important objects:
- Records which objects user asks about
- Tracks positive/negative reactions
- Adjusts detection priority based on user preferences
Data stored in data/user_feedback.json
Edit src/config.py:
CAMERA_INDEX = 0 # Camera device
MODEL_PATH = 'models/yolov8n.pt'
CONFIDENCE = 0.4 # Detection threshold
WAKE_WORD = "hey assistant" # Wake phrase
SPEECH_RATE = 160 # Words per minute
OLLAMA_MODEL = "llava" # AI modelultralytics>=8.0.0 # YOLO
opencv-python>=4.7.0 # Camera
pyttsx3>=2.90 # Offline TTS
SpeechRecognition>=3.8 # STT
vosk>=0.3.0 # Offline STT
ollama>=0.1.0 # LLM
elevenlabs>=1.0.0 # Premium TTS (optional)
python-dotenv>=1.0.0 # Config
numpy>=1.23.0 # Math
| Component | Purpose |
|---|---|
| Raspberry Pi 4 (4GB) | Main processor |
| USB/Webcam | Camera |
| Microphone | Voice input |
| Earbuds/Speaker | Audio output |
| Issue | Solution |
|---|---|
| Wake word not detected | Speak louder, reduce background noise |
| Ollama slow | Use smaller model or disable |
| TTS not working | Run espeak "test" to verify |
| Camera not found | Check cv2.VideoCapture(0) index |
VGS Companion - All 7 Phases Complete