A modular, multi-camera AI visual interface designed for real-time object recognition, AI interaction, and future robotic control.
WALDO is a Python-based AI assistant interface with live video input from up to 4 USB cameras. Each feed is processed independently by an associated AI instance, while a central "interface AI" receives typed input from the user and intelligently routes context-aware prompts.
This system is designed for rapid prototyping of AI perception systems, with the end goal of integrating with physical robotics such as Le Robot (open-source robotic arm).
-
Modular Project Structure with separate files for UI, camera, AI logic, and API integration.
-
4-Camera Feed Grid with on/off toggles and camera index overlay.
-
Interface GUI built using Tkinter:
- Live 2x2 camera feed display (20-25 FPS)
- Scrollable AI output log
- Resizable, expandable interface layout (via PanedWindow)
- Multiline input box with Shift+Enter for newline, Enter to send
-
API Integration using NRP (Nautilus Research Platform) API with llava-onevision model
-
Query Routing via
ai_manager.pyto link camera-specific prompts to appropriate AI instance -
.bat Launcher for easy startup without CLI or terminal
- Python 3.10+
- Virtual environment activated (venv)
- Dependencies:
# Create and activate a virtual environment (if not done yet)
python -m venv venv
.\venv\Scripts\activate # Windows
# Upgrade pip
pip install --upgrade pip
# Core dependencies
pip install scipy matplotlib pyserial pandas scikit-learn
pip install opencv-python opencv-contrib-python
pip install numpy
pip install pillow
pip install openai==1.30.1 # Match your API version & structure
pip install requests
pip install python-dotenv # Optional, for storing API keys in a .env file
pip install pyserial
pip install torch torchvision torchaudio transformers
pip install ultralytics
pip install python-dotenv openai
pip install --upgrade openai
project/
├── camera_viewer.py # Handles video grid and feed logic
├── gui_interface.py # User interface (Tkinter)
├── ai_manager.py # Prompt routing & AI instance control
├── vision_ai.py # NRP API integration
├── .env # Contains your NRP_API_KEY
├── run_interface.bat # Windows launcher
NRP_API_KEY=sk-xxxxxxxxxxxxxxxxxxxxxxxxxx
Run the interface by double-clicking run_interface.bat
- No longer requires phrases like "What do you see on camera..."
- Interprets context (e.g., "Where is the red box?" → uses AI judgment to query and correlate feeds)
- Maintains ongoing conversation context (multimodal grounding)
- Exclude toggle buttons and number overlays from AI input images
- Display bounding boxes and labels on objects identified in live feeds
- Include (x, y) pixel coordinates for object location
- Estimate 3D position (x, y, z) using triangulation across camera angles
- Requires calibrated camera positioning configuration file
- Connect AI output to control Le Robot (open-source)
- Allow AI to select and manipulate objects based on camera data and interface prompts
- Step toward full perception-actuation loop
- NRP Nautilus API
- OpenAI Python SDK (custom base_url support)
- [Le Robot GitHub (TBD: insert link)]
- Tkinter Docs
- LLaVA (vision-language model)
This system is modular and open to extension:
- Add persistent memory per AI
- Integrate voice recognition and TTS
- Enable external plugin system for additional sensors or actuators
- Allow in-GUI camera toggling, layout presets, and dark mode themes
Built by Peter Shryock for a future where AI can see, speak, and act.