A FastAPI-based copilot system that guides users through procedures step by step using computer vision analysis. The project is designed to be compatible with the Auki Real World Web ecosystem. This repository includes a Mentra Smart Glasses client application.
Oneshot Copilot guides users through multi-step procedures by analyzing video frames in real-time. It uses a VLM (Vision Language Model) service to verify each step's completion before allowing progression to the next step.
Important
Major Refactor Complete! Oneshot Copilot has been significantly refactored to be more modular. The system now supports multiple procedure sources and VLM providers. One new configuration uses an external Memory Service (private repository) for storing and retrieving procedure instructions. This is just one optional procedure configuration — you can still use local JSON procedures exactly as before.
- Step-by-step procedure guidance with configurable steps
- Real-time frame analysis using VLM integration
- Debounce logic requiring multiple consecutive YES responses
- Frame buffering to handle high-frequency updates
- Single in-flight request per user to prevent overload
- Timeout handling with automatic reset
- Client callbacks for progress updates
- Modular VLM providers — switch between local and cloud VLM services
- Flexible procedure sources — load procedures from local JSON or external Memory Service
Oneshot Copilot supports multiple ways to load procedure definitions:
Procedures are loaded from JSON files in app/data/procedures/. This is the standard approach and requires no external dependencies.
# Start a procedure by ID
POST /api/start_procedure?username=alice&procedure_id=pizza_custom@v1
# Or by file path
POST /api/start_procedure?username=alice&procedure_file=app/data/procedures/pizza_custom.jsonFor advanced use cases, procedures can be fetched from an external Memory Service. This is a private repository that provides:
- Centralized procedure storage and versioning
- Dynamic procedure updates without redeployment
- Integration with the Mentra ecosystem
To use Memory Service procedures, prefix the procedure ID with starting: in the MentraApp:
// In mainwebview.ejs
<div class="procedure-item" data-procedure="starting:detect_office_items">This signals the backend to fetch the procedure from the Memory Service instead of local files.
Note
The Memory Service integration is optional. All existing local JSON procedures continue to work without any changes. You can mix and match both approaches.
Oneshot Copilot supports multiple Vision Language Model (VLM) providers. Configure the provider in your .env file:
# VLM Provider Configuration
# Options: "local", "moondream", "auki_local"
VLM_PROVIDER=auki_local| Provider | Description | Required Config |
|---|---|---|
local |
Generic local VLM service | VLM_URL |
auki_local |
Auki Labs VLM Node (recommended for development) | VLM_URL |
moondream |
Moondream AI cloud service | MOONDREAM_API_KEY |
See CLOUD_VLM.md for detailed Moondream configuration.
Client → /ingest (frames) → State Machine → VLM Service
↓
Client ← Progress Updates ← State Machine ← /vlm/callback
- Python 3.8+ with pip
- FastAPI and dependencies (see
requirements.txt) - VLM Service (Vision Language Model) - Choose one option:
- Local Auki VLM Node (Recommended for development):
- Repository: Auki Labs VLM Node
- Requires Docker with GPU support
- Must run on port 8080
- Cloud VLM: Moondream AI account with API key
- Local Auki VLM Node (Recommended for development):
- Mentra Account with:
- API key
- App domain (
app.com.domain)
- Setup tutorial: MentraOS Extended Example
For automatic frame ingestion from video streams:
- RTSP/RTMP Server (local or remote)
- For local testing, you can use MediaMTX:
docker pull bluenviron/mediamtx docker run --rm -it -p 8554:8554 -p 1935:1935 bluenviron/mediamtx
- Configure
RTSP_STREAM_URLin.envto enable
- Clone the repository
cd oneshot_copilot- Create a virtual python environment in the root folder and run it.
venv\Scripts\activate- Install dependencies
pip install -r requirements.txt- Configure environment
cp .env.example .env
# Edit .env with your actual URLsThis section provides instructions for building and running Oneshot Copilot using Docker. The Docker container includes both the FastAPI backend and the MentraApp frontend.
- Docker installed on your system
- Configured environment files:
.envin the root directory (copy from.env.example)MentraApp/.envin the MentraApp directory (copy fromMentraApp/.env.example)
- Optional GPU support: If using GPU-accelerated features (e.g., for VLM processing), ensure Docker has GPU access configured
-
Clone the repository and navigate to the project directory:
git clone <repository-url> cd oneshot-copilot
-
Build the Docker image:
docker build -t oneshot-copilot .This will create a Docker image that includes:
- Python 3.11 environment with all dependencies
- Bun runtime for the MentraApp
- Both Oneshot Copilot and MentraApp services
-
Run the container with port mappings:
docker run -p 8000:8000 -p 3000:3000 oneshot-copilot
Port mappings:
8000: Oneshot Copilot FastAPI server3000: MentraApp frontend server
-
Optional: Mount environment files (if you want to modify them without rebuilding):
docker run -p 8000:8000 -p 3000:3000 \ -v $(pwd)/.env:/app/.env \ -v $(pwd)/MentraApp/.env:/app/MentraApp/.env \ oneshot-copilot
-
Optional: Run in detached mode:
docker run -d -p 8000:8000 -p 3000:3000 --name oneshot-container oneshot-copilot
Once the container is running:
-
Oneshot Copilot API: Available at
http://localhost:8000- API documentation:
http://localhost:8000/docs - Health check:
http://localhost:8000/health
- API documentation:
-
MentraApp Frontend: Available at
http://localhost:3000- Main application interface
- User authentication and procedure management
The container expects the following environment variables to be configured in the respective .env files:
Root .env:
SELF_URL: Your server's public URLMENTRA_URL: Mentra webhook URLVLM_URL: VLM service URLUSE_CLOUD_VLM: Whether to use cloud VLM (true/false)MOONDREAM_API_KEY: API key for Moondream AI (if using cloud VLM)- Other configuration options as documented in the Configuration section
MentraApp/.env:
MENTRAOS_API_KEY: Mentra OS API keyPORT: Port for Mentra app (default: 3000)PACKAGE_NAME: Application package identifierRTMP_URL: RTMP stream URL (optional)
- Container fails to start: Ensure
.envandMentraApp/.envfiles exist and are properly configured - Port conflicts: If ports 8000 or 3000 are already in use, modify the port mappings (e.g.,
-p 8001:8000) - Permission issues: Ensure Docker has access to the project directory
- VLM connection issues: Verify
VLM_URLis accessible from within the container - Logs: View container logs with
docker logs <container-name>for debugging - GPU support: If using GPU features, ensure NVIDIA Docker runtime is configured (
--gpus allflag may be needed)
- The container runs both services simultaneously using a startup script
- For development, consider mounting source code volumes for live reloading
- Ensure your
.envfiles contain all required configuration before building - The container uses Python virtual environment and Bun for optimal performance
Edit .env file with your settings:
SELF_URL=https://your-server.ngrok-free.app
MENTRA_URL=https://client-webhook.example.com
VLM_URL=https://vlm-service.ngrok.app
MAX_FRAMES_PER_USER=10
# Optional: RTSP/RTMP Stream Auto-Ingestion
RTSP_STREAM_URL=rtsp://192.168.1.100:8554/live/stream
STREAM_USERNAME=stream_userOneshot Copilot can automatically ingest frames from an RTSP or RTMP stream. When configured, it will:
- Connect to the stream on application startup
- Filter frames by quality (blur and brightness)
- Automatically POST quality frames to the
/ingestendpoint - Send approximately 1 frame per second
To enable:
- Set
RTSP_STREAM_URLto your camera/stream URL - Set
STREAM_USERNAMEto identify frames from this stream
For local development, you can use the Auki Labs VLM Node which provides vision language model capabilities:
- Docker with GPU support (NVIDIA GPU recommended)
- NVIDIA Container Toolkit installed
- At least 8GB GPU VRAM
-
Clone the Auki VLM Node repository:
git clone https://github.com/aukilabs/vlm-node.git cd vlm-node -
Build and run with GPU support:
make docker-gpu
-
Verify the service is running: The VLM Node will start on port 8080 by default. You can verify it's running:
curl http://localhost:8080/health
-
Configure Oneshot Copilot to use the local VLM: In your
.envfile, set:VLM_URL=http://localhost:8080 USE_CLOUD_VLM=false
Note: The Auki VLM Node must be running before starting the Oneshot Copilot server.
Oneshot Copilot supports both local and cloud-based VLM services. By default, it uses a local VLM service, but you can switch to Moondream AI cloud VLM for production deployments.
To use Moondream AI cloud VLM:
- Sign up at moondream.ai and get your API key
- Set
USE_CLOUD_VLM=truein.env - Set
MOONDREAM_API_KEY=your_keyin.env - See CLOUD_VLM.md for detailed documentation
Configuration example:
USE_CLOUD_VLM=true
MOONDREAM_API_KEY=your_moondream_api_key_hereImportant Note: Moondream AI cloud does not support negative questions. When using cloud VLM, only positive questions from procedures are sent; negative questions are ignored (see CLOUD_VLM.md for details). Moondream AI cloud is also rate limited.
The Mentra app is a separate application that provides the frontend interface and handles user authentication for Oneshot Copilot. It's built using Express.js and the Mentra SDK.
MentraApp/
├── src/
│ ├── index.ts # Main application entry point
│ ├── tools.ts # Mentra SDK tools integration
│ ├── webview.ts # Web view management
│ └── services/
│ ├── AudioFeedback.ts # Audio feedback service
│ └── UserMetadataService.ts # User metadata handling
├── views/
│ ├── login_view.ejs # Login page template
│ ├── mainwebview.ejs # Main interface template
│ └── taskview.ejs # Task view template
├── public/
│ └── css/
│ └── style.css # Application styles
├── .env # Environment configuration (create from .env.example)
├── .env.example # Example environment configuration
├── package.json # Node.js dependencies
└── tsconfig.json # TypeScript configuration
Create a .env file in the MentraApp directory based on .env.example:
# Mentra OS API Key (required)
# Get this from your Mentra developer account at https://mentra.com
MENTRAOS_API_KEY=your_mentra_api_key_here
# Port for the Mentra app (default: 3000)
PORT=3000
# Package name for your Mentra application
# This should match your app registration in Mentra OS
PACKAGE_NAME=com.yourcompany.oneshotcopilot
# RTMP stream URL (optional)
# URL to the RTMP server for streaming video frames
# Must match the RTSP_STREAM_URL configuration in the main app
RTMP_URL=rtmp://192.168.1.100:1935/live/oneshotMENTRAOS_API_KEY(Required): Your Mentra OS API key obtained from the Mentra developer portalPORT(Optional): The port on which the Mentra app will run (default: 3000)PACKAGE_NAME(Required): Your application's package identifier in Mentra OS (e.g.,com.yourcompany.oneshotcopilot)RTMP_URL(Optional): The RTMP stream URL for video ingestion, should correspond to the RTSP stream configured in the main application
The Mentra app requires:
- Node.js 18 or higher (up to Node.js 22)
- Bun runtime (recommended) or npm
- Mentra SDK (
@mentra/sdk) - Express.js for the web server
- EJS for templating
If using the local Auki VLM Node (recommended for development):
-
Navigate to the VLM Node directory:
cd vlm-node -
Start the VLM service with GPU support:
make docker-gpu
-
Verify it's running on port 8080:
curl http://localhost:8080/health
Note: If using Moondream AI cloud VLM instead, skip this step and ensure USE_CLOUD_VLM=true in your .env file.
The Mentra app handles user authentication and provides the frontend interface. It must be started before the main server.
-
Setup Mentra following the MentraOS Extended Example tutorial
-
Configure environment - Get API keys and configure
.envin theMentraAppdirectory:cd MentraApp cp .env.example .env # Edit .env with your Mentra credentials
-
Install dependencies and start:
bun update bun run dev
-
Expose with Ngrok - Follow the tutorial to expose your local port with Ngrok
After Mentra is running, start the main FastAPI server:
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000The server will start at http://localhost:8000
Make sure to expose the server with Ngrok
Note: For development with auto-reload, you can use:
uvicorn app.main:app --reloaddocker pull bluenviron/mediamtx
docker run --rm -p 1935:1935 -p 8554:8554 -p 8888:8888 bluenviron/mediamtx
Link to the Repo: https://github.com/bluenviron/mediamtx
-
POST /start_procedure - Start a new procedure for a user
- Parameters:
username,procedure_file(optional) - Returns:
{ok: true, procedure_id: "..."}
- Parameters:
-
GET /status - Get user's current status
- Parameters:
username - Returns: State object with current step info
- Parameters:
-
*POST /pause - Pause active procedure
- Parameters:
username - Returns:
{ok: true}
- Parameters:
-
*POST /resume - Resume paused procedure
- Parameters:
username - Returns:
{ok: true}
- Parameters:
-
*POST /abort - Abort active procedure
- Parameters:
username - Returns:
{ok: true}
- Parameters:
-
POST /trigger_stream_reconnect - Trigger immediate RTSP stream reconnection
- No parameters required
- Returns:
{ok: true, message: "Stream reconnection triggered"}
- POST /ingest - Upload a frame for analysis
- Form data:
username,frame_id,file(image) - Returns:
{ok: true, queued: true}
- Form data:
- POST /vlm/callback - Receive VLM analysis results (called by VLM service)
- Query params:
user,procedure_id,step_id,frame_id,idem - Body:
{decision: "YES"|"NO"|"UNCERTAIN"|"NOT_APPLICABLE"} - Returns:
{ok: true}
- Query params:
Disclaimer: Multi-user has not been tested yet, so far usernames have been hardcoded.
curl -X POST "http://localhost:8000/start_procedure?username=john&procedure_file=app/data/procedures/pizza_custom.json"curl -X POST "http://localhost:8000/ingest" \
-F "username=john" \
-F "frame_id=frame_001" \
-F "file=@image.jpg"curl "http://localhost:8000/status?username=john"Procedures are defined in JSON files in app/data/procedures/:
{
"id": "pizza_custom@v1",
"name": "Make a Custom Pizza",
"version": 1,
"steps": [
{
"id": 1,
"name": "Show pizza dough",
"positives": ["A plain pizza base or dough is clearly visible..."],
"negatives": ["No pizza base or dough is visible..."],
"timeout_s": 20,
"debounce": {
"consecutive_yes": 2
}
}
]
}- Frame buffering: Up to 10 frames per user (configurable)
- Single in-flight: Only one VLM request active per user at a time
- YES-only progression: Only YES decisions increment the counter
- Debounce: Requires 2 consecutive YES responses by default
- Timeout reset: Timeout resets the YES counter but keeps the same step
oneshot_copilot/
├── app/
│ ├── main.py # FastAPI application
│ ├── config.py # Configuration settings
│ ├── models/
│ │ ├── state.py # UserState, Decision enums
│ │ └── procedure.py # ProcedureDef, StepDef
│ ├── core/
│ │ ├── statemachine.py # State machine logic
│ │ ├── vlm_client.py # VLM HTTP client
│ │ ├── callbacks.py # Event callbacks
│ │ └── frame_store.py # Frame storage
│ ├── services/
│ │ ├── stream_quality_filter.py # RTSP stream ingestion
│ │ └── status_service.py # Status tracking
│ └── api/
│ ├── procedure.py # Procedure endpoints
│ ├── ingest.py # Frame ingestion
│ └── vlm_callback.py # VLM callback handler
└── app/data/
└── procedures/
└── pizza_custom.json # Example procedure
- Mika Haak "@augmentedcamel"
MIT