GitHub - staggeredsix/nvidia_nim_webui: NVIDIA NIM vllm Benchmark with Webui

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
app		app
frontend		frontend
helpers		helpers
Dockerfle		Dockerfle
LICENSE		LICENSE
README		README
ngc_key_helper.py		ngc_key_helper.py
requirements.txt		requirements.txt
start_server.sh		start_server.sh

Repository files navigation

NIM Benchmark Tool Programming Guide
Project Structure
Copy├── app/
│   ├── api/                 # FastAPI routes and endpoints
│   ├── services/           # Business logic
│   └── utils/             # Utility functions
├── frontend/
│   ├── src/
│   │   ├── components/    # React components
│   │   ├── hooks/        # Custom React hooks
│   │   ├── routes/       # Page components
│   │   ├── services/     # API clients
│   │   └── types/       # TypeScript type definitions
└── requirements.txt      # Python dependencies

## Key Components

### Backend (Python/FastAPI)

1. `benchmark_service.py`: Manages benchmark execution
   - Handles NIM container lifecycle
   - Executes benchmarks
   - Records metrics

2. `metrics.py`: GPU metrics collection
   - Real-time GPU monitoring
   - System metrics tracking
   - Historical data collection

3. `container.py`: NIM container management
   - Container lifecycle (start/stop)
   - Health checks
   - Port management

## Multi-provider DGX Spark workflow

The benchmark harness now supports testing external inference providers in addition to NIM containers so that you can validate early weights on DGX Spark without depending on NIM availability.

Supported provider targets:

- **llama.cpp** (server or local HTTP binding)
- **Ollama**
- **sgLang**
- **vLLM**

### How to run benchmarks

1. Launch your provider on DGX Spark with an HTTP endpoint (for example `http://localhost:8000`).
2. POST to `/benchmark` with the provider fields:
   - `provider`: one of `llama.cpp`, `ollama`, `sglang`, `vllm`, or `nim`.
   - `endpoint`: full base URL for the provider (defaults to `http://localhost:8000`).
   - `model_name`: provider-visible model identifier.
   - `quantization`: `default` or `nvfp4` (add others if needed).
   - `stream`: `true` to capture first-token and inter-token latency.
3. To exercise the requested workflow, run two passes: the provider’s default precision and a quantized `nvfp4` pass. Capture any issues in both runs; other quantizations can be added by repeating with a different `quantization` value.

### Using the WebUI for provider coverage

- The **Run a Benchmark** page now lets you mix NIM containers and external providers (llama.cpp, Ollama, sgLang, and vLLM) in the same run list.
- For each target you can set the endpoint, model name, quantization (`default` or `nvfp4`), optional expected output for quick accuracy checks, and per-target prompts.
- Streaming can be toggled per target to capture time-to-first-token and inter-token latency.

### Preparing NGC CLI models for non-NIM backends

- In the Benchmark page you can enter an NGC model spec (for example `nvidia/llama2_70b:1.0`) plus a friendly model name to auto-download via `ngc`.
- The tool creates a dedicated directory under `models/<model_name>` with subfolders for llama.cpp, Ollama, sgLang, and vLLM, wiring symlinks to the downloaded payload and emitting example launch commands per backend.
- This path is meant for non-NIM runners only (NIM will not host these NVIDIA CLI downloads).

### Metrics captured per run

- Time to first token (prefill latency)
- Inter-token latency
- Total completion time
- Average latency and p95 latency
- Tokens/sec and peak TPS
- Tool-call latency and accuracy (when `tool_calls` are present)
- Accuracy against an expected output when `expected_output` is provided

### Frontend (React/TypeScript)

1. `TelemetryComponents.tsx`: Real-time metrics display
   - GPU utilization
   - Memory usage
   - Power consumption

2. `BenchmarkHistory.tsx`: Historical data viewing
   - Past benchmark results
   - Performance metrics
   - Export functionality

## Common Tasks

### Adding New Metrics

1. Update `metrics.py`:
```python
def get_gpu_metrics(self, gpu_index: int) -> Dict:
    # Add new metric to collection

Update TypeScript types in types/metrics.ts:

typescriptCopyinterface MetricsData {
    // Add new metric type
}

Update display component in TelemetryComponents.tsx

Adding New Features

Backend Route:

Add endpoint in appropriate file under app/api/
Implement service logic in app/services/


Frontend Integration:

Add API client method in services/api.ts
Create/update React component
Add to relevant route



Known Issues & Fixes

WebSocket Connection:

Use dynamic hostname
Implement reconnection logic
Add error handling


Metrics Collection:

Ensure consistent units (MB vs GB)
Add proper error handling
Implement retry logic


Container Management:

Add timeout handling
Implement health checks
Add cleanup logic



Testing

Backend:

bashCopypytest tests/

Frontend:

bashCopynpm test
Common Errors

"model_name not found":

Check container_info structure
Verify NIM container status


WebSocket disconnections:

Check network connectivity
Verify port accessibility
Check server logs


Missing metrics:

Verify nvidia-smi access
Check GPU permissions
Verify CUDA installation



# Bug Fixes

## Critical Fixes

1. WebSocket Connection:
```typescript
const WS_BASE = `ws://${window.location.hostname}:7000`;
const ws = new WebSocket(`${WS_BASE}/ws/metrics`);
```

2. Metrics Collection:
```python
# Fix nvidia-smi command
result = subprocess.run([
    'nvidia-smi',
    f'--id={gpu_index}',
    '--query-gpu=utilization.gpu,memory.used,memory.total,temperature.gpu,power.draw',
    '--format=csv,nounits'
], capture_output=True, text=True, timeout=5)
```

3. Container Port:
```python
# Use container_info port instead of hardcoded
port = container_info['port']
endpoint = f"http://localhost:{port}/v1/completions"
```

## Performance Improvements

1. Frontend State Management:
```typescript
useEffect(() => {
    if (metrics?.gpu_metrics) {
        setGpuData(metrics.gpu_metrics.map((metric, index) => ({
            name: `GPU ${index}`,
            utilization: metric.gpu_utilization,
            power: metric.power_draw
        })));
    }
}, [metrics?.gpu_metrics]); // Only update when GPU metrics change
```

2. Backend Memory Usage:
```python
# Limit historical metrics storage
if len(self.historical_metrics) > 100:
    self.historical_metrics = self.historical_metrics[-100:]
```

## Error Handling

1. Container State:
```python
if not container_info:
    raise Exception("Failed to start NIM container")
```

2. WebSocket Reconnection:
```typescript
const reconnectTimeoutRef = useRef<number>();
// Exponential backoff retry
if (retryCount < MAX_RETRIES) {
    reconnectTimeoutRef.current = window.setTimeout(() => {
        connect();
    }, RETRY_DELAY * Math.pow(2, retryCount));
}
```