-
Notifications
You must be signed in to change notification settings - Fork 0
NVIDIA NIM vllm Benchmark with Webui
License
staggeredsix/nvidia_nim_webui
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
NIM Benchmark Tool Programming Guide
Project Structure
Copy├── app/
│ ├── api/ # FastAPI routes and endpoints
│ ├── services/ # Business logic
│ └── utils/ # Utility functions
├── frontend/
│ ├── src/
│ │ ├── components/ # React components
│ │ ├── hooks/ # Custom React hooks
│ │ ├── routes/ # Page components
│ │ ├── services/ # API clients
│ │ └── types/ # TypeScript type definitions
└── requirements.txt # Python dependencies
## Key Components
### Backend (Python/FastAPI)
1. `benchmark_service.py`: Manages benchmark execution
- Handles NIM container lifecycle
- Executes benchmarks
- Records metrics
2. `metrics.py`: GPU metrics collection
- Real-time GPU monitoring
- System metrics tracking
- Historical data collection
3. `container.py`: NIM container management
- Container lifecycle (start/stop)
- Health checks
- Port management
## Multi-provider DGX Spark workflow
The benchmark harness now supports testing external inference providers in addition to NIM containers so that you can validate early weights on DGX Spark without depending on NIM availability.
Supported provider targets:
- **llama.cpp** (server or local HTTP binding)
- **Ollama**
- **sgLang**
- **vLLM**
### How to run benchmarks
1. Launch your provider on DGX Spark with an HTTP endpoint (for example `http://localhost:8000`).
2. POST to `/benchmark` with the provider fields:
- `provider`: one of `llama.cpp`, `ollama`, `sglang`, `vllm`, or `nim`.
- `endpoint`: full base URL for the provider (defaults to `http://localhost:8000`).
- `model_name`: provider-visible model identifier.
- `quantization`: `default` or `nvfp4` (add others if needed).
- `stream`: `true` to capture first-token and inter-token latency.
3. To exercise the requested workflow, run two passes: the provider’s default precision and a quantized `nvfp4` pass. Capture any issues in both runs; other quantizations can be added by repeating with a different `quantization` value.
### Using the WebUI for provider coverage
- The **Run a Benchmark** page now lets you mix NIM containers and external providers (llama.cpp, Ollama, sgLang, and vLLM) in the same run list.
- For each target you can set the endpoint, model name, quantization (`default` or `nvfp4`), optional expected output for quick accuracy checks, and per-target prompts.
- Streaming can be toggled per target to capture time-to-first-token and inter-token latency.
### Preparing NGC CLI models for non-NIM backends
- In the Benchmark page you can enter an NGC model spec (for example `nvidia/llama2_70b:1.0`) plus a friendly model name to auto-download via `ngc`.
- The tool creates a dedicated directory under `models/<model_name>` with subfolders for llama.cpp, Ollama, sgLang, and vLLM, wiring symlinks to the downloaded payload and emitting example launch commands per backend.
- This path is meant for non-NIM runners only (NIM will not host these NVIDIA CLI downloads).
### Metrics captured per run
- Time to first token (prefill latency)
- Inter-token latency
- Total completion time
- Average latency and p95 latency
- Tokens/sec and peak TPS
- Tool-call latency and accuracy (when `tool_calls` are present)
- Accuracy against an expected output when `expected_output` is provided
### Frontend (React/TypeScript)
1. `TelemetryComponents.tsx`: Real-time metrics display
- GPU utilization
- Memory usage
- Power consumption
2. `BenchmarkHistory.tsx`: Historical data viewing
- Past benchmark results
- Performance metrics
- Export functionality
## Common Tasks
### Adding New Metrics
1. Update `metrics.py`:
```python
def get_gpu_metrics(self, gpu_index: int) -> Dict:
# Add new metric to collection
Update TypeScript types in types/metrics.ts:
typescriptCopyinterface MetricsData {
// Add new metric type
}
Update display component in TelemetryComponents.tsx
Adding New Features
Backend Route:
Add endpoint in appropriate file under app/api/
Implement service logic in app/services/
Frontend Integration:
Add API client method in services/api.ts
Create/update React component
Add to relevant route
Known Issues & Fixes
WebSocket Connection:
Use dynamic hostname
Implement reconnection logic
Add error handling
Metrics Collection:
Ensure consistent units (MB vs GB)
Add proper error handling
Implement retry logic
Container Management:
Add timeout handling
Implement health checks
Add cleanup logic
Testing
Backend:
bashCopypytest tests/
Frontend:
bashCopynpm test
Common Errors
"model_name not found":
Check container_info structure
Verify NIM container status
WebSocket disconnections:
Check network connectivity
Verify port accessibility
Check server logs
Missing metrics:
Verify nvidia-smi access
Check GPU permissions
Verify CUDA installation
# Bug Fixes
## Critical Fixes
1. WebSocket Connection:
```typescript
const WS_BASE = `ws://${window.location.hostname}:7000`;
const ws = new WebSocket(`${WS_BASE}/ws/metrics`);
```
2. Metrics Collection:
```python
# Fix nvidia-smi command
result = subprocess.run([
'nvidia-smi',
f'--id={gpu_index}',
'--query-gpu=utilization.gpu,memory.used,memory.total,temperature.gpu,power.draw',
'--format=csv,nounits'
], capture_output=True, text=True, timeout=5)
```
3. Container Port:
```python
# Use container_info port instead of hardcoded
port = container_info['port']
endpoint = f"http://localhost:{port}/v1/completions"
```
## Performance Improvements
1. Frontend State Management:
```typescript
useEffect(() => {
if (metrics?.gpu_metrics) {
setGpuData(metrics.gpu_metrics.map((metric, index) => ({
name: `GPU ${index}`,
utilization: metric.gpu_utilization,
power: metric.power_draw
})));
}
}, [metrics?.gpu_metrics]); // Only update when GPU metrics change
```
2. Backend Memory Usage:
```python
# Limit historical metrics storage
if len(self.historical_metrics) > 100:
self.historical_metrics = self.historical_metrics[-100:]
```
## Error Handling
1. Container State:
```python
if not container_info:
raise Exception("Failed to start NIM container")
```
2. WebSocket Reconnection:
```typescript
const reconnectTimeoutRef = useRef<number>();
// Exponential backoff retry
if (retryCount < MAX_RETRIES) {
reconnectTimeoutRef.current = window.setTimeout(() => {
connect();
}, RETRY_DELAY * Math.pow(2, retryCount));
}
```
About
NVIDIA NIM vllm Benchmark with Webui
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published