Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
233 changes: 233 additions & 0 deletions docs/tutorials/ollama-endpoint.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,233 @@
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->

# Ollama Generate Endpoint

The Ollama generate endpoint enables benchmarking of [Ollama](https://ollama.com/) models using the `/api/generate` endpoint. It supports both streaming and non-streaming text generation with full access to Ollama's configuration options.

## When to Use

Use the `ollama_generate` endpoint when:
- Benchmarking models running on Ollama
- You need access to Ollama-specific features like system prompts, JSON formatting, or raw mode
- You want to test Ollama's streaming capabilities

## Basic Example

Benchmark an Ollama model with default settings:

```bash
aiperf profile \
--model llama2 \
--url http://localhost:11434 \
--endpoint-type ollama_generate \
--synthetic-input-tokens-mean 100 \
--output-tokens-mean 50 \
--concurrency 4 \
--request-count 20
```

## Configuration

Configure the endpoint using `--extra-inputs` for Ollama-specific options:

### Top-Level Parameters

- **`system`**: System prompt to guide model behavior
- **`format`**: Output format (`"json"` or a JSON schema)
- **`raw`**: Skip prompt templating (boolean)
- **`keep_alive`**: Model persistence duration (e.g., `"5m"`, `"1h"`)
- **`images`**: List of base64-encoded images for vision models

### Model Options

Pass model parameters using the `options` object:

- **`temperature`**: Sampling temperature (0.0-2.0)
- **`top_p`**: Nucleus sampling threshold
- **`top_k`**: Top-k sampling limit
- **`seed`**: Random seed for reproducibility
- **`num_ctx`**: Context window size
- **`stop`**: Stop sequences

## Examples

### Basic Text Generation

```bash
aiperf profile \
--model llama2 \
--url http://localhost:11434 \
--endpoint-type ollama_generate \
--synthetic-input-tokens-mean 200 \
--output-tokens-mean 100 \
--concurrency 8 \
--request-count 50
```

### With System Prompt

```bash
aiperf profile \
--model mistral \
--url http://localhost:11434 \
--endpoint-type ollama_generate \
--extra-inputs system:"You are a helpful AI assistant" \
--synthetic-input-tokens-mean 150 \
--output-tokens-mean 75 \
--concurrency 4 \
--request-count 25
```

### With Model Options

```bash
aiperf profile \
--model llama2 \
--url http://localhost:11434 \
--endpoint-type ollama_generate \
--extra-inputs options:'{
"temperature": 0.7,
"top_p": 0.9,
"top_k": 40,
"seed": 42
}' \
--synthetic-input-tokens-mean 100 \
--output-tokens-mean 50 \
--concurrency 6 \
--request-count 30
```

### JSON Mode

Force structured JSON output:

```bash
aiperf profile \
--model llama2 \
--url http://localhost:11434 \
--endpoint-type ollama_generate \
--extra-inputs format:json \
--extra-inputs system:"Return responses as valid JSON" \
--synthetic-input-tokens-mean 100 \
--output-tokens-mean 50 \
--concurrency 4 \
--request-count 20
```

### Streaming Mode

Enable streaming for token-by-token generation:

```bash
aiperf profile \
--model llama2 \
--url http://localhost:11434 \
--endpoint-type ollama_generate \
--streaming \
--synthetic-input-tokens-mean 200 \
--output-tokens-mean 150 \
--concurrency 2 \
--request-count 10
```

### With Custom Keep-Alive

Control how long the model stays in memory:

```bash
aiperf profile \
--model codellama \
--url http://localhost:11434 \
--endpoint-type ollama_generate \
--extra-inputs keep_alive:10m \
--synthetic-input-tokens-mean 500 \
--output-tokens-mean 200 \
--concurrency 4 \
--request-count 15
```

### Vision Model (with Images)

Benchmark vision-capable models:

```bash
aiperf profile \
--model llava \
--url http://localhost:11434 \
--endpoint-type ollama_generate \
--extra-inputs images:'["base64_encoded_image_data"]' \
--synthetic-input-tokens-mean 100 \
--output-tokens-mean 50 \
--concurrency 2 \
--request-count 10
```

### Complete Configuration

Combine multiple options:

```bash
aiperf profile \
--model mistral \
--url http://localhost:11434 \
--endpoint-type ollama_generate \
--streaming \
--extra-inputs system:"You are a technical documentation writer" \
--extra-inputs format:json \
--extra-inputs keep_alive:5m \
--extra-inputs options:'{
"temperature": 0.3,
"top_p": 0.95,
"seed": 123,
"num_ctx": 4096
}' \
--synthetic-input-tokens-mean 300 \
--output-tokens-mean 200 \
--concurrency 4 \
--request-count 50
```

## Response Handling

The endpoint automatically:
- Extracts generated text from the `response` field
- Parses token counts when `done: true`:
- `prompt_eval_count` → `prompt_tokens`
- `eval_count` → `completion_tokens`
- Calculates `total_tokens`
- Handles streaming chunks progressively

## Tips

- **Use `--streaming`** to benchmark Ollama's streaming performance
- **Set `keep_alive`** to avoid model reload overhead between requests
- **Use `format:json`** with a system prompt for structured output
- **Set `raw:true`** to skip Ollama's automatic prompt templating
- **Use `-v` or `-vv`** to see detailed request/response logs
- **Check `artifacts/<run-name>/`** for detailed metrics

## Troubleshooting

**Model not responding**
- Verify Ollama is running: `ollama list`
- Check the base URL is correct (default: `http://localhost:11434`)

**Slow performance**
- Increase `keep_alive` to keep the model in memory
- Reduce concurrency if you're hitting resource limits

**Invalid JSON responses**
- Add a system prompt when using `format:json`
- Not all models support JSON mode equally well

**Token counts missing**
- Token counts only appear in the final response when `done: true`
- Check the model supports token counting

## API Reference

For complete Ollama API documentation, see:
- [Ollama Generate API](https://docs.ollama.com/api/generate)
1 change: 1 addition & 0 deletions src/aiperf/common/enums/plugin_enums.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ class EndpointType(CaseInsensitiveStrEnum):
HF_TEI_RANKINGS = "hf_tei_rankings"
HUGGINGFACE_GENERATE = "huggingface_generate"
NIM_RANKINGS = "nim_rankings"
OLLAMA_GENERATE = "ollama_generate"
SOLIDO_RAG = "solido_rag"
TEMPLATE = "template"

Expand Down
4 changes: 4 additions & 0 deletions src/aiperf/endpoints/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,9 @@
from aiperf.endpoints.nim_rankings import (
NIMRankingsEndpoint,
)
from aiperf.endpoints.ollama_generate import (
OllamaGenerateEndpoint,
)
from aiperf.endpoints.openai_chat import (
ChatEndpoint,
)
Expand All @@ -45,6 +48,7 @@
"HFTeiRankingsEndpoint",
"HuggingFaceGenerateEndpoint",
"NIMRankingsEndpoint",
"OllamaGenerateEndpoint",
"SolidoEndpoint",
"TemplateEndpoint",
]
118 changes: 118 additions & 0 deletions src/aiperf/endpoints/ollama_generate.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0

from __future__ import annotations

from typing import Any

from aiperf.common.decorators import implements_protocol
from aiperf.common.enums import EndpointType
from aiperf.common.factories import EndpointFactory
from aiperf.common.models import ParsedResponse
from aiperf.common.models.metadata import EndpointMetadata
from aiperf.common.models.record_models import RequestInfo
from aiperf.common.protocols import EndpointProtocol, InferenceServerResponse
from aiperf.endpoints.base_endpoint import BaseEndpoint


@implements_protocol(EndpointProtocol)
@EndpointFactory.register(EndpointType.OLLAMA_GENERATE)
class OllamaGenerateEndpoint(BaseEndpoint):
"""Ollama Generate endpoint.

Supports both streaming and non-streaming text generation using Ollama's
/api/generate endpoint. This endpoint is designed for single-turn text
generation with optional system prompts and advanced parameters.
"""

@classmethod
def metadata(cls) -> EndpointMetadata:
"""Return Ollama Generate endpoint metadata."""
return EndpointMetadata(
endpoint_path="/api/generate",
supports_streaming=True,
produces_tokens=True,
tokenizes_input=True,
metrics_title="LLM Metrics",
)

def format_payload(self, request_info: RequestInfo) -> dict[str, Any]:
"""Format payload for Ollama Generate request.

Args:
request_info: Request context including model endpoint, metadata, and turns

Returns:
Ollama Generate API payload
"""
if not request_info.turns:
raise ValueError("Ollama Generate endpoint requires at least one turn.")

turn = request_info.turns[0]
model_endpoint = request_info.model_endpoint

prompt = " ".join(
[content for text in turn.texts for content in text.contents if content]
)

payload: dict[str, Any] = {
"model": turn.model or model_endpoint.primary_model_name,
"prompt": prompt,
"stream": model_endpoint.endpoint.streaming,
}

if turn.max_tokens is not None:
payload.setdefault("options", {})["num_predict"] = turn.max_tokens

if model_endpoint.endpoint.extra:
extra = dict(model_endpoint.endpoint.extra)
extra_options = extra.pop("options", {})

payload.update(extra)

if extra_options:
payload.setdefault("options", {}).update(extra_options)

self.debug(lambda: f"Formatted Ollama Generate payload: {payload}")
return payload

def parse_response(
self, response: InferenceServerResponse
) -> ParsedResponse | None:
"""Parse Ollama Generate response.

Handles both streaming and non-streaming modes. In streaming mode,
each chunk contains incremental response text. In non-streaming mode,
the complete response is returned at once.

Args:
response: Raw response from inference server

Returns:
Parsed response with extracted text and usage data
"""
json_obj = response.get_json()
if not json_obj:
return None

text = json_obj.get("response")
if not text:
self.debug(lambda: f"No 'response' field in Ollama response: {json_obj}")
return None
Comment on lines +98 to +101
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Don't drop streaming completion chunks

For streaming /api/generate, the final chunk arrives with "response": "" but still carries prompt_eval_count, eval_count, and other usage metrics.(ollama.readthedocs.io) Because if not text: treats the empty string as missing, the final chunk is discarded and we lose usage accounting for every streaming run. Please treat only a missing/None field as absent so that empty strings still flow through.

-        text = json_obj.get("response")
-        if not text:
+        text = json_obj.get("response")
+        if text is None:
             self.debug(lambda: f"No 'response' field in Ollama response: {json_obj}")
             return None

make_text_response_data already returns None for empty strings, so this keeps usage while preserving the logging for truly absent fields. Adding a regression test that feeds a streaming final chunk (response="", done=True) would help ensure this path stays covered.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
text = json_obj.get("response")
if not text:
self.debug(lambda: f"No 'response' field in Ollama response: {json_obj}")
return None
text = json_obj.get("response")
if text is None:
self.debug(lambda: f"No 'response' field in Ollama response: {json_obj}")
return None
🤖 Prompt for AI Agents
In src/aiperf/endpoints/ollama_generate.py around lines 98 to 101, the code
treats an empty string response as missing (using a falsy check) and drops the
final streaming chunk that contains usage metrics; change the check to only
treat a truly missing/None field as absent (e.g., use json_obj.get("response")
is None or explicit key presence check) so empty string responses still pass
through, keep the existing make_text_response_data behavior which will return
None for empty strings, and add a regression test that feeds a streaming final
chunk with response="" and done=True to ensure usage metrics are preserved.


data = self.make_text_response_data(text)

usage = None
if json_obj.get("done"):
prompt_eval_count = json_obj.get("prompt_eval_count")
eval_count = json_obj.get("eval_count")

if prompt_eval_count is not None or eval_count is not None:
usage = {
"prompt_tokens": prompt_eval_count,
"completion_tokens": eval_count,
}
if prompt_eval_count is not None and eval_count is not None:
usage["total_tokens"] = prompt_eval_count + eval_count

return ParsedResponse(perf_ns=response.perf_ns, data=data, usage=usage)
Loading