-
Notifications
You must be signed in to change notification settings - Fork 9
feat: Add native ollama_generate endpoint type support #448
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
ajcasagrande
wants to merge
1
commit into
main
Choose a base branch
from
ajc/ollama
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,233 @@ | ||
| <!-- | ||
| SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
| SPDX-License-Identifier: Apache-2.0 | ||
| --> | ||
|
|
||
| # Ollama Generate Endpoint | ||
|
|
||
| The Ollama generate endpoint enables benchmarking of [Ollama](https://ollama.com/) models using the `/api/generate` endpoint. It supports both streaming and non-streaming text generation with full access to Ollama's configuration options. | ||
|
|
||
| ## When to Use | ||
|
|
||
| Use the `ollama_generate` endpoint when: | ||
| - Benchmarking models running on Ollama | ||
| - You need access to Ollama-specific features like system prompts, JSON formatting, or raw mode | ||
| - You want to test Ollama's streaming capabilities | ||
|
|
||
| ## Basic Example | ||
|
|
||
| Benchmark an Ollama model with default settings: | ||
|
|
||
| ```bash | ||
| aiperf profile \ | ||
| --model llama2 \ | ||
| --url http://localhost:11434 \ | ||
| --endpoint-type ollama_generate \ | ||
| --synthetic-input-tokens-mean 100 \ | ||
| --output-tokens-mean 50 \ | ||
| --concurrency 4 \ | ||
| --request-count 20 | ||
| ``` | ||
|
|
||
| ## Configuration | ||
|
|
||
| Configure the endpoint using `--extra-inputs` for Ollama-specific options: | ||
|
|
||
| ### Top-Level Parameters | ||
|
|
||
| - **`system`**: System prompt to guide model behavior | ||
| - **`format`**: Output format (`"json"` or a JSON schema) | ||
| - **`raw`**: Skip prompt templating (boolean) | ||
| - **`keep_alive`**: Model persistence duration (e.g., `"5m"`, `"1h"`) | ||
| - **`images`**: List of base64-encoded images for vision models | ||
|
|
||
| ### Model Options | ||
|
|
||
| Pass model parameters using the `options` object: | ||
|
|
||
| - **`temperature`**: Sampling temperature (0.0-2.0) | ||
| - **`top_p`**: Nucleus sampling threshold | ||
| - **`top_k`**: Top-k sampling limit | ||
| - **`seed`**: Random seed for reproducibility | ||
| - **`num_ctx`**: Context window size | ||
| - **`stop`**: Stop sequences | ||
|
|
||
| ## Examples | ||
|
|
||
| ### Basic Text Generation | ||
|
|
||
| ```bash | ||
| aiperf profile \ | ||
| --model llama2 \ | ||
| --url http://localhost:11434 \ | ||
| --endpoint-type ollama_generate \ | ||
| --synthetic-input-tokens-mean 200 \ | ||
| --output-tokens-mean 100 \ | ||
| --concurrency 8 \ | ||
| --request-count 50 | ||
| ``` | ||
|
|
||
| ### With System Prompt | ||
|
|
||
| ```bash | ||
| aiperf profile \ | ||
| --model mistral \ | ||
| --url http://localhost:11434 \ | ||
| --endpoint-type ollama_generate \ | ||
| --extra-inputs system:"You are a helpful AI assistant" \ | ||
| --synthetic-input-tokens-mean 150 \ | ||
| --output-tokens-mean 75 \ | ||
| --concurrency 4 \ | ||
| --request-count 25 | ||
| ``` | ||
|
|
||
| ### With Model Options | ||
|
|
||
| ```bash | ||
| aiperf profile \ | ||
| --model llama2 \ | ||
| --url http://localhost:11434 \ | ||
| --endpoint-type ollama_generate \ | ||
| --extra-inputs options:'{ | ||
| "temperature": 0.7, | ||
| "top_p": 0.9, | ||
| "top_k": 40, | ||
| "seed": 42 | ||
| }' \ | ||
| --synthetic-input-tokens-mean 100 \ | ||
| --output-tokens-mean 50 \ | ||
| --concurrency 6 \ | ||
| --request-count 30 | ||
| ``` | ||
|
|
||
| ### JSON Mode | ||
|
|
||
| Force structured JSON output: | ||
|
|
||
| ```bash | ||
| aiperf profile \ | ||
| --model llama2 \ | ||
| --url http://localhost:11434 \ | ||
| --endpoint-type ollama_generate \ | ||
| --extra-inputs format:json \ | ||
| --extra-inputs system:"Return responses as valid JSON" \ | ||
| --synthetic-input-tokens-mean 100 \ | ||
| --output-tokens-mean 50 \ | ||
| --concurrency 4 \ | ||
| --request-count 20 | ||
| ``` | ||
|
|
||
| ### Streaming Mode | ||
|
|
||
| Enable streaming for token-by-token generation: | ||
|
|
||
| ```bash | ||
| aiperf profile \ | ||
| --model llama2 \ | ||
| --url http://localhost:11434 \ | ||
| --endpoint-type ollama_generate \ | ||
| --streaming \ | ||
| --synthetic-input-tokens-mean 200 \ | ||
| --output-tokens-mean 150 \ | ||
| --concurrency 2 \ | ||
| --request-count 10 | ||
| ``` | ||
|
|
||
| ### With Custom Keep-Alive | ||
|
|
||
| Control how long the model stays in memory: | ||
|
|
||
| ```bash | ||
| aiperf profile \ | ||
| --model codellama \ | ||
| --url http://localhost:11434 \ | ||
| --endpoint-type ollama_generate \ | ||
| --extra-inputs keep_alive:10m \ | ||
| --synthetic-input-tokens-mean 500 \ | ||
| --output-tokens-mean 200 \ | ||
| --concurrency 4 \ | ||
| --request-count 15 | ||
| ``` | ||
|
|
||
| ### Vision Model (with Images) | ||
|
|
||
| Benchmark vision-capable models: | ||
|
|
||
| ```bash | ||
| aiperf profile \ | ||
| --model llava \ | ||
| --url http://localhost:11434 \ | ||
| --endpoint-type ollama_generate \ | ||
| --extra-inputs images:'["base64_encoded_image_data"]' \ | ||
| --synthetic-input-tokens-mean 100 \ | ||
| --output-tokens-mean 50 \ | ||
| --concurrency 2 \ | ||
| --request-count 10 | ||
| ``` | ||
|
|
||
| ### Complete Configuration | ||
|
|
||
| Combine multiple options: | ||
|
|
||
| ```bash | ||
| aiperf profile \ | ||
| --model mistral \ | ||
| --url http://localhost:11434 \ | ||
| --endpoint-type ollama_generate \ | ||
| --streaming \ | ||
| --extra-inputs system:"You are a technical documentation writer" \ | ||
| --extra-inputs format:json \ | ||
| --extra-inputs keep_alive:5m \ | ||
| --extra-inputs options:'{ | ||
| "temperature": 0.3, | ||
| "top_p": 0.95, | ||
| "seed": 123, | ||
| "num_ctx": 4096 | ||
| }' \ | ||
| --synthetic-input-tokens-mean 300 \ | ||
| --output-tokens-mean 200 \ | ||
| --concurrency 4 \ | ||
| --request-count 50 | ||
| ``` | ||
|
|
||
| ## Response Handling | ||
|
|
||
| The endpoint automatically: | ||
| - Extracts generated text from the `response` field | ||
| - Parses token counts when `done: true`: | ||
| - `prompt_eval_count` → `prompt_tokens` | ||
| - `eval_count` → `completion_tokens` | ||
| - Calculates `total_tokens` | ||
| - Handles streaming chunks progressively | ||
|
|
||
| ## Tips | ||
|
|
||
| - **Use `--streaming`** to benchmark Ollama's streaming performance | ||
| - **Set `keep_alive`** to avoid model reload overhead between requests | ||
| - **Use `format:json`** with a system prompt for structured output | ||
| - **Set `raw:true`** to skip Ollama's automatic prompt templating | ||
| - **Use `-v` or `-vv`** to see detailed request/response logs | ||
| - **Check `artifacts/<run-name>/`** for detailed metrics | ||
|
|
||
| ## Troubleshooting | ||
|
|
||
| **Model not responding** | ||
| - Verify Ollama is running: `ollama list` | ||
| - Check the base URL is correct (default: `http://localhost:11434`) | ||
|
|
||
| **Slow performance** | ||
| - Increase `keep_alive` to keep the model in memory | ||
| - Reduce concurrency if you're hitting resource limits | ||
|
|
||
| **Invalid JSON responses** | ||
| - Add a system prompt when using `format:json` | ||
| - Not all models support JSON mode equally well | ||
|
|
||
| **Token counts missing** | ||
| - Token counts only appear in the final response when `done: true` | ||
| - Check the model supports token counting | ||
|
|
||
| ## API Reference | ||
|
|
||
| For complete Ollama API documentation, see: | ||
| - [Ollama Generate API](https://docs.ollama.com/api/generate) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,118 @@ | ||
| # SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
|
|
||
| from __future__ import annotations | ||
|
|
||
| from typing import Any | ||
|
|
||
| from aiperf.common.decorators import implements_protocol | ||
| from aiperf.common.enums import EndpointType | ||
| from aiperf.common.factories import EndpointFactory | ||
| from aiperf.common.models import ParsedResponse | ||
| from aiperf.common.models.metadata import EndpointMetadata | ||
| from aiperf.common.models.record_models import RequestInfo | ||
| from aiperf.common.protocols import EndpointProtocol, InferenceServerResponse | ||
| from aiperf.endpoints.base_endpoint import BaseEndpoint | ||
|
|
||
|
|
||
| @implements_protocol(EndpointProtocol) | ||
| @EndpointFactory.register(EndpointType.OLLAMA_GENERATE) | ||
| class OllamaGenerateEndpoint(BaseEndpoint): | ||
| """Ollama Generate endpoint. | ||
|
|
||
| Supports both streaming and non-streaming text generation using Ollama's | ||
| /api/generate endpoint. This endpoint is designed for single-turn text | ||
| generation with optional system prompts and advanced parameters. | ||
| """ | ||
|
|
||
| @classmethod | ||
| def metadata(cls) -> EndpointMetadata: | ||
| """Return Ollama Generate endpoint metadata.""" | ||
| return EndpointMetadata( | ||
| endpoint_path="/api/generate", | ||
| supports_streaming=True, | ||
| produces_tokens=True, | ||
| tokenizes_input=True, | ||
| metrics_title="LLM Metrics", | ||
| ) | ||
|
|
||
| def format_payload(self, request_info: RequestInfo) -> dict[str, Any]: | ||
| """Format payload for Ollama Generate request. | ||
|
|
||
| Args: | ||
| request_info: Request context including model endpoint, metadata, and turns | ||
|
|
||
| Returns: | ||
| Ollama Generate API payload | ||
| """ | ||
| if not request_info.turns: | ||
| raise ValueError("Ollama Generate endpoint requires at least one turn.") | ||
|
|
||
| turn = request_info.turns[0] | ||
| model_endpoint = request_info.model_endpoint | ||
|
|
||
| prompt = " ".join( | ||
| [content for text in turn.texts for content in text.contents if content] | ||
| ) | ||
|
|
||
| payload: dict[str, Any] = { | ||
| "model": turn.model or model_endpoint.primary_model_name, | ||
| "prompt": prompt, | ||
| "stream": model_endpoint.endpoint.streaming, | ||
| } | ||
|
|
||
| if turn.max_tokens is not None: | ||
| payload.setdefault("options", {})["num_predict"] = turn.max_tokens | ||
|
|
||
| if model_endpoint.endpoint.extra: | ||
| extra = dict(model_endpoint.endpoint.extra) | ||
| extra_options = extra.pop("options", {}) | ||
|
|
||
| payload.update(extra) | ||
|
|
||
| if extra_options: | ||
| payload.setdefault("options", {}).update(extra_options) | ||
|
|
||
| self.debug(lambda: f"Formatted Ollama Generate payload: {payload}") | ||
| return payload | ||
|
|
||
| def parse_response( | ||
| self, response: InferenceServerResponse | ||
| ) -> ParsedResponse | None: | ||
| """Parse Ollama Generate response. | ||
|
|
||
| Handles both streaming and non-streaming modes. In streaming mode, | ||
| each chunk contains incremental response text. In non-streaming mode, | ||
| the complete response is returned at once. | ||
|
|
||
| Args: | ||
| response: Raw response from inference server | ||
|
|
||
| Returns: | ||
| Parsed response with extracted text and usage data | ||
| """ | ||
| json_obj = response.get_json() | ||
| if not json_obj: | ||
| return None | ||
|
|
||
| text = json_obj.get("response") | ||
| if not text: | ||
| self.debug(lambda: f"No 'response' field in Ollama response: {json_obj}") | ||
| return None | ||
|
|
||
| data = self.make_text_response_data(text) | ||
|
|
||
| usage = None | ||
| if json_obj.get("done"): | ||
| prompt_eval_count = json_obj.get("prompt_eval_count") | ||
| eval_count = json_obj.get("eval_count") | ||
|
|
||
| if prompt_eval_count is not None or eval_count is not None: | ||
| usage = { | ||
| "prompt_tokens": prompt_eval_count, | ||
| "completion_tokens": eval_count, | ||
| } | ||
| if prompt_eval_count is not None and eval_count is not None: | ||
| usage["total_tokens"] = prompt_eval_count + eval_count | ||
|
|
||
| return ParsedResponse(perf_ns=response.perf_ns, data=data, usage=usage) | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't drop streaming completion chunks
For streaming
/api/generate, the final chunk arrives with"response": ""but still carriesprompt_eval_count,eval_count, and other usage metrics.(ollama.readthedocs.io) Becauseif not text:treats the empty string as missing, the final chunk is discarded and we lose usage accounting for every streaming run. Please treat only a missing/Nonefield as absent so that empty strings still flow through.make_text_response_dataalready returnsNonefor empty strings, so this keeps usage while preserving the logging for truly absent fields. Adding a regression test that feeds a streaming final chunk (response="",done=True) would help ensure this path stays covered.📝 Committable suggestion
🤖 Prompt for AI Agents