Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
117 changes: 117 additions & 0 deletions docs/site/endpoints.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
# Bring-Your-Own-Endpoint (BYOE)

Specsmith ships first-class support for self-hosted OpenAI-v1-compatible
LLM servers (vLLM, llama.cpp `server`, LM Studio, TGI,
text-generation-webui, …). Every endpoint you register can be selected
per session via `--endpoint <id>` on `specsmith run`, `chat`, and
`serve` (PR-2).

## Quick start

Register a vLLM running on your LAN:

```sh
specsmith endpoints add \
--id home-vllm \
--name "Home vLLM" \
--base-url http://10.0.0.4:8000/v1 \
--default-model Qwen/Qwen2.5-Coder-32B-Instruct-GPTQ-Int8 \
--auth none \
--set-default

specsmith endpoints test home-vllm
```

Once the test reports `ok`, run an agent against it:

```sh
specsmith run --endpoint home-vllm "summarise the last commit"
```

## Storage layout

All endpoints live in `~/.specsmith/endpoints.json` (override with
`SPECSMITH_HOME`). The on-disk schema is versioned:

```json
{
"schema_version": 1,
"default_endpoint_id": "home-vllm",
"endpoints": [
{
"id": "home-vllm",
"name": "Home vLLM",
"base_url": "http://10.0.0.4:8000/v1",
"auth": {"kind": "bearer-keyring",
"keyring_service": "specsmith",
"keyring_user": "endpoint:home-vllm"},
"default_model": "Qwen/Qwen2.5-Coder-32B",
"verify_tls": true,
"tags": ["local", "coder"],
"created_at": "2026-05-01T11:30:17Z"
}
]
}
```

The file is written `chmod 600` on POSIX. Token bytes for the inline
strategy are the only secret material that ever lands in this file —
the keyring and env-var strategies leave it secret-free.

## Auth strategies

| Kind | Where the token lives | When to use |
|------------------|----------------------------------------------------|-------------|
| `none` | nowhere — request is unauthenticated | trusted LAN, open vLLM dev box |
| `bearer-inline` | `endpoints.json` (plaintext, `chmod 600`) | quick scratch setups where keyring is unavailable |
| `bearer-env` | the env var name you specify (`--token-env FOO`) | CI / containers / 12-factor deploys |
| `bearer-keyring` | OS keyring, indexed by `(service, user)` (default) | desktop / laptop installs (default) |

The `list --json` output redacts inline tokens to `"***"`. The CLI
never logs token bytes to terminal output.

## Health checks

```sh
specsmith endpoints test home-vllm --json
specsmith endpoints models home-vllm --json
```

`test` calls `<base_url>/models` with the resolved bearer token, prints
the latency in milliseconds, and reports up to 5 model ids. `models`
returns the full list.

If the endpoint does not expose `/v1/models`, `test` will still return a
clear error message — set `default_model` manually and rely on the
session-level model dropdown instead.

## CLI reference

| Command | Notes |
|---------|-------|
| `specsmith endpoints add` | Register a new endpoint. `--auth bearer-keyring` (default) prompts for the secret without echo. |
| `specsmith endpoints list [--json]` | Tabular by default, JSON for IDE consumers. Tokens are redacted. |
| `specsmith endpoints remove <id> [--purge-keyring]` | Remove the entry; pass `--purge-keyring` to also delete the saved token. |
| `specsmith endpoints default <id>` | Promote an existing endpoint to the default. |
| `specsmith endpoints test [<id>] [--timeout 5]` | Probe `/v1/models`. Exits 1 on failure. |
| `specsmith endpoints models [<id>]` | List every model the endpoint advertises. |

## Security notes

* The store path is `chmod 600` on POSIX where supported.
* `verify_tls: false` is opt-in (`--no-verify-tls`); otherwise the CLI
verifies the certificate chain. Disabling it for an https endpoint is
documented per-endpoint in the on-disk JSON so a drift audit can spot
insecure configurations.
* `auth.kind == bearer-inline` is functional but not recommended.
Prefer `bearer-keyring` when the OS keyring is available; otherwise
use `bearer-env` and inject the secret through your shell or
container environment.

## Roadmap

* **PR-2 (this milestone):** wires `--endpoint <id>` into `run`,
`chat`, and `serve`, plus a new `_run_openai_compat` provider driver.
* **PR-3:** Endpoints tab and a per-session dropdown in the
`specsmith-vscode` extension.
* **PR-4:** 0.8.0 release notes + tag.
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"

[project]
name = "specsmith"
version = "0.7.0"
version = "0.8.0"
description = "Applied Epistemic Engineering toolkit — AEE agent sessions, execution profiles, FPGA/HDL governance, tool installer, 50+ CLI commands."
readme = "README.md"
license = "MIT"
Expand Down
99 changes: 98 additions & 1 deletion src/specsmith/agent/chat_runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -80,11 +80,35 @@ def run_chat(
history: list[dict[str, Any]] | None = None,
confidence_target: float = 0.7,
rules_prefix: str = "",
endpoint_id: str | None = None,
) -> ChatRunResult | None:
"""Drive a real LLM turn. Return ``None`` if no provider is reachable."""
"""Drive a real LLM turn. Return ``None`` if no provider is reachable.

When ``endpoint_id`` is set, the BYOE store (REQ-142) is consulted and
the resolved :class:`Endpoint` short-circuits the provider chain via
the new :func:`_run_openai_compat` driver. Any error during endpoint
resolution falls back to the legacy auto-detect chain so an offline
misconfigured endpoint never breaks `specsmith chat`.
"""
history = history or []
messages = _build_messages(utterance, history, rules_prefix)

# REQ-142: explicit endpoint override.
if endpoint_id:
try:
from specsmith.agent.endpoints import EndpointStore

endpoint = EndpointStore.load().resolve(endpoint_id)
except Exception: # noqa: BLE001 - any failure → fall back to auto-detect
endpoint = None
if endpoint is not None:
try:
full_text = _run_openai_compat(messages, emitter, msg_block, endpoint=endpoint)
except Exception: # noqa: BLE001 - degrade to auto-detect
full_text = None
if full_text is not None:
return _finalize(full_text, "openai_compat", project_dir, confidence_target)

# Order matters: Ollama first because it's local-first and free.
for provider in (_run_ollama, _run_anthropic, _run_openai, _run_gemini):
try:
Expand Down Expand Up @@ -228,6 +252,79 @@ def _run_openai(
return "".join(pieces) if pieces else None


def _run_openai_compat(
messages: list[dict[str, str]],
emitter: EventEmitter,
block_id: str,
*,
endpoint: Any,
) -> str | None:
"""Stream from a user-registered OpenAI-v1-compatible endpoint (REQ-142).

Uses raw stdlib HTTP so the openai SDK is not a hard dependency for
BYOE. Sends a streaming ``/chat/completions`` request, decodes the
Server-Sent-Events ``data:`` lines, and forwards each ``content``
delta as a ``token`` event on ``block_id``.
"""
base_url = endpoint.base_url.rstrip("/")
url = f"{base_url}/chat/completions"
model = endpoint.default_model or os.environ.get("SPECSMITH_OPENAI_COMPAT_MODEL", "")
if not model:
# The endpoint did not pin a default model and the env override is
# absent. We cannot fabricate one; fall back to the auto-detect chain.
return None

headers: dict[str, str] = {
"Content-Type": "application/json",
"Accept": "text/event-stream",
}
try:
token = endpoint.resolve_token()
except Exception: # noqa: BLE001 - fall back to auto-detect chain
return None
if token:
headers["Authorization"] = f"Bearer {token}"

body = json.dumps({"model": model, "messages": messages, "stream": True}).encode("utf-8")
req = Request(url, data=body, headers=headers, method="POST") # noqa: S310 - user-supplied

ctx = None
if not endpoint.verify_tls and url.startswith("https://"):
import ssl

ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

pieces: list[str] = []
try:
with urlopen(req, timeout=120, context=ctx) as resp: # noqa: S310 - user-supplied
for raw_line in resp:
line = raw_line.decode("utf-8", errors="replace").rstrip("\n\r")
if not line.startswith("data:"):
continue
payload = line[len("data:") :].strip()
if not payload or payload == "[DONE]":
if payload == "[DONE]":
break
continue
try:
obj = json.loads(payload)
except ValueError:
continue
choices = obj.get("choices") or []
if not choices:
continue
delta = (choices[0] or {}).get("delta") or {}
chunk = str(delta.get("content") or "")
if chunk:
emitter.token(block_id, chunk)
pieces.append(chunk)
except (URLError, TimeoutError, OSError):
return None
return "".join(pieces) if pieces else None


def _run_gemini(
messages: list[dict[str, str]],
emitter: EventEmitter,
Expand Down
Loading