BitConcepts · tbitcs · May 4, 2026 · May 1, 2026 · May 1, 2026 · May 1, 2026
@@ -0,0 +1,117 @@
+# Bring-Your-Own-Endpoint (BYOE)
+
+Specsmith ships first-class support for self-hosted OpenAI-v1-compatible
+LLM servers (vLLM, llama.cpp `server`, LM Studio, TGI,
+text-generation-webui, …). Every endpoint you register can be selected
+per session via `--endpoint <id>` on `specsmith run`, `chat`, and
+`serve` (PR-2).
+
+## Quick start
+
+Register a vLLM running on your LAN:
+
+```sh
+specsmith endpoints add \
+  --id home-vllm \
+  --name "Home vLLM" \
+  --base-url http://10.0.0.4:8000/v1 \
+  --default-model Qwen/Qwen2.5-Coder-32B-Instruct-GPTQ-Int8 \
+  --auth none \
+  --set-default
+
+specsmith endpoints test home-vllm
+```
+
+Once the test reports `ok`, run an agent against it:
+
+```sh
+specsmith run --endpoint home-vllm "summarise the last commit"
+```
+
+## Storage layout
+
+All endpoints live in `~/.specsmith/endpoints.json` (override with
+`SPECSMITH_HOME`). The on-disk schema is versioned:
+
+```json
+{
+  "schema_version": 1,
+  "default_endpoint_id": "home-vllm",
+  "endpoints": [
+    {
+      "id": "home-vllm",
+      "name": "Home vLLM",
+      "base_url": "http://10.0.0.4:8000/v1",
+      "auth": {"kind": "bearer-keyring",
+               "keyring_service": "specsmith",
+               "keyring_user": "endpoint:home-vllm"},
+      "default_model": "Qwen/Qwen2.5-Coder-32B",
+      "verify_tls": true,
+      "tags": ["local", "coder"],
+      "created_at": "2026-05-01T11:30:17Z"
+    }
+  ]
+}
+```
+
+The file is written `chmod 600` on POSIX. Token bytes for the inline
+strategy are the only secret material that ever lands in this file —
+the keyring and env-var strategies leave it secret-free.
+
+## Auth strategies
+
+| Kind             | Where the token lives                              | When to use |
+|------------------|----------------------------------------------------|-------------|
+| `none`           | nowhere — request is unauthenticated                | trusted LAN, open vLLM dev box |
+| `bearer-inline`  | `endpoints.json` (plaintext, `chmod 600`)           | quick scratch setups where keyring is unavailable |
+| `bearer-env`     | the env var name you specify (`--token-env FOO`)    | CI / containers / 12-factor deploys |
+| `bearer-keyring` | OS keyring, indexed by `(service, user)` (default)  | desktop / laptop installs (default) |
+
+The `list --json` output redacts inline tokens to `"***"`. The CLI
+never logs token bytes to terminal output.
+
+## Health checks
+
+```sh
+specsmith endpoints test home-vllm --json
+specsmith endpoints models home-vllm --json
+```
+
+`test` calls `<base_url>/models` with the resolved bearer token, prints
+the latency in milliseconds, and reports up to 5 model ids. `models`
+returns the full list.
+
+If the endpoint does not expose `/v1/models`, `test` will still return a
+clear error message — set `default_model` manually and rely on the
+session-level model dropdown instead.
+
+## CLI reference
+
+| Command | Notes |
+|---------|-------|
+| `specsmith endpoints add` | Register a new endpoint. `--auth bearer-keyring` (default) prompts for the secret without echo. |
+| `specsmith endpoints list [--json]` | Tabular by default, JSON for IDE consumers. Tokens are redacted. |
+| `specsmith endpoints remove <id> [--purge-keyring]` | Remove the entry; pass `--purge-keyring` to also delete the saved token. |
+| `specsmith endpoints default <id>` | Promote an existing endpoint to the default. |
+| `specsmith endpoints test [<id>] [--timeout 5]` | Probe `/v1/models`. Exits 1 on failure. |
+| `specsmith endpoints models [<id>]` | List every model the endpoint advertises. |
+
+## Security notes
+
+* The store path is `chmod 600` on POSIX where supported.
+* `verify_tls: false` is opt-in (`--no-verify-tls`); otherwise the CLI
+  verifies the certificate chain. Disabling it for an https endpoint is
+  documented per-endpoint in the on-disk JSON so a drift audit can spot
+  insecure configurations.
+* `auth.kind == bearer-inline` is functional but not recommended.
+  Prefer `bearer-keyring` when the OS keyring is available; otherwise
+  use `bearer-env` and inject the secret through your shell or
+  container environment.
+
+## Roadmap
+
+* **PR-2 (this milestone):** wires `--endpoint <id>` into `run`,
+  `chat`, and `serve`, plus a new `_run_openai_compat` provider driver.
+* **PR-3:** Endpoints tab and a per-session dropdown in the
+  `specsmith-vscode` extension.
+* **PR-4:** 0.8.0 release notes + tag.
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 
 [project]
 name = "specsmith"
-version = "0.7.0"
+version = "0.8.0"
 description = "Applied Epistemic Engineering toolkit — AEE agent sessions, execution profiles, FPGA/HDL governance, tool installer, 50+ CLI commands."
 readme = "README.md"
 license = "MIT"

@@ -80,11 +80,35 @@ def run_chat(
     history: list[dict[str, Any]] | None = None,
     confidence_target: float = 0.7,
     rules_prefix: str = "",
+    endpoint_id: str | None = None,
 ) -> ChatRunResult | None:
-    """Drive a real LLM turn. Return ``None`` if no provider is reachable."""
+    """Drive a real LLM turn. Return ``None`` if no provider is reachable.
+
+    When ``endpoint_id`` is set, the BYOE store (REQ-142) is consulted and
+    the resolved :class:`Endpoint` short-circuits the provider chain via
+    the new :func:`_run_openai_compat` driver. Any error during endpoint
+    resolution falls back to the legacy auto-detect chain so an offline
+    misconfigured endpoint never breaks `specsmith chat`.
+    """
     history = history or []
     messages = _build_messages(utterance, history, rules_prefix)
 
+    # REQ-142: explicit endpoint override.
+    if endpoint_id:
+        try:
+            from specsmith.agent.endpoints import EndpointStore
+
+            endpoint = EndpointStore.load().resolve(endpoint_id)
+        except Exception:  # noqa: BLE001 - any failure → fall back to auto-detect
+            endpoint = None
+        if endpoint is not None:
+            try:
+                full_text = _run_openai_compat(messages, emitter, msg_block, endpoint=endpoint)
+            except Exception:  # noqa: BLE001 - degrade to auto-detect
+                full_text = None
+            if full_text is not None:
+                return _finalize(full_text, "openai_compat", project_dir, confidence_target)
+
     # Order matters: Ollama first because it's local-first and free.
     for provider in (_run_ollama, _run_anthropic, _run_openai, _run_gemini):
         try:
@@ -228,6 +252,79 @@ def _run_openai(
     return "".join(pieces) if pieces else None
 
 
+def _run_openai_compat(
+    messages: list[dict[str, str]],
+    emitter: EventEmitter,
+    block_id: str,
+    *,
+    endpoint: Any,
+) -> str | None:
+    """Stream from a user-registered OpenAI-v1-compatible endpoint (REQ-142).
+
+    Uses raw stdlib HTTP so the openai SDK is not a hard dependency for
+    BYOE. Sends a streaming ``/chat/completions`` request, decodes the
+    Server-Sent-Events ``data:`` lines, and forwards each ``content``
+    delta as a ``token`` event on ``block_id``.
+    """
+    base_url = endpoint.base_url.rstrip("/")
+    url = f"{base_url}/chat/completions"
+    model = endpoint.default_model or os.environ.get("SPECSMITH_OPENAI_COMPAT_MODEL", "")
+    if not model:
+        # The endpoint did not pin a default model and the env override is
+        # absent. We cannot fabricate one; fall back to the auto-detect chain.
+        return None
+
+    headers: dict[str, str] = {
+        "Content-Type": "application/json",
+        "Accept": "text/event-stream",
+    }
+    try:
+        token = endpoint.resolve_token()
+    except Exception:  # noqa: BLE001 - fall back to auto-detect chain
+        return None
+    if token:
+        headers["Authorization"] = f"Bearer {token}"
+
+    body = json.dumps({"model": model, "messages": messages, "stream": True}).encode("utf-8")
+    req = Request(url, data=body, headers=headers, method="POST")  # noqa: S310 - user-supplied
+
+    ctx = None
+    if not endpoint.verify_tls and url.startswith("https://"):
+        import ssl
+
+        ctx = ssl.create_default_context()
+        ctx.check_hostname = False
+        ctx.verify_mode = ssl.CERT_NONE
+
+    pieces: list[str] = []
+    try:
+        with urlopen(req, timeout=120, context=ctx) as resp:  # noqa: S310 - user-supplied
+            for raw_line in resp:
+                line = raw_line.decode("utf-8", errors="replace").rstrip("\n\r")
+                if not line.startswith("data:"):
+                    continue
+                payload = line[len("data:") :].strip()
+                if not payload or payload == "[DONE]":
+                    if payload == "[DONE]":
+                        break
+                    continue
+                try:
+                    obj = json.loads(payload)
+                except ValueError:
+                    continue
+                choices = obj.get("choices") or []
+                if not choices:
+                    continue
+                delta = (choices[0] or {}).get("delta") or {}
+                chunk = str(delta.get("content") or "")
+                if chunk:
+                    emitter.token(block_id, chunk)
+                    pieces.append(chunk)
+    except (URLError, TimeoutError, OSError):
+        return None
+    return "".join(pieces) if pieces else None
+
+
 def _run_gemini(
     messages: list[dict[str, str]],
     emitter: EventEmitter,