timepointai · seanfromthepast · Mar 12, 2026 · Mar 12, 2026
@@ -206,6 +206,18 @@ curl -X POST localhost:8000/api/v1/timepoints/generate/stream \
 
 ---
 
+## Model Control
+
+Downstream apps can control model selection and generation behavior per-request:
+
+- **`model_policy: "permissive"`** — Routes all generation through open-weight models (DeepSeek, Llama, Qwen, Mistral) via OpenRouter, uses Pollinations for images, and skips Google grounding. Fully Google-free.
+- **`text_model` / `image_model`** — Override preset models with any OpenRouter-compatible model ID (e.g. `qwen/qwen3-235b-a22b`) or Google native model.
+- **`llm_params`** — Fine-grained control over temperature, max_tokens, top_p, top_k, penalties, stop sequences, and system prompt injection. Applied to all 14 pipeline agents.
+
+All three are composable: `model_policy` + explicit models + `llm_params` work together with clear priority ordering. See [docs/API.md](docs/API.md) for the full parameter reference.
+
+---
+
 ## API
 
 | Endpoint | Description |
@@ -279,6 +291,7 @@ python3.10 -m pytest tests/ -v              # 522 fast + integration, 11 skipped
 - [iOS Integration](docs/IOS_INTEGRATION.md) — Auth flow, credit system, endpoint map for iOS client
 - [Agent Architecture](docs/AGENTS.md) — Pipeline breakdown with example output
 - [Temporal Navigation](docs/TEMPORAL.md) — Time travel mechanics
+- [Downstream Model Control](docs/DOWNSTREAM_MODEL_CONTROL.md) — Model policy, LLM params, and per-request control for downstream apps
 - [Eval Roadmap](docs/EVAL_ROADMAP.md) — Quality scoring and benchmark plans
 - [Deployment](docs/DEPLOY.md) — Local, Replit, and production deployment
 

@@ -62,6 +62,77 @@ Override preset models for custom configurations:
 }
 ```
 
+**Permissive Mode (Google-Free):**
+
+Use only open-weight, distillable models — zero Google API calls:
+```json
+{
+  "query": "The signing of the Magna Carta, 1215",
+  "generate_image": true,
+  "model_policy": "permissive"
+}
+```
+Text routes to DeepSeek/Llama/Qwen via OpenRouter, images route to Pollinations, and Google grounding is skipped. Response metadata reflects the actual models used:
+```json
+{
+  "text_model_used": "deepseek/deepseek-r1-0528",
+  "image_model_used": "pollinations",
+  "model_provider": "openrouter",
+  "model_permissiveness": "permissive"
+}
+```
+
+**Composing model_policy with explicit models:**
+
+`model_policy` and explicit model names are composable — explicit models take priority:
+```json
+{
+  "query": "Apollo 11 Moon Landing, 1969",
+  "model_policy": "permissive",
+  "text_model": "qwen/qwen3-235b-a22b",
+  "generate_image": true
+}
+```
+This uses the specified Qwen model for text, Pollinations for images (from permissive policy), and skips Google grounding.
+
+---
+
+## LLM Parameters
+
+The `llm_params` object gives downstream callers fine-grained control over generation hyperparameters. All fields are optional — unset fields use agent/preset defaults. These parameters are applied to every agent in the 14-step pipeline.
+
+```json
+{
+  "query": "Turing breaks Enigma, 1941",
+  "text_model": "deepseek/deepseek-r1-0528",
+  "llm_params": {
+    "temperature": 0.5,
+    "max_tokens": 4096,
+    "top_p": 0.9,
+    "system_prompt_suffix": "Keep all descriptions under 200 words. Use British English."
+  }
+}
+```
+
+| Parameter | Type | Range | Providers | Description |
+|-----------|------|-------|-----------|-------------|
+| `temperature` | float | 0.0–2.0 | All | Sampling temperature. Overrides per-agent defaults (which range from 0.2 for factual agents to 0.85 for creative agents). |
+| `max_tokens` | int | 1–32768 | All | Maximum output tokens per agent call. Preset defaults: hyper=1024, balanced=2048, hd=8192. |
+| `top_p` | float | 0.0–1.0 | All | Nucleus sampling — only consider tokens whose cumulative probability is <= top_p. |
+| `top_k` | int | >= 1 | All | Top-k sampling — only consider the k most likely tokens at each step. |
+| `frequency_penalty` | float | -2.0–2.0 | OpenRouter | Penalize tokens proportionally to how often they've appeared in the output. |
+| `presence_penalty` | float | -2.0–2.0 | OpenRouter | Penalize tokens that have appeared at all in the output so far. |
+| `repetition_penalty` | float | 0.0–2.0 | OpenRouter | Multiplicative penalty for repeated tokens. |
+| `stop` | string[] | max 4 | All | Stop sequences — generation halts when any of these strings is produced. |
+| `thinking_level` | string | — | Google | Reasoning depth for thinking models: `"none"`, `"low"`, `"medium"`, `"high"`. |
+| `system_prompt_prefix` | string | max 2000 | All | Text prepended to every agent's system prompt. Use for tone, persona, or style injection. |
+| `system_prompt_suffix` | string | max 2000 | All | Text appended to every agent's system prompt. Use for constraints, formatting rules, or output instructions. |
+
+**Notes:**
+- Parameters marked "OpenRouter" are silently ignored when the request routes to Google (and vice versa for `thinking_level`).
+- `system_prompt_prefix` and `system_prompt_suffix` affect all 14 pipeline agents. Use these to inject cross-cutting concerns (e.g., language, tone, verbosity constraints).
+- Request-level `llm_params` override per-agent defaults. For example, if `llm_params.temperature` is set, it overrides the judge agent's default of 0.3, the scene agent's default of 0.7, etc.
+
 ---
 
 ## Endpoints Overview
@@ -120,12 +191,20 @@ Generate a scene with real-time progress updates via Server-Sent Events.
 | query | string | Yes | Historical moment (3-500 chars) |
 | generate_image | boolean | No | Generate AI image (default: false) |
 | preset | string | No | Quality preset: `hd`, `hyper`, `balanced` (default), `gemini3` |
-| text_model | string | No | Override text model (ignores preset) |
-| image_model | string | No | Override image model (ignores preset) |
+| text_model | string | No | Text model ID — OpenRouter format (`org/model`) or Google native (`gemini-*`). Overrides preset. |
+| image_model | string | No | Image model ID — `pollinations` for free open-source, or Google native. Overrides preset. |
+| model_policy | string | No | `"permissive"` — selects only open-weight models (Llama, DeepSeek, Qwen) and skips Google-dependent steps. Fully Google-free. Works alongside explicit model overrides. |
+| llm_params | object | No | Fine-grained LLM parameters applied to all pipeline agents. See **LLM Parameters** below. |
 | visibility | string | No | `public` (default) or `private` — controls who can see full data |
 | callback_url | string | No | URL to POST results to when generation completes (async endpoint only) |
 | request_context | object | No | Opaque context passed through to response (e.g. `{"source": "clockchain", "job_id": "..."}`) |
 
+**Model selection priority** (highest first):
+1. Explicit `text_model` / `image_model` — use exactly these models
+2. `model_policy: "permissive"` — auto-select open-weight models, skip Google grounding
+3. `preset` — use preset's default models
+4. Server defaults
+
 **Response:** SSE stream with events:
 
 ```
@@ -1036,4 +1115,4 @@ Rate limit: 60 requests/minute per IP.
 
 ---
 
-*Last updated: 2026-02-23*
+*Last updated: 2026-03-11*
@@ -0,0 +1,82 @@
+# Downstream Model Control — TIMEPOINT Flash
+
+**For teams building on TIMEPOINT Flash (Web App, iPhone App, Clockchain, Billing, Enterprise integrations)**
+
+TIMEPOINT Flash now supports full downstream control of model selection and generation hyperparameters on every generation request. Downstream apps can set `model_policy: "permissive"` to route all 14 pipeline agents through open-weight models (DeepSeek R1, Llama, Qwen, Mistral) via OpenRouter with Pollinations for images — making the entire pipeline fully Google-free with zero Google API calls, including grounding. Apps can also specify exact models by name using `text_model` and `image_model` (any OpenRouter-compatible model ID like `qwen/qwen3-235b-a22b` or Google native like `gemini-2.5-flash`), and these explicit overrides take priority over `model_policy`, which in turn takes priority over `preset`. In addition, the new `llm_params` object provides fine-grained control over generation hyperparameters — temperature, max_tokens, top_p, top_k, frequency/presence/repetition penalties, stop sequences, thinking level, and system prompt injection (prefix/suffix) — all applied uniformly across every agent in the pipeline. Request-level `llm_params` override each agent's built-in defaults, so setting `temperature: 0.3` overrides the scene agent's default of 0.7, the dialog agent's default of 0.85, etc. All of these controls are composable: you can combine `model_policy`, explicit models, `preset`, and `llm_params` in the same request.
+
+## Request Parameters
+
+| Parameter | Type | Required | Description |
+|-----------|------|----------|-------------|
+| `query` | string | Yes | Historical moment description (3-500 chars) |
+| `generate_image` | boolean | No | Generate AI image (default: false) |
+| `preset` | string | No | Quality preset: `hyper`, `balanced` (default), `hd`, `gemini3` |
+| `text_model` | string | No | Text model ID — OpenRouter format (`org/model`) or Google native (`gemini-*`). Overrides preset. |
+| `image_model` | string | No | Image model ID — `pollinations` for free, or Google native. Overrides preset. |
+| `model_policy` | string | No | `"permissive"` for open-weight only, Google-free generation. |
+| `llm_params` | object | No | Fine-grained LLM hyperparameters (see table below). |
+| `visibility` | string | No | `public` (default) or `private` |
+| `callback_url` | string | No | URL to POST results when generation completes (async only) |
+| `request_context` | object | No | Opaque context passed through to response |
+
+## LLM Parameters (`llm_params`)
+
+| Parameter | Type | Range | Providers | Description |
+|-----------|------|-------|-----------|-------------|
+| `temperature` | float | 0.0–2.0 | All | Sampling temperature. Overrides per-agent defaults (0.2 for factual, 0.85 for creative). |
+| `max_tokens` | int | 1–32768 | All | Max output tokens per agent call. Preset defaults: hyper=1024, balanced=2048, hd=8192. |
+| `top_p` | float | 0.0–1.0 | All | Nucleus sampling threshold. |
+| `top_k` | int | >= 1 | All | Top-k sampling — consider only the k most likely tokens. |
+| `frequency_penalty` | float | -2.0–2.0 | OpenRouter | Penalize tokens proportionally to frequency in output. |
+| `presence_penalty` | float | -2.0–2.0 | OpenRouter | Penalize tokens that have appeared at all in output. |
+| `repetition_penalty` | float | 0.0–2.0 | OpenRouter | Multiplicative penalty for repeated tokens. |
+| `stop` | string[] | max 4 | All | Stop sequences — generation halts when produced. |
+| `thinking_level` | string | — | Google | Reasoning depth: `"none"`, `"low"`, `"medium"`, `"high"`. |
+| `system_prompt_prefix` | string | max 2000 | All | Text prepended to every agent's system prompt. |
+| `system_prompt_suffix` | string | max 2000 | All | Text appended to every agent's system prompt. |
+
+## Model Selection Priority (highest first)
+
+1. Explicit `text_model` / `image_model`
+2. `model_policy: "permissive"` (auto-selects open-weight models, skips Google grounding)
+3. `preset` (uses preset's default models)
+4. Server defaults
+
+## Examples
+
+**Google-free generation:**
+```json
+{
+  "query": "The signing of the Magna Carta, 1215",
+  "generate_image": true,
+  "model_policy": "permissive"
+}
+```
+
+**Specific model with custom params:**
+```json
+{
+  "query": "Turing breaks Enigma, 1941",
+  "text_model": "deepseek/deepseek-r1-0528",
+  "llm_params": {
+    "temperature": 0.5,
+    "max_tokens": 4096,
+    "top_p": 0.9,
+    "system_prompt_suffix": "Keep all descriptions under 200 words. Use British English."
+  }
+}
+```
+
+**Permissive mode with explicit model override:**
+```json
+{
+  "query": "Apollo 11 Moon Landing, 1969",
+  "model_policy": "permissive",
+  "text_model": "qwen/qwen3-235b-a22b",
+  "generate_image": true
+}
+```
+
+---
+
+*Last updated: 2026-03-11*