Unify workspace path handling in tools#10
Unify workspace path handling in tools#10meanaverage wants to merge 201 commits intogobbleyourdong:mainfrom
Conversation
Default: 9B wave + 2B eddies (auto-scale to leftover mem) + SD-Turbo Lite: 2B only, no image gen (for 4GB systems) SD-Turbo (2GB) now included in full mode memory budget. README simplified to two modes. 13 scaling tests passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Smoke test validates full one-click experience: models respond → wave reasons → eddies parallel → image gen → agent e2e 7/8 pass (image gen needs diffusers in system python — now fixed in installer). Setup.sh now installs diffusers+torch+accelerate alongside core deps. One command, everything works. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The 9B wave built this page with zero human intervention: - Pure HTML + Tailwind CSS CDN (no build step) - Hero with wave background + typewriter CLI animation - Stats bar, architecture diagram, features grid - Install section with curl command - 258 lines, 12.8KB, oceanic theme 4 LLM calls, 34K tokens, $0.00 (local model). This page was built by the tool it describes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ints - tool_choice: required (Ark: MUST respond with function calling) - Format: NEVER use bullets for deliverables, paragraphs mandatory - Search: description now says use 3 query variants, visit sources - Code: stronger "save to file first" rule, no inline complex code - Skills/waveforms removed — dead code, AGI doesn't need plugins - Disclosure protection already existed (verified) - 607 tests passing, live 9B verified with tool_choice: required Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Solves the monolith problem — instead of the wave writing one huge file, it specs components, dispatches eddies to write each one, then assembles. Eddies return code via done tool (no filesystem write needed). Registered in bootstrap tools. 607 tests passing. Also: Snake game built autonomously by the 9B wave (209 lines). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
tool_choice: required overflows the 9B server when combined with 20 tool schemas (2515 tokens) + system prompt (4000 tokens). The prompt rule "MUST respond with exactly one tool call" enforces the same behavior without crashing the server. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Swell: more elegant, more violent. When agents spawn, the swell rises. tide.py → swell.py, tide_analyze → swell_analyze, tide_build → swell_build. README rewritten from scratch — no jargon, no walls of tables. One command install, what it does, how it works, what you need. Written for people who want to use it, not read about it. 607 tests passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
20 tools (2515 tokens) + system prompt overflowed 9B context on complex prompts causing 500 Internal Server Errors. Moved swell_analyze, swell_build, shell_view, plan_advance, file_append to loadable toolbox. 15 tools (1829 tokens) fits comfortably. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Break: where the wave meets the shore. Results converge at the break. Full oceanic naming: wave → swell → eddies → break → output. 9B now runs with 32K context (was 16K) — fixes 500 errors on large file generation. Alphabet tracer building. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CRITICAL FIX: Verify loop now in prompt as hard rule: "NEVER call message_result on code you haven't verified. Write → verify → fix → deliver." Snake game: 3 iterations (broken) → 13 iterations (working) after adding verify. This was the gobbleyourdong#1 reason all 4 apps were broken. Also: agent loop nudges "save to files" every 5 iterations if no file writes detected. Prevents context overflow on long tasks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Watcher caught 6 issues in alphabet tracer build that would have shipped broken. Config updated: 9B wave, 8192 max tokens, watcher on by default with 2B eddy at interval 3. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Wave builds → drag tests → wave fixes → drag retests → ship. Static analysis (bracket matching, canvas checks, Three.js checks) + headless Playwright (console errors, page errors, canvas dims). Found pinball bug immediately: "Cannot access scoreDisplay before initialization." Snake, alphabet tracer, node editor all pass. Named "drag" — the undertow that pulls back what's not ready. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…k, undertow Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three layers enforce fact-checking: 1. Prompt: mandatory 4-step triangulation (hypothesis → search → cross-ref → deduce) 2. Watcher: 2B checks if message_result has unverified claims 3. Agent loop: triangulation gate blocks delivery of factual claims that were never search-verified. Injects warning, forces verification. From Manus's methodology: parametric memory is unreliable for specifics. External sources win over training data. Always. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Current: measures tension (0.0 grounded → 1.0 hallucinating)
- Heuristic: red flags, quality anchors, text patterns
- Model probe: 2B eddy evaluates factual reliability
- Correction: iterative re-asking until tension drops
Circulation: routes based on tension reading
- Low → deliver directly
- Capability gap → force search
- Truth gap → explain contradiction
- Critical → refuse ("I don't know")
- Post-tool validation: reject results that increase tension
Pressure: tracks tension over time
- Escalates: calm → moderate → heavy → crushing
- Forces search after 2 consecutive high readings
- Forces refusal after 4 consecutive high readings
Validated against THEGAP.md: correctly flags unverified claims,
delivers well-sourced content, forces search on hallucinated stats.
27 new tests, 634 total, all green.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replaces hacky keyword triangulation gate with real tension measurement. Before delivery: measure_heuristic scores the response (0.0-1.0). Circulation routes based on score: - deliver (grounded) - force search (elevated tension, no prior search) - refuse (critical tension — say "I don't know") Pressure tracks tension across session, escalates over time. The wave can't hallucinate past the current anymore. 634 tests passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Current/circulation/pressure now checks BOTH: 1. Tool choice: "is this the right tool?" (before execution) 2. Response: "is this grounded?" (before delivery) Pressure tracks consecutive high-tension decisions: 2+ → force search to ground the agent 4+ → force message_ask for user guidance Watcher (2B text reviewer) removed — replaced by the tension system which is more rigorous and doesn't need an extra LLM call. 634 tests passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
duckduckgo_search renamed to ddgs — was returning 0 results. Added arXiv API search for research queries (https, follow redirects). Both tested and working. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- tide→swell throughout (tagline, footer, architecture) - 607→634 tests - Architecture cards: wave/eddies/swell/break/undertow (was flow/tide/whirlpool) - Install URL: github raw (was hallucinated tsunami.ark.sh) - Scaling table: real auto-scaling tiers (was hallucinated linear speedup) - Eddies: up to 32 (was 4) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The undertow tests code. This tests logic. After the wave writes a synthesis with reasoning claims, an eddy receives it as a hostile peer reviewer: "assume this is wrong, find every logical error, scope error, unstated assumption." If the reviewer finds flaws → objections injected back to wave, delivery blocked until addressed. If the reviewer finds no flaws → deliver. Would have caught the tsunami_gap.md errors: - "gap is narrower" (wrong — rescaling moves T³ to R³) - "KNSS proven on R³" (wrong — only axisymmetric cases) - Sobolev chain applied in wrong context (T³ vs R³ after rescaling) Three quality gates now: 1. Current/tension — catches hallucinated facts 2. Undertow — catches broken code 3. Adversarial — catches broken reasoning 634 tests passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Undertow completely rewritten as a lever-puller. The wave provides what to test, the undertow just does it and reports facts. No diagnosis. - Lever types: screenshot, press, click, read_text, console, wait - Auto-generates levers from HTML (every ID, key binding, button) - Eddy vision: screenshot description vs user intent comparison - DOM + pixel diff for interaction levers (catches subtle changes) - Visibility check before clicking (no more 30s timeouts) - code_tension metric (lever fail ratio) feeds into pressure - Research-before-building mandatory in prompt - 500 retryable with backoff in model layer - Delivery gate capped at 2 blocks (was infinite loop) - Info loop detector tightened to 3/6 (was 5/10) - arXiv User-Agent header (fixes 429s) - Current measures prose tension only (undertow measures code) Results: pinball went from black screen (0.62 tension) to rendered 3D table (0.21 tension) in fewer iterations (13 vs 25). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Root now has only run.py (main entry) and serve_diffusion.py (SD-Turbo). All test harnesses, stress tests, and verification scripts in tests/. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Added sections for current/circulation/pressure (the tension system), the undertow lever-puller architecture, and research-before-building. Includes before/after results: black screen → rendered 3D pinball. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New search_type="code" hits GitHub API for repos and source code. No auth needed (public API). Returns repos sorted by stars with direct links to browse code. Wave reads real implementations instead of hallucinating API calls. Tested: "three.js pinball physics" → found pinball-xr (cannon-es), Three.js forum discussions with working CCD physics examples. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Double-clicking the .ps1 opens a new window that closes when the script finishes. Added ReadKey pause at end and before early returns so users can read the output. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
generate_image now runs SD-Turbo directly in the agent process. Auto-downloads the 2GB model on first use via HuggingFace diffusers. No separate server needed. 1 inference step, <1s on GPU. - generate.py: _try_sd_turbo_local as primary backend, placeholder fallback - serve_diffusion.py: rewritten for SD-Turbo (was Qwen-Image-2512/13GB) - setup.sh: installs diffusers+torch+transformers+accelerate Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The 2B isn't "gimped" — it's the same model acting as wave on lite boxes. All eddy endpoint references now read from config.eddy_endpoint (propagated via TSUNAMI_EDDY_ENDPOINT env var). On lite mode, eddy_endpoint points at the wave's port (:8090) — one server, one model, both roles. - config.py: added eddy_endpoint field - agent.py: propagates config.eddy_endpoint to env var at init - Replaced all hardcoded :8092 / TSUNAMI_BEE_ENDPOINT across 10 files - launcher.py: lite mode starts ONE server, sets TSUNAMI_EDDY_ENDPOINT=:8090 - docker-entrypoint.sh: same lite mode fix Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When file_write sees useState/useEffect/useRef in a .tsx file without a React import, auto-prepend the import. The 2B (lite mode wave) consistently forgets this — builds pass but runtime crashes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Battle-tested on actual Windows hardware. Key fixes over previous version: - tsu.ps1: model priority chain (27B → MoE → 8B/2B), vision mmproj auto-detect, Windows JSON escaping for --chat-template-kwargs, FastAPI backend on :3000, Node CLI with Python REPL fallback - setup.ps1: encoding and escaping fixes from real Windows debugging Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
SD-Turbo goes to CPU on tight VRAM (~1min instead of <1s). Worth it for the 9B wave quality over 2B. 8GB cards get the full stack. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- tsu.ps1: hide llama-server internals from user (no model names, no localhost URLs, no "waiting 120s"). Just "Loading model...", "Starting up...", "Ready". - requirements.txt: added fastapi, uvicorn, websockets, ddgs, pillow as required (were optional/missing, tsu.ps1 backend needs them) - setup.sh: DEPS includes all core packages Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- vision_ground removed from bootstrap (17 tools). The 2B calls it repeatedly when no VL model exists. Auto-ground in agent.py still works — it imports the tool directly when generate_image fires. - message_info + message_result strip non-ASCII before printing. Windows cp1252 console crashes on emoji from the 2B. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Both tsu (bash) and tsu.ps1 (Windows) silently check for updates on every launch. Fetch, compare, pull if behind. No user action needed. Offline gracefully ignored. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
build_exe.py: bundles complete tsunami package with all tool modules, file_watcher, serve_daemon, fastapi, uvicorn, rich, psutil, ddgs. GitHub Actions workflow uses build_exe.py instead of inline pyinstaller. Trigger: push to desktop/ or tsunami/, or manual dispatch. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Suren's battle-tested version with: - PATH refresh after winget installs Python/Git (fixes "not found" after install) - Lite mode starts one server only (matches eddy-is-a-role) - Added fastapi, uvicorn, rich, psutil to pip install - 8GB VRAM threshold (was 10) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ered away from files The 2B on Windows was using python_exec with hardcoded C:\ paths to do file operations instead of using file_read/file_write/match_glob. - shell_exec: default cwd is now workspace dir, not wherever Python started - python_exec: description explicitly says NOT for file ops, use file tools Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
setup.bat installs to llama-server\, setup.ps1 installed to llama.cpp\, tsu.ps1 only looked in llama.cpp\. Now all aligned: - setup.ps1: installs to llama-server\ (matches setup.bat) - tsu.ps1: checks llama-server\ first, then llama.cpp\ as fallback Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Generated from docs/favicon.svg via Playwright rendering. 256/128/64/48/32/16px in one ICO. build_exe.py already uses --icon=icon.ico. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The exe was spawning visible console windows for every subprocess (llama-server, ws_bridge, diffusion). On Windows PyInstaller exes, child processes re-execute main() without freeze_support(). - All Popen calls: CREATE_NO_WINDOW + STARTF_USESHOWWINDOW on Windows - Added multiprocessing.freeze_support() to prevent fork bombs - ws_bridge Popen also hidden Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Standard Next>Next>Install wizard. Bundles the repo (~5MB), then runs setup.ps1 post-install to download models (~7GB) with progress. - Start Menu + Desktop shortcut (wave icon) - Add/Remove Programs uninstaller - Cleans up models/workspace on uninstall - GitHub Actions builds it on release or manual dispatch Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PyInstaller exe had: console flash, AV flags, no auto-update, fork bombs. Inno Setup installer has: progress bar, Start Menu, Add/Remove Programs, runs setup.ps1 for model downloads, auto-updates via git pull in tsu.ps1. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace #13#10 (newlines) with Chr(13)+Chr(10) in Pascal code block. Inno Setup preprocessor runs before Pascal compilation and chokes on #13. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
tsu.ps1 only looked in its own directory for models and llama-server. If installed via setup.bat/ps1, files are in %USERPROFILE%\tsunami\. Now searches both locations. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The installer runs setup.ps1 post-install but setup.ps1 was downloading
models to %USERPROFILE%\tsunami while the shortcut runs from Program Files.
- Installer passes TSUNAMI_DIR={app} so setup.ps1 uses the install path
- setup.ps1 detects installer layout (files exist, no .git) and inits
git for future auto-updates instead of failing on git clone
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Laptops with Intel + NVIDIA often have nvidia-smi at C:\Windows\System32\nvidia-smi.exe but not in PowerShell's PATH. Now checks the known location as fallback. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…-fixes-clean # Conflicts: # tsunami/tools/shell.py
|
TL;DR Introduced uniform workspace path resolver. Each tool had a different shape for path resolution, many didn't align with each other. NOT TOUCHED IN THIS PR: browser.py / vision_ground.py: tsunami/tools/swell.py (line 69) tsunami/tools/swell_build.py (line 90) assuming may be intentional for scaffolding view? if not should be on uniform resolver tsunami/tools/python_exec.py (line 74) tsunami/tools/match.py (line 34) tsunami/tools/match.py (line 77) tsunami/tools/shell.py (line 115) tsunami/tools/swell_analyze.py (line 50) Path(self.config.workspace_dir).parent / stripped tsunami/tools/project_init.py (line 223) tsunami/tools/filesystem.py (line 56) tsunami/tools/webdev.py (line 80) tsunami/tools/browser.py (line 475) and tsunami/tools/vision_ground.py (line 67) |
|
further unification will allow reliable 'transient workspaces' via configurable option |
This PR unifies workspace path handling across the main tool surface so the agent behaves more consistently between host and Docker-style environments.
What was fixed:
Why this matters:
Validation:
Important caveat: