Skip to content

Unify workspace path handling in tools#10

Open
meanaverage wants to merge 201 commits intogobbleyourdong:mainfrom
meanaverage:workspace-path-tool-fixes-clean
Open

Unify workspace path handling in tools#10
meanaverage wants to merge 201 commits intogobbleyourdong:mainfrom
meanaverage:workspace-path-tool-fixes-clean

Conversation

@meanaverage
Copy link
Copy Markdown

This PR unifies workspace path handling across the main tool surface so the agent behaves more consistently between host and Docker-style environments.

What was fixed:

  • introduced a shared workspace-path normalizer in tsunami/tools/filesystem.py
  • normalized common path variants seen in traces, including /app/workspace/..., /workspace/tsunami/workspace/..., /workspace/deliverables/..., and repo-root deliverables/...
  • updated match_glob and match_grep to resolve directories through the shared resolver instead of raw Path(...).resolve()
  • updated shell_exec so bad workspace roots inside command strings are normalized before execution and workdir is resolved through the same shared resolver
  • updated python_exec so ARK_DIR, WORKSPACE, and DELIVERABLES are derived from the configured workspace_dir instead of hardcoded ark_dir/workspace
  • updated swell_analyze to remove ad hoc Docker-specific fallback rewriting and use the shared resolver for both input directories and output paths
  • updated generate_image so save_path uses the same resolver instead of its own partial workspace/app/workspace stripping logic
  • updated project_init guidance so the model is told to use shell_exec with workdir='deliverables/' rather than absolute-path cd commands

Why this matters:

  • before this change, different tools were interpreting the same project path in different ways
  • that led to runs where one tool correctly found workspace/deliverables/ and the next tool looked in /workspace/tsunami/workspace/..., repo-root deliverables/..., or another non-existent variant
  • the result was repeated false directory-not-found failures even when the project actually existed

Validation:

  • python3 -m py_compile passed for all modified tool modules

Important caveat:

  • this PR intentionally does not modify tsunami/tools/swell.py or tsunami/tools/swell_build.py
  • those tools currently use workspace_dir.parent / project-root-style working directories, and that may be intentional to preserve broader repo visibility for eddy workers
  • I left them unchanged and would like explicit review on whether they should remain repo-root scoped or also move to stricter workspace-relative semantics later

gobbleyourdong and others added 30 commits March 31, 2026 18:18
Default: 9B wave + 2B eddies (auto-scale to leftover mem) + SD-Turbo
Lite: 2B only, no image gen (for 4GB systems)
SD-Turbo (2GB) now included in full mode memory budget.
README simplified to two modes. 13 scaling tests passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Smoke test validates full one-click experience:
  models respond → wave reasons → eddies parallel → image gen → agent e2e
7/8 pass (image gen needs diffusers in system python — now fixed in installer).
Setup.sh now installs diffusers+torch+accelerate alongside core deps.
One command, everything works.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The 9B wave built this page with zero human intervention:
- Pure HTML + Tailwind CSS CDN (no build step)
- Hero with wave background + typewriter CLI animation
- Stats bar, architecture diagram, features grid
- Install section with curl command
- 258 lines, 12.8KB, oceanic theme

4 LLM calls, 34K tokens, $0.00 (local model).
This page was built by the tool it describes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ints

- tool_choice: required (Ark: MUST respond with function calling)
- Format: NEVER use bullets for deliverables, paragraphs mandatory
- Search: description now says use 3 query variants, visit sources
- Code: stronger "save to file first" rule, no inline complex code
- Skills/waveforms removed — dead code, AGI doesn't need plugins
- Disclosure protection already existed (verified)
- 607 tests passing, live 9B verified with tool_choice: required

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Solves the monolith problem — instead of the wave writing one huge file,
it specs components, dispatches eddies to write each one, then assembles.
Eddies return code via done tool (no filesystem write needed).
Registered in bootstrap tools. 607 tests passing.

Also: Snake game built autonomously by the 9B wave (209 lines).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
tool_choice: required overflows the 9B server when combined with
20 tool schemas (2515 tokens) + system prompt (4000 tokens).
The prompt rule "MUST respond with exactly one tool call" enforces
the same behavior without crashing the server.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Swell: more elegant, more violent. When agents spawn, the swell rises.
tide.py → swell.py, tide_analyze → swell_analyze, tide_build → swell_build.

README rewritten from scratch — no jargon, no walls of tables.
One command install, what it does, how it works, what you need.
Written for people who want to use it, not read about it.

607 tests passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
20 tools (2515 tokens) + system prompt overflowed 9B context on
complex prompts causing 500 Internal Server Errors. Moved
swell_analyze, swell_build, shell_view, plan_advance, file_append
to loadable toolbox. 15 tools (1829 tokens) fits comfortably.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Break: where the wave meets the shore. Results converge at the break.
Full oceanic naming: wave → swell → eddies → break → output.

9B now runs with 32K context (was 16K) — fixes 500 errors on
large file generation. Alphabet tracer building.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CRITICAL FIX: Verify loop now in prompt as hard rule:
"NEVER call message_result on code you haven't verified.
Write → verify → fix → deliver."

Snake game: 3 iterations (broken) → 13 iterations (working) after
adding verify. This was the gobbleyourdong#1 reason all 4 apps were broken.

Also: agent loop nudges "save to files" every 5 iterations if
no file writes detected. Prevents context overflow on long tasks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Watcher caught 6 issues in alphabet tracer build that would have
shipped broken. Config updated: 9B wave, 8192 max tokens, watcher
on by default with 2B eddy at interval 3.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Wave builds → drag tests → wave fixes → drag retests → ship.
Static analysis (bracket matching, canvas checks, Three.js checks)
+ headless Playwright (console errors, page errors, canvas dims).

Found pinball bug immediately: "Cannot access scoreDisplay before
initialization." Snake, alphabet tracer, node editor all pass.

Named "drag" — the undertow that pulls back what's not ready.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…k, undertow

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three layers enforce fact-checking:
1. Prompt: mandatory 4-step triangulation (hypothesis → search → cross-ref → deduce)
2. Watcher: 2B checks if message_result has unverified claims
3. Agent loop: triangulation gate blocks delivery of factual claims
   that were never search-verified. Injects warning, forces verification.

From Manus's methodology: parametric memory is unreliable for specifics.
External sources win over training data. Always.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Current: measures tension (0.0 grounded → 1.0 hallucinating)
  - Heuristic: red flags, quality anchors, text patterns
  - Model probe: 2B eddy evaluates factual reliability
  - Correction: iterative re-asking until tension drops

Circulation: routes based on tension reading
  - Low → deliver directly
  - Capability gap → force search
  - Truth gap → explain contradiction
  - Critical → refuse ("I don't know")
  - Post-tool validation: reject results that increase tension

Pressure: tracks tension over time
  - Escalates: calm → moderate → heavy → crushing
  - Forces search after 2 consecutive high readings
  - Forces refusal after 4 consecutive high readings

Validated against THEGAP.md: correctly flags unverified claims,
delivers well-sourced content, forces search on hallucinated stats.
27 new tests, 634 total, all green.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replaces hacky keyword triangulation gate with real tension measurement.
Before delivery: measure_heuristic scores the response (0.0-1.0).
Circulation routes based on score:
  - deliver (grounded)
  - force search (elevated tension, no prior search)
  - refuse (critical tension — say "I don't know")
Pressure tracks tension across session, escalates over time.

The wave can't hallucinate past the current anymore.
634 tests passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Current/circulation/pressure now checks BOTH:
1. Tool choice: "is this the right tool?" (before execution)
2. Response: "is this grounded?" (before delivery)

Pressure tracks consecutive high-tension decisions:
  2+ → force search to ground the agent
  4+ → force message_ask for user guidance

Watcher (2B text reviewer) removed — replaced by the tension
system which is more rigorous and doesn't need an extra LLM call.

634 tests passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
duckduckgo_search renamed to ddgs — was returning 0 results.
Added arXiv API search for research queries (https, follow redirects).
Both tested and working.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- tide→swell throughout (tagline, footer, architecture)
- 607→634 tests
- Architecture cards: wave/eddies/swell/break/undertow (was flow/tide/whirlpool)
- Install URL: github raw (was hallucinated tsunami.ark.sh)
- Scaling table: real auto-scaling tiers (was hallucinated linear speedup)
- Eddies: up to 32 (was 4)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The undertow tests code. This tests logic.

After the wave writes a synthesis with reasoning claims, an eddy
receives it as a hostile peer reviewer: "assume this is wrong,
find every logical error, scope error, unstated assumption."

If the reviewer finds flaws → objections injected back to wave,
delivery blocked until addressed.
If the reviewer finds no flaws → deliver.

Would have caught the tsunami_gap.md errors:
- "gap is narrower" (wrong — rescaling moves T³ to R³)
- "KNSS proven on R³" (wrong — only axisymmetric cases)
- Sobolev chain applied in wrong context (T³ vs R³ after rescaling)

Three quality gates now:
1. Current/tension — catches hallucinated facts
2. Undertow — catches broken code
3. Adversarial — catches broken reasoning

634 tests passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Undertow completely rewritten as a lever-puller. The wave provides
what to test, the undertow just does it and reports facts. No diagnosis.

- Lever types: screenshot, press, click, read_text, console, wait
- Auto-generates levers from HTML (every ID, key binding, button)
- Eddy vision: screenshot description vs user intent comparison
- DOM + pixel diff for interaction levers (catches subtle changes)
- Visibility check before clicking (no more 30s timeouts)
- code_tension metric (lever fail ratio) feeds into pressure
- Research-before-building mandatory in prompt
- 500 retryable with backoff in model layer
- Delivery gate capped at 2 blocks (was infinite loop)
- Info loop detector tightened to 3/6 (was 5/10)
- arXiv User-Agent header (fixes 429s)
- Current measures prose tension only (undertow measures code)

Results: pinball went from black screen (0.62 tension) to
rendered 3D table (0.21 tension) in fewer iterations (13 vs 25).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Root now has only run.py (main entry) and serve_diffusion.py (SD-Turbo).
All test harnesses, stress tests, and verification scripts in tests/.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Added sections for current/circulation/pressure (the tension system),
the undertow lever-puller architecture, and research-before-building.
Includes before/after results: black screen → rendered 3D pinball.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New search_type="code" hits GitHub API for repos and source code.
No auth needed (public API). Returns repos sorted by stars with
direct links to browse code. Wave reads real implementations
instead of hallucinating API calls.

Tested: "three.js pinball physics" → found pinball-xr (cannon-es),
Three.js forum discussions with working CCD physics examples.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
gobbleyourdong and others added 24 commits April 2, 2026 14:54
Double-clicking the .ps1 opens a new window that closes when the
script finishes. Added ReadKey pause at end and before early returns
so users can read the output.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
generate_image now runs SD-Turbo directly in the agent process.
Auto-downloads the 2GB model on first use via HuggingFace diffusers.
No separate server needed. 1 inference step, <1s on GPU.

- generate.py: _try_sd_turbo_local as primary backend, placeholder fallback
- serve_diffusion.py: rewritten for SD-Turbo (was Qwen-Image-2512/13GB)
- setup.sh: installs diffusers+torch+transformers+accelerate

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The 2B isn't "gimped" — it's the same model acting as wave on lite boxes.
All eddy endpoint references now read from config.eddy_endpoint (propagated
via TSUNAMI_EDDY_ENDPOINT env var). On lite mode, eddy_endpoint points at
the wave's port (:8090) — one server, one model, both roles.

- config.py: added eddy_endpoint field
- agent.py: propagates config.eddy_endpoint to env var at init
- Replaced all hardcoded :8092 / TSUNAMI_BEE_ENDPOINT across 10 files
- launcher.py: lite mode starts ONE server, sets TSUNAMI_EDDY_ENDPOINT=:8090
- docker-entrypoint.sh: same lite mode fix

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When file_write sees useState/useEffect/useRef in a .tsx file without
a React import, auto-prepend the import. The 2B (lite mode wave)
consistently forgets this — builds pass but runtime crashes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Battle-tested on actual Windows hardware. Key fixes over previous version:
- tsu.ps1: model priority chain (27B → MoE → 8B/2B), vision mmproj
  auto-detect, Windows JSON escaping for --chat-template-kwargs,
  FastAPI backend on :3000, Node CLI with Python REPL fallback
- setup.ps1: encoding and escaping fixes from real Windows debugging

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
SD-Turbo goes to CPU on tight VRAM (~1min instead of <1s). Worth it
for the 9B wave quality over 2B. 8GB cards get the full stack.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- tsu.ps1: hide llama-server internals from user (no model names,
  no localhost URLs, no "waiting 120s"). Just "Loading model...",
  "Starting up...", "Ready".
- requirements.txt: added fastapi, uvicorn, websockets, ddgs, pillow
  as required (were optional/missing, tsu.ps1 backend needs them)
- setup.sh: DEPS includes all core packages

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- vision_ground removed from bootstrap (17 tools). The 2B calls it
  repeatedly when no VL model exists. Auto-ground in agent.py still
  works — it imports the tool directly when generate_image fires.
- message_info + message_result strip non-ASCII before printing.
  Windows cp1252 console crashes on emoji from the 2B.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Both tsu (bash) and tsu.ps1 (Windows) silently check for updates
on every launch. Fetch, compare, pull if behind. No user action
needed. Offline gracefully ignored.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
build_exe.py: bundles complete tsunami package with all tool modules,
file_watcher, serve_daemon, fastapi, uvicorn, rich, psutil, ddgs.
GitHub Actions workflow uses build_exe.py instead of inline pyinstaller.

Trigger: push to desktop/ or tsunami/, or manual dispatch.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Suren's battle-tested version with:
- PATH refresh after winget installs Python/Git (fixes "not found" after install)
- Lite mode starts one server only (matches eddy-is-a-role)
- Added fastapi, uvicorn, rich, psutil to pip install
- 8GB VRAM threshold (was 10)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ered away from files

The 2B on Windows was using python_exec with hardcoded C:\ paths to
do file operations instead of using file_read/file_write/match_glob.

- shell_exec: default cwd is now workspace dir, not wherever Python started
- python_exec: description explicitly says NOT for file ops, use file tools

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
setup.bat installs to llama-server\, setup.ps1 installed to llama.cpp\,
tsu.ps1 only looked in llama.cpp\. Now all aligned:
- setup.ps1: installs to llama-server\ (matches setup.bat)
- tsu.ps1: checks llama-server\ first, then llama.cpp\ as fallback

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Generated from docs/favicon.svg via Playwright rendering.
256/128/64/48/32/16px in one ICO. build_exe.py already uses --icon=icon.ico.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The exe was spawning visible console windows for every subprocess
(llama-server, ws_bridge, diffusion). On Windows PyInstaller exes,
child processes re-execute main() without freeze_support().

- All Popen calls: CREATE_NO_WINDOW + STARTF_USESHOWWINDOW on Windows
- Added multiprocessing.freeze_support() to prevent fork bombs
- ws_bridge Popen also hidden

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Standard Next>Next>Install wizard. Bundles the repo (~5MB), then
runs setup.ps1 post-install to download models (~7GB) with progress.

- Start Menu + Desktop shortcut (wave icon)
- Add/Remove Programs uninstaller
- Cleans up models/workspace on uninstall
- GitHub Actions builds it on release or manual dispatch

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PyInstaller exe had: console flash, AV flags, no auto-update, fork bombs.
Inno Setup installer has: progress bar, Start Menu, Add/Remove Programs,
runs setup.ps1 for model downloads, auto-updates via git pull in tsu.ps1.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace #13#10 (newlines) with Chr(13)+Chr(10) in Pascal code block.
Inno Setup preprocessor runs before Pascal compilation and chokes on #13.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
tsu.ps1 only looked in its own directory for models and llama-server.
If installed via setup.bat/ps1, files are in %USERPROFILE%\tsunami\.
Now searches both locations.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The installer runs setup.ps1 post-install but setup.ps1 was downloading
models to %USERPROFILE%\tsunami while the shortcut runs from Program Files.

- Installer passes TSUNAMI_DIR={app} so setup.ps1 uses the install path
- setup.ps1 detects installer layout (files exist, no .git) and inits
  git for future auto-updates instead of failing on git clone

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
gobbleyourdong and others added 2 commits April 2, 2026 19:06
Laptops with Intel + NVIDIA often have nvidia-smi at
C:\Windows\System32\nvidia-smi.exe but not in PowerShell's PATH.
Now checks the known location as fallback.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…-fixes-clean

# Conflicts:
#	tsunami/tools/shell.py
@meanaverage
Copy link
Copy Markdown
Author

TL;DR

Introduced uniform workspace path resolver. Each tool had a different shape for path resolution, many didn't align with each other.

NOT TOUCHED IN THIS PR:

browser.py / vision_ground.py:
mostly accept user-provided absolute paths, which is a different category from workspace-relative tool confusion -- INTENTIONAL? not touched in this PR.

tsunami/tools/swell.py (line 69)
sets project_root = os.path.dirname(os.path.abspath(self.config.workspace_dir)), meaning it runs relative to the parent of the workspace, not the workspace itself. That is different again from the other tools.

tsunami/tools/swell_build.py (line 90)
uses Path(self.config.workspace_dir).parent as workdir, which shifts execution one directory above the workspace.

assuming may be intentional for scaffolding view? if not should be on uniform resolver


tsunami/tools/python_exec.py (line 74)
It hardcodes the repo workspace as ark_dir/workspace and ark_dir/workspace/deliverables and does os.chdir(ark_dir). It ignores self.config.workspace_dir, so host and Docker can drift immediately.

tsunami/tools/match.py (line 34)
match_glob resolves directory with Path(directory).expanduser().resolve() directly. It does not use the shared resolver from filesystem.py, so deliverables/..., workspace/..., and /workspace/... forms behave differently here.

tsunami/tools/match.py (line 77)
match_grep has the same issue as match_glob: direct Path(...).resolve(), no normalization via the canonical workspace path logic.

tsunami/tools/shell.py (line 115)
shell_exec resolves workdir relative to the repo root if it is not already absolute. That is inconsistent with the rest of the tool stack, which is supposed to operate relative to workspace_dir or the active project.

tsunami/tools/swell_analyze.py (line 50)
This tool contains custom fallback rewrites:

Path(self.config.workspace_dir).parent / stripped
directory.replace("/workspace/", "workspace/")
That is a separate normalization system, clearly Docker-shaped, and inconsistent with filesystem._resolve_path.
tsunami/tools/generate.py (line 48)
It strips only "workspace/" and "app/workspace/" before joining onto workspace_dir. That is another one-off normalization scheme, different from both filesystem.py and the broken /workspace/tsunami/... paths showing up in traces.

tsunami/tools/project_init.py (line 223)
The tool result teaches the model to run:
shell_exec 'cd {project_dir} && npx vite build'
That encourages absolute-path cd usage, which fights the idea of project-relative semantics and amplifies shell_exec’s inconsistent cwd handling.

tsunami/tools/filesystem.py (line 56)
This is the closest thing to a canonical resolver, but only the filesystem tools use it consistently. The inconsistency is less in this file and more that the rest of the tool stack bypasses it.

tsunami/tools/webdev.py (line 80)
webdev mostly uses Path(self.config.workspace_dir) correctly, but it still mixes in repo-root assumptions in places like asset download handling around tsunami/tools/webdev.py (line 582), where it computes ark_dir directly.

tsunami/tools/browser.py (line 475) and tsunami/tools/vision_ground.py (line 67)
These resolve absolute paths directly with Path(...).resolve(). That may be fine for user-provided files, but it is another path convention that does not flow through workspace-aware normalization.

@meanaverage
Copy link
Copy Markdown
Author

further unification will allow reliable 'transient workspaces' via configurable option

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants