Skip to content

feat: Onboarding Phase 2 — chat-first setup + forced password change#733

Closed
MiltonSilvaJr wants to merge 59 commits intonextlevelbuilder:mainfrom
vellus-ai:claude/feat/onboarding-phase2-impl
Closed

feat: Onboarding Phase 2 — chat-first setup + forced password change#733
MiltonSilvaJr wants to merge 59 commits intonextlevelbuilder:mainfrom
vellus-ai:claude/feat/onboarding-phase2-impl

Conversation

@MiltonSilvaJr
Copy link
Copy Markdown

Summary

  • OnboardingStore PostgreSQL — 4 methods (UpdateTenantSettings, UpdateTenantBranding, GetOnboardingStatus, CompleteOnboarding) with UPSERT on-demand, tenant isolation
  • Migration 000031setup_progress table + must_change_password column on users
  • Gateway wiringwireOnboardingTools() registers 8 tools + group:onboarding after stores init
  • Forced password changePOST /v1/auth/change-password endpoint (PCI DSS, history check, audit), JWT mcp claim, Radix Dialog blocking modal, i18n in 8 languages
  • E2E tests — full onboarding flow, tenant isolation, change password

Test plan

  • go vet ./internal/... — zero warnings
  • go test ./internal/auth/... — all passing
  • go test ./internal/http/... — 29 tests passing (6 new for change-password)
  • go test ./internal/tools/... — 38 onboarding tool tests passing (pre-existing path failures on Windows)
  • go build — cross-compile for Linux succeeds
  • TypeScript compiles without errors (tsc --noEmit)
  • Integration tests with real DB (go test -tags integration ./internal/store/pg/... ./tests/onboarding_e2e/...)
  • Manual: login with temporary password → modal appears → change password → modal closes → normal access

🤖 Generated with Claude Code

Milton Silva and others added 30 commits March 21, 2026 23:10
Fork independente do GoClaw mantido pela Vellus para o produto ARGO.

Rename completo em 444 arquivos:
- Go module: github.com/nextlevelbuilder/goclaw → github.com/vellus-ai/argoclaw
- Env vars: GOCLAW_* → ARGOCLAW_*
- Headers: X-GoClaw-User-Id → X-ArgoClaw-User-Id
- Frontend UI strings: GoClaw → ARGO (marca pública)
- Docker, scripts, configs: goclaw → argoclaw
- OpenAPI spec atualizado

Regra de naming:
- Backend/código: ArgoClaw
- Frontend/público: ARGO
- Empresa: Vellus

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sprint 0 — Security hardening before feature development.

HIGH fixes:
- #1: Whitelist table names in execMapUpdate() — prevents SQL injection
  via dynamic table name (store/pg/helpers.go)
- #2: Log invalid groupBy values in snapshot queries (store/pg/snapshot.go)
- #3: Validated shellEscape() — single-quote wrapping is correct;
  added PBT tests for shell injection (tools/dynamic_tool_security_test.go)

MEDIUM fixes:
- #4-5: Log security warnings for no-token and viewer-fallback auth
  (gateway/router.go)
- #6: Restrict CORS on OpenAPI endpoint — removed wildcard, allow only
  localhost origins (http/openapi.go)
- #7: Add CheckSSRFWithPinning() for DNS rebinding TOCTOU prevention
  (tools/web_shared.go)
- #8: Log warning when TLS verification is disabled
  (tracing/otelexport/exporter.go)
- #9: Pin all Python package versions in Dockerfile — prevents
  supply chain attacks via unpinned dependencies
- #10: Change HOME fallback from /tmp to /app — prevents temp dir
  abuse (tools/credentialed_exec.go)

Also fixes arargoclaw double-rename bug in 356 Go import paths.

Tests: PBT tests for table whitelist and shell escaping (testing/quick).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
security: fix 10 AppSec audit findings (3 HIGH, 7 MEDIUM)
TDD + PBT implementation of email+password authentication:

Migration 026:
- users table (per-tenant, email unique, Argon2id hash, lockout)
- password_history (last 4 passwords, PCI DSS reuse prevention)
- user_sessions (JWT refresh tokens, SHA-256 hashed)
- login_audit (success/failure/lockout logging)

Password validation (PCI DSS):
- Minimum 12 characters
- Requires: uppercase, lowercase, digit, special character
- Rejects passwords containing email local part
- History check: prevents reuse of last 4 passwords
- Argon2id hashing (OWASP params: 64MB, 3 iterations, 4 threads)
- Constant-time hash comparison (crypto/subtle)

JWT tokens:
- Access token: HS256, 15min expiry, contains uid/email/tid/role
- Refresh token: 32 random bytes, SHA-256 hash stored in DB
- Round-trip validation with PBT (1000 iterations)

Tests (TDD + PBT):
- 13 unit tests for password validation
- PBT: strong passwords always accepted (5k iterations)
- PBT: alpha-only passwords always rejected (5k iterations)
- PBT: hash always verifiable (200 iterations)
- PBT: JWT round-trip (1k iterations)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Complete auth flow implementation:

Store layer (internal/store):
- UserStore interface: CRUD users, password history, sessions, audit
- PGUserStore: PostgreSQL implementation with parameterized queries

HTTP handlers (POST /v1/auth/*):
- /register: email+password, PCI DSS validation, Argon2id hash
- /login: constant-time user enumeration prevention, lockout (5 attempts,
  30min), audit logging, JWT issuance
- /refresh: token rotation (revoke old, issue new)
- /logout: session revocation

JWT middleware:
- Extracts Bearer JWT from Authorization header
- Validates and injects claims into context
- Sets X-ArgoClaw-User-Id header for backward compatibility
- Pass-through for gateway tokens (no dots = not JWT)
- RequireUserAuth() wrapper for JWT-only endpoints

Security:
- Constant-time password check (Argon2id + subtle.ConstantTimeCompare)
- User enumeration prevention (burn time on non-existent email)
- Account lockout with audit trail
- Refresh token rotation (old token revoked on use)
- IP + User-Agent logged on all auth events

Tests (TDD):
- 4 middleware tests (valid JWT, invalid JWT, no token, gateway token)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat(auth): PCI DSS email+password authentication
…-label

Tenant = company/client (Vellus, Axis, Pitflow), NOT individual user.

Migration 027:
- tenants table (slug, name, plan, status, Stripe customer ID)
- tenant_users (N:N link with role: owner/admin/member)
- tenant_branding (logo, favicon, primary color, WCAG AA palette,
  custom domain, sender email, product name)
- Added tenant_id column to: agents, llm_providers, sessions,
  channel_instances, agent_teams, cron_jobs, custom_tools,
  mcp_servers, skills
- Indexes on all tenant_id columns

Store layer:
- TenantStore interface: CRUD tenants, membership, branding
- PGTenantStore: PostgreSQL with parameterized queries
- Updated allowedTables + tablesWithUpdatedAt whitelists

Tenant middleware:
- Extracts tenant_id from JWT claims
- Injects into request context for downstream isolation
- RequireTenant() wrapper for tenant-only endpoints
- Pass-through for gateway token mode (backward compat)

Tests (TDD):
- 5 tests: tenant injection, no-JWT pass-through, require tenant
  rejects/allows, nil when empty

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat(tenancy): enterprise multi-tenancy + white-label branding
Three features in one commit:

1. White-label branding (HTTP endpoints):
   - GET /v1/branding — get tenant branding (logo, colors, domain)
   - PUT /v1/branding — update branding config
   - GET /v1/branding/domain/{domain} — resolve branding by custom domain
   - WCAG AA palette support (JSON field for AI-generated colors)

2. i18n: 5 new backend locales (124 keys each):
   - pt (Brazilian Portuguese) — primary market
   - es (Spanish)
   - fr (French)
   - it (Italian)
   - de (German)
   Total: 8 locales (en, vi, zh + 5 new)

3. ARGO personality presets (replacing GoClaw defaults):
   - 🚀 Captain (Capitão) — strategic advisor, executive
   - ⚡ Helmsman (Timoneiro) — operations, project management
   - 🔍 Lookout (Vigia) — research, analysis
   - 🎯 Gunner (Artilheiro) — data, finance, KPIs
   - 🧭 Navigator (Navegador) — legal, compliance, governance
   - 🛠️ Smith (Ferreiro) — technical, engineering, DevOps
   Full pt-BR translations included.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: white-label branding + i18n 8 locales + ARGO presets
* merge(upstream): PR nextlevelbuilder#226 from GoClaw community

* merge(upstream): PR nextlevelbuilder#314 from GoClaw community

* merge(upstream): PR nextlevelbuilder#356 from GoClaw community

* merge(upstream): PR nextlevelbuilder#352 from GoClaw community

* merge(upstream): GoClaw PR nextlevelbuilder#339 — add curl to Docker runtime image

* docs: CHANGELOG ArgoClaw — upstream merges + internal history

Track all modifications: 5 upstream GoClaw PRs merged,
3 pending conflict resolution, 6 under review,
2 rejected/skipped. Plus internal Sprint 0 features.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Milton Silva <milton@vellus.tech>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Daily automated check (05:00 BRT) that:
- Fetches all open PRs from nextlevelbuilder/goclaw
- Classifies by type (security/bug, feature, build/docs)
- Tests patch applicability against our ArgoClaw fork
- Creates/updates tracking issue with report
- Optional Telegram notification (commented, enable later)

Co-authored-by: Milton Silva <milton@vellus.tech>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Send daily report via Resend to milton@vellus.tech.
This workflow ONLY monitors and reports — no auto-merge.
All merges require manual Code Review + AppSec approval.

Co-authored-by: Milton Silva <milton@vellus.tech>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… + TDD/PBT (#8)

Upstream GoClaw PR nextlevelbuilder#316: project-scoped MCP isolation with env overrides.

Security hardening (ArgoClaw):
- Env var blocklist: blocks 50+ dangerous vars (LD_PRELOAD, PATH, HOME,
  SHELL, NODE_OPTIONS, PYTHONPATH, GOCLAW_*, POSTGRES_*, etc.)
- Prefix blocklist: LD_*, DYLD_*, GOCLAW_*, ARGOCLAW_*, POSTGRES_*
- Case-insensitive validation
- Immutable field protection: id, created_by, created_at, tenant_id
  cannot be modified via UpdateProject
- tenant_id added to projects table (multi-tenancy)
- UNIQUE constraint scoped by tenant_id

Tests (TDD + PBT):
- 47 unit tests covering all security controls
- Property-Based Testing: 2500+ random prefix tests, 1000+ random
  safe var tests using testing/quick
- All tests PASS (verified on VM goclaw-pilot)

Co-authored-by: Milton Silva <milton@vellus.tech>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…le system prompt + PBT (#9)

Upstream GoClaw PR nextlevelbuilder#343: Anthropic OAuth setup token support.

ArgoClaw enhancements:
- OAuth system prompt now configurable via oauthSystemPrompt field
  (default: Claude Code identifier, overridable per-provider)
- Prevents forced persona degradation in ARGO agents

Tests (TDD + PBT):
- 6 tests: 2500+ random inputs via testing/quick
- PBT: valid setup tokens always accepted (500 random)
- PBT: valid API keys always accepted (500 random)
- PBT: random strings always rejected (1000 random)
- PBT: short tokens always rejected (500 random)
- All tests PASS (verified on VM goclaw-pilot)

Co-authored-by: Milton Silva <milton@vellus.tech>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ting + PBT (#10)

Upstream GoClaw PR nextlevelbuilder#202: preserve @mentions with underscores
in Telegram markdown conversion + bot-to-bot mention routing.

ArgoClaw security note: bot routing needs tenant-scoped auth
(documented for future sprint).

Tests (TDD + PBT):
- PBT: single mention preserved (500 random usernames)
- PBT: multiple mentions preserved (300 random combinations)
- PBT: no false @ injection (500 random texts)
- Original: mention preservation test
- All tests PASS (verified on VM goclaw-pilot)

Conflict resolved: cmd/gateway_consumer.go (handoff + reset)

Co-authored-by: Milton Silva <milton@vellus.tech>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…E fix + Zalo QR restart + TDD/PBT (#11)

PR nextlevelbuilder#182 (cherry-pick core fixes, without Party Mode):
- Sort non-contiguous SSE tool_call indices (prevents nil pointer panic)
- Log truncated tool call arguments instead of silently discarding
- extractDefaultModel from provider settings JSONB

PR nextlevelbuilder#346:
- Zalo QR session restart: cancel previous session instead of blocking

Tests (TDD + PBT):
- Non-contiguous indices: 1000+ PBT random inputs
- Truncated JSON arguments: 6 edge cases
- All tests PASS (verified on VM goclaw-pilot)

PR nextlevelbuilder#350: SKIPPED — core fix (generateId) already in PR nextlevelbuilder#352.
Provider listing UX improvements deferred.

Co-authored-by: Milton Silva <milton@vellus.tech>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…d missing symbols

- Rename 000027_projects → 000028_projects (conflict with 000027_multi_tenancy)
- Bump RequiredSchemaVersion to 28
- Replace all github.com/nextlevelbuilder/goclaw imports with github.com/vellus-ai/argoclaw
- Fix missing sessions/providers imports in gateway_consumer.go
- Fix LaneDelegate → LaneTeam (renamed in refactor commit 49441f7)
- Run go mod tidy to clean upstream dependency

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Critical AppSec fix: adds multi-tenant isolation to the PostgreSQL store
layer. Previously, the middleware injected tenant_id into context but
store queries did not filter by it, allowing cross-tenant data access.

Changes:
- Add WithTenantID/TenantIDFromContext to store context helpers
- Refactor tenant_middleware to use store.WithTenantID (single source)
- Add tenantIDFromCtx() and execMapUpdateTenant() helpers in pg package
- Fix 7 store files (60+ methods) to filter by tenant_id:
  - agents.go: all CRUD + shares + access checks (12 methods)
  - providers.go: all CRUD (6 methods) — API key isolation
  - channel_instances.go: all CRUD + credentials (8 methods)
  - mcp_servers.go: all CRUD (5 methods) — server credential isolation
  - custom_tools.go: all CRUD + list variants (9 methods)
  - teams.go: CRUD methods (5 methods)
  - helpers.go: new execMapUpdateTenant with tenant WHERE clause

Backwards-compatible: when tenant_id is not in context (uuid.Nil),
filters are skipped (single-tenant / gateway token mode).

Stores NOT yet fixed (lower priority, no credentials):
- cron_crud.go (methods lack ctx parameter — interface change needed)
- sessions*.go (session key encodes context, lower risk)
- skills*.go (deferred to next sprint)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Prevents injection and malformed data in tenant branding configuration:
- primary_color: must match ^#[0-9A-Fa-f]{6}$ (hex color)
- logo_url / favicon_url: must use https:// scheme (prevents javascript: XSS)
- sender_email: validated with net/mail.ParseAddress
- product_name: max 100 characters

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: resolve duplicate migration 000027 + fix upstream imports
fix(security): enforce tenant_id filtering in all store queries
fix(security): add input validation to branding handler
- Triggers on push to main and manual dispatch
- Builds with ENABLE_PYTHON=true for skill support
- Pushes to ghcr.io/vellus-ai/argoclaw:latest + SHA tag
- Uses Docker layer caching via GitHub Actions cache
- Fix Dockerfile ldflags: nextlevelbuilder → vellus-ai

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ci: Docker build + push to GHCR on main
- jwt_test.go: GenerateRefreshToken returns 3 values (raw, hash, err),
  not 2 — fix destructuring in TestGenerateRefreshToken_Unique
- provider-form-dialog.tsx: add missing isEdit constant (create-only
  dialog, always false) to fix TS2304 compilation error

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
tenant_middleware_test.go referenced undefined ctxKeyTenantID — replaced
with store.WithTenantID() which is the actual API used by the middleware.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: resolve CI build errors (Go test + TypeScript)
Milton Silva and others added 22 commits March 25, 2026 12:05
…kflow

- Remove DOCKERHUB_IMAGE env var and all Docker Hub login steps
  (we use GHCR exclusively, Docker Hub secrets were never configured)
- Remove notify-discord job (DISCORD_WEBHOOK_URL secret not configured)
- Remove Docker Hub image refs from metadata extraction
- Fix ldflags import path: nextlevelbuilder → vellus-ai

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ci: remove Docker Hub login and Discord notification
* test: add E2E tenant isolation test suite

Comprehensive E2E tests for multi-tenancy data isolation covering:
- Store-level CRUD cross-tenant isolation (agents, branding, membership)
- JWT auth boundary tests (tampering, algorithm confusion, wrong secret)
- HTTP API header injection prevention (X-ArgoClaw-User-Id, X-Tenant-Id)
- Privilege escalation (admin cross-tenant, self-add, immutable tenant_id)
- WebSocket connection isolation (connect, event leak, forged tenant param)
- SQL injection payloads against tenant-filtered queries
- Property-based testing (PBT) for isolation invariants
- Suspended/expired tenant data access policies

Includes CI workflow (ci-tenant-isolation.yml) and docker-compose for
local test execution with pgvector/pg18.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: address code review findings in tenant isolation tests

- Remove dead code mustGenerateExpiredToken (misleading: generated valid tokens)
- Fix PBT TestPBT_AgentKeyNeverLeaksCrossTenant logic: verify against
  known Tenant A agent IDs set instead of reverse-querying Tenant A context
- Rename TestHTTP_NoAuth_Returns401 to TestHTTP_NoAuth_NoTenantDataLeaked
  to match actual behavior (gateway token mode may return 200)
- Remove local min() function (redundant with Go 1.21+ builtin)
- Fix comment httpClientWithToken → httpReqWithToken

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: address all remaining code review findings

- Add multi-store isolation tests: LLM providers (list, get-by-name),
  agent teams (list), custom tools (list-all), agents (List method)
- Add t.Parallel() to all independent tests for faster CI execution
- Fix defer-in-loop in TestWS_MultipleConnections (use t.Cleanup)
- Improve TestJWT_InvalidUUID_TenantID assertions: verify ALL invalid
  payloads fail uuid.Parse, not just "not-a-uuid"
- Update migrate image version in docker-compose (v4.17.0 → v4.18.2)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Milton Silva <milton@vellus.tech>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…#21)

* fix(store): add tenant isolation to SessionStore/CronStore/SkillStore

Add context.Context parameter to SessionStore and CronStore interface
methods that perform DB operations, enabling tenant_id filtering via
tenantIDFromCtx(ctx). This prevents cross-tenant data leakage in
multi-tenant deployments.

SessionStore changes:
- GetOrCreate, Delete, List, ListPaged, ListPagedRich, Save,
  LastUsedChannel now accept ctx as first parameter
- All DB queries add AND tenant_id = $N when tid != uuid.Nil
- INSERT includes tenant_id column
- buildSessionFilter accepts tid for consistent filtering

CronStore changes:
- AddJob, GetJob, ListJobs, RemoveJob, UpdateJob, EnableJob,
  GetRunLog, RunJob now accept ctx as first parameter
- AddJob INSERT includes tenant_id column
- scanJobTenant adds tenant_id filter to single-row lookups
- GetRunLog JOINs with cron_jobs for tenant verification
- UpdateJob uses execMapUpdateTenant when tenant is present

SkillStore changes:
- CreateSkillManaged INSERT includes tenant_id column
- Added CreateSkillWithCtx, UpdateSkillWithCtx, DeleteSkillWithCtx,
  ToggleSkillWithCtx for tenant-aware operations
- DeleteSkillWithCtx adds tenant_id to SELECT and UPDATE queries

BackfillAgentEmbeddings:
- Added tenant_id filter to SELECT query when tenant is in context

All callers updated to propagate ctx: agent loop, tools, gateway
methods, heartbeat ticker, consumer handlers.

Backward compatible: when tid == uuid.Nil, no tenant filter is applied.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: address review findings in tenant store isolation

- scanJobTenant now accepts context.Context and uses QueryRowContext
  instead of QueryRow, ensuring query cancellation propagation
- EnableJob now checks RowsAffected when tenant_id is set, consistent
  with RemoveJob and session Delete (prevents silent cross-tenant no-op)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Milton Silva <milton@vellus.tech>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…stores (#22)

* fix(security): resolve all 6 tenant isolation blockers from code review

Fixes all blockers flagged in PR #22 review (score 100/100 each):

1. CronStore.RemoveJob/EnableJob: add ctx + AND tenant_id filter + RowsAffected check
2. CronStore.UpdateJob: add ctx + use execMapUpdateTenant when tenant present
3. SessionStore.Delete: use tenantIDFromCtx(ctx) instead of cache lookup — prevents
   cross-tenant deletion when session is not in local cache (restart, different node)
4. SessionStore.List: add ctx + filter by tenant via buildSessionFilter
5. sessions.loadFromDB: add tenantID param + AND tenant_id=$2 — prevents cross-tenant
   session reads via GetOrCreate with a known session key
6. DeleteSkill: add tenant filter to is_system SELECT + RowsAffected check on UPDATE

Also fixes 2 warnings (score 75):
- testutil: 10*000*1000*1000 = 0ns timeout → 10*time.Second
- TestIsolation_Sessions_Delete_CrossTenant: ctxB was unused; now tests adversarial
  cross-tenant delete (must not delete) followed by same-tenant delete (must delete)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(security): resolve 2 AppSec blockers from code review

- sessions.go: getOrInit no longer INSERTs without tenant_id.
  Sessions without ctx are kept in-memory only; GetOrCreate(ctx)
  must be called first to persist with the correct tenant_id.
  Inserting without tenant_id would create orphaned rows that bypass
  multi-tenant isolation.

- sessions_ops.go: Delete now executes the DB DELETE and verifies
  RowsAffected before evicting cache and calling OnDelete.
  Previously, cache eviction and media cleanup ran before the DB
  tenant check, leaving inconsistent state on cross-tenant attempts.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Milton Silva <milton@vellus.tech>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…#23)

* feat(argoclaw): --non-interactive onboard + OpenTelemetry GenAI instrumentation

- Add --non-interactive flag to 'argoclaw onboard' (CI/automated deploy mode)
  - Reads all inputs from env vars: ARGOCLAW_POSTGRES_DSN (required),
    ARGOCLAW_GATEWAY_TOKEN and ARGOCLAW_ENCRYPTION_KEY (auto-generated if absent)
  - Skips all interactive prompts; safe to run with stdin closed
  - Idempotent: migrations use 'no change' guard, seed is upsert-safe
- Add Gemini (gemini_native) to default provider seed list
- Add internal/telemetry package:
  - Setup() — OTLP gRPC exporter, tracer + meter provider
  - GenAI semantic conventions (AttrGenAI* constants, RecordLLMCall helper)
  - Graceful noop when OTEL_EXPORTER_OTLP_ENDPOINT not configured
- Initialize OTel in gateway startup with deferred graceful shutdown
- TDD: 5 unit tests for non-interactive mode + 4 OTel/GenAI tests

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(otel,onboard): resolve code review blockers for PR #23

- Add nil guard on otelShutdown to prevent panic on Setup failure
- Handle error from telemetry.InitMetrics() instead of silently discarding
- Read ARGOCLAW_ENVIRONMENT env var for OTel deployment.environment attribute
- Document non-overlap between internal/telemetry and internal/tracing/otelexport
- Make OTel TLS configurable via OTEL_EXPORTER_OTLP_INSECURE (standard env var)
- Use errors.Is(err, migrate.ErrNoChange) instead of string comparison
- Protect OTel metric globals with sync.Once; remove metricsInitialized bool
- Change RecordLLMCall attrs parameter to pointer so callers can update tokens post-call
- Handle errors in onboardWriteEnvFile (return error instead of silently ignoring)
- Single-quote all env var values in .env.local to prevent bash special-char expansion
- Return (string, error) from onboardGenerateToken; update all callers
- Add PBT tests (pgregory.net/rapid) and metric coverage tests for 90% coverage target

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Milton Silva <milton@vellus.tech>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* ci: add Google Artifact Registry to docker-publish workflow

- Add GAR as third registry alongside GHCR and Docker Hub
- Authenticate via Workload Identity Federation (google-github-actions/auth)
- Add id-token: write permission for OIDC
- Both build-and-push and build-and-push-web jobs publish to GAR
- Requires GCP_WORKLOAD_IDENTITY_PROVIDER and GCP_SERVICE_ACCOUNT secrets

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* sec: fix 3 CVEs — gRPC, jsonparser, imaging

- CVE-2026-33186 (CRITICAL): google.golang.org/grpc v1.78.0 -> v1.79.3
  Authorization bypass via missing leading slash in :path
- GHSA-6g7g-w4f8-9c9x (HIGH): github.com/buger/jsonparser v1.1.1 -> v1.1.2
  Denial of service vulnerability
- CVE-2023-36308 (LOW): Replace github.com/disintegration/imaging v1.6.2
  with golang.org/x/image/draw (stdlib). Panic on malformed images.
  Rewrote SanitizeImage using image.Decode + draw.CatmullRom.

Trivy scan: 0 vulnerabilities across all severities.
Closes vellus-ai/vellus-ai-agents-platform#21
Closes vellus-ai/vellus-ai-agents-platform#22

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Milton Silva <milton@vellus.tech>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pipeline DevSecOps concluído:
- Build Go (linux/amd64, CGO=0): ✅ 5.1MB
- go vet: ✅ sem issues  
- Testes WebUI (7/7): ✅ ServesIndexHTML, SecurityHeaders, FallbackToIndexHTML...
- Imagem: us-central1-docker.pkg.dev/vellus-ai-agent-platform/argoclaw/argoclaw:v1.79.0-webui
- Deploy K8s: ✅ rollout concluído
- Health check: ✅ https://argo-vellus.consilium.tec.br/ → HTTP 200

CI checks OK: go ✅, web ✅
CI checks ignorados (pré-existentes, não relacionados ao PR):
- Tenant Isolation E2E: column 'external_id' não existe no CI (schema gap)
- claude-review: Claude Code GitHub App não instalado no repo
Resolves vellus-ai/vellus-ai-agents-platform#33 — rebuild da imagem Docker
combinando security patches (v0.1.1-sec) + React SPA embutido (v1.79.0-webui).

Análise:
- main HEAD (402d322) já contém TUDO: appsec patches (#14, #21, #22) + embed-web-ui
- v0.1.1-sec foi buildada antes do merge do embed-web-ui (falta o SPA)
- v1.79.0-webui foi buildada do branch (pré-squash), sem as diferenças do commit final
- A imagem correta requer build da main HEAD com ENABLE_WEB_UI=true

Mudanças:
- .github/workflows/docker-publish.yaml: adiciona variante "webui" (-webui suffix)
  com ENABLE_WEB_UI=true; adiciona campo enable_web_ui a todas as variantes existentes
- .github/workflows/rebuild-webui-hardened.yml: workflow dedicado para rebuild imediato
  (trigger: push neste branch ou workflow_dispatch); produz tag v1.79.1-webui no GAR;
  documenta os patches de segurança incluídos no job summary

Próximo passo: após merge, executar o workflow e atualizar o deployment K8s para v1.79.1-webui.

Co-authored-by: Milton Silva <milton@vellus.tech>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
google-github-actions/auth@v2 does not generate access_token by default
when using Workload Identity Federation. The docker/login-action step
requires an access_token to authenticate against GAR.

Also configures the GCP WIF pool (github-actions) and SA (sa-github-ci)
which were missing from the repository secrets.

Co-authored-by: Milton Silva <milton@vellus.tech>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…nboard race safety (#28)

* fix(store): add SeedOnboardProvider with ON CONFLICT DO NOTHING for onboard race safety

Resolves vellus-ai/vellus-ai-agents-platform#32.

With replicas >= 2, two initContainers can race to seed placeholder providers.
The previous code called CreateProvider (plain INSERT) and silently swallowed
duplicate-key errors via slog.Debug — fragile and misleading.

Changes:
- Add store.ProviderStore.SeedOnboardProvider interface method with doc comment
  explaining the intentional ON CONFLICT (name, tenant_id) DO NOTHING semantics
- Implement SeedOnboardProvider in PGProviderStore with the idempotent INSERT;
  no DO UPDATE clause ensures user-configured values are never overwritten
- Extract seedPlaceholdersWithStore(ctx, store.ProviderStore) from
  seedOnboardPlaceholders for dependency injection and unit testing
- Update both mockProviderStore stubs (internal/http, internal/oauth) to
  satisfy the updated interface
- Add onboard_managed_test.go with 5 tests covering: full seeding, idempotency,
  api_base skip, error resilience, and PBT never-panics

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(migration): add UNIQUE(name, tenant_id) constraint on llm_providers

Migration 027 added tenant_id to llm_providers but did not update the
UNIQUE constraint from single-column (name) to composite (name, tenant_id).
This caused ON CONFLICT (name, tenant_id) DO NOTHING in SeedOnboardProvider
to fail at runtime (Issue nextlevelbuilder#43).

- Drop old llm_providers_name_key constraint
- Create regular UNIQUE index on (name, tenant_id) for arbiter inference
- Create partial UNIQUE index on (name) WHERE tenant_id IS NULL for legacy rows
- Bump RequiredSchemaVersion to 29

Resolves vellus-ai/vellus-ai-agents-platform#43

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Milton Silva <milton@vellus.tech>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* feat(auth): wire user auth endpoints to gateway server

Connect existing auth infrastructure (UserAuthHandler, UserStore, JWT)
to the running gateway server. No new logic — pure wiring:

- Add JWTSecret to GatewayConfig (env ARGOCLAW_JWT_SECRET, never persisted)
- Add Users UserStore to Stores struct + PGUserStore in factory
- Add SetUserAuthHandler + route registration in BuildMux
- Wire handler creation in cmd/gateway.go (conditional on JWT secret)
- Add unit tests for config loading, JWT roundtrip, password validation,
  password history detection (Gap G2)

Endpoints activated: POST /v1/auth/{register,login,refresh,logout}

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(gateway): wire JWT middleware globally + add security headers

- Apply JWTMiddleware globally in Start() — falls through when no JWT
  is present, preserving gateway token backward compat
- Add securityHeadersMiddleware (Gap G4/RNF-16): HSTS, X-Content-Type-Options,
  X-Frame-Options, Referrer-Policy, X-XSS-Protection (disabled per OWASP)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test(auth): add handler tests for register/login/logout endpoints

8 test cases covering UserAuthHandler:
- Register: success (201 + tokens), duplicate email (409), weak password (400)
- Login: success (200 + JWT), wrong password (401 + counter), non-existent (401),
  lockout (429)
- Logout: session revoked (200)

All tests use in-memory stubUserStore — no database dependency.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test(auth): add coverage tests for RefreshToken, HashRefreshToken, VerifyPassword

Boost internal/auth coverage from 86.3% to 90.4%:
- TestHashRefreshToken: deterministic SHA-256, different inputs
- TestGenerateRefreshToken: unique tokens, hash matches
- TestVerifyPassword_MalformedHash: malformed and empty hash

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* simplify: remove redundant code from auth-wiring changes

- Remove duplicate security headers from WebUIHandler (already set by global securityHeadersMiddleware)
- Remove task-tracking comment from securityHeadersMiddleware
- Remove redundant pgStores.Users != nil guard (factory always initializes Users)
- Move user_auth_test to package http_test for proper black-box isolation

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* test(auth): resolve code review blockers — PBT, refresh tests, edge cases, t.Parallel

- Fix os.Unsetenv → t.Setenv for proper test isolation (Blocker #1)
- Add 3 tests for handleRefresh: success, invalid token, revoked token reuse (Blocker #2)
- Add 3 PBT tests: ValidatePassword properties, JWT roundtrip, HashRefreshToken determinism (Blocker #3)
- Add 7 edge case tests: malformed JSON (register/login/refresh/logout), empty body, missing email, email normalization (Blocker #4)
- Add t.Parallel() to all independent tests in both files (Blocker #5)
- Fix stubUserStore.GetSessionByToken to filter revoked sessions (matches production SQL behavior)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Milton Silva <milton@vellus.tech>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat(plugins): Fase 0 — Plugin Host infrastructure (WP-1 through WP-9)

Implements the complete plugin host infrastructure for ArgoClaw:

## Schema (WP-1)
- Migration 000030: 5 tables (plugin_catalog, tenant_plugins, agent_plugins,
  plugin_data, plugin_audit_log) with tenant isolation, FK CASCADE, indexes
- Bumps RequiredSchemaVersion 29 → 30

## Store Layer (WP-2, WP-3)
- store.PluginStore interface (~20 methods): catalog, lifecycle, agent overrides,
  KV data, audit log
- PGPluginStore implementation with parameterized SQL only (no ORM)
- G2 blocker enforced: tenantID always from context, never from caller params
- G1 enforced: UninstallPlugin cascades to delete all plugin data atomically
- Atomic transactions for Install/Enable/Disable/Uninstall with inline audit
- Compile-time interface check: var _ store.PluginStore = (*pg.PGPluginStore)(nil)

## Manifest + Permissions (WP-4)
- ParseManifest: validates name (kebab-case), version (semver), transport whitelist
- G4 blocker: ValidatePermissions rejects any core:* write scope
- PBT via testing/quick: random core:* writes always rejected

## In-Memory Registry (WP-5)
- Thread-safe Registry (sync.RWMutex) for runtime plugin state
- Names(), ActiveNames(), Count(), List(), Register(), Unregister()

## Data Proxy (WP-6)
- DataProxy validates tenant context, collection (max 100), key (max 500)
- Enforces plugin-installed check before any store operation
- G2: context tenant always wins; never trusts caller-supplied values

## REST API (WP-7, WP-8)
- PluginHandler: catalog CRUD, install/uninstall/enable/disable, agent grants
- PluginDataHandler: KV data CRUD (list/get/put/delete)
- G4 at HTTP boundary: POST /v1/plugins/catalog validates manifest permissions
- Auth required on all endpoints (requireAuth pattern)
- Conflict 409 on duplicate install, 404 on not found

## Gateway Integration (WP-9)
- Lifecycle controller: LoadAll (startup), RegisterPlugin, UnregisterPlugin, Stop
- Tool groups registered via tools.RegisterToolGroup("plugin:{name}", ...)
- gateway/server.go: SetPluginHandler, SetPluginDataHandler, route registration
- cmd/gateway.go: wires plugin handlers when store.Plugins != nil
- allowedTables whitelist updated with 5 plugin tables

Test summary:
- internal/plugins: 95.5% coverage, 50+ tests
- internal/http (plugin files): all handlers covered, auth enforced
- internal/store/pg: compile-time interface + integration test suite (build tag)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(plugins): resolve code review blockers and AppSec advisories

- Replace err.Error() with generic messages in HTTP 500 responses (BLOCKER-1)
- Use errors.Is() for sql.ErrNoRows comparison at 6 sites (BLOCKER-2)
- Rewrite isUniqueViolation() with errors.As + pgconn.PgError (BLOCKER-2)
- Fix UNIQUE constraint on agent_plugins to include tenant_id (ADVISORY-A)
- Escape LIKE metacharacters in ListDataKeys prefix (ADVISORY-B)
- Add plugin name validation on all HTTP handlers (ADVISORY-C)
- Export IsValidPluginName() from plugins package for handler reuse

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(plugins): resolve Round 2 code review blockers

- Apply TenantMiddleware to all plugin routes for JWT multi-tenant isolation (BLOCKER-1)
- Wire DataProxy into PluginDataHandler, replacing direct store access (BLOCKER-2)
- Escape backslash in LIKE replacer for ListDataKeys (BLOCKER-3)
- Require admin role for POST /v1/plugins/catalog (BLOCKER-4)
- Validate plugin_name with IsValidPluginName in handleGrantAgent/handleInstallPlugin (BLOCKER-5)
- Fix checkPluginInstalled to distinguish ErrPluginNotFound from transient errors (BLOCKER-6)
- Verify plugin state is "enabled" in checkPluginInstalled (BLOCKER-7)
- Add collection/key length validation in HTTP data handlers (BLOCKER-8)
- Update tests: inject tenant context, use DataProxy-aware stubs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Milton Silva <milton@vellus.tech>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…dentity

* feat(providers): add Vertex AI as LLM provider with OAuth2/Workload Identity

Add Google Vertex AI as a native provider using the OpenAI-compatible
endpoint. Authentication via Application Default Credentials (ADC)
enables zero-secret auth on GKE via Workload Identity.

Changes:
- OpenAIProvider: add TokenSource support for dynamic OAuth2 tokens
- New vertex_ai.go: factory + gcpTokenSource (ADC auto-refresh)
- Config: VertexAIConfig struct (project_id, region, default_model)
- Store: ProviderVertexAI type for DB-based provider registration
- gateway_providers.go: register from config and DB
- Thought signature detection for Vertex AI endpoints

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(security): resolve 3 code review blockers for Vertex AI provider

1. OAuth2 scope: cloud-platform → aiplatform (least privilege)
2. Tenant isolation: block DB registration — Vertex AI uses host SA,
   only host operator can configure via config.json
3. Input validation: regex validation on projectID and region to
   prevent SSRF via URL injection
4. Tests: rewrite to exercise NewVertexAIProvider end-to-end, add
   PBT for URL construction invariants, add SSRF validation tests
5. Mutex: release lock before Token() call — oauth2.ReuseTokenSource
   is already thread-safe, avoids serialization bottleneck

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* refactor(providers): simplify Vertex AI — sync.Once, constants, remove false log

- Replace sync.Mutex with sync.Once for one-time ADC init (idiomatic, no log-under-lock)
- Extract VertexAIDefaultRegion and VertexAIProviderType constants (DRY)
- Remove false-positive slog.Info("registered provider") after security block
- Use exported constants instead of string literals

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Milton Silva <milton@vellus.tech>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat(providers): add Vertex AI as LLM provider with OAuth2/Workload Identity

Add Google Vertex AI as a native provider using the OpenAI-compatible
endpoint. Authentication via Application Default Credentials (ADC)
enables zero-secret auth on GKE via Workload Identity.

Changes:
- OpenAIProvider: add TokenSource support for dynamic OAuth2 tokens
- New vertex_ai.go: factory + gcpTokenSource (ADC auto-refresh)
- Config: VertexAIConfig struct (project_id, region, default_model)
- Store: ProviderVertexAI type for DB-based provider registration
- gateway_providers.go: register from config and DB
- Thought signature detection for Vertex AI endpoints

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(security): harden auth endpoints — rate limit, tenant isolation, status check, JWT aud

4 AppSec vulnerabilities fixed before exposing /v1/auth/* in production:

1. Rate limiting per-IP on auth endpoints (login: 10/min, register: 5/min,
   refresh: 20/min) with 429 + Retry-After header. Prevents brute-force.

2. WithTenantID now correctly calls store.WithTenantID(ctx, uuid) instead
   of store.WithUserID. Fixes critical tenant isolation bypass.

3. Login handler now verifies user.Status == "active" before issuing tokens.
   Disabled/suspended/pending accounts return 403. Also checked on refresh.

4. JWT tokens now include aud:"argoclaw" claim, validated on parse.
   Prevents token reuse across unintended services.

All fixes include TDD tests (11 new test cases).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(security): resolve 3 code review blockers for Vertex AI provider

1. OAuth2 scope: cloud-platform → aiplatform (least privilege)
2. Tenant isolation: block DB registration — Vertex AI uses host SA,
   only host operator can configure via config.json
3. Input validation: regex validation on projectID and region to
   prevent SSRF via URL injection
4. Tests: rewrite to exercise NewVertexAIProvider end-to-end, add
   PBT for URL construction invariants, add SSRF validation tests
5. Mutex: release lock before Token() call — oauth2.ReuseTokenSource
   is already thread-safe, avoids serialization bottleneck

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* refactor(providers): simplify Vertex AI — sync.Once, constants, remove false log

- Replace sync.Mutex with sync.Once for one-time ADC init (idiomatic, no log-under-lock)
- Extract VertexAIDefaultRegion and VertexAIProviderType constants (DRY)
- Remove false-positive slog.Info("registered provider") after security block
- Use exported constants instead of string literals

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Milton Silva <milton@vellus.tech>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. ipLimiter.allow(): replace time.Time with atomic.Int64 for lastSeen
   to prevent data race between allow() and cleanupLoop() goroutine
2. Tenant isolation test: use migration 026 users schema (email,
   password_hash) instead of legacy external_id column

Co-authored-by: Milton Silva <milton@vellus.tech>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…nagement (#36)

/v1/plugins/{name}/data/{collection} conflicted with
/v1/plugins/installed/{name}/audit in Go 1.22+ ServeMux.

Renamed data proxy routes to /v1/plugin-data/{name}/{collection}[/{key}].

Co-authored-by: Milton Silva <milton@vellus.tech>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
gRPC expects host:port format without URL schema. When the endpoint
is configured as http://host:4317, gRPC appends :443 resulting in
"too many colons in address" errors.

- Add stripEndpointSchema() to telemetry.Setup() and otelexport.New()
- Fix K8s ConfigMap to use host:port without http:// prefix
- Fix stale doc comments in plugins_data.go (old /v1/plugins/ paths)

Co-authored-by: Milton Silva <milton@vellus.tech>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat(auth): add email/password login UI + AppSec hardening

Onboarding Phase 2: WebUI email+password authentication.

Backend auth was already complete (PR #30). This PR adds the React
frontend and hardens the HTTP layer.

WebUI:
- Email login/signup form with PCI DSS password requirements checklist
- JWT auth store (access + refresh tokens in localStorage)
- Auth API client (login, register, refresh, logout)
- HTTP client auto token refresh on 401 with dedup
- Login page defaults to Email tab (Token + Pairing kept as fallback)
- i18n: all keys for en, vi, zh locales
- Vitest testing infrastructure (23 tests passing)

AppSec:
- Health endpoint no longer leaks protocol version
- General rate limiter (60 rpm per IP) on all HTTP routes
- JWT audience comment documenting cross-service binding

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* refactor(auth): simplify review findings

- Extract shared INPUT_CLASS/BUTTON_CLASS to form-styles.ts (DRY)
- Add email substring check to password requirements (parity with backend)
- Inline JWT audience comment to preserve field alignment
- Add reqNoEmail i18n key to all 3 locales

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Milton Silva <milton@vellus.tech>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
)

* feat(auth): wire JWT auto-refresh + i18n login for pt/es/fr/it/de

Phase 2 completion: JWT session management wiring and i18n expansion.

- Wire HttpClient.setRefreshFn in WsProvider for silent 401 → refresh → retry
- Wire onTokenRefreshed to persist new tokens to auth store
- Add useJwtRefresh hook: proactive token renewal 2 min before expiry
- Add login.json translations for pt-BR, es-ES, fr-FR, it-IT, de-DE
- Register 5 new ARGO product languages in i18n config with EN fallback
- Update getInitialLanguage to detect all 8 supported browser languages

Build: pnpm build OK | Tests: 23/23 pass | TypeScript strict: OK

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(auth): resolve 5 code review blockers for PR #39

1. Race condition: centralize JWT refresh in singleton (token-refresh.ts)
   — both proactive timer and reactive 401 handler share same Promise,
   preventing duplicate refresh calls with rotate-on-refresh tokens.

2. Base64 padding: add padding before atob() in getJwtExp to prevent
   Firefox failures on JWT payloads with non-multiple-of-4 length.

3. setTimeout overflow: cap timer delay at MAX_SAFE_TIMEOUT (2^31-1 ms)
   to prevent immediate firing for long-lived tokens.

4. German locale: restore all umlauts (ä, ö, ü, ß) in de/login.json.

5. French locale: restore all accents (é, è, ê, à, ç) in fr/login.json.

Tests: 17 new tests (12 getJwtExp + 5 refreshTokenSingleton).
Build: pnpm build OK | Tests: 40/40 pass | TypeScript strict: OK

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(i18n): restore diacritics in es/pt/it locales + fix exp falsy check

Code review round 2 — resolve 4 blockers:

1. es/login.json: add ñ, ó, é, á, ú, ¿, ¡ (sesión, contraseña, etc.)
2. pt/login.json: add ã, ç, é, á, õ (não, exibição, possível, etc.)
3. it/login.json: add è, à, ù (è già, più, Verrà, etc.)
4. use-jwt-refresh.ts: change `if (!exp)` to `if (exp === null)` to
   avoid treating exp=0 as falsy (gateway token misdetection)

Tests: 40/40 pass | Build: OK

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Milton Silva <milton@vellus.tech>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat(tools): add 8 onboarding tools for Imediato chat-first setup

Add tools that the Imediato (Chief of Staff) agent uses during
conversational onboarding to configure the workspace in real-time:

1. configure_workspace — set account type (personal/business), name, industry
2. set_branding — primary color, product name
3. configure_llm_provider — provider + API key + model selection
4. test_llm_connection — validate API key format
5. create_agent — create agent with preset (captain, helmsman, etc.)
6. configure_channel — webchat, telegram, whatsapp, discord, slack
7. complete_onboarding — mark setup done, transition Imediato to CoS mode
8. get_onboarding_status — check what has been configured

Architecture:
- All tools implement the Tool interface (Name, Description, Parameters, Execute)
- OnboardingStore interface for tenant settings persistence
- OnboardingStoreAware setter for dependency injection
- 15 unit tests covering all tools (TDD)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(tools): resolve 9 code review blockers in onboarding tools

1. Fix tenantIDFromCtx: use store.TenantIDFromContext (was AgentIDFromContext)
   — all 6 store-dependent tools now correctly isolate by tenant
2. Phantom successes eliminated: configure_llm_provider, create_agent, and
   configure_channel now clearly state they collect info only — user must
   complete setup via dashboard. No false "encrypted at rest" claims.
3. API key masking: keys are masked to first 4 chars + "***" in all tool
   results. Full keys never appear in LLM context.
4. Channel tool no longer accepts bot_token as parameter — directs user
   to dashboard Settings > Channels for secure credential entry.
5. SetBrandingTool: hex color validated via regex (^#[0-9A-Fa-f]{3,6}$),
   prevents CSS injection. Partial updates preserve unset fields.
6. ConfigureWorkspaceTool: validates account_type against enum, enforces
   max 255 chars on account_name, trims whitespace.
7. GetOnboardingStatusTool: json.MarshalIndent error now handled explicitly.
8. Tests rewritten with mock OnboardingStore: 38 tests including happy paths,
   store errors, tenant isolation (2 tenants independent, empty context rejected),
   PBT for hex color validation (5000 iterations) and API key masking (5000).
9. Removed unused agentStore field from ConfigureLLMProviderTool and CreateAgentTool.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Milton Silva <milton@vellus.tech>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add favicon.svg to ui/web/public/ so it's included in the Vite build
and served at /favicon.svg by the embedded SPA handler.

Co-authored-by: Milton Silva <milton@vellus.tech>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@MiltonSilvaJr
Copy link
Copy Markdown
Author

Code Review

Found 4 issue(s):

[BLOCKER] 1. Missing rate limiting on POST /v1/auth/change-password endpoint — Security

The AuthRateLimiter wraps login, register, and refresh but NOT change-password. An attacker with a stolen JWT could brute-force current_password at unlimited speed.

Fix: Add WrapChangePassword to AuthRateLimiter and wrap the handler in RegisterRoutes, matching the pattern of the other auth endpoints.

https://github.com/vellus-ai/argoclaw/blob/5f6a970c822cde6d1aed1f26f920aa62566fde5a/internal/http/user_auth.go#L56-L58


[BLOCKER] 2. Frontend catches HTTP 422 but backend returns 400 for weak passwords — Bug

ChangePasswordModal.tsx maps case 422 to weak password error, but handleChangePassword returns http.StatusBadRequest (400) for all validation errors. Users with weak passwords will see a generic "Server error" instead of the specific validation message.

Fix: Either change the frontend to catch 400 (and distinguish by error message body), or change the backend to return 422 Unprocessable Entity for validation errors to match REST conventions.

https://github.com/vellus-ai/argoclaw/blob/5f6a970c822cde6d1aed1f26f920aa62566fde5a/ui/web/src/components/shared/ChangePasswordModal.tsx#L85-L90


[BLOCKER] 3. PII (email) logged in plaintext on password change — Security/LGPD

slog.Info("security.password_changed", "user_id", user.ID, "email", user.Email) logs email in cleartext. Per CLAUDE.md and LGPD: "Never log passwords, tokens, or PII." The user_id alone is sufficient for correlation.

Fix: Remove "email", user.Email from the slog call, or mask it (user.Email[:3]+"***").

https://github.com/vellus-ai/argoclaw/blob/5f6a970c822cde6d1aed1f26f920aa62566fde5a/internal/http/user_auth.go#L427-L429


[BLOCKER] 4. Silent failure on RevokeAllSessions and AddPasswordHistory — Security

_ = h.users.RevokeAllSessions(...) and _ = h.users.AddPasswordHistory(...) discard errors silently. If RevokeAllSessions fails, old sessions remain valid after password change. If AddPasswordHistory fails, password reuse prevention is weakened (PCI DSS).

Fix: At minimum, log errors with slog.Error. For AddPasswordHistory, consider aborting the password change if history storage fails. For RevokeAllSessions, log at slog.Warn since the password is already changed.

https://github.com/vellus-ai/argoclaw/blob/5f6a970c822cde6d1aed1f26f920aa62566fde5a/internal/http/user_auth.go#L394-L409


Verdict: REQUEST CHANGES
Merge bloqueado até resolução dos 4 itens acima. Issues #1 e #3 são de segurança (AppSec).


Generated with Claude Code

@MiltonSilvaJr
Copy link
Copy Markdown
Author

Code Review — Re-check after fixes

Verified commit d86ad8df62a90e0c5f145b6ac1877410ea5f0e44 resolving all 4 blockers:

  1. Rate limitingWrapChangePassword added to AuthRateLimiter, wired in RegisterRoutes. Same rate as login (10 RPM, burst 3).
  2. Status code mismatch — Frontend now catches 400 (not 422) and displays the server error message directly.
  3. PII in log — Email removed from slog.Info("security.password_changed"). Only user_id logged.
  4. Silent error handlingAddPasswordHistory failure now aborts the operation. ClearMustChangePassword and RevokeAllSessions failures logged at WARN level.

All 6 change-password tests pass. All 23 existing auth tests pass. Build succeeds.

No issues found. Checked: CLAUDE.md compliance, architecture, code quality, tests, security (AppSec), and historical context.

Verdict: APPROVED
Merging into dev.


Generated with Claude Code

@MiltonSilvaJr MiltonSilvaJr changed the base branch from dev to main April 6, 2026 21:08
Milton Silva and others added 2 commits April 6, 2026 18:08
…rd change

Onboarding Phase 2 connects the 8 existing onboarding tools to PostgreSQL
and enables the Imediato agent to guide new customers through workspace
setup via chat.

Backend:
- Migration 000031: setup_progress table + must_change_password on users
- PGOnboardingStore: 4 methods (UpdateTenantSettings, UpdateTenantBranding,
  GetOnboardingStatus, CompleteOnboarding) with UPSERT on-demand
- wireOnboardingTools() in gateway bootstrap — registers 8 tools + group
- POST /v1/auth/change-password endpoint (PCI DSS, history check, audit)
- JWT TokenClaims.MustChangePassword (claim "mcp") for frontend detection
- User struct + PG queries updated for must_change_password field

Frontend:
- ChangePasswordModal (Radix Dialog, blocking, no escape/close)
- Auth store decodes JWT mcp claim for mustChangePassword state
- auth-client.ts with changePassword() API call
- i18n translations in 8 languages (en, pt, es, fr, de, it, vi, zh)

Tests:
- 12 integration tests for OnboardingStore (TDD, PBT, tenant isolation)
- 6 unit tests for change-password endpoint
- 4 E2E tests (full flow, tenant isolation, change password)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. Add rate limiting to POST /v1/auth/change-password endpoint
   (WrapChangePassword on AuthRateLimiter, same rate as login)

2. Fix frontend status code mismatch: catch 400 (not 422) for
   validation errors, display server error message directly

3. Remove PII (email) from password change log line (LGPD)

4. Handle errors from AddPasswordHistory (abort if fails to
   preserve PCI DSS reuse prevention), ClearMustChangePassword
   and RevokeAllSessions (log warnings instead of silent ignore)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@MiltonSilvaJr MiltonSilvaJr force-pushed the claude/feat/onboarding-phase2-impl branch from d86ad8d to a71badd Compare April 6, 2026 21:08
@MiltonSilvaJr
Copy link
Copy Markdown
Author

Closing — PR should be in vellus-ai/argoclaw, not the upstream fork.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant