Releases: HomenShum/nodebench-ai
v2.32.0 — Mobile UX + 4-Layer Grounding + Search Quality 92.5%
Closing the orphan-tag gap: v2.32.0 was tagged but no release was published.
Highlights
Search quality + grounding (anti-hallucination)
- 4-layer grounding pipeline: claim verification + citation chain + grounded judge
- Search quality 92.5% (49/53) on the 100-query eval corpus
- Multi-step chain eval: 8/8 chains, 32/32 steps, 100% pass rate
- Hybrid code+LLM judge with majority vote — eliminates flake from single-judge variance
Mobile UX
- Dedicated
MobileTabBarcomponent for mobile navigation - Mobile UX + live data flywheel
- NodeBench AI as default entity
- README updated with MobileTabBar section
Eval flywheel
- Multi-turn session eval + classification alignment — 14/18 categories at 100%
- Majority-vote judge + multi-turn sessions + ambient feedback — 13/18 at 100%
- Real LLM content generation, lens-aware content synthesis
- Eval climbed from 82% → 84% → 88% → 90% → 92.3% over the cycle
Tool graph + telemetry
- Real tool graph data: 365 nodes, 56 domains, 935 edges from live registry
- Tool coverage proof + contextual graph visualization
- Telemetry components published in nodebench-mcp@2.65.0 + 2.68.0
Founder surface stability
- Solid dark backgrounds for all founder components
- Force dark mode on founder surfaces (light-mode compatibility fix)
- Restore linter-deleted founder_local_synthesize + weekly_reset tools
Other fixes
- Updated contradiction fallback to NodeBench-specific text
- Use Gemini 3.1 Flash (not 2.5) for hard scenario judge
- Linkup search + ResultPacket mapping + eval flywheel 85.4%
Note
This release reflects state at the v2.32.0 git tag. Today's stabilization sprint (Apr 27 2026) shipped substantial follow-on work — see CHANGELOG.md or recent merged PRs for details. A future release will cover that work.
v2.31.0 — Scrapling Web Scraping + Research Optimizer + LinkedIn API Fix
What's New
Scrapling Web Scraping Integration (7 new tools)
scrapling_fetch— Adaptive URL fetching with 3 tiers (http/stealth/dynamic)scrapling_extract— CSS/XPath structured data extraction (zero LLM tokens)scrapling_batch_fetch— Parallel multi-URL fetching (up to 20 URLs)scrapling_track_element— Track elements across page changes (survives DOM restructuring)scrapling_crawl/scrapling_crawl_status/scrapling_crawl_stop— Multi-page spider with session management
Research Optimizer (new domain)
merge_research_sources— Join datasets from multiple agent sourcesscore_research_quality— Score and rank merged research by configurable criteria
LinkedIn API Fix
- Critical: Fixed silent truncation of LinkedIn posts from first
(parenthesis character cleanLinkedInText()now replaces parentheses with brackets before API calls- Added comprehensive character sanitization (pipe, box-drawing, zero-width chars)
Infrastructure
- Python bridge server for Scrapling (
scrapling_bridgeon port 8008) - Docker Compose service + Dockerfile for containerized deployment
- Added
web_scrapingdomain to research, data, multi_agent, web_dev presets - 2 new workflow chains:
competitive_intel,price_monitor
Install / Upgrade
npm install -g nodebench-mcp@2.31.0
# or
npx nodebench-mcp@2.31.0Stats
- 260 tools across 49 domains
- 10 presets: default (54), web_dev (106+), research (71+), data (78+), devops (68), mobile (95), academic (86), multi_agent (102+), content (77), full (260)
v2.30.0 — Headless API-First Agentic Engine
Headless API-First Agentic Engine
New: Engine API (--engine flag)
- HTTP server on port 6276 with 12 REST endpoints
- Execute any tool via
POST /api/tools/:name - Run any of 32 workflow chains via
POST /api/workflows/:name - SSE streaming for real-time step-by-step progress
- Session management with preset-scoped tool gating
- Conformance Reports: deterministic A-F grades on workflow quality
- Bearer token auth via
--engine-secretorENGINE_SECRETenv var
New: Engine Demo View (/engine-demo)
- Ultra minimal one-page CRUD interface
- 5 panels: Spec (CRUD), Trace (live events), Scoreboard (grades), Timeline (execution bars), Publish (curl + reports)
- Pure JetBrains Mono typography, thin dividers, maximum whitespace
Quick Start
# Install and start engine
npx nodebench-mcp --engine
# Execute a tool
curl -X POST http://127.0.0.1:6276/api/tools/discover_tools \
-H "Content-Type: application/json" \
-d '{"args": {"query": "security audit"}, "preset": "full"}'
# Run a workflow with streaming
curl -N -X POST http://127.0.0.1:6276/api/workflows/fix_bug \
-H "Content-Type: application/json" \
-d '{"preset": "web_dev", "streaming": true}'
# Get conformance report
curl http://127.0.0.1:6276/api/sessions/{id}/reportConformance Report Grades
Every workflow execution scores 8 checks: step completeness, quality gate, test layers, flywheel, learnings, recon, verification, and error-free execution.
A (90+) / B (75+) / C (60+) / D (40+) / F (<40)
Files Added
packages/mcp-local/src/engine/session.tspackages/mcp-local/src/engine/conformance.tspackages/mcp-local/src/engine/server.tssrc/features/engine/views/EngineDemoView.tsx
npm: nodebench-mcp@2.30.0
v2.28.0 — LLM-as-a-Judge QA + 85-Round Flywheel
What's New
LLM-as-a-Judge QA Scoring System
Replaced 120+ regex false-positive patterns with Gemini 2.0 Flash semantic classification. The judge classifies QA issues into 4 categories:
genuine_bug— real functional/visual defectsdesign_opinion— subjective design preferences, not bugsscreenshot_artifact— compression artifacts misread as issuesmock_data— placeholder content flagged as incomplete
Expanded Deterministic Metrics (Layer 1)
7 → 12 boolean Playwright checks:
no_layout_shift(CLS post-hydration, Google 0.25 threshold)no_404_resources(static asset 404s)no_slow_resources(>5s load time)no_mixed_content(HTTP on HTTPS)viewport_meta_ok(viewport meta validation)
3-Layer Scoring Rebalance
- Layer 1 (60%): 12 deterministic boolean metrics — zero variance
- Layer 2 (30%): Severity rubric on LLM-judged genuine issues only
- Layer 3 (10%): Legacy P-level taste deductions
QA Score Journey: R50→R85
R50: 56/100 (regex-only era, 120+ patterns)
R78: 85/100 (first LLM judge run)
R82: 82/100 (Layer 1 permanent failures fixed)
R84: 97/100 (judge tuning converging)
R85: 100/100 ✅ (perfect score — all 3 layers maxed)
148-Screenshot Gallery
Full responsive matrix: dark desktop, light desktop, dark mobile, light mobile — covering all 37 routes.
UI Polish (200+ components)
- Motion-safe animations (111 files)
- Dark mode gap fixes (research, analytics, narrative, onboarding)
- Jargon purge (ISO dates, label consistency)
- Accessibility (focus rings, ARIA labels)
Packages
Key Files
scripts/ui/runDogfoodGeminiQa.mjs— Complete scoring engine rewrite.claude/rules/gemini_qa_loop.md— QA loop integration rulepublic/dogfood/qa-results.json— 85-round score historypublic/dogfood/screenshots/— 148 screenshots
Full Changelog: v2.27.0...v2.28.0