Skip to content

Releases: HomenShum/nodebench-ai

v2.32.0 — Mobile UX + 4-Layer Grounding + Search Quality 92.5%

28 Apr 03:25

Choose a tag to compare

Closing the orphan-tag gap: v2.32.0 was tagged but no release was published.

Highlights

Search quality + grounding (anti-hallucination)

  • 4-layer grounding pipeline: claim verification + citation chain + grounded judge
  • Search quality 92.5% (49/53) on the 100-query eval corpus
  • Multi-step chain eval: 8/8 chains, 32/32 steps, 100% pass rate
  • Hybrid code+LLM judge with majority vote — eliminates flake from single-judge variance

Mobile UX

  • Dedicated MobileTabBar component for mobile navigation
  • Mobile UX + live data flywheel
  • NodeBench AI as default entity
  • README updated with MobileTabBar section

Eval flywheel

  • Multi-turn session eval + classification alignment — 14/18 categories at 100%
  • Majority-vote judge + multi-turn sessions + ambient feedback — 13/18 at 100%
  • Real LLM content generation, lens-aware content synthesis
  • Eval climbed from 82% → 84% → 88% → 90% → 92.3% over the cycle

Tool graph + telemetry

  • Real tool graph data: 365 nodes, 56 domains, 935 edges from live registry
  • Tool coverage proof + contextual graph visualization
  • Telemetry components published in nodebench-mcp@2.65.0 + 2.68.0

Founder surface stability

  • Solid dark backgrounds for all founder components
  • Force dark mode on founder surfaces (light-mode compatibility fix)
  • Restore linter-deleted founder_local_synthesize + weekly_reset tools

Other fixes

  • Updated contradiction fallback to NodeBench-specific text
  • Use Gemini 3.1 Flash (not 2.5) for hard scenario judge
  • Linkup search + ResultPacket mapping + eval flywheel 85.4%

Note

This release reflects state at the v2.32.0 git tag. Today's stabilization sprint (Apr 27 2026) shipped substantial follow-on work — see CHANGELOG.md or recent merged PRs for details. A future release will cover that work.

v2.31.0 — Scrapling Web Scraping + Research Optimizer + LinkedIn API Fix

06 Mar 02:41

Choose a tag to compare

What's New

Scrapling Web Scraping Integration (7 new tools)

  • scrapling_fetch — Adaptive URL fetching with 3 tiers (http/stealth/dynamic)
  • scrapling_extract — CSS/XPath structured data extraction (zero LLM tokens)
  • scrapling_batch_fetch — Parallel multi-URL fetching (up to 20 URLs)
  • scrapling_track_element — Track elements across page changes (survives DOM restructuring)
  • scrapling_crawl / scrapling_crawl_status / scrapling_crawl_stop — Multi-page spider with session management

Research Optimizer (new domain)

  • merge_research_sources — Join datasets from multiple agent sources
  • score_research_quality — Score and rank merged research by configurable criteria

LinkedIn API Fix

  • Critical: Fixed silent truncation of LinkedIn posts from first ( parenthesis character
  • cleanLinkedInText() now replaces parentheses with brackets before API calls
  • Added comprehensive character sanitization (pipe, box-drawing, zero-width chars)

Infrastructure

  • Python bridge server for Scrapling (scrapling_bridge on port 8008)
  • Docker Compose service + Dockerfile for containerized deployment
  • Added web_scraping domain to research, data, multi_agent, web_dev presets
  • 2 new workflow chains: competitive_intel, price_monitor

Install / Upgrade

npm install -g nodebench-mcp@2.31.0
# or
npx nodebench-mcp@2.31.0

Stats

  • 260 tools across 49 domains
  • 10 presets: default (54), web_dev (106+), research (71+), data (78+), devops (68), mobile (95), academic (86), multi_agent (102+), content (77), full (260)

v2.30.0 — Headless API-First Agentic Engine

05 Mar 02:24

Choose a tag to compare

Headless API-First Agentic Engine

New: Engine API (--engine flag)

  • HTTP server on port 6276 with 12 REST endpoints
  • Execute any tool via POST /api/tools/:name
  • Run any of 32 workflow chains via POST /api/workflows/:name
  • SSE streaming for real-time step-by-step progress
  • Session management with preset-scoped tool gating
  • Conformance Reports: deterministic A-F grades on workflow quality
  • Bearer token auth via --engine-secret or ENGINE_SECRET env var

New: Engine Demo View (/engine-demo)

  • Ultra minimal one-page CRUD interface
  • 5 panels: Spec (CRUD), Trace (live events), Scoreboard (grades), Timeline (execution bars), Publish (curl + reports)
  • Pure JetBrains Mono typography, thin dividers, maximum whitespace

Quick Start

# Install and start engine
npx nodebench-mcp --engine

# Execute a tool
curl -X POST http://127.0.0.1:6276/api/tools/discover_tools \
  -H "Content-Type: application/json" \
  -d '{"args": {"query": "security audit"}, "preset": "full"}'

# Run a workflow with streaming
curl -N -X POST http://127.0.0.1:6276/api/workflows/fix_bug \
  -H "Content-Type: application/json" \
  -d '{"preset": "web_dev", "streaming": true}'

# Get conformance report
curl http://127.0.0.1:6276/api/sessions/{id}/report

Conformance Report Grades

Every workflow execution scores 8 checks: step completeness, quality gate, test layers, flywheel, learnings, recon, verification, and error-free execution.

A (90+) / B (75+) / C (60+) / D (40+) / F (<40)

Files Added

  • packages/mcp-local/src/engine/session.ts
  • packages/mcp-local/src/engine/conformance.ts
  • packages/mcp-local/src/engine/server.ts
  • src/features/engine/views/EngineDemoView.tsx

npm: nodebench-mcp@2.30.0

v2.28.0 — LLM-as-a-Judge QA + 85-Round Flywheel

20 Feb 11:15

Choose a tag to compare

What's New

LLM-as-a-Judge QA Scoring System

Replaced 120+ regex false-positive patterns with Gemini 2.0 Flash semantic classification. The judge classifies QA issues into 4 categories:

  • genuine_bug — real functional/visual defects
  • design_opinion — subjective design preferences, not bugs
  • screenshot_artifact — compression artifacts misread as issues
  • mock_data — placeholder content flagged as incomplete

Expanded Deterministic Metrics (Layer 1)

7 → 12 boolean Playwright checks:

  • no_layout_shift (CLS post-hydration, Google 0.25 threshold)
  • no_404_resources (static asset 404s)
  • no_slow_resources (>5s load time)
  • no_mixed_content (HTTP on HTTPS)
  • viewport_meta_ok (viewport meta validation)

3-Layer Scoring Rebalance

  • Layer 1 (60%): 12 deterministic boolean metrics — zero variance
  • Layer 2 (30%): Severity rubric on LLM-judged genuine issues only
  • Layer 3 (10%): Legacy P-level taste deductions

QA Score Journey: R50→R85

R50: 56/100 (regex-only era, 120+ patterns)
R78: 85/100 (first LLM judge run)
R82: 82/100 (Layer 1 permanent failures fixed)
R84: 97/100 (judge tuning converging)
R85: 100/100 ✅ (perfect score — all 3 layers maxed)

148-Screenshot Gallery

Full responsive matrix: dark desktop, light desktop, dark mobile, light mobile — covering all 37 routes.

UI Polish (200+ components)

  • Motion-safe animations (111 files)
  • Dark mode gap fixes (research, analytics, narrative, onboarding)
  • Jargon purge (ISO dates, label consistency)
  • Accessibility (focus rings, ARIA labels)

Packages

  • nodebench-mcp@2.28.0npm
  • @homenshum/convex-mcp-nodebench@0.10.1npm

Key Files

  • scripts/ui/runDogfoodGeminiQa.mjs — Complete scoring engine rewrite
  • .claude/rules/gemini_qa_loop.md — QA loop integration rule
  • public/dogfood/qa-results.json — 85-round score history
  • public/dogfood/screenshots/ — 148 screenshots

Full Changelog: v2.27.0...v2.28.0