Releases · HomenShum/nodebench-ai

28 Apr 03:25

HomenShum

v2.32.0

76a0665

v2.32.0 — Mobile UX + 4-Layer Grounding + Search Quality 92.5% Latest

Latest

Closing the orphan-tag gap: v2.32.0 was tagged but no release was published.

Highlights

Search quality + grounding (anti-hallucination)

4-layer grounding pipeline: claim verification + citation chain + grounded judge
Search quality 92.5% (49/53) on the 100-query eval corpus
Multi-step chain eval: 8/8 chains, 32/32 steps, 100% pass rate
Hybrid code+LLM judge with majority vote — eliminates flake from single-judge variance

Mobile UX

Dedicated MobileTabBar component for mobile navigation
Mobile UX + live data flywheel
NodeBench AI as default entity
README updated with MobileTabBar section

Eval flywheel

Multi-turn session eval + classification alignment — 14/18 categories at 100%
Majority-vote judge + multi-turn sessions + ambient feedback — 13/18 at 100%
Real LLM content generation, lens-aware content synthesis
Eval climbed from 82% → 84% → 88% → 90% → 92.3% over the cycle

Tool graph + telemetry

Real tool graph data: 365 nodes, 56 domains, 935 edges from live registry
Tool coverage proof + contextual graph visualization
Telemetry components published in nodebench-mcp@2.65.0 + 2.68.0

Founder surface stability

Solid dark backgrounds for all founder components
Force dark mode on founder surfaces (light-mode compatibility fix)
Restore linter-deleted founder_local_synthesize + weekly_reset tools

Other fixes

Updated contradiction fallback to NodeBench-specific text
Use Gemini 3.1 Flash (not 2.5) for hard scenario judge
Linkup search + ResultPacket mapping + eval flywheel 85.4%

Note

This release reflects state at the v2.32.0 git tag. Today's stabilization sprint (Apr 27 2026) shipped substantial follow-on work — see CHANGELOG.md or recent merged PRs for details. A future release will cover that work.

Assets 2

06 Mar 02:41

HomenShum

v2.31.0

e6d820a

v2.31.0 — Scrapling Web Scraping + Research Optimizer + LinkedIn API Fix

What's New

Scrapling Web Scraping Integration (7 new tools)

scrapling_fetch — Adaptive URL fetching with 3 tiers (http/stealth/dynamic)
scrapling_extract — CSS/XPath structured data extraction (zero LLM tokens)
scrapling_batch_fetch — Parallel multi-URL fetching (up to 20 URLs)
scrapling_track_element — Track elements across page changes (survives DOM restructuring)
scrapling_crawl / scrapling_crawl_status / scrapling_crawl_stop — Multi-page spider with session management

Research Optimizer (new domain)

merge_research_sources — Join datasets from multiple agent sources
score_research_quality — Score and rank merged research by configurable criteria

LinkedIn API Fix

Critical: Fixed silent truncation of LinkedIn posts from first ( parenthesis character
cleanLinkedInText() now replaces parentheses with brackets before API calls
Added comprehensive character sanitization (pipe, box-drawing, zero-width chars)

Infrastructure

Python bridge server for Scrapling (scrapling_bridge on port 8008)
Docker Compose service + Dockerfile for containerized deployment
Added web_scraping domain to research, data, multi_agent, web_dev presets
2 new workflow chains: competitive_intel, price_monitor

Install / Upgrade

npm install -g nodebench-mcp@2.31.0
# or
npx nodebench-mcp@2.31.0

Stats

260 tools across 49 domains
10 presets: default (54), web_dev (106+), research (71+), data (78+), devops (68), mobile (95), academic (86), multi_agent (102+), content (77), full (260)

Assets 2

05 Mar 02:24

HomenShum

v2.30.0

e6d820a

v2.30.0 — Headless API-First Agentic Engine

Headless API-First Agentic Engine

New: Engine API (--engine flag)

HTTP server on port 6276 with 12 REST endpoints
Execute any tool via POST /api/tools/:name
Run any of 32 workflow chains via POST /api/workflows/:name
SSE streaming for real-time step-by-step progress
Session management with preset-scoped tool gating
Conformance Reports: deterministic A-F grades on workflow quality
Bearer token auth via --engine-secret or ENGINE_SECRET env var

New: Engine Demo View (/engine-demo)

Ultra minimal one-page CRUD interface
5 panels: Spec (CRUD), Trace (live events), Scoreboard (grades), Timeline (execution bars), Publish (curl + reports)
Pure JetBrains Mono typography, thin dividers, maximum whitespace

Quick Start

# Install and start engine
npx nodebench-mcp --engine

# Execute a tool
curl -X POST http://127.0.0.1:6276/api/tools/discover_tools \
  -H "Content-Type: application/json" \
  -d '{"args": {"query": "security audit"}, "preset": "full"}'

# Run a workflow with streaming
curl -N -X POST http://127.0.0.1:6276/api/workflows/fix_bug \
  -H "Content-Type: application/json" \
  -d '{"preset": "web_dev", "streaming": true}'

# Get conformance report
curl http://127.0.0.1:6276/api/sessions/{id}/report

Conformance Report Grades

Every workflow execution scores 8 checks: step completeness, quality gate, test layers, flywheel, learnings, recon, verification, and error-free execution.

A (90+) / B (75+) / C (60+) / D (40+) / F (<40)

Files Added

packages/mcp-local/src/engine/session.ts
packages/mcp-local/src/engine/conformance.ts
packages/mcp-local/src/engine/server.ts
src/features/engine/views/EngineDemoView.tsx

npm: nodebench-mcp@2.30.0

Assets 2

20 Feb 11:15

HomenShum

v2.28.0

713b2c2

v2.28.0 — LLM-as-a-Judge QA + 85-Round Flywheel

What's New

LLM-as-a-Judge QA Scoring System

Replaced 120+ regex false-positive patterns with Gemini 2.0 Flash semantic classification. The judge classifies QA issues into 4 categories:

genuine_bug — real functional/visual defects
design_opinion — subjective design preferences, not bugs
screenshot_artifact — compression artifacts misread as issues
mock_data — placeholder content flagged as incomplete

Expanded Deterministic Metrics (Layer 1)

7 → 12 boolean Playwright checks:

no_layout_shift (CLS post-hydration, Google 0.25 threshold)
no_404_resources (static asset 404s)
no_slow_resources (>5s load time)
no_mixed_content (HTTP on HTTPS)
viewport_meta_ok (viewport meta validation)

3-Layer Scoring Rebalance

Layer 1 (60%): 12 deterministic boolean metrics — zero variance
Layer 2 (30%): Severity rubric on LLM-judged genuine issues only
Layer 3 (10%): Legacy P-level taste deductions

QA Score Journey: R50→R85

R50: 56/100 (regex-only era, 120+ patterns)
R78: 85/100 (first LLM judge run)
R82: 82/100 (Layer 1 permanent failures fixed)
R84: 97/100 (judge tuning converging)
R85: 100/100 ✅ (perfect score — all 3 layers maxed)

148-Screenshot Gallery

Full responsive matrix: dark desktop, light desktop, dark mobile, light mobile — covering all 37 routes.

UI Polish (200+ components)

Motion-safe animations (111 files)
Dark mode gap fixes (research, analytics, narrative, onboarding)
Jargon purge (ISO dates, label consistency)
Accessibility (focus rings, ARIA labels)

Packages

nodebench-mcp@2.28.0 — npm
@homenshum/convex-mcp-nodebench@0.10.1 — npm

Key Files

scripts/ui/runDogfoodGeminiQa.mjs — Complete scoring engine rewrite
.claude/rules/gemini_qa_loop.md — QA loop integration rule
public/dogfood/qa-results.json — 85-round score history
public/dogfood/screenshots/ — 148 screenshots

Full Changelog: v2.27.0...v2.28.0

Assets 2

Releases: HomenShum/nodebench-ai

v2.32.0 — Mobile UX + 4-Layer Grounding + Search Quality 92.5%

Highlights

Search quality + grounding (anti-hallucination)

Mobile UX

Eval flywheel

Tool graph + telemetry

Founder surface stability

Other fixes

Note

Uh oh!

v2.31.0 — Scrapling Web Scraping + Research Optimizer + LinkedIn API Fix

What's New

Scrapling Web Scraping Integration (7 new tools)

Research Optimizer (new domain)

LinkedIn API Fix

Infrastructure

Install / Upgrade

Stats

Uh oh!

v2.30.0 — Headless API-First Agentic Engine

Headless API-First Agentic Engine

Quick Start

Conformance Report Grades

Files Added

Uh oh!

v2.28.0 — LLM-as-a-Judge QA + 85-Round Flywheel

What's New

LLM-as-a-Judge QA Scoring System

Expanded Deterministic Metrics (Layer 1)

3-Layer Scoring Rebalance

QA Score Journey: R50→R85

148-Screenshot Gallery

UI Polish (200+ components)

Packages

Key Files

Uh oh!