Nova Sonic Test Harness

An end-to-end test framework for evaluating Amazon Nova Sonic (speech-to-speech model) via AWS Bedrock's bidirectional streaming API. It simulates multi-turn conversations between a configurable user simulator (Bedrock LLMs or scripted inputs) and Nova Sonic, logs all interactions, and provides LLM-as-judge evaluation.

No microphone or speaker required — uses silent audio streaming or Amazon Polly TTS for realistic audio input.

Quick Start

# 1. Install dependencies
pip install -r requirements.txt

# 2. Configure AWS credentials
cp .env.example .env
# Edit .env with your AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, etc.

# 3. Run a basic conversation
python main.py --config configs/example_basic.json

# 4. Run with tools
python main.py --config configs/example_with_tools.json

# 5. Batch test all configs in a directory
python main.py --scenarios-dir configs/order_status --parallel 2

Setup

Prerequisites

Python 3.12 (managed via mise, see .mise.toml)
AWS credentials with access to the services below

Required AWS IAM Permissions

Service	Permissions	When Needed
Amazon Bedrock	`bedrock:InvokeModel`, `bedrock:InvokeModelWithResponseStream`, `bedrock:InvokeModelWithBidirectionalStream`	Always (Nova Sonic streaming, user simulator, judge)
Amazon Polly	`polly:SynthesizeSpeech`	Only when `input_mode: "polly"`
Amazon S3	`s3:PutObject`, `s3:GetObject`, `s3:DeleteObject`, `s3:CreateBucket`	Only for audio-text consistency evaluation (`evaluate_audio_text.py`)
Amazon Transcribe	`transcribe:StartTranscriptionJob`, `transcribe:GetTranscriptionJob`, `transcribe:DeleteTranscriptionJob`	Only for audio-text consistency evaluation

Your IAM user/role also needs access to the specific Bedrock models configured in configs/models.yaml (Nova Sonic, Claude, etc.). Model access must be enabled in the Bedrock console for your region.

Install

pip install -r requirements.txt

Key dependencies:

Package	Purpose
`boto3`	AWS SDK for Bedrock Converse API (user sim, judge)
`aws_sdk_bedrock_runtime`	Bidirectional streaming with Nova Sonic
`rx` (RxPY)	Reactive streams for event-driven audio/text
`smithy-aws-core`	Low-level AWS signing for streaming
`pyyaml`	Config file loading
`python-dotenv`	`.env` file support

AWS Credentials

Copy .env.example to .env and fill in:

AWS_ACCESS_KEY_ID=your_access_key_here
AWS_SECRET_ACCESS_KEY=your_secret_key_here
AWS_DEFAULT_REGION=us-east-1

SONIC_MODEL_ID=amazon.nova-2-sonic-v1:0
SONIC_REGION=us-east-1

Environment variables override config file values for sonic_model_id, sonic_region, and sonic_endpoint_uri.

Running Tests

Single Conversation

python main.py --config configs/example_basic.json

With Tools Enabled

python main.py --config configs/example_with_tools.json

Scripted Mode (Pre-Recorded User Messages)

python main.py --config configs/example_scripted.json

Batch Testing (All Configs in a Directory)

python main.py --scenarios-dir configs/order_status --parallel 2

Repeat a Config Multiple Times

python main.py --config configs/claude_haiku_basic.json --repeat 5 --parallel 3

Override Config from CLI

python main.py --config configs/example_basic.json --max-turns 5 --no-evaluate

Evaluate an Existing Log

# Evaluate an existing conversation log
python evaluate_log.py logs/<session_dir>/ \
  --user-goal "Check order status" \
  --assistant-objective "Provide accurate order information" \
  --aspects "Goal Achievement,Tool Usage,Response Quality"

# Binary (YES/NO) evaluation
python evaluate_log.py logs/<session_dir>/ \
  --judge-type binary \
  --aspects "Goal Achievement,Tool Usage,Response Quality" \
  --rubrics '{"Goal Achievement": ["Did the agent find the order?", "Did the agent provide the status?"]}'

Audio-Text Consistency Check (Hallucination Detection)

python evaluate_audio_text.py \
  --log logs/<session_dir> \
  --s3-bucket my-transcription-bucket

Requires log_audio: true in the config. Transcribes each turn's audio via Amazon Transcribe and compares against the text output to detect hallucinations. Audio files are uploaded to S3 keyed by the turn's Nova Sonic session ID.

Polly TTS Mode (Real Audio Input)

# Basic Polly TTS conversation
python main.py --config configs/example_polly.json

# Polly TTS with tools
python main.py --config configs/example_with_tools_polly.json

# Enable Polly via CLI overrides on any config
python main.py --config configs/example_basic.json \
  --input-mode polly --polly-voice Matthew --record-conversation --sonic-voice tiffany

Instead of sending text to Nova Sonic, Polly TTS synthesizes user simulator text into 16kHz audio and streams it as real speech input. Set record_conversation: true to save input/output audio as LPCM files and a merged stereo WAV.

Dataset-Driven Testing

Run test scenarios from external datasets (JSONL files or HuggingFace datasets). Each entry becomes a separate scripted session with optional ground truth for evaluation.

# Run a local JSONL dataset
python main.py --dataset datasets/example_customer_support.jsonl

# With parallelism
python main.py --dataset datasets/example_customer_support.jsonl --parallel 4

# With a base config (provides system prompt, tools, voice, etc.)
python main.py --dataset datasets/my_tests.jsonl --base-config configs/example_dataset.json

# HuggingFace dataset
python main.py --dataset hf://ScaleAI/audiomc

# Via config file (set dataset_path in the config)
python main.py --config configs/example_dataset.json

Dataset format (JSONL) — two styles auto-detected:

Style A — Turns array (standard):

{"id": "order_001", "category": "tool_usage", "turns": [{"role": "user", "content": "Check order ORD-999"}, {"role": "assistant", "content": "Your order is in transit.", "expected_tool_call": "getOrderStatus"}], "rubric": "Must use getOrderStatus tool", "config": {"sonic_system_prompt": "You are a support agent."}}

Style B — Wide format (AudioMC-compatible):

{"id": "music_001", "user_turn_1_transcript": "Play Bowie", "assistant_turn_1_transcript": "Here's Let's Dance", "user_turn_2_transcript": "Something mellow?", "assistant_turn_2_transcript": "Here's Space Oddity", "rubric": "Should remember artist across turns"}

Each entry can include:

id — unique identifier
category — test category/axis
rubric — evaluation criteria text (injected into judge prompt)
config — per-entry config overrides (e.g. system prompt, tools)
Expected responses and tool calls serve as ground truth for the LLM judge

Quick Run via Shell Script

./run_example.sh claude_haiku_basic
./run_example.sh gpt_oss_tools
./run_example.sh scripted

CLI Arguments

Argument	Description
`--config PATH`	Single configuration file
`--scenarios-dir DIR`	Directory of configs (mutually exclusive with `--config`)
`--dataset PATH`	JSONL dataset or `hf://org/dataset` (mutually exclusive with above)
`--base-config PATH`	Base config for dataset runs (used with `--dataset`)
`--pattern GLOB`	Config file pattern (default: `*.json`)
`--test-name NAME`	Override test name
`--max-turns N`	Override max turns
`--mode bedrock\|scripted`	Override interaction mode
`--parallel N`	Parallel session count (default: 1)
`--repeat N`	Repeat each config N times (default: 1)
`--no-evaluate`	Disable auto-evaluation
`--log-judge-prompts`	Log judge prompts and raw responses in evaluation output
`--sonic-voice ID`	Nova Sonic output voice (e.g. `matthew`, `tiffany`, `amy`)
`--input-mode text\|polly`	Audio input mode: `text` (default) or `polly` (TTS)
`--polly-voice ID`	Polly voice ID (e.g. `Matthew`, `Joanna`)
`--record-conversation`	Record full conversation audio (LPCM + stereo WAV)
`--output-dir DIR`	Output directory for results (default: `results`)
`--log-events`	Log all streaming events to JSONL (debug mode)

Configuration

Configurations are JSON or YAML files in configs/. All settings are defined in the TestConfig dataclass (config_manager.py).

Configuration Reference

{
  "test_name": "order_status_delayed",
  "description": "Test handling of delayed order inquiries",

  "sonic_model_id": "nova-sonic",
  "sonic_region": "us-east-1",
  "sonic_endpoint_uri": null,
  "sonic_system_prompt": "You are a customer service agent...",
  "sonic_inference_config": {
    "max_tokens": 1024,
    "temperature": 0.7,
    "top_p": 0.9
  },
  "sonic_tool_config": {
    "tools": [ { "toolSpec": { "..." } } ],
    "tool_choice": { "auto": {} }
  },

  "sonic_voice_id": "matthew",

  "user_model_id": "claude-haiku",
  "user_system_prompt": "You are an angry customer...",
  "user_max_tokens": 1024,
  "user_temperature": 1.0,

  "input_mode": "text",
  "polly_voice_id": "Matthew",
  "polly_engine": "neural",
  "polly_region": null,
  "record_conversation": false,

  "interaction_mode": "bedrock",
  "scripted_messages": [],
  "max_turns": 10,
  "turn_delay": 1.0,

  "log_audio": false,
  "log_text": true,
  "log_tools": true,
  "output_directory": "results",

  "auto_evaluate": true,
  "evaluation_criteria": {
    "user_goal": "Get answers about delayed order",
    "assistant_objective": "Provide accurate order info and de-escalate",
    "evaluation_aspects": [
      "Goal Achievement",
      "Conversation Flow",
      "Tool Usage Appropriateness",
      "Response Accuracy",
      "User Experience"
    ],
    "rubrics": {
      "Goal Achievement": [
        "Did the agent provide the order status?",
        "Did the agent offer a resolution?"
      ]
    }
  },

  "tool_registry_module": "tools.order_status_tools",

  "enable_session_continuation": true,
  "transition_threshold_seconds": 360.0,
  "audio_buffer_duration_seconds": 3.0
}

Configuration Fields

Field	Type	Default	Description
`test_name`	string	required	Name for this test scenario
`description`	string	`""`	Human-readable description
Nova Sonic
`sonic_model_id`	string	`"nova-sonic"`	Model alias or full Bedrock ID
`sonic_region`	string	`"us-east-1"`	AWS region for Nova Sonic
`sonic_endpoint_uri`	string	`null`	Custom endpoint URI (optional)
`sonic_system_prompt`	string	`"You are a helpful assistant."`	System prompt for Nova Sonic
`sonic_inference_config`	object	`{max_tokens:1024, temperature:0.7, top_p:0.9}`	Inference parameters
`sonic_tool_config`	object	`null`	Tool definitions and tool_choice
`sonic_voice_id`	string	`"matthew"`	Nova Sonic output voice (e.g. `matthew`, `tiffany`, `amy`)
User Simulator
`user_model_id`	string	`"claude-haiku"`	Model alias for user simulation
`user_system_prompt`	string	`null`	System prompt for the simulated user
`user_max_tokens`	int	`1024`	Max tokens for user messages
`user_temperature`	float	`1.0`	Temperature for user generation
Audio Input
`input_mode`	string	`"text"`	`"text"` (interactive text) or `"polly"` (Polly TTS audio)
`polly_voice_id`	string	`"Matthew"`	Polly voice ID for TTS synthesis
`polly_engine`	string	`"neural"`	Polly engine (`neural` or `standard`)
`polly_region`	string	`null`	AWS region for Polly (defaults to `sonic_region`)
`record_conversation`	bool	`false`	Record input/output audio (LPCM + stereo WAV)
Interaction
`interaction_mode`	string	`"bedrock"`	`"bedrock"` (LLM user) or `"scripted"` (predefined messages)
`scripted_messages`	list	`[]`	Messages for scripted mode
`max_turns`	int	`10`	Maximum conversation turns
`turn_delay`	float	`1.0`	Seconds between turns
Logging
`log_audio`	bool	`true`	Record audio to WAV files
`log_text`	bool	`true`	Log text transcripts
`log_tools`	bool	`true`	Log tool calls and results
`output_directory`	string	`"results"`	Base output directory
`log_events`	bool	`true`	Log all streaming events to JSONL
Evaluation
`auto_evaluate`	bool	`false`	Run LLM judge after conversation
`evaluation_criteria`	object	`null`	User goal, assistant objective, aspects, rubrics
`log_judge_prompts`	bool	`false`	Record judge prompt and raw response in evaluation output
Tools
`tool_registry_module`	string	`null`	Python module path for custom tools
Dataset
`dataset_path`	string	`null`	Path to JSONL dataset or `hf://org/dataset`
Session Continuation
`enable_session_continuation`	bool	`true`	Auto-reconnect on timeout
`transition_threshold_seconds`	float	`360.0`	When to start new session
`audio_buffer_duration_seconds`	float	`3.0`	Audio buffer for transitions

Included Configs

Config	Mode	Tools	Turns	User Model
`example_basic.json`	bedrock	No	5	claude-haiku
`example_with_tools.json`	bedrock	Yes (3 travel tools)	10	claude-haiku
`example_scripted.json`	scripted	Yes	1	(scripted)
`claude_haiku_basic.json`	bedrock	No	5	claude-haiku
`gpt_oss_tools.json`	bedrock	Yes	10	gpt-oss
`qwen_complex.json`	bedrock	No	10	qwen
`qwen_with_tools.json`	bedrock	Yes	10	qwen
`example_binary_eval.json`	bedrock	No	5	claude-haiku (binary judge)
`example_polly.json`	bedrock	No	5	claude-haiku (Polly TTS)
`example_with_tools_polly.json`	bedrock	Yes (3 travel tools)	1	claude-haiku (Polly TTS)
`example_dataset.json`	dataset	No	(per entry)	(scripted)
`order_status/*.json`	bedrock	Yes (lookupOrder)	4-5	claude-haiku

The order_status/ directory contains 5 scenario variants: aggressive, delayed, normal, processing, wrong_id — each with different user personas and expected order states.

Legacy Key Migration

Old claude_* keys are auto-migrated to user_*:

Old Key	New Key
`claude_model`	`user_model_id`
`claude_system_prompt`	`user_system_prompt`
`claude_max_tokens`	`user_max_tokens`
`claude_temperature`	`user_temperature`

Architecture

Core Pipeline Flow

                          ┌──────────────────┐
                          │   main.py        │
                          │ LiveInteraction  │
                          │    Session       │
                          └──────┬───────────┘
                                 │
              ┌──────────────────┼──────────────────┐
              │                  │                   │
              ▼                  ▼                   ▼
   ┌──────────────────┐  ┌─────────────┐  ┌──────────────────┐
   │ BedrockModelClient│  │SonicStream  │  │InteractionLogger │
   │ (user simulator)  │  │  Manager    │  │ (JSON + WAV)     │
   │                   │  │ (RxPY)      │  │                  │
   └──────────────────┘  └──────┬──────┘  └──────────────────┘
                                │
                         ┌──────┴──────┐
                         │ Session     │
                         │ Continuation│
                         │ Manager     │
                         └─────────────┘

main.py (LiveInteractionSession) orchestrates the turn loop
Each turn: user simulator generates text -> sent to Nova Sonic via streaming (as text, or synthesized to speech via Polly TTS) -> Nova Sonic responds with text/audio/tool-use events -> turn logged
After all turns, optionally runs LLM judge evaluation
Results organized into results/sessions/ by ResultsManager

Event-Driven Streaming (RxPY)

sonic_stream_manager.py uses reactive streams for bidirectional communication with Nova Sonic. The initialization sequence:

sessionStart (inference config)
promptStart (audio/text/tool output config, tool definitions)
System prompt (contentStart -> textInput -> contentEnd)
Audio contentStart (USER role, AUDIO type)
Silent audio streaming begins (512-frame chunks at 30ms intervals)

Events emitted by Nova Sonic and handled by the callback:

Event	Description
`completion_start`	Fires once per Nova Sonic session; captures `sonic_session_id`
`content_start`	New content block begins (tracks `generationStage` and `role`)
`text_output`	Text chunk (SPECULATIVE or FINAL stage)
`audio_output`	Audio chunk (base64 PCM)
`tool_use`	Tool call with name, ID, and parameters
`content_end`	Content block ends (may contain `stopReason`)

All event callbacks include sonic_session_id — the Nova Sonic server-side session UUID. This changes when session continuation creates a new connection, allowing each turn's audio to be mapped to the exact Nova Sonic session that produced it.

Turn Completion Detection

Critical logic in main.py:

Primary: Speculative text count == final text count (most reliable). Each text block goes through SPECULATIVE then FINAL stages. When all speculative blocks have finalized, the turn is complete.
Safety net: 60-second timeout for initial response, 60-second stall timeout between FINAL texts.

Hangup Detection

The user simulator can emit [HANGUP] in its message to signal the conversation should end early (e.g., frustrated customer hangs up). The turn loop detects this and stops cleanly.

When hangup_prompt_enabled is true (the default), the harness automatically appends a [HANGUP] instruction to the user simulator's system prompt. This means you do not need to include [HANGUP] instructions in your config's user_system_prompt. If the prompt already contains [HANGUP], the postamble is skipped to avoid duplication. Set "hangup_prompt_enabled": false in your config to disable this behavior.

Tool System

Tool Definitions in Config

Tools are defined in sonic_tool_config.tools using the Bedrock toolSpec format:

{
  "toolSpec": {
    "name": "lookupOrder",
    "description": "Look up order status by order ID",
    "inputSchema": {
      "json": {
        "type": "object",
        "properties": {
          "order_id": {
            "type": "string",
            "description": "The order ID to look up"
          }
        },
        "required": ["order_id"]
      }
    }
  }
}

Tool Registry (Custom Implementations)

Register async handlers via tool_registry.py:

from tool_registry import ToolRegistry

registry = ToolRegistry()

@registry.tool("lookupOrder")
async def lookup_order(tool_input: dict) -> dict:
    order_id = tool_input.get("order_id")
    # Call your actual API...
    return {"status": "success", "order": {"id": order_id, "state": "shipped"}}

To use a registry from a config file, set tool_registry_module:

{
  "tool_registry_module": "tools.order_status_tools"
}

The module must export a registry attribute (a ToolRegistry instance). It's loaded via importlib at startup.

Mock Fallback

If no handler is registered for a tool call, main.py falls back to a mock response with a 0.5-second simulated delay. This lets you run tool-enabled configs without implementing every tool.

Tool Call Flow

Nova Sonic emits a toolUse event with toolName, toolUseId, and content
SonicStreamManager._handle_tool_request() calls the registered handler
Result is sent back via contentStart(TOOL) -> toolResult -> contentEnd
Nova Sonic continues generating based on the tool result

Evaluation System

Binary LLM judge with strict YES/NO verdicts per metric. Active when auto_evaluate: true in config.

Binary Judge (`llm_judge_binary.py`)

Each metric returns a strict YES/NO verdict. Useful for automated CI/CD pass/fail gating and aggregated metric pass rates.

Four built-in metrics: Goal Achievement, Tool Usage, Response Quality, Conversation Flow
Tool Usage auto-skipped when no tools are configured
Supports multiple rubric questions per metric (each independently YES/NO). A metric passes only if ALL its rubric questions pass
Custom rubrics via evaluation_criteria.rubrics config field
Extensible: add new metrics to BUILTIN_METRICS or pass custom_metrics at runtime

from llm_judge_binary import BinaryLLMJudge
from evaluation_types import BinaryEvaluationCriteria

judge = BinaryLLMJudge(judge_model="claude-opus")
criteria = BinaryEvaluationCriteria(
    user_goal="Book a flight to Paris",
    assistant_objective="Provide travel assistance",
    evaluation_aspects=["Goal Achievement", "Tool Usage", "Response Quality", "Conversation Flow"],
    rubrics={
        "Goal Achievement": [
            "Did the agent help find flights to Paris?",
            "Did the agent check the rewards balance?",
        ]
    },
)
result = judge.evaluate_conversation(conversation_log, criteria, config=eval_config)
# result.pass_fail -> True/False (all metrics must pass)
# result.pass_rate -> 0.75 (3 of 4 metrics passed)
# result.metric_verdicts -> per-metric YES/NO with reasoning

Config example:

{
  "auto_evaluate": true,
  "evaluation_criteria": {
    "user_goal": "Book a flight and check rewards",
    "assistant_objective": "Provide travel assistance",
    "evaluation_aspects": ["Goal Achievement", "Tool Usage", "Response Quality", "Conversation Flow"],
    "rubrics": {
      "Goal Achievement": [
        "Did the agent help find flights to Paris?",
        "Did the agent check the rewards balance?"
      ]
    }
  }
}

Output format:

{
  "rating_type": "binary",
  "results": {
    "overall_rating": "PASS",
    "pass_fail": true,
    "pass_rate": 1.0,
    "metric_verdicts": {
      "Goal Achievement": {
        "verdict": "PASS",
        "reasoning": "...",
        "rubric_verdicts": [
          {"question": "Did the agent help find flights?", "verdict": "YES", "reasoning": "..."}
        ]
      }
    }
  }
}

Standalone Log Evaluation

python evaluate_log.py logs/<session_dir>/ \
  --user-goal "Custom user goal" \
  --assistant-objective "Custom assistant objective" \
  --aspects "Goal Achievement,Tool Usage,Response Quality" \
  --rubrics '{"Goal Achievement": ["Did the agent help?"]}'

Path D: Audio-Text Consistency Evaluation (`evaluate_audio_text.py`)

Detects hallucinations by comparing Nova Sonic's text output against a transcription of its audio output via Amazon Transcribe.

python evaluate_audio_text.py \
  --log logs/<session_dir> \
  --s3-bucket my-transcription-bucket \
  --region us-east-1 \
  --judge-model claude-opus

How it works:

For each turn with recorded audio, uploads the WAV file to S3 (keyed by the turn's Nova Sonic sonic_session_id, falling back to the log session ID)
Runs an Amazon Transcribe job and fetches the transcript
An LLM judge compares the text output vs. audio transcription, classifying differences as factual discrepancies, phrasing differences, or filler words
Saves audio_text_consistency.json in the log directory and cleans up S3 artifacts

Per-turn verdicts:

Verdict	Meaning
`CONSISTENT`	Only filler words or no differences
`MINOR_DIFFERENCES`	Phrasing differences only (synonyms, reworded sentences)
`HALLUCINATION`	Factual discrepancies (different numbers, names, facts)

Requirements:

log_audio: true in the config (so WAV files are recorded per turn)
An S3 bucket for temporary audio uploads (created automatically if it doesn't exist)
Amazon Transcribe access in the specified region

Output:

{
  "session_id": "order_status_20260225_170309_0b85b6",
  "judge_model": "claude-opus",
  "total_turns": 5,
  "turns_evaluated": 4,
  "turns_skipped": 1,
  "turns_hallucinated": 0,
  "hallucination_rate": 0.0,
  "turn_results": [
    {
      "turn_number": 1,
      "verdict": "CONSISTENT",
      "sonic_response_text": "...",
      "audio_transcript": "...",
      "factual_discrepancies": [],
      "phrasing_differences": [],
      "filler_words": ["um"]
    }
  ]
}

Results & Output

Session Output

All session output goes to {output_directory}/sessions/{session_id}/ (default: results/sessions/):

results/sessions/{session_id}/
├── audio/                          # Audio files
│   ├── turn_1.wav                  # Per-turn audio (if log_audio: true)
│   ├── input.lpcm                  # Raw input audio (if record_conversation: true)
│   ├── output.lpcm                 # Raw output audio (if record_conversation: true)
│   └── conversation.wav            # Stereo WAV L=input R=output (if record_conversation: true)
├── chat/
│   └── conversation.txt            # Human-readable transcript
├── config/
│   └── test_config.json            # Test configuration used
├── logs/
│   ├── interaction_log.json        # Complete structured data
│   └── events.jsonl                # Streaming events debug log (if log_events: true)
├── evaluation/
│   └── llm_judge_evaluation.json   # LLM judge evaluation results
├── session_metadata.json
├── README.json                     # Quick reference with statistics
└── README.md

Indexes

results/master_index.json — Array of all sessions with metadata
results/sessions_index.csv — CSV version for spreadsheets

Batch Results

When running --scenarios-dir or --parallel, batch summaries are generated:

results/batches/{batch_id}/batch_summary.json — Aggregated stats, pass rates, rating distributions

Interaction Log Format

{
  "session_id": "order_status_normal_20260225_144223_a1b2c3",
  "test_name": "order_status_normal",
  "start_time": "2026-02-25T14:42:23.261709",
  "end_time": "2026-02-25T14:47:17.065399",
  "configuration": { "...full TestConfig..." },
  "turns": [
    {
      "turn_number": 1,
      "timestamp": "2026-02-25T14:42:25.000000",
      "user_message": "Hi, I'd like to check on my order ORD-2024-33100.",
      "sonic_response": "Of course! Let me look that up for you.",
      "tool_calls": [
        {
          "tool_name": "lookupOrder",
          "tool_use_id": "uuid",
          "tool_input": { "order_id": "ORD-2024-33100" },
          "tool_result": { "status": "shipped", "tracking": "..." },
          "timestamp": "2026-02-25T14:42:30.000000"
        }
      ],
      "audio_recorded": true,
      "audio_file": "turn_1.wav",
      "sonic_session_id": "04934a4f-c91b-4756-bd11-8c0b41890489",
      "metadata": {}
    }
  ],
  "total_turns": 4,
  "total_tool_calls": 1,
  "errors": [],
  "session_events": [],
  "summary": {
    "total_turns": 4,
    "total_tool_calls": 1,
    "tool_usage_rate": "25.00%",
    "avg_user_message_length": 236.5,
    "avg_sonic_response_length": 286.0,
    "session_continuations": 0
  }
}

Evaluation Output Format

{
  "evaluated_at": "2026-02-25T14:50:00.000000",
  "judge_model": "global.anthropic.claude-opus-4-5-20251101-v1:0",
  "rating_type": "binary",
  "criteria": {
    "user_goal": "Check order status and get delivery estimate",
    "assistant_objective": "Provide accurate order information",
    "evaluation_aspects": ["Goal Achievement", "Tool Usage", "Response Quality"]
  },
  "results": {
    "overall_rating": "PASS",
    "pass_fail": true,
    "pass_rate": 1.0,
    "metric_verdicts": {
      "Goal Achievement": {
        "verdict": "PASS",
        "reasoning": "The assistant correctly looked up the order and provided status"
      },
      "Tool Usage": {
        "verdict": "PASS",
        "reasoning": "Correctly used lookupOrder tool with proper parameters"
      }
    }
  },
  "conversation_summary": {
    "total_turns": 4,
    "total_tool_calls": 1,
    "session_id": "order_status_normal_20260225_144223_a1b2c3"
  },
  "judge_log": {
    "prompt": "You are an expert evaluator... (full judge prompt)",
    "raw_response": "{\"overall_rating\": \"PASS\", ...}"
  }
}

The judge_log field is only present when log_judge_prompts: true is set in config or --log-judge-prompts is passed on the CLI.

Model Registry

configs/models.yaml maps short aliases to full Bedrock model IDs:

user_models:
  claude-haiku:
    model_id: "us.anthropic.claude-haiku-4-5-20251001-v1:0"
    region: "us-west-2"
  claude-sonnet:
    model_id: "us.anthropic.claude-sonnet-4-20250514-v1:0"
    region: "us-west-2"
  gpt-oss:
    model_id: "openai.gpt-oss-20b-1:0"
    region: "us-west-2"
  qwen:
    model_id: "qwen.qwen3-32b-v1:0"
    region: "us-west-2"

sonic_models:
  nova-sonic:
    model_id: "amazon.nova-2-sonic-v1:0"
    region: "us-east-1"

judge_models:
  claude-opus:
    model_id: "global.anthropic.claude-opus-4-5-20251101-v1:0"
    region: "us-east-1"
  claude-sonnet:
    model_id: "us.anthropic.claude-sonnet-4-20250514-v1:0"
    region: "us-east-1"

Three categories:

Category	Used By	API
`user_models`	`BedrockModelClient` (user simulator)	Bedrock Converse
`sonic_models`	`SonicStreamManager` (Nova Sonic)	Bidirectional Streaming
`judge_models`	`BinaryLLMJudge` (evaluation)	Bedrock Converse

To add a new model, edit configs/models.yaml — no code changes needed. Unmatched aliases fall back to being treated as literal Bedrock model IDs.

Session Continuation

Nova Sonic has an approximately 8-minute connection timeout. SessionContinuationManager (session_continuation.py) wraps SonicStreamManager to transparently handle this:

Monitors connection age against transition_threshold_seconds (default: 360s / 6 min)
When approaching timeout, creates a new SonicStreamManager session
Replays conversation history into the new session using ConversationHistory (conversation_history.py)
History replay has budget limits: 1KB per message, 40KB total
The turn loop in main.py is unaware of session transitions

Enable/disable via config:

{
  "enable_session_continuation": true,
  "transition_threshold_seconds": 360.0,
  "audio_buffer_duration_seconds": 3.0
}

Session events (transitions, recoveries) are logged in interaction_log.json under session_events.

Module Reference

Core Pipeline

Module	Description
`main.py`	`LiveInteractionSession` — orchestrates the turn loop, handles events and tool calls
`sonic_stream_manager.py`	`SonicStreamManager` — RxPY bidirectional streaming with Nova Sonic
`session_continuation.py`	`SessionContinuationManager` — transparent session restart on timeout
`conversation_history.py`	`ConversationHistory` — tracks messages for replay (1KB/msg, 40KB budget)
`bedrock_model_client.py`	`BedrockModelClient` (LLM user simulator) + `ScriptedUser` (predefined messages)
`config_manager.py`	`ConfigManager` + `TestConfig` / `InferenceConfig` / `ToolConfig` dataclasses
`interaction_logger.py`	`InteractionLogger` — records turns, audio, tool calls, session events
`tool_registry.py`	`ToolRegistry` — decorator-based tool registration and async execution
`model_registry.py`	`ModelRegistry` — resolves aliases via `configs/models.yaml`
`results_manager.py`	`ResultsManager` — organizes output, generates indexes and summaries
`multi_session_runner.py`	`MultiSessionRunner` — batch mode with parallel execution and dataset runs
`dataset_loader.py`	`DatasetLoader` + `DatasetEntry` — loads JSONL and HuggingFace datasets
`polly_tts_client.py`	`PollyTTSClient` — Amazon Polly TTS wrapper (16kHz 16-bit mono PCM)
`conversation_audio_recorder.py`	`ConversationAudioRecorder` — records input/output LPCM + stereo WAV

Evaluation

Module	Description
`evaluation_types.py`	`BinaryEvaluationCriteria` + `BinaryEvaluationResult` + `AudioTextConsistencyReport` dataclasses
`llm_judge_binary.py`	`BinaryLLMJudge` — binary (YES/NO) evaluation with per-metric rubrics
`evaluate_log.py`	Standalone script to evaluate existing logs
`evaluate_audio_text.py`	`AudioTextConsistencyJudge` — audio-vs-text hallucination detection
`audio_transcription_client.py`	`TranscriptionClient` — Amazon Transcribe wrapper (S3 upload, polling, cleanup)

Module Dependency Graph

main.py
├── sonic_stream_manager.py (RxPY, aws_sdk_bedrock_runtime)
├── session_continuation.py -> conversation_history.py
├── bedrock_model_client.py -> model_registry.py -> configs/models.yaml
├── config_manager.py
├── interaction_logger.py
├── tool_registry.py
├── results_manager.py
├── polly_tts_client.py (Amazon Polly, optional)
├── conversation_audio_recorder.py (LPCM + WAV recording, optional)
└── llm_judge_binary.py -> model_registry.py

main.py -> multi_session_runner.py (for batch/parallel runs)
main.py -> dataset_loader.py (for --dataset runs)
multi_session_runner.py -> main.py (LiveInteractionSession), config_manager,
                           results_manager, dataset_loader

evaluate_audio_text.py -> audio_transcription_client.py (Amazon Transcribe via S3)
                       -> model_registry.py (judge model resolution)

Directory Structure

Nova2SonicTestHarness/
├── main.py                          # Entry point, turn loop orchestration
├── sonic_stream_manager.py          # Bidirectional streaming (RxPY)
├── session_continuation.py          # Auto-reconnect on timeout
├── conversation_history.py          # Message replay for new sessions
├── bedrock_model_client.py          # User simulator (Bedrock + scripted)
├── config_manager.py                # Config loading and dataclasses
├── interaction_logger.py            # JSON + WAV logging
├── tool_registry.py                 # Decorator-based tool registry
├── model_registry.py                # Alias -> Bedrock model ID
├── results_manager.py               # Output organization and indexes
├── multi_session_runner.py          # Batch/parallel execution
├── polly_tts_client.py              # Amazon Polly TTS wrapper
├── conversation_audio_recorder.py   # Input/output LPCM + stereo WAV recording
├── llm_judge_binary.py              # LLM-as-judge evaluation (binary YES/NO)
├── evaluation_types.py              # Shared eval dataclasses
├── evaluate_log.py                  # Standalone log evaluator
├── evaluate_audio_text.py           # Audio-text consistency (hallucination detection)
├── audio_transcription_client.py    # Amazon Transcribe wrapper
├── requirements.txt                 # Python dependencies
├── .env.example                     # AWS credentials template
├── .mise.toml                       # Python 3.12 version manager
├── run_example.sh                   # Quick-run shell script
├── quick_start.sh                   # Quick start script
│
├── configs/                         # Test configurations
│   ├── models.yaml                  # Model alias registry
│   ├── example_basic.json           # Basic conversation
│   ├── example_with_tools.json      # Tools enabled
│   ├── example_polly.json           # Polly TTS audio input
│   ├── example_with_tools_polly.json # Polly TTS + tools
│   ├── example_scripted.json        # Scripted messages
│   ├── example_binary_eval.json    # Binary (YES/NO) evaluation with rubrics
│   ├── claude_haiku_basic.json      # Claude Haiku user sim
│   ├── gpt_oss_tools.json           # GPT-OSS user sim with tools
│   ├── qwen_complex.json            # Qwen user sim
│   ├── qwen_with_tools.json         # Qwen with tools
│   └── order_status/                # Order status scenario variants
│       ├── order_status_aggressive.json
│       ├── order_status_delayed.json
│       ├── order_status_normal.json
│       ├── order_status_processing.json
│       └── order_status_wrong_id.json
│
├── examples/                        # Example implementations
│   ├── example_with_tool_registry.py        # Tool registration patterns
│   ├── example_with_tool_registry_scripted.py
│   ├── example_tool_implementation.py
│   └── order_status_tools.py                # Order status tool handlers
│
├── results/                         # Organized results
│   ├── master_index.json
│   ├── sessions_index.csv
│   ├── sessions/{session_id}/       # Per-session organized output
│   └── batches/{batch_id}/          # Batch summaries

Examples

Basic Conversation (No Tools)

python main.py --config configs/example_basic.json

Claude Haiku simulates a user having a 5-turn conversation with Nova Sonic.

Tool-Enabled Conversation

python main.py --config configs/example_with_tools.json

Three travel tools (getBookingDetails, searchDestinations, getRewardsBalance) with custom implementations from examples/example_with_tool_registry.py.

Scripted Mode

python main.py --config configs/example_scripted.json

Predefined user messages for repeatable testing. No LLM-generated user input.

Order Status Scenarios

python main.py --scenarios-dir configs/order_status --parallel 2

Five customer service scenarios with different user personas (aggressive, patient, confused) and order states (delayed, processing, wrong ID). Each uses the lookupOrder tool.

Custom Tool Implementation

# my_tools.py
from tool_registry import ToolRegistry

registry = ToolRegistry()

@registry.tool("getWeather")
async def get_weather(tool_input: dict) -> dict:
    city = tool_input.get("city", "unknown")
    return {"temperature": 72, "condition": "sunny", "city": city}

Reference it from your config:

{
  "tool_registry_module": "my_tools",
  "sonic_tool_config": {
    "tools": [{
      "toolSpec": {
        "name": "getWeather",
        "description": "Get current weather for a city",
        "inputSchema": {
          "json": {
            "type": "object",
            "properties": {
              "city": { "type": "string", "description": "City name" }
            },
            "required": ["city"]
          }
        }
      }
    }]
  }
}

Audio Parameters

Parameter	Value
Input sample rate	16 kHz
Output sample rate (text mode)	24 kHz (default, set via `audioOutputConfiguration.sampleRateHertz` in `start_prompt()`)
Output sample rate (Polly mode)	16 kHz (matches input for recording alignment)
Polly TTS sample rate	16 kHz
Channels	Mono
Bit depth	16-bit PCM
Text chunking	1,000 chars per `textInput` event
Audio chunking	512-frame chunks (~32ms at 16kHz)
Silent audio interval	30ms between chunks

No microphone or speaker is needed. In text mode, the harness streams silent audio to satisfy Nova Sonic's requirement for an active audio channel while sending text input. In Polly mode, user simulator text is synthesized to speech and streamed as real audio input.

Note: The output sample rate is controlled by audioOutputConfiguration.sampleRateHertz in sonic_stream_manager.py:start_prompt(). Nova Sonic respects this setting. When Polly mode is enabled, both input and output use 16kHz so conversation recordings stay aligned.

Troubleshooting

"Failed to initialize stream"

Check AWS credentials in .env are valid
Verify the model ID exists in your region (configs/models.yaml)
Ensure aws_sdk_bedrock_runtime is the correct version (>=0.1.0,<0.2.0)

"Response timeout - no response from Nova Sonic"

Nova Sonic didn't produce any output within 60 seconds. Possible causes:

Invalid system prompt or tool configuration
Region mismatch (Nova Sonic is only available in certain regions)
Network connectivity issues

Tool calls return mock responses

If you see "Using mock implementation for {tool_name}", register a real handler. Set tool_registry_module in your config or pass a ToolRegistry to LiveInteractionSession.

Session continuation events in logs

These are normal. Nova Sonic connections time out after ~8 minutes. The SessionContinuationManager automatically reconnects and replays history. Check session_events in the interaction log for details.

Evaluation fails with "No JSON found in response"

The judge model didn't return valid JSON. This occasionally happens with weaker models. Use claude-opus for evaluation, or re-run with python evaluate_log.py.

"Module has no 'registry' attribute"

Your tool_registry_module Python file must export a variable named registry that is a ToolRegistry instance. See examples/example_with_tool_registry.py.

Polly TTS "AccessDeniedException"

Your IAM user/role needs the polly:SynthesizeSpeech permission. Add an IAM policy granting Polly access. If Polly fails at runtime, the harness falls back to sending text input.

Low-quality user messages

Adjust the user_system_prompt to be more specific about the persona and behavior. Increase user_temperature for more variety, or decrease it for more focused responses. Consider using scripted mode for reproducible tests.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
configs		configs
custom_tools		custom_tools
datasets		datasets
examples		examples
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
audio_transcription_client.py		audio_transcription_client.py
bedrock_model_client.py		bedrock_model_client.py
config_manager.py		config_manager.py
conversation_audio_recorder.py		conversation_audio_recorder.py
conversation_history.py		conversation_history.py
dataset_loader.py		dataset_loader.py
evaluate_audio_text.py		evaluate_audio_text.py
evaluate_log.py		evaluate_log.py
evaluation_dashboard.py		evaluation_dashboard.py
evaluation_types.py		evaluation_types.py
event_logger.py		event_logger.py
generate_config.py		generate_config.py
install_dependencies.sh		install_dependencies.sh
interaction_logger.py		interaction_logger.py
llm_judge_binary.py		llm_judge_binary.py
main.py		main.py
model_registry.py		model_registry.py
multi_session_runner.py		multi_session_runner.py
polly_tts_client.py		polly_tts_client.py
quick_start.sh		quick_start.sh
requirements.txt		requirements.txt
results_manager.py		results_manager.py
session_continuation.py		session_continuation.py
sonic_stream_manager.py		sonic_stream_manager.py
tool_registry.py		tool_registry.py

Folders and files

Latest commit

History

Repository files navigation