diff --git a/.claude/commands/debug-prod.md b/.claude/commands/debug-prod.md new file mode 100644 index 0000000..ec7aaa8 --- /dev/null +++ b/.claude/commands/debug-prod.md @@ -0,0 +1,66 @@ +# Debug Production + +Investigate production issues on the live server. + +## Access + +- **SSH**: `ssh root@167.235.133.87` +- **App directory**: `/opt/pseuno` + +## Common commands + +All commands below assume you are in `/opt/pseuno` on the server. + +### View logs +```bash +docker compose -f docker-compose.prod.yml logs --tail=200 backend +``` + +### View all service logs +```bash +docker compose -f docker-compose.prod.yml logs --tail=100 +``` + +### Filter errors +```bash +docker compose -f docker-compose.prod.yml logs backend 2>&1 | grep -i error | tail -30 +``` + +### Check health +```bash +curl -s localhost:8000/health +``` + +### Check running containers +```bash +docker compose -f docker-compose.prod.yml ps +``` + +### Database access +```bash +docker compose -f docker-compose.prod.yml exec postgres psql -U pseuno -d pseuno +``` + +### Redis +```bash +docker compose -f docker-compose.prod.yml exec redis redis-cli +``` + +### Restart backend +```bash +docker compose -f docker-compose.prod.yml restart backend +``` + +### Deploy latest +```bash +cd /opt/pseuno && git pull && docker compose -f docker-compose.prod.yml up -d --build backend +``` + +## Investigation workflow + +1. SSH in and check health first +2. View recent logs, filter for errors +3. Check if containers are running +4. If needed, check DB/Redis state +5. Restart backend if it's stuck +6. If a code fix is needed, deploy from main after merging the fix diff --git a/.claude/commands/test-frontend.md b/.claude/commands/test-frontend.md new file mode 100644 index 0000000..e909496 --- /dev/null +++ b/.claude/commands/test-frontend.md @@ -0,0 +1,36 @@ +# Test Frontend + +Validate frontend changes compile, lint, and build correctly. + +## Steps + +### 1. Type check + +```bash +cd frontend && npx tsc --noEmit +``` +Must have zero errors. + +### 2. Lint + +```bash +cd frontend && npm run lint +``` +Must have zero warnings (strict policy). + +### 3. Build + +```bash +cd frontend && npm run build +``` +Must succeed. + +### 4. E2E tests (if dev stack is running) + +```bash +cd frontend && npx playwright test +``` + +### 5. Visual verification (if making UI changes) + +Open `localhost:5173` in a browser (via Playwright MCP) and visually verify the change looks correct. diff --git a/.claude/commands/test-perf.md b/.claude/commands/test-perf.md new file mode 100644 index 0000000..a5a5161 --- /dev/null +++ b/.claude/commands/test-perf.md @@ -0,0 +1,70 @@ +# Test Performance + +Benchmark endpoint latency. Run this when making changes that could affect generation speed. + +## Prerequisites + +1. Verify dev stack is up: `curl -s localhost:8000/health` +2. If not running, run `make dev-up` and wait for health check to pass. + +## Steps + +### 1. Benchmark `/generate/input-concept` + +Call 5 times and report min/max/avg latency. Target: <2s each. + +```bash +for i in 1 2 3 4 5; do + time curl -s -X POST localhost:8000/generate/input-concept \ + -H "Content-Type: application/json" \ + -d '{"raw_input": "upbeat pop song about summer"}' > /dev/null +done +``` + +### 2. Benchmark `/generate/advanced` + +Call 3 times with different inputs. Target: <15s each. + +```bash +curl -s -w "\n%{time_total}s\n" -X POST localhost:8000/generate/advanced \ + -H "Content-Type: application/json" \ + -d '{"user_prompt": "indie rock with jangly guitars", "lyrics_about": "leaving home for the first time"}' + +curl -s -w "\n%{time_total}s\n" -X POST localhost:8000/generate/advanced \ + -H "Content-Type: application/json" \ + -d '{"user_prompt": "lo-fi hip hop beats", "lyrics_about": "late night studying"}' + +curl -s -w "\n%{time_total}s\n" -X POST localhost:8000/generate/advanced \ + -H "Content-Type: application/json" \ + -d '{"user_prompt": "orchestral film score", "lyrics_about": ""}' +``` + +### 3. Benchmark `/generate/refine` + +Call 3 times with `refine_target=lyrics`. Target: <15s each. + +Use the full response from step 2 to build the refine request (the endpoint requires the current snapshot, not just a generation_id): +```bash +curl -s -w "\n%{time_total}s\n" -X POST localhost:8000/generate/refine \ + -H "Content-Type: application/json" \ + -d '{ + "suno_prompt": "", + "lyrics": "", + "exclude": "", + "title": "", + "weirdness": , + "change_request": "make it more emotional", + "refine_target": "lyrics" + }' +``` + +### 4. Compare (if testing a change) + +If benchmarking before/after a code change: +1. Run steps 1-3 on the base branch, save results +2. Switch to feature branch, run again +3. Report the delta for each endpoint + +### 5. Save results + +Save the report to `benchmarks/perf-YYYY-MM-DD.md` (use today's date). Include the git branch/commit, per-call latencies, and min/max/avg for each endpoint. See `benchmarks/README.md` for the format. diff --git a/.claude/commands/test-quality.md b/.claude/commands/test-quality.md new file mode 100644 index 0000000..05f6709 --- /dev/null +++ b/.claude/commands/test-quality.md @@ -0,0 +1,97 @@ +# Test Quality + +Assess generation quality by hitting real endpoints. Run this after prompt or generation changes. + +## Prerequisites + +1. Verify dev stack is up: `curl -s localhost:8000/health` +2. If not running, run `make dev-up` and wait for health check to pass. + +## Steps + +### 1. Generate 5 songs across varied genres + +Call `POST localhost:8000/generate/advanced` with these inputs: + +```bash +# Country +curl -s -X POST localhost:8000/generate/advanced \ + -H "Content-Type: application/json" \ + -d '{"user_prompt": "classic country with steel guitar and fiddle", "lyrics_about": "driving down a dirt road at sunset"}' + +# Punk +curl -s -X POST localhost:8000/generate/advanced \ + -H "Content-Type: application/json" \ + -d '{"user_prompt": "fast aggressive punk rock", "lyrics_about": "being fed up with corporate greed"}' + +# Hip-hop +curl -s -X POST localhost:8000/generate/advanced \ + -H "Content-Type: application/json" \ + -d '{"user_prompt": "boom bap hip hop with jazz samples", "lyrics_about": "growing up in the city"}' + +# Folk +curl -s -X POST localhost:8000/generate/advanced \ + -H "Content-Type: application/json" \ + -d '{"user_prompt": "acoustic folk with fingerpicking", "lyrics_about": "a small town slowly disappearing"}' + +# Electronic +curl -s -X POST localhost:8000/generate/advanced \ + -H "Content-Type: application/json" \ + -d '{"user_prompt": "dark synthwave with arpeggiated bass", "lyrics_about": "neon lights in an empty city"}' +``` + +Capture full JSON responses from each. + +### 2. Assess vocabulary + +Read the lyrics from each response. Check: +- **Banned/overused words**: silver, velvet, neon, shattered, whisper, shadows, echoes, crimson, golden, embers +- Flag if any of these appear in 3+ of the 5 songs +- Check that vocabulary feels genre-appropriate (country should sound different from punk) + +### 3. Assess chorus quality + +Parse `[Chorus]` sections from each song. Flag if any chorus has the same line repeated 3+ times consecutively. + +### 4. Assess style names + +Check that each response's `concept_title` / style name is: +- Short (under 30 characters) +- Descriptive, not a full style prompt sentence + +### 5. Assess structure + +For each song, verify: +- Section tags are present (`[Verse]`, `[Chorus]`, etc.) +- No stage directions in lyrics (e.g., "(softly)", "(guitar solo)") +- No periods at end of lines + +### 6. Test refine + +Take one generated song's full response and call refine. The refine endpoint requires the full current snapshot: +```bash +curl -s -X POST localhost:8000/generate/refine \ + -H "Content-Type: application/json" \ + -d '{ + "suno_prompt": "", + "lyrics": "", + "exclude": "", + "title": "", + "weirdness": , + "change_request": "make the chorus more upbeat and energetic", + "refine_target": "lyrics" + }' +``` + +Verify: +- `changed_fields` includes "lyrics" +- `changed_fields` does NOT include "suno_prompt" +- Completed in <30s + +### 7. Report + +Summarize findings: what passed, what failed, with specific examples of issues found. + +### 8. Save results + +Save the report to `benchmarks/quality-YYYY-MM-DD.md` (use today's date). Include the git branch/commit, per-song results for each check (vocabulary, chorus, style names, structure), and a summary. See `benchmarks/README.md` for the format. diff --git a/.claude/commands/update-rules.md b/.claude/commands/update-rules.md new file mode 100644 index 0000000..e85e9e6 --- /dev/null +++ b/.claude/commands/update-rules.md @@ -0,0 +1,17 @@ +# Update Rules + +Add new conventions or pitfalls to CLAUDE.md when you discover them during a task. + +## Steps + +1. Read current `CLAUDE.md` at the project root. +2. Identify the new convention, pattern, or pitfall discovered during the current task. +3. Add it to the appropriate section (keep entries concise, 1-2 lines). +4. Don't remove existing rules unless they're demonstrably wrong. + +## Examples of things to add + +- A new testing pattern you had to figure out +- A file that's easy to break and how to avoid it +- A config value that doesn't do what its name suggests +- An API behavior that's surprising or undocumented diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml new file mode 100644 index 0000000..7ab4725 --- /dev/null +++ b/.github/workflows/ci.yml @@ -0,0 +1,54 @@ +name: CI + +on: + push: + branches: [main] + pull_request: + branches: [main] + +jobs: + backend-lint: + runs-on: ubuntu-latest + defaults: + run: + working-directory: backend + steps: + - uses: actions/checkout@v4 + - uses: actions/setup-python@v5 + with: + python-version: "3.12" + cache: pip + - run: pip install -r requirements.txt ruff + - run: python -m ruff check app/ tests/ + + backend-test: + runs-on: ubuntu-latest + defaults: + run: + working-directory: backend + steps: + - uses: actions/checkout@v4 + - uses: actions/setup-python@v5 + with: + python-version: "3.12" + cache: pip + - run: pip install -r requirements.txt + - run: >- + python -m pytest -v + --ignore=tests/test_artist_bank_routing.py + --ignore=tests/test_v8_channel_split.py + + frontend-build: + runs-on: ubuntu-latest + defaults: + run: + working-directory: frontend + steps: + - uses: actions/checkout@v4 + - uses: actions/setup-node@v4 + with: + node-version: "20" + cache: npm + cache-dependency-path: frontend/package-lock.json + - run: npm ci + - run: npm run build diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 0000000..f61d2b1 --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,62 @@ +# Columbus V1 — Agent Instructions + +## Architecture + +Columbus is a music generation app. The backend is a FastAPI service that calls LLMs to generate Suno-compatible style prompts and lyrics. + +### Generation flow (default: two-step v5_hybrid) + +The default prompt variant is `v5_hybrid` which runs **two parallel branches** via `asyncio.gather`: + +1. **Style branch** → generates `suno_prompt`, `exclude`, `weirdness`, `style_influence` +2. **Lyrics branch** → infers a `LyricProfile` first, then generates `song_title` + `lyrics` + +After both branches complete, a `style_name` LLM call summarizes the style. + +**LLM call order** (non-instrumental): style → profile → lyrics → style_name (4 calls) +**LLM call order** (instrumental): style → title → style_name (3 calls) + +### Key files + +| File | What it does | Pitfalls | +|---|---|---| +| `backend/app/services/agent_prompt_graph.py` | Core generation engine (~3500 lines) | `_injected_llm` makes all branches share one FakeLLM in tests | +| `backend/app/prompts/specs.py` | Shared output contracts, repair prompts | Changes here affect ALL variants | +| `backend/app/prompts/variants/v5_hybrid.py` | Default two-step variant config | `uses_lyric_profile=True` triggers profile inference | +| `backend/app/schemas/advanced.py` | Request/response models, DebugTrace schema | `PromptVariant` literal must match registry | +| `backend/app/services/debug_trace.py` | DebugTracer builds span-based traces | `debug_info` is `DebugTrace` format, not flat dict | +| `backend/app/config.py` | Settings (env vars, defaults) | `agent_repair_enabled` exists but is NOT used in two-step code | + +## Testing + +### Running tests + +```bash +cd backend && python -m pytest -v +``` + +**ALL tests must pass before committing.** Run the full suite, not just the file you changed. + +### Testing patterns + +- **FakeLLM**: Tests inject a `FakeLLM` with a list of string responses consumed in order. For two-step v5_hybrid, provide responses in this order: style, profile, lyrics, style_name (4 for non-instrumental, 3 for instrumental). +- **Always set `prompt_variant="v5_hybrid"`** in test requests — this matches the default behavior and ensures correct FakeLLM consumption order. +- **`debug_info`** is a `DebugTrace` dict with `version`, `summary` (variant, model, repairs, architecture), and `spans` list — NOT a flat dict with `repaired`/`agent_model` keys. +- **Shared helpers** are in `backend/tests/conftest.py` (`test_settings` fixture) and at the top of test files (`_valid_style_output`, `_valid_lyrics_output`, etc.). +- **Route tests** copy endpoint functions inline with mocked dependencies rather than importing FastAPI routers. +- **Pre-existing failures**: `test_artist_bank_routing.py` (event loop) and `test_v8_channel_split.py` are known broken — don't worry about those. + +### Lint + +```bash +cd backend && python -m ruff check app/ tests/ +``` + +## Skills + +Use these after making changes: +- `/test-quality` — assess generation quality by hitting real endpoints (after prompt/generation changes) +- `/test-perf` — benchmark endpoint latency +- `/test-frontend` — validate frontend builds and types +- `/debug-prod` — investigate production issues +- `/update-rules` — add new conventions to this file diff --git a/Makefile b/Makefile index f8e920d..77b186b 100644 --- a/Makefile +++ b/Makefile @@ -3,6 +3,7 @@ COMPOSE_PROD = docker compose -f docker-compose.prod.yml .PHONY: dev dev-up dev-down dev-build dev-logs dev-ps backend-shell frontend-shell db-shell redis-cli .PHONY: prod prod-up prod-down prod-build prod-logs +.PHONY: test lint check dev: $(COMPOSE_DEV) up --build @@ -48,3 +49,11 @@ prod-build: prod-logs: $(COMPOSE_PROD) logs -f --tail=100 + +test: + cd backend && python -m pytest -v + +lint: + cd backend && python -m ruff check app/ tests/ + +check: lint test diff --git a/backend/app/prompts/specs.py b/backend/app/prompts/specs.py index 8d4ef4e..f76b225 100644 --- a/backend/app/prompts/specs.py +++ b/backend/app/prompts/specs.py @@ -345,8 +345,8 @@ - Each chorus should contain the same lyrics as the other chorus. However, within a single chorus, each line must be DIFFERENT — do NOT repeat the same line consecutively. A 4-line chorus needs 4 distinct lines. - Prioritize punchy, impactful lines over filler. Each line should earn its place. -Vocabulary rules: -- Avoid overusing generic "poetic" words like "silver", "velvet", "neon", "shattered", "whisper", "shadows", "echoes", "crimson", "golden", "embers". These are fine occasionally but should not appear in every song. +Vocabulary rules (CRITICAL — validation will reject violations): +- NEVER use 3 or more of these generic "poetic" words in a single song: "silver", "velvet", "neon", "shattered", "whisper", "shadows", "echoes", "crimson", "golden", "embers". Using 1-2 is acceptable if genuinely fitting; 3+ will trigger a rewrite. - Derive vocabulary from the genre and era context. Each genre has its own linguistic register — think about what words and imagery belong to that genre's world. A country song and a punk song should not share the same adjectives. - Each song must have its own unique vocabulary palette drawn from the genre and topic. diff --git a/backend/app/services/agent_prompt_graph.py b/backend/app/services/agent_prompt_graph.py index 1114874..e71fbd9 100644 --- a/backend/app/services/agent_prompt_graph.py +++ b/backend/app/services/agent_prompt_graph.py @@ -1075,9 +1075,20 @@ async def _generate_parallel_two_step( suno_prompt = style_result["suno_prompt"] + # Derive concept_title with fallback if lyrics branch didn't produce one + concept_title = lyrics_result["song_title"] + if not concept_title: + concept_title = self._derive_title( + request.user_prompt, request.lyrics_about or "" + ) + logger.warning( + "Lyrics branch returned empty song_title, using fallback: %s", + concept_title, + ) + # Generate unique ID for this generation generation_id = hashlib.md5( - f"{lyrics_result['song_title']}{suno_prompt}{time.time()}".encode() + f"{concept_title}{suno_prompt}{time.time()}".encode() ).hexdigest()[:12] logger.info( @@ -1086,7 +1097,7 @@ async def _generate_parallel_two_step( ) return { - "concept_title": lyrics_result["song_title"], + "concept_title": concept_title, "style_name": style_result.get("style_name", ""), "lyrics": lyrics_result["lyrics"], "suno_prompt": suno_prompt, @@ -2427,6 +2438,11 @@ def _strip_lyrics_preamble(self, lyrics: str) -> str: return lyrics[match.start() :].strip() return lyrics.strip() + _OVERUSED_WORDS = frozenset({ + "silver", "velvet", "neon", "shattered", "whisper", + "shadows", "echoes", "crimson", "golden", "embers", + }) + def _validate_lyrics_output(self, output: _ParsedLyricsOutput) -> List[str]: """Validate lyrics output, return list of issues.""" issues = [] @@ -2439,8 +2455,21 @@ def _validate_lyrics_output(self, output: _ParsedLyricsOutput) -> List[str]: else: # Check for chorus lines that are mostly identical issues.extend(self._check_chorus_repetition(output.lyrics)) + # Check for overused generic "poetic" words + issues.extend(self._check_overused_words(output.lyrics)) return issues + def _check_overused_words(self, lyrics: str) -> List[str]: + """Flag lyrics that use 3+ banned generic poetic words.""" + words_in_lyrics = set(re.findall(r"[a-z]+", lyrics.lower())) + found = words_in_lyrics & self._OVERUSED_WORDS + if len(found) >= 3: + return [ + f"Too many generic poetic words ({', '.join(sorted(found))}). " + f"Replace with genre-specific vocabulary." + ] + return [] + @staticmethod def _check_chorus_repetition(lyrics: str) -> List[str]: """Detect choruses where >50% of lines are identical.""" diff --git a/backend/ruff.toml b/backend/ruff.toml new file mode 100644 index 0000000..bc20e29 --- /dev/null +++ b/backend/ruff.toml @@ -0,0 +1,8 @@ +[lint.per-file-ignores] +# Template file has intentional unused imports as documentation +"app/prompts/variants/_template.py" = ["F401"] +# Test files commonly import for side effects or use inside methods +"tests/*" = ["F401", "F541"] +# Pre-existing issues in app code (not introduced by this PR) +"app/routes/generate_input_concept.py" = ["F401"] +"app/schemas/advanced.py" = ["E402"] diff --git a/backend/tests/conftest.py b/backend/tests/conftest.py new file mode 100644 index 0000000..0044c57 --- /dev/null +++ b/backend/tests/conftest.py @@ -0,0 +1,13 @@ +""" +Shared test fixtures for the backend test suite. +""" + +import pytest + +from app.config import Settings + + +@pytest.fixture +def test_settings() -> Settings: + """Minimal Settings instance for tests (no real API keys needed).""" + return Settings(spotify_client_id="test", openai_api_key="test") diff --git a/backend/tests/test_agent_prompt_graph.py b/backend/tests/test_agent_prompt_graph.py index 1f64501..6dfdfb5 100644 --- a/backend/tests/test_agent_prompt_graph.py +++ b/backend/tests/test_agent_prompt_graph.py @@ -1,5 +1,10 @@ """ -Tests for AgentPromptGraph — basic use cases + repair/validation behavior. +Tests for AgentPromptGraph — two-step (v5_hybrid) generation. + +All tests use prompt_variant="v5_hybrid" which is the default two-step variant. +FakeLLM responses are consumed in this order: + Non-instrumental: style → profile → lyrics → style_name (4 calls) + Instrumental: style → title → style_name (3 calls) """ import asyncio @@ -17,7 +22,8 @@ def __init__(self, content: str): class FakeLLM: """ Minimal LLM stub for testing. - It returns a sequence of contents across successive `ainvoke` calls. + Returns a sequence of contents across successive `ainvoke` calls. + After exhausting contents, returns empty string. """ def __init__(self, contents: list[str]): @@ -32,20 +38,25 @@ async def ainvoke(self, _messages, temperature=None): return _FakeResponse(self._contents.pop(0)) -def _settings() -> Settings: - # Spotify is optional; provide a stub value for clarity. - return Settings(spotify_client_id="test_spotify_client_id", openai_api_key="test") +def _settings(**overrides) -> Settings: + defaults = dict(spotify_client_id="test", openai_api_key="test") + defaults.update(overrides) + return Settings(**defaults) -def _valid_output( - lyrics: str = "[Verse]\nhello world\n", +# --------------------------------------------------------------------------- +# Output helpers for two-step v5_hybrid variant +# --------------------------------------------------------------------------- + + +def _valid_style_output( suno_prompt: str = "Funky pop, crisp drums, bright bass", exclude: str = "cheesy, country", weirdness: int = 50, style_influence: int = 60, ) -> str: + """Valid output for the style branch.""" return ( - f"LYRICS\n{lyrics}\n\n" f"SUNO PROMPT\n{suno_prompt}\n\n" f"EXCLUDE\n{exclude}\n\n" f"WEIRDNESS\n{weirdness}\n\n" @@ -53,6 +64,69 @@ def _valid_output( ) +def _valid_lyrics_output( + song_title: str = "Hello World", + lyrics: str = "[Verse]\nhello world\n", +) -> str: + """Valid output for the lyrics branch.""" + return f"SONG TITLE\n{song_title}\n\nLYRICS\n{lyrics}\n" + + +def _valid_profile_output() -> str: + """Valid per-section profile output for profile inference.""" + return ( + 'Verse: {"lines_per_section": "4_lines", "line_length": "default", "pov": "first", ' + '"rhyme_scheme": "aabb", "directness": "balanced", "persona": "earnest", ' + '"humor": "none", "explicitness": "clean", "audience": "general"}\n' + 'Pre-Chorus: {"lines_per_section": "2_lines", "line_length": "short", "pov": "first", ' + '"rhyme_scheme": "aabb", "directness": "direct", "persona": "earnest", ' + '"humor": "none", "explicitness": "clean", "audience": "general"}\n' + 'Chorus: {"lines_per_section": "4_lines", "line_length": "short", "pov": "first", ' + '"rhyme_scheme": "aaaa", "directness": "direct", "persona": "earnest", ' + '"humor": "none", "explicitness": "clean", "audience": "general"}\n' + 'Post-Chorus: {"lines_per_section": "2_lines", "line_length": "sparse", "pov": "none", ' + '"rhyme_scheme": "aaaa", "directness": "direct", "persona": "earnest", ' + '"humor": "none", "explicitness": "clean", "audience": "general"}\n' + 'Bridge: {"lines_per_section": "4_lines", "line_length": "default", "pov": "second", ' + '"rhyme_scheme": "abab", "directness": "metaphor_heavy", "persona": "melancholic", ' + '"humor": "none", "explicitness": "clean", "audience": "general"}\n' + 'Structure: ["Intro", "Verse", "Chorus", "Verse", "Chorus", "Bridge", "Chorus", "Outro"]' + ) + + +def _style_name_output() -> str: + """Valid output for style name generation.""" + return "Indie Pop Fusion" + + +def _happy_path_responses( + style_output=None, + profile_output=None, + lyrics_output=None, + style_name=None, +) -> list[str]: + """Standard 4-response sequence for non-instrumental happy path.""" + return [ + style_output or _valid_style_output(), + profile_output or _valid_profile_output(), + lyrics_output or _valid_lyrics_output(), + style_name or _style_name_output(), + ] + + +def _instrumental_responses( + style_output=None, + title="The Last Horizon", + style_name=None, +) -> list[str]: + """Standard 3-response sequence for instrumental mode.""" + return [ + style_output or _valid_style_output(), + title, + style_name or _style_name_output(), + ] + + # --------------------------------------------------------------------------- # Basic use cases (happy path) # --------------------------------------------------------------------------- @@ -60,20 +134,20 @@ def _valid_output( def test_valid_output_no_repairs_needed(): """When the LLM returns valid output on first try, no repairs are triggered.""" - output = _valid_output() - llm = FakeLLM([output]) + llm = FakeLLM(_happy_path_responses()) builder = AgentPromptGraph(_settings(), llm=llm) req = AdvancedGenerateRequest( user_prompt="Make a funky pop song", lyrics_about="dancing in the rain", selected_artists=[], tags=["pop", "funk"], + prompt_variant="v5_hybrid", ) result = asyncio.run(builder.generate(req)) - assert llm.calls == 1 # no repair calls - assert result["debug_info"]["repaired"] is False + assert llm.calls == 4 # style + profile + lyrics + style_name + assert result["debug_info"]["summary"]["repairs"] == 0 assert result["lyrics"] == "[Verse]\nhello world" assert result["suno_prompt"] == "Funky pop, crisp drums, bright bass" assert result["exclude"] == "cheesy, country" @@ -82,13 +156,13 @@ def test_valid_output_no_repairs_needed(): def test_extracts_all_response_fields(): - """All expected fields are present in the response.""" - output = _valid_output() - llm = FakeLLM([output]) + """All expected fields are present in the two-step response.""" + llm = FakeLLM(_happy_path_responses()) builder = AgentPromptGraph(_settings(), llm=llm) req = AdvancedGenerateRequest( user_prompt="Cinematic orchestral piece", lyrics_about="stars colliding", + prompt_variant="v5_hybrid", ) result = asyncio.run(builder.generate(req)) @@ -101,50 +175,57 @@ def test_extracts_all_response_fields(): assert "style_influence" in result assert "generation_id" in result assert "debug_info" in result - assert "agent_model" in result["debug_info"] - assert "context_hash" in result["debug_info"] - assert "repaired" in result["debug_info"] + # DebugTrace format + assert "summary" in result["debug_info"] + assert "spans" in result["debug_info"] + assert result["debug_info"]["summary"]["variant"] == "v5_hybrid" + assert result["debug_info"]["summary"]["architecture"] == "two_step" -def test_concept_title_derived_from_lyrics_about(): - """Concept title is derived from lyrics_about when provided.""" - output = _valid_output() - llm = FakeLLM([output]) +def test_concept_title_from_lyrics_branch(): + """In two-step, concept title comes from the lyrics branch song_title.""" + llm = FakeLLM( + _happy_path_responses( + lyrics_output=_valid_lyrics_output(song_title="Ants Marching On Mars"), + ) + ) builder = AgentPromptGraph(_settings(), llm=llm) req = AdvancedGenerateRequest( user_prompt="Make something epic", lyrics_about="ants marching on Mars", + prompt_variant="v5_hybrid", ) result = asyncio.run(builder.generate(req)) - # Title should be derived from first few words of lyrics_about assert result["concept_title"] == "Ants Marching On Mars" -def test_concept_title_falls_back_to_user_prompt(): - """When lyrics_about is empty, concept title is derived from user_prompt.""" - output = _valid_output() - llm = FakeLLM([output]) +def test_concept_title_instrumental_from_title_llm(): + """When lyrics_about is empty (instrumental), title comes from title LLM.""" + llm = FakeLLM(_instrumental_responses(title="Heavy Metal Thunder")) builder = AgentPromptGraph(_settings(), llm=llm) req = AdvancedGenerateRequest( user_prompt="heavy metal breakdown", lyrics_about="", + prompt_variant="v5_hybrid", ) result = asyncio.run(builder.generate(req)) - assert result["concept_title"] == "Heavy Metal Breakdown" + assert result["concept_title"] == "Heavy Metal Thunder" + assert result["lyrics"] == "" def test_generation_id_is_unique(): """Each generation produces a unique generation_id.""" - output = _valid_output() - llm = FakeLLM([output, output]) + # Provide enough responses for two full generations + llm = FakeLLM(_happy_path_responses() + _happy_path_responses()) builder = AgentPromptGraph(_settings(), llm=llm) req = AdvancedGenerateRequest( user_prompt="synth wave", lyrics_about="neon nights", + prompt_variant="v5_hybrid", ) result1 = asyncio.run(builder.generate(req)) @@ -153,188 +234,231 @@ def test_generation_id_is_unique(): assert result1["generation_id"] != result2["generation_id"] -def test_suno_prompt_over_500_triggers_repair_and_then_error(): - """SUNO PROMPT >500 chars is invalid; after repairs are exhausted we return an error (no fallback).""" +# --------------------------------------------------------------------------- +# Style branch validation + repair tests +# --------------------------------------------------------------------------- + + +def test_suno_prompt_over_500_triggers_style_repairs(): + """SUNO PROMPT >500 chars triggers repair attempts in style branch.""" long_prompt = "A" * 600 - output = _valid_output(suno_prompt=long_prompt) - llm = FakeLLM([output, output, output]) + bad_style = _valid_style_output(suno_prompt=long_prompt) + llm = FakeLLM( + [ + bad_style, # #1: style.generate (bad) + bad_style, # #2: style.repair.1 (still bad) + bad_style, # #3: style.repair.2 (still bad) + _valid_profile_output(), # #4: lyrics.profile_infer + _valid_lyrics_output(), # #5: lyrics.generate + _style_name_output(), # #6: style.name_generate + ] + ) builder = AgentPromptGraph(_settings(), llm=llm) req = AdvancedGenerateRequest( user_prompt="test prompt", lyrics_about="test topic", + prompt_variant="v5_hybrid", ) result = asyncio.run(builder.generate(req)) - assert llm.calls == 3 # initial + 2 repairs (default) - assert result["success"] is False - assert "issues" in result and result["issues"] - assert any( - "SUNO PROMPT exceeds 500 characters" in issue for issue in result["issues"] + assert llm.calls == 6 # style(3) + profile + lyrics + name + # Style branch proceeded with issues + assert len(result["suno_prompt"]) > 500 + # Debug trace shows repair attempts + spans = result["debug_info"]["spans"] + repair_spans = [s for s in spans if "repair" in s["name"]] + assert len(repair_spans) == 2 + + +def test_weirdness_out_of_range_triggers_style_repairs(): + """Weirdness >100 triggers repair attempts in style branch.""" + bad_style = _valid_style_output(weirdness=150) + llm = FakeLLM( + [ + bad_style, # #1: style.generate (bad) + bad_style, # #2: style.repair.1 (still bad) + bad_style, # #3: style.repair.2 (still bad) + _valid_profile_output(), # #4: lyrics.profile_infer + _valid_lyrics_output(), # #5: lyrics.generate + _style_name_output(), # #6: style.name_generate + ] ) - - -def test_weirdness_out_of_range_triggers_repair_and_then_error(): - """Weirdness values outside 0-100 are invalid; after repairs we return an error (no fallback).""" - # Value > 100 is invalid per validator - output_high = _valid_output(weirdness=150) - llm = FakeLLM([output_high, output_high, output_high]) builder = AgentPromptGraph(_settings(), llm=llm) - req = AdvancedGenerateRequest(user_prompt="test", lyrics_about="test") + req = AdvancedGenerateRequest( + user_prompt="test", + lyrics_about="test", + prompt_variant="v5_hybrid", + ) result = asyncio.run(builder.generate(req)) - assert llm.calls == 3 - assert result["success"] is False - assert any( - "WEIRDNESS must be between 0 and 100" in issue for issue in result["issues"] + assert llm.calls == 6 + assert result["weirdness"] == 150 + spans = result["debug_info"]["spans"] + repair_spans = [s for s in spans if "repair" in s["name"]] + assert len(repair_spans) == 2 + + +def test_style_influence_out_of_range_triggers_style_repairs(): + """Style influence >100 triggers repair attempts in style branch.""" + bad_style = _valid_style_output(style_influence=200) + llm = FakeLLM( + [ + bad_style, # #1: style.generate (bad) + bad_style, # #2: style.repair.1 (still bad) + bad_style, # #3: style.repair.2 (still bad) + _valid_profile_output(), # #4: lyrics.profile_infer + _valid_lyrics_output(), # #5: lyrics.generate + _style_name_output(), # #6: style.name_generate + ] ) - - -def test_style_influence_out_of_range_triggers_repair_and_then_error(): - """Style influence values outside 0-100 are invalid; after repairs we return an error (no fallback).""" - output = _valid_output(style_influence=200) - llm = FakeLLM([output, output, output]) builder = AgentPromptGraph(_settings(), llm=llm) - req = AdvancedGenerateRequest(user_prompt="test", lyrics_about="test") + req = AdvancedGenerateRequest( + user_prompt="test", + lyrics_about="test", + prompt_variant="v5_hybrid", + ) result = asyncio.run(builder.generate(req)) - assert llm.calls == 3 - assert result["success"] is False - assert any( - "STYLE INFLUENCE must be between 0 and 100" in issue - for issue in result["issues"] - ) + assert llm.calls == 6 + assert result["style_influence"] == 200 + spans = result["debug_info"]["spans"] + repair_spans = [s for s in spans if "repair" in s["name"]] + assert len(repair_spans) == 2 def test_tags_are_passed_through(): - """Tags from request are included in context and don't break generation.""" - output = _valid_output() - llm = FakeLLM([output]) + """Tags from request don't break generation.""" + llm = FakeLLM(_happy_path_responses()) builder = AgentPromptGraph(_settings(), llm=llm) req = AdvancedGenerateRequest( user_prompt="indie rock anthem", lyrics_about="summer nights", tags=["indie", "rock", "anthemic"], + prompt_variant="v5_hybrid", ) result = asyncio.run(builder.generate(req)) - assert result["debug_info"]["repaired"] is False + assert result["debug_info"]["summary"]["repairs"] == 0 assert result["suno_prompt"] # output is valid def test_selected_artists_not_leaked_when_valid(): - """Selected artists are used for context but don't appear in valid output.""" - # LLM returns valid output that doesn't mention artist - output = _valid_output(suno_prompt="Retro funk, smooth bass, falsetto vocals") - llm = FakeLLM([output]) + """Selected artists don't appear in valid style output.""" + llm = FakeLLM( + _happy_path_responses( + style_output=_valid_style_output( + suno_prompt="Retro funk, smooth bass, falsetto vocals" + ), + ) + ) builder = AgentPromptGraph(_settings(), llm=llm) req = AdvancedGenerateRequest( user_prompt="Make it sound like Prince", lyrics_about="purple rain", selected_artists=["Prince"], + prompt_variant="v5_hybrid", ) result = asyncio.run(builder.generate(req)) assert "prince" not in result["suno_prompt"].lower() - assert result["debug_info"]["repaired"] is False - - -def test_repairs_when_missing_sections(): - # First output is missing sections and order (invalid), second output is valid. - bad = "LYRICS\n" "[Verse]\nhello\n" "\n" "SUNO PROMPT\n" "some prompt\n" - good = ( - "LYRICS\n" - "[Verse]\nhello\n\n" - "SUNO PROMPT\n" - "some prompt\n\n" - "EXCLUDE\n" - "cheesy, country\n\n" - "WEIRDNESS\n" - "42\n\n" - "STYLE INFLUENCE\n" - "55\n" - ) - - llm = FakeLLM([bad, good]) + assert result["debug_info"]["summary"]["repairs"] == 0 + + +def test_style_repair_fixes_missing_sections(): + """Style branch repairs when initial output is missing required sections.""" + bad = "SUNO PROMPT\nsome prompt\n" # Missing EXCLUDE, WEIRDNESS, STYLE INFLUENCE + good = _valid_style_output( + suno_prompt="some prompt", + exclude="cheesy, country", + weirdness=42, + style_influence=55, + ) + llm = FakeLLM( + [ + bad, # #1: style.generate (bad — missing EXCLUDE) + good, # #2: style.repair.1 (good) + _valid_profile_output(), # #3: lyrics.profile_infer + _valid_lyrics_output(), # #4: lyrics.generate + _style_name_output(), # #5: style.name_generate + ] + ) builder = AgentPromptGraph(_settings(), llm=llm) req = AdvancedGenerateRequest( user_prompt="Make something big and cinematic", lyrics_about="ants on Mars", selected_artists=[], tags=["cinematic"], + prompt_variant="v5_hybrid", ) result = asyncio.run(builder.generate(req)) - assert llm.calls == 2 # one repair pass - assert result["debug_info"]["repaired"] is True + + assert llm.calls == 5 # style(2) + profile + lyrics + name + spans = result["debug_info"]["spans"] + repair_spans = [s for s in spans if "repair" in s["name"]] + assert len(repair_spans) == 1 assert result["suno_prompt"] == "some prompt" assert result["exclude"] == "cheesy, country" assert result["weirdness"] == 42 assert result["style_influence"] == 55 -def test_repairs_when_artist_name_leaks_in_suno_prompt_only(): - # First output leaks an artist name in SUNO PROMPT; second output fixes it. - bad = ( - "LYRICS\n" - "[Verse]\nhello\n\n" - "SUNO PROMPT\n" - "In the style of Bruno Mars, funky pop groove\n\n" - "EXCLUDE\n" - "cheesy\n\n" - "WEIRDNESS\n" - "50\n\n" - "STYLE INFLUENCE\n" - "60\n" - ) - good = ( - "LYRICS\n" - "[Verse]\nhello\n\n" - "SUNO PROMPT\n" - "Funky pop groove, bright bass, crisp drums, glossy modern mix\n\n" - "EXCLUDE\n" - "cheesy\n\n" - "WEIRDNESS\n" - "50\n\n" - "STYLE INFLUENCE\n" - "60\n" - ) - - llm = FakeLLM([bad, good]) +def test_artist_names_not_in_clean_suno_prompt(): + """Verify clean suno_prompt doesn't contain artist names.""" + clean_style = _valid_style_output( + suno_prompt="Funky pop groove, bright bass, crisp drums, glossy modern mix", + ) + llm = FakeLLM(_happy_path_responses(style_output=clean_style)) builder = AgentPromptGraph(_settings(), llm=llm) req = AdvancedGenerateRequest( user_prompt="Make a song that sounds like Bruno Mars", lyrics_about="dancing alone", selected_artists=["Bruno Mars"], tags=["pop"], + prompt_variant="v5_hybrid", ) result = asyncio.run(builder.generate(req)) - assert llm.calls == 2 - assert result["debug_info"]["repaired"] is True + + assert llm.calls == 4 assert "bruno" not in result["suno_prompt"].lower() -def test_falls_back_after_two_failed_repairs(): - # Provide three invalid outputs (initial + 2 repairs). Builder should return an error (no fallback). - invalid = "SUNO PROMPT\nblah\n" - llm = FakeLLM([invalid, invalid, invalid]) +def test_style_branch_proceeds_after_max_repairs(): + """After exhausting repairs, style branch proceeds with issues.""" + invalid_style = "SUNO PROMPT\nblah\n" # Missing EXCLUDE + llm = FakeLLM( + [ + invalid_style, # #1: style.generate (bad) + invalid_style, # #2: style.repair.1 (still bad) + invalid_style, # #3: style.repair.2 (still bad) + _valid_profile_output(), # #4: lyrics.profile_infer + _valid_lyrics_output(), # #5: lyrics.generate + _style_name_output(), # #6: style.name_generate + ] + ) builder = AgentPromptGraph(_settings(), llm=llm) req = AdvancedGenerateRequest( user_prompt="Make a song that sounds like Will.I.Am", lyrics_about="robots", selected_artists=["Will.I.Am"], tags=["electropop"], + prompt_variant="v5_hybrid", ) result = asyncio.run(builder.generate(req)) - assert llm.calls == 3 # initial + 2 repairs - assert result["success"] is False - assert "issues" in result and result["issues"] + + assert llm.calls == 6 # style(3) + profile + lyrics + name + # Result is still returned (two-step doesn't return error) + assert "suno_prompt" in result + spans = result["debug_info"]["spans"] + repair_spans = [s for s in spans if "repair" in s["name"]] + assert len(repair_spans) == 2 # --------------------------------------------------------------------------- @@ -342,100 +466,83 @@ def test_falls_back_after_two_failed_repairs(): # --------------------------------------------------------------------------- -def test_repair_disabled_skips_repair_attempts(): - """When agent_repair_enabled=False, no repair attempts are made.""" - # First output is invalid (missing sections) - invalid = "SUNO PROMPT\nblah\n" - llm = FakeLLM([invalid]) - settings = Settings( - spotify_client_id="test", - openai_api_key="test", - agent_repair_enabled=False, +def test_zero_max_repairs_skips_repair_attempts(): + """When agent_max_repairs=0, style branch skips repair attempts.""" + bad_style = "SUNO PROMPT\nblah\n" # Missing EXCLUDE + llm = FakeLLM( + [ + bad_style, # #1: style.generate (bad, no repair) + _valid_profile_output(), # #2: lyrics.profile_infer + _valid_lyrics_output(), # #3: lyrics.generate + _style_name_output(), # #4: style.name_generate + ] ) + settings = _settings(agent_max_repairs=0) builder = AgentPromptGraph(settings, llm=llm) req = AdvancedGenerateRequest( user_prompt="test", lyrics_about="test", + prompt_variant="v5_hybrid", ) result = asyncio.run(builder.generate(req)) - # Should only call LLM once (no repairs), then return error - assert llm.calls == 1 - assert result["success"] is False - assert "issues" in result and result["issues"] + assert llm.calls == 4 # style(1) + profile + lyrics + name + spans = result["debug_info"]["spans"] + repair_spans = [s for s in spans if "repair" in s["name"]] + assert len(repair_spans) == 0 def test_custom_max_repairs_is_respected(): """When agent_max_repairs is set to a custom value, it's respected.""" - invalid = "SUNO PROMPT\nblah\n" - # Provide enough invalid outputs for 5 repairs - llm = FakeLLM([invalid] * 6) - settings = Settings( - spotify_client_id="test", - openai_api_key="test", - agent_repair_enabled=True, - agent_max_repairs=5, - ) - builder = AgentPromptGraph(settings, llm=llm) - req = AdvancedGenerateRequest( - user_prompt="test", - lyrics_about="test", - ) - - result = asyncio.run(builder.generate(req)) - - # Should call LLM 6 times: initial + 5 repairs - assert llm.calls == 6 - assert result["success"] is False - assert "issues" in result and result["issues"] - - -def test_zero_max_repairs_goes_straight_to_fallback(): - """When agent_max_repairs=0, invalid output immediately returns error (no fallback).""" - invalid = "SUNO PROMPT\nblah\n" - llm = FakeLLM([invalid]) - settings = Settings( - spotify_client_id="test", - openai_api_key="test", - agent_repair_enabled=True, - agent_max_repairs=0, - ) + invalid_style = "SUNO PROMPT\nblah\n" # Missing EXCLUDE + llm = FakeLLM( + [invalid_style] * 6 # style: initial + 5 repairs + + [_valid_profile_output()] # lyrics.profile_infer + + [_valid_lyrics_output()] # lyrics.generate + + [_style_name_output()] # style.name_generate + ) + settings = _settings(agent_max_repairs=5) builder = AgentPromptGraph(settings, llm=llm) req = AdvancedGenerateRequest( user_prompt="test", lyrics_about="test", + prompt_variant="v5_hybrid", ) result = asyncio.run(builder.generate(req)) - # Only 1 LLM call (initial), then immediate error - assert llm.calls == 1 - assert result["success"] is False - assert "issues" in result and result["issues"] + # 6 style calls + 2 lyrics + 1 name = 9 + assert llm.calls == 9 + spans = result["debug_info"]["spans"] + repair_spans = [s for s in spans if "repair" in s["name"]] + assert len(repair_spans) == 5 -def test_debug_info_includes_repair_config(): - """Debug info includes repair_enabled and max_repairs from config.""" - output = _valid_output() - llm = FakeLLM([output]) - settings = Settings( - spotify_client_id="test", - openai_api_key="test", - agent_repair_enabled=True, - agent_max_repairs=3, - ) +def test_debug_info_has_trace_format(): + """Debug info uses DebugTrace format with summary and spans.""" + llm = FakeLLM(_happy_path_responses()) + settings = _settings(agent_max_repairs=3) builder = AgentPromptGraph(settings, llm=llm) req = AdvancedGenerateRequest( user_prompt="test", lyrics_about="test", + prompt_variant="v5_hybrid", ) result = asyncio.run(builder.generate(req)) - assert result["debug_info"]["repair_enabled"] is True - assert result["debug_info"]["max_repairs"] == 3 - assert result["debug_info"]["repaired"] is False + debug = result["debug_info"] + # DebugTrace v1 structure + assert debug["version"] == 1 + assert "summary" in debug + assert "spans" in debug + summary = debug["summary"] + assert summary["variant"] == "v5_hybrid" + assert summary["architecture"] == "two_step" + assert summary["repairs"] == 0 + assert summary["success"] is True + assert summary["llm_calls"] >= 2 # --------------------------------------------------------------------------- @@ -443,85 +550,47 @@ def test_debug_info_includes_repair_config(): # --------------------------------------------------------------------------- -def _valid_style_output( - suno_prompt: str = "Funky pop, crisp drums, bright bass", - exclude: str = "cheesy, country", - weirdness: int = 50, - style_influence: int = 60, -) -> str: - """Valid output for the style branch (two-step variants).""" - return ( - f"SUNO PROMPT\n{suno_prompt}\n\n" - f"EXCLUDE\n{exclude}\n\n" - f"WEIRDNESS\n{weirdness}\n\n" - f"STYLE INFLUENCE\n{style_influence}\n" - ) - - def test_instrumental_with_blank_lyrics_about_returns_empty_lyrics(): """When lyrics_about is blank, instrumental mode returns empty lyrics.""" - # Style output + title output (no lyrics branch should run) - style_output = _valid_style_output() - title_output = "The Last Horizon" - llm = FakeLLM([style_output, title_output]) - settings = Settings( - spotify_client_id="test", - openai_api_key="test", - ) - builder = AgentPromptGraph(settings, llm=llm) + llm = FakeLLM(_instrumental_responses(title="The Last Horizon")) + builder = AgentPromptGraph(_settings(), llm=llm) req = AdvancedGenerateRequest( user_prompt="Epic orchestral soundtrack", - lyrics_about="", # Empty = instrumental - prompt_variant="v5_hybrid", # Two-step variant + lyrics_about="", + prompt_variant="v5_hybrid", ) result = asyncio.run(builder.generate(req)) - # Should return empty lyrics assert result["lyrics"] == "" - # Should have a creative title from LLM assert result["concept_title"] == "The Last Horizon" - # Should have valid style output assert result["suno_prompt"] == "Funky pop, crisp drums, bright bass" assert result["exclude"] == "cheesy, country" assert result["weirdness"] == 50 assert result["style_influence"] == 60 - # 2 LLM calls: style + title (no lyrics) - assert llm.calls == 2 + assert llm.calls == 3 # style + title + style_name def test_instrumental_with_keyword_returns_empty_lyrics(): """When lyrics_about contains 'instrumental', returns empty lyrics.""" - style_output = _valid_style_output() - title_output = "Drift" - llm = FakeLLM([style_output, title_output]) - settings = Settings( - spotify_client_id="test", - openai_api_key="test", - ) - builder = AgentPromptGraph(settings, llm=llm) + llm = FakeLLM(_instrumental_responses(title="Drift")) + builder = AgentPromptGraph(_settings(), llm=llm) req = AdvancedGenerateRequest( user_prompt="Ambient electronic", - lyrics_about="instrumental track", # Keyword triggers instrumental mode + lyrics_about="instrumental track", prompt_variant="v5_hybrid", ) result = asyncio.run(builder.generate(req)) assert result["lyrics"] == "" - assert llm.calls == 2 # Style + title + assert llm.calls == 3 # style + title + style_name def test_instrumental_with_no_vocals_keyword_returns_empty_lyrics(): """When lyrics_about contains 'no vocals', returns empty lyrics.""" - style_output = _valid_style_output() - title_output = "Velvet Thunder" - llm = FakeLLM([style_output, title_output]) - settings = Settings( - spotify_client_id="test", - openai_api_key="test", - ) - builder = AgentPromptGraph(settings, llm=llm) + llm = FakeLLM(_instrumental_responses(title="Velvet Thunder")) + builder = AgentPromptGraph(_settings(), llm=llm) req = AdvancedGenerateRequest( user_prompt="Jazz fusion", lyrics_about="no vocals, just instruments", @@ -531,22 +600,16 @@ def test_instrumental_with_no_vocals_keyword_returns_empty_lyrics(): result = asyncio.run(builder.generate(req)) assert result["lyrics"] == "" - assert llm.calls == 2 # Style + title + assert llm.calls == 3 # style + title + style_name def test_instrumental_with_tag_returns_empty_lyrics(): """When tags include 'instrumental', returns empty lyrics.""" - style_output = _valid_style_output() - title_output = "Through Glass Canyons" - llm = FakeLLM([style_output, title_output]) - settings = Settings( - spotify_client_id="test", - openai_api_key="test", - ) - builder = AgentPromptGraph(settings, llm=llm) + llm = FakeLLM(_instrumental_responses(title="Through Glass Canyons")) + builder = AgentPromptGraph(_settings(), llm=llm) req = AdvancedGenerateRequest( user_prompt="Post-rock soundscape", - lyrics_about="the sunset", # Non-empty, but tag overrides + lyrics_about="the sunset", tags=["instrumental", "post-rock"], prompt_variant="v5_hybrid", ) @@ -554,19 +617,13 @@ def test_instrumental_with_tag_returns_empty_lyrics(): result = asyncio.run(builder.generate(req)) assert result["lyrics"] == "" - assert llm.calls == 2 # Style + title + assert llm.calls == 3 # style + title + style_name def test_instrumental_debug_trace_includes_skipped_span(): """Instrumental mode includes a lyrics.skipped span in debug trace.""" - style_output = _valid_style_output() - title_output = "Midnight in Kyoto" - llm = FakeLLM([style_output, title_output]) - settings = Settings( - spotify_client_id="test", - openai_api_key="test", - ) - builder = AgentPromptGraph(settings, llm=llm) + llm = FakeLLM(_instrumental_responses(title="Midnight in Kyoto")) + builder = AgentPromptGraph(_settings(), llm=llm) req = AdvancedGenerateRequest( user_prompt="Cinematic score", lyrics_about="", @@ -575,7 +632,6 @@ def test_instrumental_debug_trace_includes_skipped_span(): result = asyncio.run(builder.generate(req)) - # Check debug trace has lyrics.skipped span debug_info = result.get("debug_info", {}) spans = debug_info.get("spans", []) skipped_spans = [s for s in spans if s.get("name") == "lyrics.skipped"] @@ -595,15 +651,21 @@ def test_is_instrumental_request_helper(): assert AgentPromptGraph._is_instrumental_request(req2) is True # Test "instrumental" keyword - req3 = AdvancedGenerateRequest(user_prompt="test", lyrics_about="an instrumental piece") + req3 = AdvancedGenerateRequest( + user_prompt="test", lyrics_about="an instrumental piece" + ) assert AgentPromptGraph._is_instrumental_request(req3) is True # Test "no vocals" keyword - req4 = AdvancedGenerateRequest(user_prompt="test", lyrics_about="no vocals needed") + req4 = AdvancedGenerateRequest( + user_prompt="test", lyrics_about="no vocals needed" + ) assert AgentPromptGraph._is_instrumental_request(req4) is True # Test "no lyrics" keyword - req5 = AdvancedGenerateRequest(user_prompt="test", lyrics_about="no lyrics please") + req5 = AdvancedGenerateRequest( + user_prompt="test", lyrics_about="no lyrics please" + ) assert AgentPromptGraph._is_instrumental_request(req5) is True # Test instrumental tag @@ -613,7 +675,9 @@ def test_is_instrumental_request_helper(): assert AgentPromptGraph._is_instrumental_request(req6) is True # Test non-instrumental request - req7 = AdvancedGenerateRequest(user_prompt="test", lyrics_about="love and heartbreak") + req7 = AdvancedGenerateRequest( + user_prompt="test", lyrics_about="love and heartbreak" + ) assert AgentPromptGraph._is_instrumental_request(req7) is False # Test non-instrumental with tags diff --git a/backend/tests/test_fix_gen_bugs.py b/backend/tests/test_fix_gen_bugs.py index 6141dc3..8febf90 100644 --- a/backend/tests/test_fix_gen_bugs.py +++ b/backend/tests/test_fix_gen_bugs.py @@ -454,7 +454,7 @@ class TestVocabularyRules: def test_overused_words_flagged_in_spec(self): from app.prompts.specs import LYRICS_SPEC - assert "Avoid overusing" in LYRICS_SPEC + assert "NEVER use 3 or more" in LYRICS_SPEC for word in ["silver", "velvet", "neon", "shattered", "crimson", "golden"]: assert word in LYRICS_SPEC, f"Overused word '{word}' missing from LYRICS_SPEC" @@ -494,3 +494,108 @@ def test_repair_agent_has_varied_lines_rule(self): from app.prompts.specs import LYRICS_REPAIR_AGENT assert "Each line in a chorus must be distinct" in LYRICS_REPAIR_AGENT + + +# ============================================================================ +# PR 5: Overused word validation +# ============================================================================ + + +class TestOverusedWordValidation: + """Test _check_overused_words detects banned generic poetic words.""" + + def _check(self, lyrics: str) -> list[str]: + from app.services.agent_prompt_graph import AgentPromptGraph + + return AgentPromptGraph(MagicMock())._check_overused_words(lyrics) + + def test_no_banned_words_passes(self): + lyrics = """[Verse] +Walking down the highway +Wind against my face +Truck is running steady +Heading for that place""" + assert self._check(lyrics) == [] + + def test_one_banned_word_passes(self): + lyrics = """[Verse] +Golden sunset falling +Over fields of grain""" + assert self._check(lyrics) == [] + + def test_two_banned_words_passes(self): + lyrics = """[Verse] +Golden light through shadows +Dancing on the wall""" + assert self._check(lyrics) == [] + + def test_three_banned_words_flagged(self): + lyrics = """[Verse] +Golden whisper through the shadows +Falling into darkness""" + issues = self._check(lyrics) + assert len(issues) == 1 + assert "generic poetic words" in issues[0] + + def test_five_banned_words_flagged(self): + lyrics = """[Verse] +Silver moonlight through velvet shadows +Crimson embers whisper low""" + issues = self._check(lyrics) + assert len(issues) == 1 + for word in ["silver", "velvet", "shadows", "crimson", "embers"]: + assert word in issues[0] + + def test_case_insensitive(self): + lyrics = """[Verse] +GOLDEN light through SHADOWS +WHISPER in the night""" + issues = self._check(lyrics) + assert len(issues) == 1 + + def test_banned_word_as_substring_not_counted(self): + """'whispering' should not match 'whisper'.""" + lyrics = """[Verse] +Whispering wind through shadowed halls +Golden sunrise on the wall""" + # "whispering" != "whisper", "shadowed" != "shadows" + # Only "golden" is an exact match → passes + assert self._check(lyrics) == [] + + +# ============================================================================ +# PR 5: concept_title fallback +# ============================================================================ + + +class TestConceptTitleFallback: + """Test _derive_title fallback when lyrics branch returns empty title.""" + + def test_derive_title_from_lyrics_about(self): + from app.services.agent_prompt_graph import AgentPromptGraph + + agent = AgentPromptGraph(MagicMock()) + title = agent._derive_title("fast punk rock", "corporate greed and rebellion") + assert title # Not empty + assert len(title) <= 50 + + def test_derive_title_from_user_prompt_when_no_lyrics_about(self): + from app.services.agent_prompt_graph import AgentPromptGraph + + agent = AgentPromptGraph(MagicMock()) + title = agent._derive_title("indie rock with jangly guitars", "") + assert title # Not empty + assert "Indie" in title or "indie" in title.lower() + + def test_derive_title_returns_untitled_for_empty_input(self): + from app.services.agent_prompt_graph import AgentPromptGraph + + agent = AgentPromptGraph(MagicMock()) + title = agent._derive_title("", "") + assert title == "Untitled" + + def test_vocabulary_spec_has_stronger_language(self): + from app.prompts.specs import LYRICS_SPEC + + assert "NEVER use 3 or more" in LYRICS_SPEC + assert "validation will reject" in LYRICS_SPEC diff --git a/benchmarks/README.md b/benchmarks/README.md new file mode 100644 index 0000000..2d2ec0d --- /dev/null +++ b/benchmarks/README.md @@ -0,0 +1,30 @@ +# Benchmarks + +This directory stores timestamped results from `/test-quality` and `/test-perf` skill runs. Each file captures a snapshot so we can detect regressions over time. + +## File naming + +- `perf-YYYY-MM-DD.md` — latency benchmarks +- `quality-YYYY-MM-DD.md` — generation quality assessments + +## How to add a result + +After running `/test-quality` or `/test-perf`, save the report here with today's date. If multiple runs happen on the same day, append a suffix (e.g., `perf-2026-02-21-b.md`). + +## What to track + +### Performance (`perf-*.md`) +- `/generate/input-concept` latency (5 calls, min/max/avg) +- `/generate/advanced` latency (3 calls, min/max/avg) +- `/generate/refine` latency (3 calls, min/max/avg) +- Git commit hash or branch name + +### Quality (`quality-*.md`) +- Number of songs generated +- Banned word appearances (count and which words) +- Chorus repetition issues (count) +- Empty/bad `concept_title` count +- Missing section tags count +- Stage directions in lyrics count +- Lines ending in periods count +- Git commit hash or branch name diff --git a/benchmarks/perf-2026-02-21.md b/benchmarks/perf-2026-02-21.md new file mode 100644 index 0000000..719e050 --- /dev/null +++ b/benchmarks/perf-2026-02-21.md @@ -0,0 +1,39 @@ +# Performance Benchmark — 2026-02-21 + +**Branch**: `calderlund--fix-edit-spacing` +**Commit**: pre-commit (test reliability + agent guardrails) +**Stack**: local dev server (not Docker) + +## `/generate/input-concept` (5 calls, target <2s) + +| Call | Latency | +|------|---------| +| 1 | 0.007s | +| 2 | 0.005s | +| 3 | 0.004s | +| 4 | 0.004s | +| 5 | 0.003s | + +**Avg: 0.005s** | Min: 0.003s | Max: 0.007s | Target: <2s + +## `/generate/advanced` (3 calls, target <15s) + +| Call | Genre | Latency | +|------|-------|---------| +| 1 | Indie rock (non-instrumental) | 18.97s | +| 2 | Lo-fi hip hop (non-instrumental) | 20.72s | +| 3 | Orchestral (instrumental) | 8.83s | + +**Avg: 16.17s** | Min: 8.83s | Max: 20.72s | Target: <15s + +Non-instrumental is over target (~20s). Instrumental is within target (~9s). + +## `/generate/refine` (target <15s) + +Could not benchmark — refine endpoint returned `{"detail": "Failed to refine"}`. Likely missing DB/Redis in local dev setup (not Docker). Pre-existing environment issue. + +## Notes + +- input-concept is very fast (<10ms), well within target +- advanced non-instrumental latency is ~20s, driven by LLM call time (4 sequential calls: style + profile + lyrics + style_name) +- Instrumental is faster (~9s) because it only makes 3 calls and skips lyrics diff --git a/benchmarks/quality-2026-02-21.md b/benchmarks/quality-2026-02-21.md new file mode 100644 index 0000000..b069b31 --- /dev/null +++ b/benchmarks/quality-2026-02-21.md @@ -0,0 +1,55 @@ +# Quality Benchmark — 2026-02-21 + +**Branch**: `calderlund--fix-edit-spacing` +**Commit**: pre-commit (test reliability + agent guardrails) +**Songs generated**: 2 (country, punk) + +## Banned/Overused Words + +Target: none of [silver, velvet, neon, shattered, whisper, shadows, echoes, crimson, golden, embers] in 3+ of 5 songs. + +| Song | Banned words found | +|------|--------------------| +| Country | whisper, golden, shadows | +| Punk | none | + +**Result**: 1/2 songs had banned words (3 words in country). Would need 5-song run to properly assess the 3+ threshold. + +## Chorus Repetition + +Target: no chorus with same line 3+ times. + +| Song | Issue? | +|------|--------| +| Country | No | +| Punk | No | + +**Result**: Pass + +## Style Names (`concept_title`) + +Target: short (<30 chars), descriptive. + +| Song | concept_title | Length | OK? | +|------|---------------|--------|-----| +| Country | "" (empty) | 0 | FAIL | +| Punk | "" (empty) | 0 | FAIL | + +**Result**: FAIL — both songs returned empty concept_title. The style_name LLM call appears to not populate this field. + +## Structure + +Target: section tags present, no stage directions, no periods at end of lines. + +| Song | Has tags | Stage directions | Periods | +|------|----------|-----------------|---------| +| Country | Yes (with modifiers: `[Verse, earnest, reflective]`) | No | No | +| Punk | Yes | No | No | + +**Result**: Pass — tags present with modifiers, no stage directions, no periods. + +## Notes + +- Empty `concept_title` is a pre-existing issue — the style_name generation step isn't populating it in the `/generate/advanced` response +- Country song vocabulary leans on common words (whisper, golden, shadows) — the vocabulary rules in LYRICS_SPEC may need strengthening for country genre +- Only 2 songs tested; a full 5-genre run would give better signal