LLM judge: decision tree completeness scoring

## Context

From a PR review patterns audit across 7 MongoDB Agent Skills, missing decision trees / diagnostic flows was the third most frequent pattern (12 instances across 5 PRs). Skills tell agents what to analyze but not how to make decisions, what to do with findings, or how to handle non-happy-path scenarios.

This is the hardest quality dimension to evaluate manually and would benefit most from LLM-assisted scoring.

## Proposed LLM scoring dimension

Add a **Decision Tree Completeness** dimension to the LLM-as-judge scoring for SKILL.md.

### Scoring rubric (1-5)

| Score | Criteria |
|-------|----------|
| 5 | Every conditional instruction has explicit true/false paths. Diagnostic sections pair each "check" with a specific action. Fallback behavior is defined for all branch points. |
| 4 | Most conditionals have both paths defined. Minor gaps in edge case handling. |
| 3 | Happy path is well-defined but alternative paths are vague or missing. Some "if X" without "else". |
| 2 | Multiple decision points with undefined behavior. Diagnostic sections list what to check but not what to do about findings. |
| 1 | Instructions are linear with no branching logic despite the skill covering multiple scenarios. |

### What the judge should evaluate

1. For every conditional instruction ("if X", "when Y", "check whether Z"), is the alternative path defined?
2. For diagnostic/troubleshooting sections: does each "what to check" have a corresponding "what to do about it"?
3. Are there abstract action verbs ("analyze", "identify", "determine") without concrete steps for how to perform the analysis?
4. For multi-step workflows: what happens if a step fails? Is that defined?

### Examples from PR reviews

| PR | Issue |
|----|-------|
| 5 | "Analyze whether client config or infrastructure" — no decision tree for how to determine which |
| 3 | Bash denied at Step 1 skips Steps 2-4 — user never selects auth option |
| 3 | Partial config (one of two variables set) — undefined behavior |
| 3 | Missing guidance when user can't decide between auth options |
| 9 | "Refining" workflow entirely abstract: "identify issues", "propose improvements" with no routing |
| 2 | "Handle edge cases" — no guidance on which edge cases or what actions |

### Implementation notes

This could be implemented as:
- A new dimension in the existing SKILL.md scoring prompt
- Or a separate focused evaluation pass (may produce higher quality results since decision tree analysis requires careful reading)

## Related

Part of a series of LLM judge enhancements derived from PR review pattern analysis. The highest-impact single addition for reducing manual review burden.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

LLM judge: decision tree completeness scoring #54

Context

Proposed LLM scoring dimension

Scoring rubric (1-5)

What the judge should evaluate

Examples from PR reviews

Implementation notes

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Score	Criteria
5	Every conditional instruction has explicit true/false paths. Diagnostic sections pair each "check" with a specific action. Fallback behavior is defined for all branch points.
4	Most conditionals have both paths defined. Minor gaps in edge case handling.
3	Happy path is well-defined but alternative paths are vague or missing. Some "if X" without "else".
2	Multiple decision points with undefined behavior. Diagnostic sections list what to check but not what to do about findings.
1	Instructions are linear with no branching logic despite the skill covering multiple scenarios.

PR	Issue
5	"Analyze whether client config or infrastructure" — no decision tree for how to determine which
3	Bash denied at Step 1 skips Steps 2-4 — user never selects auth option
3	Partial config (one of two variables set) — undefined behavior
3	Missing guidance when user can't decide between auth options
9	"Refining" workflow entirely abstract: "identify issues", "propose improvements" with no routing
2	"Handle edge cases" — no guidance on which edge cases or what actions

Uh oh!

LLM judge: decision tree completeness scoring #54

Description

Context

Proposed LLM scoring dimension

Scoring rubric (1-5)

What the judge should evaluate

Examples from PR reviews

Implementation notes

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions