Context
From a PR review patterns audit across 7 MongoDB Agent Skills, missing decision trees / diagnostic flows was the third most frequent pattern (12 instances across 5 PRs). Skills tell agents what to analyze but not how to make decisions, what to do with findings, or how to handle non-happy-path scenarios.
This is the hardest quality dimension to evaluate manually and would benefit most from LLM-assisted scoring.
Proposed LLM scoring dimension
Add a Decision Tree Completeness dimension to the LLM-as-judge scoring for SKILL.md.
Scoring rubric (1-5)
| Score |
Criteria |
| 5 |
Every conditional instruction has explicit true/false paths. Diagnostic sections pair each "check" with a specific action. Fallback behavior is defined for all branch points. |
| 4 |
Most conditionals have both paths defined. Minor gaps in edge case handling. |
| 3 |
Happy path is well-defined but alternative paths are vague or missing. Some "if X" without "else". |
| 2 |
Multiple decision points with undefined behavior. Diagnostic sections list what to check but not what to do about findings. |
| 1 |
Instructions are linear with no branching logic despite the skill covering multiple scenarios. |
What the judge should evaluate
- For every conditional instruction ("if X", "when Y", "check whether Z"), is the alternative path defined?
- For diagnostic/troubleshooting sections: does each "what to check" have a corresponding "what to do about it"?
- Are there abstract action verbs ("analyze", "identify", "determine") without concrete steps for how to perform the analysis?
- For multi-step workflows: what happens if a step fails? Is that defined?
Examples from PR reviews
| PR |
Issue |
| 5 |
"Analyze whether client config or infrastructure" — no decision tree for how to determine which |
| 3 |
Bash denied at Step 1 skips Steps 2-4 — user never selects auth option |
| 3 |
Partial config (one of two variables set) — undefined behavior |
| 3 |
Missing guidance when user can't decide between auth options |
| 9 |
"Refining" workflow entirely abstract: "identify issues", "propose improvements" with no routing |
| 2 |
"Handle edge cases" — no guidance on which edge cases or what actions |
Implementation notes
This could be implemented as:
- A new dimension in the existing SKILL.md scoring prompt
- Or a separate focused evaluation pass (may produce higher quality results since decision tree analysis requires careful reading)
Related
Part of a series of LLM judge enhancements derived from PR review pattern analysis. The highest-impact single addition for reducing manual review burden.
Context
From a PR review patterns audit across 7 MongoDB Agent Skills, missing decision trees / diagnostic flows was the third most frequent pattern (12 instances across 5 PRs). Skills tell agents what to analyze but not how to make decisions, what to do with findings, or how to handle non-happy-path scenarios.
This is the hardest quality dimension to evaluate manually and would benefit most from LLM-assisted scoring.
Proposed LLM scoring dimension
Add a Decision Tree Completeness dimension to the LLM-as-judge scoring for SKILL.md.
Scoring rubric (1-5)
What the judge should evaluate
Examples from PR reviews
Implementation notes
This could be implemented as:
Related
Part of a series of LLM judge enhancements derived from PR review pattern analysis. The highest-impact single addition for reducing manual review burden.