-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
Description
No way to evaluate whether the AI-optimized output from CLI commands (using the -ai flag) is actually effective for AI agents. This is distinct from skill testing - it's about whether the format and content of CLI output gives agents the information they need in the most effective way possible.
The question: "When an AI agent calls tdn list --ai, does the output convey the information effectively?". This is essentially testing "information density" - can the AI extract the information it needs, understand it in context and effectively use it to reason.
Approach
LLM-as-Judge for information extraction. We might write test cases like this:
vault: busy-freelancer
command: tdn list --project "Website Redesign" --ai
questions:
- "How many tasks are in this project?"
- "Which task is due soonest?"
- "Are there any blocked tasks?"
ground_truth:
- 7
- "Update homepage copy (due 2025-01-15)"
- "No"We can then:
- Run the command and get the output
- Ask LLM One the questions based on the output
- Have LLM Two compare it's answers to our ground truth.
Rough thoughts on things to try
- Sparse information in vault (eg many empty projects and areas, few tasks) vs overflowing vault
- Totally un-primed LLM vs minimally-primed vs has our Skill available
- Effectiveness when piped to
head --nwith decreasingn - Effectiveness or error responses - does it know what to try next?
- Thinking needed to get to next tool call - when the question is an instruction to do something, how quickly did the LLM return "make X tool call".
Reactions are currently unavailable