Skip to content

Mechanism to test CLI --ai Output Effectiveness #27

@dannysmith

Description

@dannysmith

No way to evaluate whether the AI-optimized output from CLI commands (using the -ai flag) is actually effective for AI agents. This is distinct from skill testing - it's about whether the format and content of CLI output gives agents the information they need in the most effective way possible.

The question: "When an AI agent calls tdn list --ai, does the output convey the information effectively?". This is essentially testing "information density" - can the AI extract the information it needs, understand it in context and effectively use it to reason.

Approach

LLM-as-Judge for information extraction. We might write test cases like this:

vault: busy-freelancer
command: tdn list --project "Website Redesign" --ai
questions:
  - "How many tasks are in this project?"
  - "Which task is due soonest?"
  - "Are there any blocked tasks?"
ground_truth:
  - 7
  - "Update homepage copy (due 2025-01-15)"
  - "No"

We can then:

  1. Run the command and get the output
  2. Ask LLM One the questions based on the output
  3. Have LLM Two compare it's answers to our ground truth.

Rough thoughts on things to try

  • Sparse information in vault (eg many empty projects and areas, few tasks) vs overflowing vault
  • Totally un-primed LLM vs minimally-primed vs has our Skill available
  • Effectiveness when piped to head --n with decreasing n
  • Effectiveness or error responses - does it know what to try next?
  • Thinking needed to get to next tool call - when the question is an instruction to do something, how quickly did the LLM return "make X tool call".

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions