Skip to content

VoicuTomut/AgentFit

Repository files navigation

█████╗  ██████╗ ███████╗███╗   ██╗████████╗    ███████╗██╗████████╗
██╔══██╗██╔════╝ ██╔════╝████╗  ██║╚══██╔══╝    ██╔════╝██║╚══██╔══╝
███████║██║  ███╗█████╗  ██╔██╗ ██║   ██║       █████╗  ██║   ██║
██╔══██║██║   ██║██╔══╝  ██║╚██╗██║   ██║       ██╔══╝  ██║   ██║
██║  ██║╚██████╔╝███████╗██║ ╚████║   ██║       ██║     ██║   ██║
╚═╝  ╚═╝ ╚═════╝ ╚══════╝╚═╝  ╚═══╝   ╚═╝       ╚═╝     ╚═╝   ╚═╝
                

AgentFit

Does your codebase speak LLM?

AgentFit audits how well AI models can actually work with your Python code — not just read it, but complete functions, fix bugs, navigate a live repo with tools, and explain architecture. It scores five static AI-readiness metrics, then verifies them by benchmarking real LLMs against auto-generated challenges.


What it does

agentfit benchmark ./src

Stage 1 — Static Analysis scores your codebase on five dimensions that predict how well LLMs will perform on it:

Metric What it measures
Schema Density How many data-passing functions use Pydantic / TypedDict / dataclasses
DRYness Absence of duplicated function bodies
Docstring Richness Presence of >>> usage examples in public docstrings
Test Coverage Structural Ratio of test files to source files
Import Clarity Absence of circular imports and dependency tangles

Stage 2 — LLM Benchmarking auto-generates coding challenges (completion, debugging, explanation, refactoring) from your source tree, sends them to every configured provider, and judges responses with a second LLM.

Stage 3 — Agentic Benchmarking introduces a real mutation into a copy of your codebase and lets the model use filesystem + test tools over multiple turns to find and fix the bug — just like a developer would.

Stage 4 — Correlated Reporting finds which static metrics actually correlate with LLM performance on your specific codebase and surfaces prioritised, actionable recommendations.


Install

pip install agentfit

Requires Python 3.11+. Optional provider SDKs:

pip install agentfit[anthropic]   # Anthropic Claude
pip install agentfit[openai]      # OpenAI / any OpenAI-compatible endpoint
pip install agentfit[all]         # everything

Quick start

# 1. Scaffold a config file
agentfit init

# 2. Full audit — static analysis + LLM benchmarking
agentfit benchmark ./src

# 3. Static analysis only (no API keys needed)
agentfit benchmark ./src --no-llm

# 4. Gate CI — exit code 1 if score drops below 60
agentfit benchmark ./src --fail-below 60

# 5. Save full results to JSON
agentfit benchmark ./src --save-results

Try it without any API key

--no-llm runs the full static analysis pipeline and gives you a scored report — no LLM provider, no API key, no cost:

pip install agentfit
agentfit init
agentfit benchmark ./src --no-llm

You get scores for all five metrics (Schema Density, DRYness, Docstring Richness, Test Coverage Structural, Import Clarity) plus ranked recommendations. The LLM benchmarking and correlation stages are skipped — those need a provider configured in ai-bench.yml.


Sample output

╭──────────────────────── AgentFit Report ─────────────────────────╮
│ Source: ./src                                                      │
│ Generated: 2026-03-26T14:00:00+00:00                               │
│ Overall Score: 71.4   Threshold: 60.0  ✓ PASS                      │
╰────────────────────────────────────────────────────────────────────╯

Static Analysis
 Metric                    Score   Bar          Correlation
 Schema Density             82.0   ████████░░   strong ↑ (r=0.81)
 DRYness                    71.0   ███████░░░   —
 Docstring Richness         43.0   ████░░░░░░   strong ↑ (r=0.74)
 Test Coverage Structural   55.0   █████░░░░░   —
 Import Clarity             89.0   ████████░░   —

LLM Benchmark Results
 Provider    Model              Attempted  Passed  Mean Score  P50ms  P95ms
 anthropic   claude-sonnet-4-6       15      12        74.2    1203   2847
 qwen        local                   15       9        61.3    3100   5200

Recommendations
  1. [HIGH] Add usage examples to public functions
     Docstring Richness is 43.0/100. Strong positive correlation with LLM
     scores (r=0.74). Adding >>> examples significantly improves LLM performance.
  2. [MEDIUM] Increase structural test coverage
     ...

Use --verbose to also print per-metric warnings.


Configuration

agentfit init writes an ai-bench.yml to the current directory:

version: "1"

analysis:
  source_path: "."
  languages:
    - python
  metric_weights:
    schema_density: 1.0    # set to 0 to exclude from overall score
    dryness: 1.0
    docstring_richness: 1.0
    test_coverage: 1.0
    import_clarity: 1.0

providers:
  anthropic:
    enabled: true
    model: "claude-sonnet-4-6"
  openai:
    enabled: false
    model: "gpt-4o"
    # base_url: "https://your-local-endpoint/v1"   # any OpenAI-compatible API
    # name: "my-provider"                          # display name in reports

benchmarking:
  challenges_per_module: 3
  max_concurrent_requests: 5
  max_tool_rounds: 10        # agentic mode: max turns per challenge

scoring:
  judge_model: "claude-sonnet-4-6"
  judge_provider: "anthropic"

reporting:
  output_format: "text"
  fail_below: null

Agentic benchmarking

AgentFit automatically generates agentic_debugging challenges for any source file that has a matching test file. Each challenge:

  1. Introduces one mutation into a copy of your source tree (e.g. flips ==!=)
  2. Gives the model access to five tools: read_file, list_files, search_code, write_file, run_tests
  3. Runs a multi-turn loop until the model fixes the bug or max_tool_rounds is reached
  4. Scores the result on correctness, fix quality, test verification, and round efficiency

Supported providers: Anthropic and any OpenAI-compatible endpoint. Ollama is not supported (no tool-use API).

You can also force any manual challenge through the agentic loop with agentic: true:

# challenges.yml
- id: "explain-scoring"
  source_module: "agentfit.scoring"
  challenge_type: "explanation"
  agentic: true          # model reads the real codebase before answering
  prompt: |
    Read the source files and explain how challenge scoring works end-to-end.
  context_code: ""
  expected_behavior: |
    A detailed explanation covering ChallengeGenerator, JudgeLLM, and Scorer.

Run with manual challenges:

agentfit benchmark ./src --challenges challenges.yml --save-results

Local / self-hosted LLM endpoints

Any OpenAI-compatible API works — Ollama, LM Studio, ngrok tunnels, Qwen, Mistral, etc.:

providers:
  openai:
    enabled: true
    model: "qwen2.5-coder:14b"
    base_url: "https://xxxx.ngrok-free.app/v1"
    name: "qwen"    # shows as "qwen" in the report table

No API key required when base_url is set.


CLI reference

agentfit init [--output PATH]
    Scaffold ai-bench.yml in the current directory.

agentfit benchmark SOURCE_PATH
    [--config PATH]          Override config file location
    [--fail-below SCORE]     Exit 1 if overall score < SCORE
    [--no-llm]               Static analysis only (no API keys needed)
    [--verbose, -v]          Show per-metric warnings
    [--challenges PATH]      YAML file of manually authored challenges
    [--max-challenges N]     Cap auto-generated challenges (manual always included)
    [--save-results]         Write full report to agentfit-results.json
    [--manual-eval]          Export challenge/response/verdict triples to JSONL
    [--load-evals PATH]      Merge a manual eval JSONL into auto scores
    [--output-format FORMAT] 'text' (default) or 'html'
    [--output-file PATH]     Write HTML report to file
    [--badge]                Write agentfit-badge.json

Multi-language support

AgentFit analyses Python natively and has regex-based analysers for:

Language Schema Density DRYness Docstring Richness Import Clarity Test Coverage
Python
TypeScript / JS needs tests¹
Rust needs tests¹
Go needs tests¹
Java needs tests¹

¹ Test Coverage Structural works by pairing source files with test files (e.g. engine.goengine_test.go). For non-Python languages the metric will score 0 if your project has no test files alongside the source — add tests to your project to get a meaningful score.

analysis:
  languages:
    - python
    - typescript
    - rust

Roadmap

Version Theme Status
v0.1 Python static analysis + Anthropic/OpenAI/Ollama runners ✓ Done
v0.2 TypeScript/JavaScript AST analysis ✓ Done
v0.3 Rust/C/C++ + Go + Java support x Partialy
v0.4 Agentic tool harness + multi-turn debugging challenges ✓ Done
v0.5 HTML report export + CI badge generation ✓ Done
v1.0 Real pytest-cov integration + VS Code extension Planned

What's left (v1.0)

  • Real pytest-cov integration — blend runtime branch coverage with the structural score
  • VS Code extension — inline metric decorations, status bar score, WebView report panel
  • CI/GitHub Actions — self-audit job (agentfit benchmark ./agentfit --fail-below 90) on every PR
  • mypy --strict — full type-checking across all modules
  • PyPI publishpip i stall agentfit from the public registry

Development

git clone https://github.com/voicutomut/AgentFit
cd AgentFit
pip install -e ".[dev]"
pytest                     # 739 tests
ruff check agentfit/       # lint

See ROADMAP.md for the full phased implementation plan.


Community suggestions — help shape AgentFit

AgentFit is early and the five metrics are our first take at what makes a codebase LLM-friendly. We want your input.

Open an issue or start a discussion if you have thoughts on any of these:

Are the current metrics the right ones?

The five we picked:

Metric Our hypothesis
Schema Density Typed data structures give LLMs clear contracts to reason about
DRYness Duplicated logic confuses context windows and wastes tokens
Docstring Richness >>> examples are the most information-dense context you can give a model
Test Coverage Structural Tests tell the model what "correct" looks like
Import Clarity Circular deps and star imports obscure the dependency graph

Do these match your experience? Have you noticed other code properties that seem to help or hurt LLM performance on your projects?

What metrics are we missing?

Some candidates we're considering — tell us which matter most to you:

  • Naming consistency — do identifiers follow a single convention? Does the model have to context-switch between styles?
  • Function length / cyclomatic complexity — do shorter, focused functions produce better LLM completions?
  • Comment density — inline comments vs. docstrings, which helps more?
  • Dependency freshness — does using up-to-date libraries (in the model's training data) improve results?
  • Magic number / constant density — does replacing raw literals with named constants help?
  • Error handling coverage — does consistent exception handling improve LLM-generated patches?

Other ways to contribute

  • Share a benchmark result — run agentfit benchmark ./your-repo --save-results and share the JSON output. Real data helps us validate which metrics actually correlate with LLM performance.
  • Propose a new challenge type — beyond completion, debugging, refactoring, and explanation, what coding tasks should we be measuring?
  • Report false positives — if a metric scores your codebase unfairly, open an issue with a minimal example.

Open an issue · Start a discussion

About

AI rediness audit

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors