█████╗ ██████╗ ███████╗███╗ ██╗████████╗ ███████╗██╗████████╗
██╔══██╗██╔════╝ ██╔════╝████╗ ██║╚══██╔══╝ ██╔════╝██║╚══██╔══╝
███████║██║ ███╗█████╗ ██╔██╗ ██║ ██║ █████╗ ██║ ██║
██╔══██║██║ ██║██╔══╝ ██║╚██╗██║ ██║ ██╔══╝ ██║ ██║
██║ ██║╚██████╔╝███████╗██║ ╚████║ ██║ ██║ ██║ ██║
╚═╝ ╚═╝ ╚═════╝ ╚══════╝╚═╝ ╚═══╝ ╚═╝ ╚═╝ ╚═╝ ╚═╝
Does your codebase speak LLM?
AgentFit audits how well AI models can actually work with your Python code — not just read it, but complete functions, fix bugs, navigate a live repo with tools, and explain architecture. It scores five static AI-readiness metrics, then verifies them by benchmarking real LLMs against auto-generated challenges.
agentfit benchmark ./src
Stage 1 — Static Analysis scores your codebase on five dimensions that predict how well LLMs will perform on it:
| Metric | What it measures |
|---|---|
| Schema Density | How many data-passing functions use Pydantic / TypedDict / dataclasses |
| DRYness | Absence of duplicated function bodies |
| Docstring Richness | Presence of >>> usage examples in public docstrings |
| Test Coverage Structural | Ratio of test files to source files |
| Import Clarity | Absence of circular imports and dependency tangles |
Stage 2 — LLM Benchmarking auto-generates coding challenges (completion, debugging, explanation, refactoring) from your source tree, sends them to every configured provider, and judges responses with a second LLM.
Stage 3 — Agentic Benchmarking introduces a real mutation into a copy of your codebase and lets the model use filesystem + test tools over multiple turns to find and fix the bug — just like a developer would.
Stage 4 — Correlated Reporting finds which static metrics actually correlate with LLM performance on your specific codebase and surfaces prioritised, actionable recommendations.
pip install agentfitRequires Python 3.11+. Optional provider SDKs:
pip install agentfit[anthropic] # Anthropic Claude
pip install agentfit[openai] # OpenAI / any OpenAI-compatible endpoint
pip install agentfit[all] # everything# 1. Scaffold a config file
agentfit init
# 2. Full audit — static analysis + LLM benchmarking
agentfit benchmark ./src
# 3. Static analysis only (no API keys needed)
agentfit benchmark ./src --no-llm
# 4. Gate CI — exit code 1 if score drops below 60
agentfit benchmark ./src --fail-below 60
# 5. Save full results to JSON
agentfit benchmark ./src --save-results--no-llm runs the full static analysis pipeline and gives you a scored report — no LLM provider, no API key, no cost:
pip install agentfit
agentfit init
agentfit benchmark ./src --no-llmYou get scores for all five metrics (Schema Density, DRYness, Docstring Richness, Test Coverage Structural, Import Clarity) plus ranked recommendations. The LLM benchmarking and correlation stages are skipped — those need a provider configured in ai-bench.yml.
╭──────────────────────── AgentFit Report ─────────────────────────╮
│ Source: ./src │
│ Generated: 2026-03-26T14:00:00+00:00 │
│ Overall Score: 71.4 Threshold: 60.0 ✓ PASS │
╰────────────────────────────────────────────────────────────────────╯
Static Analysis
Metric Score Bar Correlation
Schema Density 82.0 ████████░░ strong ↑ (r=0.81)
DRYness 71.0 ███████░░░ —
Docstring Richness 43.0 ████░░░░░░ strong ↑ (r=0.74)
Test Coverage Structural 55.0 █████░░░░░ —
Import Clarity 89.0 ████████░░ —
LLM Benchmark Results
Provider Model Attempted Passed Mean Score P50ms P95ms
anthropic claude-sonnet-4-6 15 12 74.2 1203 2847
qwen local 15 9 61.3 3100 5200
Recommendations
1. [HIGH] Add usage examples to public functions
Docstring Richness is 43.0/100. Strong positive correlation with LLM
scores (r=0.74). Adding >>> examples significantly improves LLM performance.
2. [MEDIUM] Increase structural test coverage
...
Use --verbose to also print per-metric warnings.
agentfit init writes an ai-bench.yml to the current directory:
version: "1"
analysis:
source_path: "."
languages:
- python
metric_weights:
schema_density: 1.0 # set to 0 to exclude from overall score
dryness: 1.0
docstring_richness: 1.0
test_coverage: 1.0
import_clarity: 1.0
providers:
anthropic:
enabled: true
model: "claude-sonnet-4-6"
openai:
enabled: false
model: "gpt-4o"
# base_url: "https://your-local-endpoint/v1" # any OpenAI-compatible API
# name: "my-provider" # display name in reports
benchmarking:
challenges_per_module: 3
max_concurrent_requests: 5
max_tool_rounds: 10 # agentic mode: max turns per challenge
scoring:
judge_model: "claude-sonnet-4-6"
judge_provider: "anthropic"
reporting:
output_format: "text"
fail_below: nullAgentFit automatically generates agentic_debugging challenges for any source file that has a matching test file. Each challenge:
- Introduces one mutation into a copy of your source tree (e.g. flips
==→!=) - Gives the model access to five tools:
read_file,list_files,search_code,write_file,run_tests - Runs a multi-turn loop until the model fixes the bug or
max_tool_roundsis reached - Scores the result on correctness, fix quality, test verification, and round efficiency
Supported providers: Anthropic and any OpenAI-compatible endpoint. Ollama is not supported (no tool-use API).
You can also force any manual challenge through the agentic loop with agentic: true:
# challenges.yml
- id: "explain-scoring"
source_module: "agentfit.scoring"
challenge_type: "explanation"
agentic: true # model reads the real codebase before answering
prompt: |
Read the source files and explain how challenge scoring works end-to-end.
context_code: ""
expected_behavior: |
A detailed explanation covering ChallengeGenerator, JudgeLLM, and Scorer.Run with manual challenges:
agentfit benchmark ./src --challenges challenges.yml --save-resultsAny OpenAI-compatible API works — Ollama, LM Studio, ngrok tunnels, Qwen, Mistral, etc.:
providers:
openai:
enabled: true
model: "qwen2.5-coder:14b"
base_url: "https://xxxx.ngrok-free.app/v1"
name: "qwen" # shows as "qwen" in the report tableNo API key required when base_url is set.
agentfit init [--output PATH]
Scaffold ai-bench.yml in the current directory.
agentfit benchmark SOURCE_PATH
[--config PATH] Override config file location
[--fail-below SCORE] Exit 1 if overall score < SCORE
[--no-llm] Static analysis only (no API keys needed)
[--verbose, -v] Show per-metric warnings
[--challenges PATH] YAML file of manually authored challenges
[--max-challenges N] Cap auto-generated challenges (manual always included)
[--save-results] Write full report to agentfit-results.json
[--manual-eval] Export challenge/response/verdict triples to JSONL
[--load-evals PATH] Merge a manual eval JSONL into auto scores
[--output-format FORMAT] 'text' (default) or 'html'
[--output-file PATH] Write HTML report to file
[--badge] Write agentfit-badge.json
AgentFit analyses Python natively and has regex-based analysers for:
| Language | Schema Density | DRYness | Docstring Richness | Import Clarity | Test Coverage |
|---|---|---|---|---|---|
| Python | ✓ | ✓ | ✓ | ✓ | ✓ |
| TypeScript / JS | ✓ | ✓ | ✓ | ✓ | needs tests¹ |
| Rust | ✓ | ✓ | ✓ | ✓ | needs tests¹ |
| Go | ✓ | ✓ | ✓ | ✓ | needs tests¹ |
| Java | ✓ | ✓ | ✓ | ✓ | needs tests¹ |
¹ Test Coverage Structural works by pairing source files with test files (e.g.
engine.go→engine_test.go). For non-Python languages the metric will score 0 if your project has no test files alongside the source — add tests to your project to get a meaningful score.
analysis:
languages:
- python
- typescript
- rust| Version | Theme | Status |
|---|---|---|
| v0.1 | Python static analysis + Anthropic/OpenAI/Ollama runners | ✓ Done |
| v0.2 | TypeScript/JavaScript AST analysis | ✓ Done |
| v0.3 | Rust/C/C++ + Go + Java support | x Partialy |
| v0.4 | Agentic tool harness + multi-turn debugging challenges | ✓ Done |
| v0.5 | HTML report export + CI badge generation | ✓ Done |
| v1.0 | Real pytest-cov integration + VS Code extension | Planned |
- Real
pytest-covintegration — blend runtime branch coverage with the structural score - VS Code extension — inline metric decorations, status bar score, WebView report panel
- CI/GitHub Actions — self-audit job (
agentfit benchmark ./agentfit --fail-below 90) on every PR -
mypy --strict— full type-checking across all modules - PyPI publish —
pip i stall agentfitfrom the public registry
git clone https://github.com/voicutomut/AgentFit
cd AgentFit
pip install -e ".[dev]"
pytest # 739 tests
ruff check agentfit/ # lintSee ROADMAP.md for the full phased implementation plan.
AgentFit is early and the five metrics are our first take at what makes a codebase LLM-friendly. We want your input.
Open an issue or start a discussion if you have thoughts on any of these:
The five we picked:
| Metric | Our hypothesis |
|---|---|
| Schema Density | Typed data structures give LLMs clear contracts to reason about |
| DRYness | Duplicated logic confuses context windows and wastes tokens |
| Docstring Richness | >>> examples are the most information-dense context you can give a model |
| Test Coverage Structural | Tests tell the model what "correct" looks like |
| Import Clarity | Circular deps and star imports obscure the dependency graph |
Do these match your experience? Have you noticed other code properties that seem to help or hurt LLM performance on your projects?
Some candidates we're considering — tell us which matter most to you:
- Naming consistency — do identifiers follow a single convention? Does the model have to context-switch between styles?
- Function length / cyclomatic complexity — do shorter, focused functions produce better LLM completions?
- Comment density — inline comments vs. docstrings, which helps more?
- Dependency freshness — does using up-to-date libraries (in the model's training data) improve results?
- Magic number / constant density — does replacing raw literals with named constants help?
- Error handling coverage — does consistent exception handling improve LLM-generated patches?
- Share a benchmark result — run
agentfit benchmark ./your-repo --save-resultsand share the JSON output. Real data helps us validate which metrics actually correlate with LLM performance. - Propose a new challenge type — beyond completion, debugging, refactoring, and explanation, what coding tasks should we be measuring?
- Report false positives — if a metric scores your codebase unfairly, open an issue with a minimal example.