Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
54d43df
Add timezone normalization, rename tracking, and churn-risk labels
lex0c Apr 17, 2026
2b20ef0
Add deterministic tiebreakers to every ranked stat
lex0c Apr 17, 2026
50ea2ea
Document reproducibility guarantee and tiebreaker table
lex0c Apr 17, 2026
444ef7b
Document classification thresholds as named constants
lex0c Apr 17, 2026
b1861f0
Add churn-based Pareto lens for developer concentration
lex0c Apr 17, 2026
3270090
Fix zero-aggregate edge cases in Pareto devs computation
lex0c Apr 17, 2026
9b8b59a
Document dual-lens Pareto and add pareto to README stats table
lex0c Apr 17, 2026
f94b0af
Filter mechanical refactors from coupling pair accumulation
lex0c Apr 17, 2026
f92d570
Cover coupling refactor-filter boundaries and document mean-cheating
lex0c Apr 17, 2026
053c182
Polish coupling refactor filter: helper, test, and caveats
lex0c Apr 17, 2026
557f6c7
Close remaining sort determinism gaps and short-span trend signal
lex0c Apr 18, 2026
876341f
Align DevProfile.Collaborators with DevNetwork SharedLines signal
lex0c Apr 18, 2026
e1f2e9e
Document Collaborators shared_lines semantics and extended tiebreakers
lex0c Apr 18, 2026
2db4d8d
Propagate zero-aggregate guards to Files, Dirs, and CLI Pareto output
lex0c Apr 18, 2026
b8e5785
Refuse rename migration for reused oldPaths
lex0c Apr 18, 2026
90af977
Distinguish rename-back chains from true path reuse
lex0c Apr 18, 2026
86966a4
Harden rename reconciliation: order-independence, naming, tests
lex0c Apr 18, 2026
a48f240
Compute churn trend for single-month histories
lex0c Apr 18, 2026
a3f0a91
Pin churnTrend→Label wiring and document single-touch sensitivity
lex0c Apr 18, 2026
60a8f23
Detect single-edge path reuse via post-rename activity
lex0c Apr 18, 2026
317a141
Refine single-edge reuse detection via maxEdgeDate
lex0c Apr 18, 2026
aa689e7
Document four rename-reuse patterns with detection rules
lex0c Apr 18, 2026
6600539
Refresh Performance table with current validation runs
lex0c Apr 18, 2026
a207d2f
Speed up DevProfiles by pre-computing collaborator pairs
lex0c Apr 18, 2026
c3cec96
Add DevProfiles benchmarks to guard perf regressions
lex0c Apr 18, 2026
dc8722f
Update Performance table with post-optimization report timings
lex0c Apr 18, 2026
285ae4d
Remove Chromium from Performance table, add fresh Linux extract time
lex0c Apr 18, 2026
127b9c0
Document vendor/generated-code distortion in README
lex0c Apr 18, 2026
7e13adb
Fix incorrect --ignore desync caveat (extract does recalculate totals)
lex0c Apr 18, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
82 changes: 68 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,15 +4,49 @@ Extracts commit metadata, file changes, blob sizes, and developer info into JSON

## Performance

Benchmarked on open-source repositories (bare clones):
Benchmarked on open-source repositories. `extract` reads bare clones; `stats` and `report` read the resulting JSONL. Measurements taken with a pre-built binary on a single machine (not a controlled lab benchmark; directional, not absolute).

| Repository | Commits | Devs | Extract time | Throughput | JSONL size |
|------------|---------|------|-------------|------------|------------|
| [Pi-hole](https://github.com/pi-hole/pi-hole) | 7,077 | 286 | 0.9s | 7,800/s | 23K lines |
| [Praat](https://github.com/praat/praat) | 10,221 | 24 | 26s | 393/s | 95K lines |
| [WordPress](https://github.com/WordPress/WordPress) | 52,466 | 131 | 46s | 1,140/s | 298K lines |
| [Kubernetes](https://github.com/kubernetes/kubernetes) | 137,016 | 5,480 | 2m 00s | 1,140/s | 943K lines |
| [Linux kernel](https://github.com/torvalds/linux) | 1,438,634 | 38,281 | 13m 12s | 1,816/s | 6M lines |
| Repository | Commits | Devs | Extract | Stats (JSON) | Report (HTML) | JSONL size |
|------------|---------|------|---------|-------------|--------------|------------|
| [Pi-hole](https://github.com/pi-hole/pi-hole) | 7,077 | 281 | 1.5s | 0.18s | 0.24s | 23K lines / 6.5 MB |
| [Praat](https://github.com/praat/praat) | 10,221 | 19 | 25s | 0.96s | 0.95s | 95K lines / 30 MB |
| [WordPress](https://github.com/WordPress/WordPress) | 52,466 | 131 | 47s | 2.9s | 2.8s | 298K lines / 96 MB |
| [Kubernetes](https://github.com/kubernetes/kubernetes) | 137,016 | 5,295 | 2m 4s | 11.7s | 14s | 943K lines / 314 MB |
| [Linux kernel](https://github.com/torvalds/linux) | 1,438,634 | 38,832 | 12m 57s | 1m 15s | 1m 53s | 6M lines / 1.9 GB |

`extract`, `stats`, and `report` scale roughly linearly with dataset size. The per-dev collaborator map in `report` is pre-computed in a single pass over files (O(F × D_per_file²)); on the kubernetes snapshot that adds ~2 seconds over `stats`, on linux ~40 seconds. A previous implementation computed this nested inside the per-dev loop (O(D × F × D_per_file)) and was 6× slower on kubernetes and 11× slower on linux. If you only need the aggregate data, `stats --format json` is always the fastest path; reach for `report` when you actually want the HTML dashboard.

## Vendor and generated code

**This is the biggest practical distortion in every stat.** Line-count metrics treat a 50k-line `generated.pb.go` the same as a 50k-line hand-written module. Lock files like `package-lock.json` regenerate with every dependency bump. Vendored dependencies inflate churn whenever they're updated. OpenAPI specs, minified JS, `bindata.go`-style embeds — all common, all inflate churn and bus factor without reflecting real human contribution.

Run gitcortex on kubernetes without filtering and the top legacy-hotspots are `vendor/golang.org/x/tools/…/manifest.go`, `api/openapi-spec/v3/…v1alpha3_openapi.json`, and `staging/…/generated.pb.go` — technically correct per the data, practically useless for decision-making.

Mitigate with `--ignore` glob patterns at extract time. Files matched are dropped from the JSONL entirely, so **every downstream stat** (hotspots, churn-risk, bus factor, coupling, dev-network, profiles) reflects only hand-authored code:

```bash
# Typical starter set
gitcortex extract --repo . \
--ignore "vendor/*" \
--ignore "node_modules/*" \
--ignore "dist/*" \
--ignore "build/*" \
--ignore "*.min.js" \
--ignore "*.min.css" \
--ignore "package-lock.json" \
--ignore "yarn.lock" \
--ignore "Cargo.lock" \
--ignore "go.sum" \
--ignore "poetry.lock" \
--ignore "*.pb.go" \
--ignore "*_generated.go"
```

Patterns match against the file path as emitted by `git log --raw` (forward-slash, repo-relative). Directory patterns like `vendor/*` exclude anything under that prefix. File-name patterns like `*.pb.go` match at any depth.

Start permissive, run `gitcortex stats --stat hotspots --top 20` and `--stat churn-risk --top 20`, and add `--ignore` entries for whatever generated file type dominates the output. Re-extract until the top list represents real changes worth understanding.

> Both commit-level (`Summary.TotalAdditions/Deletions`) and file-level aggregations recompute from the filtered set, so all totals stay consistent after `--ignore` — the extract step recalculates commit additions/deletions as the sum of non-ignored file records before writing them to JSONL.

## Privacy and reliability

Expand Down Expand Up @@ -159,14 +193,17 @@ Available stats:
| `activity` | Commits and line changes bucketed by day, week, month, or year |
| `busfactor` | Files with lowest bus factor (fewest developers owning 80%+ of changes) |
| `coupling` | Files that frequently change together, revealing hidden architectural dependencies |
| `churn-risk` | Files ranked by recency-weighted churn combined with bus factor |
| `churn-risk` | Files ranked by recent churn, classified into `cold` / `active` / `active-core` / `silo` / `legacy-hotspot` |
| `working-patterns` | Commit heatmap by hour and day of week |
| `dev-network` | Developer collaboration graph based on shared file ownership |
| `profile` | Per-developer report: scope, contribution type, pace, collaboration, top files |
| `top-commits` | Largest commits ranked by lines changed (includes message if extracted with `--include-commit-messages`) |
| `pareto` | Concentration (80% threshold) across files, devs (two lenses: commits and churn), and directories |

Output formats: `table` (default, human-readable), `csv` (single clean table per `--stat`), `json` (unified object with all sections).

See [`docs/METRICS.md`](docs/METRICS.md) for how each metric is calculated, including timezone handling (UTC for aggregation buckets, author-local for working patterns) and rename tracking (history merged across git-detected renames).

### Developer profile

Manager-facing report per developer showing scope, contribution type, pace, collaboration, and top files.
Expand Down Expand Up @@ -211,19 +248,34 @@ IWorkspaceRepository.cs WorkspaceRepository.cs 19

### Churn risk

Ranks files by a risk score combining recency-weighted churn with bus factor. Recent changes weigh more (exponential decay), and files with fewer owners score higher.
Ranks files by recency-weighted churn and classifies each into an actionable label, so you can tell a healthy core module apart from a legacy bottleneck without eyeballing five columns.

```bash
gitcortex stats --input data.jsonl --stat churn-risk --top 15
gitcortex stats --input data.jsonl --stat churn-risk --churn-half-life 60 # faster decay
```

Real output from the Pi-hole repository (one sample per label):

```
PATH RISK RECENT CHURN BUS FACTOR TOTAL CHANGES LAST CHANGE
src/Api/Controllers/Auth.cs 142.5 285.0 2 47 2024-03-28
src/Domain/Entities/User.cs 98.3 98.3 1 12 2024-03-25
PATH LABEL CHURN BF AGE TREND
automated install/basic-install.sh active 115.3 15 4121d 0.00
.github/workflows/codeql-analysis.yml legacy-hotspot 66.2 2 1640d 0.26
advanced/bash-completion/pihole-ftl.bash silo 16.5 1 240d 1.00
test/_alpine_3_23.Dockerfile active-core 7.1 1 120d 1.00
advanced/Templates/gravity.db.schema cold 0.0 1 2616d 1.00
```

| Label | Meaning |
|-------|---------|
| `cold` | Low recent churn — ignore. |
| `active` | Shared ownership (bus factor ≥ 3). Healthy. |
| `active-core` | New code (< 180d), single author. Usually fine. |
| `silo` | Old + concentrated + stable/growing. Knowledge bottleneck — plan transfer. |
| `legacy-hotspot` | **Urgent.** Old + concentrated + declining. Deprecated paths still being touched. |

Sort key is `recent_churn`; the label answers "is this activity a problem?". The composite `risk_score` field (`recent_churn / bus_factor`) is still emitted for CI gate back-compat.

`--churn-half-life` controls how fast old changes lose weight (default 90 days = changes lose half their weight every 90 days).

### Working patterns
Expand Down Expand Up @@ -330,7 +382,7 @@ Run automated checks and fail the build when thresholds are exceeded.
# Fail if any file has bus factor of 1
gitcortex ci --input data.jsonl --fail-on-busfactor 1

# Fail if any file has churn risk >= 500
# Fail if any file has churn risk >= 500 (legacy composite: recent_churn / bus_factor)
gitcortex ci --input data.jsonl --fail-on-churn-risk 500

# Both rules, GitHub Actions format
Expand All @@ -344,6 +396,8 @@ Output formats: `text` (default), `github-actions` (annotations), `gitlab` (Code

Exit code 1 when violations are found, 0 when clean.

> `--fail-on-churn-risk` evaluates the legacy `risk_score = recent_churn / bus_factor` field, not the new label classification surfaced by `stats --stat churn-risk`. The two can disagree — a file might have `risk_score` below the threshold yet still classify as `legacy-hotspot`. Use the stat command for triage; use the CI gate as a coarse threshold alarm.

## Architecture

```
Expand Down
16 changes: 11 additions & 5 deletions cmd/gitcortex/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -280,13 +280,19 @@ func renderStats(ds *stats.Dataset, sf *statsFlags) error {
}
return "well distributed"
}
devsLabel := judge(p.DevsPct80Commits, p.TotalDevs)
if p.TotalDevs > 0 && p.DevsPct80Commits <= 10 {
devsLabel := judge(p.DevsPct80Commits, p.TopCommitDevs)
if p.TopCommitDevs > 0 && p.DevsPct80Commits <= 10 {
devsLabel += ", key-person dependence"
}
fmt.Fprintf(os.Stdout, "Files: %d of %d files concentrate 80%% of churn — %s\n", p.TopChurnFiles, p.TotalFiles, judge(p.FilesPct80Churn, p.TotalFiles))
fmt.Fprintf(os.Stdout, "Devs: %d of %d devs produce 80%% of commits — %s\n", p.TopCommitDevs, p.TotalDevs, devsLabel)
fmt.Fprintf(os.Stdout, "Dirs: %d of %d dirs concentrate 80%% of churn — %s\n", p.TopChurnDirs, p.TotalDirs, judge(p.DirsPct80Churn, p.TotalDirs))
// Gate each line on the corresponding TopChurn/TopCommit count.
// judge() only uses its second arg as a zero-guard, so passing the
// Top* count (which is 0 when the signal is absent, e.g. zero-churn
// dataset or empty contributors) maps directly to "no data" without
// falsely hitting the "extremely concentrated" branch.
fmt.Fprintf(os.Stdout, "Files: %d of %d files concentrate 80%% of churn — %s\n", p.TopChurnFiles, p.TotalFiles, judge(p.FilesPct80Churn, p.TopChurnFiles))
fmt.Fprintf(os.Stdout, "Devs (commits): %d of %d devs produce 80%% of commits — %s\n", p.TopCommitDevs, p.TotalDevs, devsLabel)
fmt.Fprintf(os.Stdout, "Devs (churn): %d of %d devs produce 80%% of line churn — %s\n", p.TopChurnDevs, p.TotalDevs, judge(p.DevsPct80Churn, p.TopChurnDevs))
fmt.Fprintf(os.Stdout, "Dirs: %d of %d dirs concentrate 80%% of churn — %s\n", p.TopChurnDirs, p.TotalDirs, judge(p.DirsPct80Churn, p.TopChurnDirs))
}
if showAll || sf.stat == "top-commits" {
fmt.Fprintf(os.Stderr, "\n=== Top %d Commits ===\n", sf.topN)
Expand Down
Loading
Loading