diff --git a/README.md b/README.md index 0d5a3f5..8c2fc70 100644 --- a/README.md +++ b/README.md @@ -4,15 +4,49 @@ Extracts commit metadata, file changes, blob sizes, and developer info into JSON ## Performance -Benchmarked on open-source repositories (bare clones): +Benchmarked on open-source repositories. `extract` reads bare clones; `stats` and `report` read the resulting JSONL. Measurements taken with a pre-built binary on a single machine (not a controlled lab benchmark; directional, not absolute). -| Repository | Commits | Devs | Extract time | Throughput | JSONL size | -|------------|---------|------|-------------|------------|------------| -| [Pi-hole](https://github.com/pi-hole/pi-hole) | 7,077 | 286 | 0.9s | 7,800/s | 23K lines | -| [Praat](https://github.com/praat/praat) | 10,221 | 24 | 26s | 393/s | 95K lines | -| [WordPress](https://github.com/WordPress/WordPress) | 52,466 | 131 | 46s | 1,140/s | 298K lines | -| [Kubernetes](https://github.com/kubernetes/kubernetes) | 137,016 | 5,480 | 2m 00s | 1,140/s | 943K lines | -| [Linux kernel](https://github.com/torvalds/linux) | 1,438,634 | 38,281 | 13m 12s | 1,816/s | 6M lines | +| Repository | Commits | Devs | Extract | Stats (JSON) | Report (HTML) | JSONL size | +|------------|---------|------|---------|-------------|--------------|------------| +| [Pi-hole](https://github.com/pi-hole/pi-hole) | 7,077 | 281 | 1.5s | 0.18s | 0.24s | 23K lines / 6.5 MB | +| [Praat](https://github.com/praat/praat) | 10,221 | 19 | 25s | 0.96s | 0.95s | 95K lines / 30 MB | +| [WordPress](https://github.com/WordPress/WordPress) | 52,466 | 131 | 47s | 2.9s | 2.8s | 298K lines / 96 MB | +| [Kubernetes](https://github.com/kubernetes/kubernetes) | 137,016 | 5,295 | 2m 4s | 11.7s | 14s | 943K lines / 314 MB | +| [Linux kernel](https://github.com/torvalds/linux) | 1,438,634 | 38,832 | 12m 57s | 1m 15s | 1m 53s | 6M lines / 1.9 GB | + +`extract`, `stats`, and `report` scale roughly linearly with dataset size. The per-dev collaborator map in `report` is pre-computed in a single pass over files (O(F × D_per_file²)); on the kubernetes snapshot that adds ~2 seconds over `stats`, on linux ~40 seconds. A previous implementation computed this nested inside the per-dev loop (O(D × F × D_per_file)) and was 6× slower on kubernetes and 11× slower on linux. If you only need the aggregate data, `stats --format json` is always the fastest path; reach for `report` when you actually want the HTML dashboard. + +## Vendor and generated code + +**This is the biggest practical distortion in every stat.** Line-count metrics treat a 50k-line `generated.pb.go` the same as a 50k-line hand-written module. Lock files like `package-lock.json` regenerate with every dependency bump. Vendored dependencies inflate churn whenever they're updated. OpenAPI specs, minified JS, `bindata.go`-style embeds — all common, all inflate churn and bus factor without reflecting real human contribution. + +Run gitcortex on kubernetes without filtering and the top legacy-hotspots are `vendor/golang.org/x/tools/…/manifest.go`, `api/openapi-spec/v3/…v1alpha3_openapi.json`, and `staging/…/generated.pb.go` — technically correct per the data, practically useless for decision-making. + +Mitigate with `--ignore` glob patterns at extract time. Files matched are dropped from the JSONL entirely, so **every downstream stat** (hotspots, churn-risk, bus factor, coupling, dev-network, profiles) reflects only hand-authored code: + +```bash +# Typical starter set +gitcortex extract --repo . \ + --ignore "vendor/*" \ + --ignore "node_modules/*" \ + --ignore "dist/*" \ + --ignore "build/*" \ + --ignore "*.min.js" \ + --ignore "*.min.css" \ + --ignore "package-lock.json" \ + --ignore "yarn.lock" \ + --ignore "Cargo.lock" \ + --ignore "go.sum" \ + --ignore "poetry.lock" \ + --ignore "*.pb.go" \ + --ignore "*_generated.go" +``` + +Patterns match against the file path as emitted by `git log --raw` (forward-slash, repo-relative). Directory patterns like `vendor/*` exclude anything under that prefix. File-name patterns like `*.pb.go` match at any depth. + +Start permissive, run `gitcortex stats --stat hotspots --top 20` and `--stat churn-risk --top 20`, and add `--ignore` entries for whatever generated file type dominates the output. Re-extract until the top list represents real changes worth understanding. + +> Both commit-level (`Summary.TotalAdditions/Deletions`) and file-level aggregations recompute from the filtered set, so all totals stay consistent after `--ignore` — the extract step recalculates commit additions/deletions as the sum of non-ignored file records before writing them to JSONL. ## Privacy and reliability @@ -159,14 +193,17 @@ Available stats: | `activity` | Commits and line changes bucketed by day, week, month, or year | | `busfactor` | Files with lowest bus factor (fewest developers owning 80%+ of changes) | | `coupling` | Files that frequently change together, revealing hidden architectural dependencies | -| `churn-risk` | Files ranked by recency-weighted churn combined with bus factor | +| `churn-risk` | Files ranked by recent churn, classified into `cold` / `active` / `active-core` / `silo` / `legacy-hotspot` | | `working-patterns` | Commit heatmap by hour and day of week | | `dev-network` | Developer collaboration graph based on shared file ownership | | `profile` | Per-developer report: scope, contribution type, pace, collaboration, top files | | `top-commits` | Largest commits ranked by lines changed (includes message if extracted with `--include-commit-messages`) | +| `pareto` | Concentration (80% threshold) across files, devs (two lenses: commits and churn), and directories | Output formats: `table` (default, human-readable), `csv` (single clean table per `--stat`), `json` (unified object with all sections). +See [`docs/METRICS.md`](docs/METRICS.md) for how each metric is calculated, including timezone handling (UTC for aggregation buckets, author-local for working patterns) and rename tracking (history merged across git-detected renames). + ### Developer profile Manager-facing report per developer showing scope, contribution type, pace, collaboration, and top files. @@ -211,19 +248,34 @@ IWorkspaceRepository.cs WorkspaceRepository.cs 19 ### Churn risk -Ranks files by a risk score combining recency-weighted churn with bus factor. Recent changes weigh more (exponential decay), and files with fewer owners score higher. +Ranks files by recency-weighted churn and classifies each into an actionable label, so you can tell a healthy core module apart from a legacy bottleneck without eyeballing five columns. ```bash gitcortex stats --input data.jsonl --stat churn-risk --top 15 gitcortex stats --input data.jsonl --stat churn-risk --churn-half-life 60 # faster decay ``` +Real output from the Pi-hole repository (one sample per label): + ``` -PATH RISK RECENT CHURN BUS FACTOR TOTAL CHANGES LAST CHANGE -src/Api/Controllers/Auth.cs 142.5 285.0 2 47 2024-03-28 -src/Domain/Entities/User.cs 98.3 98.3 1 12 2024-03-25 +PATH LABEL CHURN BF AGE TREND +automated install/basic-install.sh active 115.3 15 4121d 0.00 +.github/workflows/codeql-analysis.yml legacy-hotspot 66.2 2 1640d 0.26 +advanced/bash-completion/pihole-ftl.bash silo 16.5 1 240d 1.00 +test/_alpine_3_23.Dockerfile active-core 7.1 1 120d 1.00 +advanced/Templates/gravity.db.schema cold 0.0 1 2616d 1.00 ``` +| Label | Meaning | +|-------|---------| +| `cold` | Low recent churn — ignore. | +| `active` | Shared ownership (bus factor ≥ 3). Healthy. | +| `active-core` | New code (< 180d), single author. Usually fine. | +| `silo` | Old + concentrated + stable/growing. Knowledge bottleneck — plan transfer. | +| `legacy-hotspot` | **Urgent.** Old + concentrated + declining. Deprecated paths still being touched. | + +Sort key is `recent_churn`; the label answers "is this activity a problem?". The composite `risk_score` field (`recent_churn / bus_factor`) is still emitted for CI gate back-compat. + `--churn-half-life` controls how fast old changes lose weight (default 90 days = changes lose half their weight every 90 days). ### Working patterns @@ -330,7 +382,7 @@ Run automated checks and fail the build when thresholds are exceeded. # Fail if any file has bus factor of 1 gitcortex ci --input data.jsonl --fail-on-busfactor 1 -# Fail if any file has churn risk >= 500 +# Fail if any file has churn risk >= 500 (legacy composite: recent_churn / bus_factor) gitcortex ci --input data.jsonl --fail-on-churn-risk 500 # Both rules, GitHub Actions format @@ -344,6 +396,8 @@ Output formats: `text` (default), `github-actions` (annotations), `gitlab` (Code Exit code 1 when violations are found, 0 when clean. +> `--fail-on-churn-risk` evaluates the legacy `risk_score = recent_churn / bus_factor` field, not the new label classification surfaced by `stats --stat churn-risk`. The two can disagree — a file might have `risk_score` below the threshold yet still classify as `legacy-hotspot`. Use the stat command for triage; use the CI gate as a coarse threshold alarm. + ## Architecture ``` diff --git a/cmd/gitcortex/main.go b/cmd/gitcortex/main.go index 1fd018a..83e7aae 100644 --- a/cmd/gitcortex/main.go +++ b/cmd/gitcortex/main.go @@ -280,13 +280,19 @@ func renderStats(ds *stats.Dataset, sf *statsFlags) error { } return "well distributed" } - devsLabel := judge(p.DevsPct80Commits, p.TotalDevs) - if p.TotalDevs > 0 && p.DevsPct80Commits <= 10 { + devsLabel := judge(p.DevsPct80Commits, p.TopCommitDevs) + if p.TopCommitDevs > 0 && p.DevsPct80Commits <= 10 { devsLabel += ", key-person dependence" } - fmt.Fprintf(os.Stdout, "Files: %d of %d files concentrate 80%% of churn — %s\n", p.TopChurnFiles, p.TotalFiles, judge(p.FilesPct80Churn, p.TotalFiles)) - fmt.Fprintf(os.Stdout, "Devs: %d of %d devs produce 80%% of commits — %s\n", p.TopCommitDevs, p.TotalDevs, devsLabel) - fmt.Fprintf(os.Stdout, "Dirs: %d of %d dirs concentrate 80%% of churn — %s\n", p.TopChurnDirs, p.TotalDirs, judge(p.DirsPct80Churn, p.TotalDirs)) + // Gate each line on the corresponding TopChurn/TopCommit count. + // judge() only uses its second arg as a zero-guard, so passing the + // Top* count (which is 0 when the signal is absent, e.g. zero-churn + // dataset or empty contributors) maps directly to "no data" without + // falsely hitting the "extremely concentrated" branch. + fmt.Fprintf(os.Stdout, "Files: %d of %d files concentrate 80%% of churn — %s\n", p.TopChurnFiles, p.TotalFiles, judge(p.FilesPct80Churn, p.TopChurnFiles)) + fmt.Fprintf(os.Stdout, "Devs (commits): %d of %d devs produce 80%% of commits — %s\n", p.TopCommitDevs, p.TotalDevs, devsLabel) + fmt.Fprintf(os.Stdout, "Devs (churn): %d of %d devs produce 80%% of line churn — %s\n", p.TopChurnDevs, p.TotalDevs, judge(p.DevsPct80Churn, p.TopChurnDevs)) + fmt.Fprintf(os.Stdout, "Dirs: %d of %d dirs concentrate 80%% of churn — %s\n", p.TopChurnDirs, p.TotalDirs, judge(p.DirsPct80Churn, p.TopChurnDirs)) } if showAll || sf.stat == "top-commits" { fmt.Fprintf(os.Stderr, "\n=== Top %d Commits ===\n", sf.topN) diff --git a/docs/METRICS.md b/docs/METRICS.md index e049b23..843ceeb 100644 --- a/docs/METRICS.md +++ b/docs/METRICS.md @@ -39,13 +39,13 @@ Same as hotspots but aggregated by directory path. Files in the repository root | Column | Meaning | |--------|---------| -| Commits | Sum of commits across all files in the directory | +| File Touches | Sum of per-file commit counts. **A single commit touching N files in this directory contributes N to this number — this is NOT distinct commits.** | | Churn | Sum of additions + deletions for all files | | Files | Number of unique files in the directory | | Devs | Unique authors who touched any file in the directory | | Bus Factor | Minimum devs covering 80% of lines changed (see Bus Factor) | -**How to interpret**: Gives a module-level health view for leadership meetings. +**How to interpret**: Gives a module-level health view. Watch the `File Touches` name — it inflates relative to what people intuitively call "commits" when commits are large. ## Activity @@ -71,14 +71,17 @@ For each file/directory: the minimum number of developers whose contributions co **How to interpret**: Files with bus factor 1 and high recent churn are the highest risk. Cross-reference with churn-risk. +> **Caveat — weighted by lines, not commits.** A dev with one big 10k-line commit outweighs 100 small commits. If familiarity matters more than volume to you, treat bus factor as an upper bound on knowledge loss risk, not a direct measurement. + ## Coupling File pairs that frequently change in the same commit. **Calculation**: 1. For each commit with 2-N files (skipping single-file commits and commits with >50 files by default), generate all file pairs -2. Count co-occurrences per pair -3. Coupling % = co-changes / min(changes_A, changes_B) × 100 +2. Exclude pairs from **mechanical-refactor commits**: a commit touching ≥ `refactorMinFiles` (10) with mean per-file churn < `refactorMaxChurnPerFile` (5 lines) is treated as a global rename / format / lint fix, and its pairs are skipped. The files' individual change counts (`changes_A`, `changes_B`) still include these commits — only the pair accumulation is suppressed. +3. Count co-occurrences per pair +4. Coupling % = co-changes / min(changes_A, changes_B) × 100 | Column | Meaning | |--------|---------| @@ -93,16 +96,23 @@ File pairs that frequently change in the same commit. Based on Adam Tornhill's ["Your Code as a Crime Scene"](https://pragprog.com/titles/atcrime/your-code-as-a-crime-scene/) methodology. +> **Caveat — co-change is not causation.** Two files changing in the same commit proves they were touched by the same unit of work, not that one depends on the other. The refactor filter catches the most blatant false positives (global renames, format passes) but not all — a genuinely large feature touching many related files can still leak pair counts. Treat high coupling % as a hypothesis worth investigating, not a proof of architectural dependency. +> +> **Caveat — the refactor filter uses mean churn.** A commit mixing many mechanical renames (low churn each) with a few substantive edits (high churn each) can have mean > 5 and escape the filter. Example: 12 renames of 1 line + 3 real edits of 100 lines → mean ≈ 21, filter does not fire, and the 12 rename-participating files generate ~66 spurious pairs. Per-file weighting would fix this but requires restructuring pair generation; it is an acknowledged limitation rather than a planned change. +> +> **Caveat — rename commits below `refactorMinFiles` leak.** A commit renaming 8 files (say, a small module rename) has zero churn per file but only 8 files, so the filter does not fire. Pairs are generated between old paths and then re-keyed onto the canonical (new) paths by the rename tracker. Each such pair has `CoChanges = 1`, so any sensible `--coupling-min-changes` threshold (≥ 2) filters them out of display. Readers inspecting raw data may still see the residual pairs. +> +> **Caveat — `--coupling-max-files` below the refactor threshold disables the filter.** The refactor filter only runs for commits with `≥ refactorMinFiles` (10) files, but the existing `--coupling-max-files` gate runs first. Setting `--coupling-max-files` below 10 discards those commits outright (no pairs at all), so the refactor filter becomes dead code. This is correct layering but worth knowing if you tune the flag aggressively. + ## Churn Risk -Files ranked by a risk score combining recent churn with bus factor. +Files ranked by recency-weighted churn, **classified into actionable labels** so the reader can judge whether the activity is a problem or just ongoing work. -**Calculation**: -``` -risk_score = recent_churn / bus_factor -``` +### Ranking + +Sort key: `recent_churn` descending (tiebreak: lower `bus_factor` first). -Where `recent_churn` uses exponential decay: +`recent_churn` uses exponential decay: ``` weight = e^(-λ × days_since_change) recent_churn = Σ (additions + deletions) × weight @@ -111,28 +121,74 @@ recent_churn = Σ (additions + deletions) × weight Default half-life: 90 days (changes lose half their weight every 90 days). -**How to interpret**: High risk = lots of recent changes owned by few people. These files need knowledge transfer or pair programming. +### Classification labels + +Ranking alone conflates "core being actively developed by one author" (expected) +with "legacy module everyone forgot" (urgent). The label separates them by +adding age and trend dimensions. + +**Rules are evaluated in order; the first match wins.** Conditions on later +rows implicitly assume the earlier rows didn't match. + +| # | Label | Rule | Action | +|---|-------|------|--------| +| 1 | **cold** | `recent_churn ≤ 0.5 × median(recent_churn)` | Ignore. | +| 2 | **active** | `bus_factor ≥ 3` | Healthy, shared. | +| 3 | **active-core** | `bus_factor ≤ 2` and `age < 180 days` | New code, single author is expected. | +| 4 | **legacy-hotspot** | `bus_factor ≤ 2`, `age ≥ 180 days`, and `trend < 0.5` | **Urgent.** Old + concentrated + declining. | +| 5 | **silo** | default (everything the rules above didn't catch) | Knowledge bottleneck — plan transfer. | + +Where: +- `age = days between firstChange and latest commit in dataset` +- `trend = churn_last_3_months / churn_earlier`. Edge cases: empty history returns 1 (no signal); recent-only history returns 2 (grew from nothing); earlier-only history returns 0 (declined to nothing — the strongest `legacy-hotspot` signal); short-span datasets whose entire window fits inside the trend window return 1 to avoid false "growing" reports + +> **Sensitivity note.** Files touched a single time long ago and never again correctly route to `legacy-hotspot` via the earlier-only trend=0 path. On large mature repos this pattern is the common case, not the exception — e.g. validation on a kubernetes snapshot classified ~29k files this way. If the label distribution looks heavy on `legacy-hotspot` for a long-lived codebase, that is usually diagnosing real dormant code, not a bug. + +### Additional columns + +| Column | Meaning | +|--------|---------| +| `risk_score` | `recent_churn / bus_factor` — legacy composite. Still consumed by `gitcortex ci --fail-on-churn-risk N`. Not used for ranking. May diverge from the label (a file can have low `risk_score` but be classified `legacy-hotspot`, and vice versa). | +| `first_change`, `last_change` | Bounds of the file's activity in the dataset (UTC) | +| `age_days` | `latest - first_change` in days | +| `trend` | Ratio described above | + +### How to interpret + +- **legacy-hotspot** is the alarm — investigate first. +- **silo** suggests pairing / documentation work, not panic. +- **active-core** is usually fine, but watch for `bus_factor=1` + growing. +- **active** with growing trend may indicate a healthy shared module or a collision of too many cooks. ## Working Patterns Commit distribution by day of week and hour. - Uses author date (not committer date) -- Hours are in the author's local timezone (as recorded by git) +- Hours are in the author's **local** timezone as recorded by git — this metric describes human work rhythm, not UTC instants - Displayed as a 7×24 heatmap **How to interpret**: Reveals team timezones, work-life patterns, and off-hours work. Consistent late-night commits may indicate overwork. +> The per-developer working grid in `profile` uses the same local-timezone semantics. + ## Developer Network Collaboration graph based on shared file ownership. **Calculation**: -1. For each file, collect the set of authors -2. For each pair of authors who share files, count shared files -3. Weight = shared_files / max(files_A, files_B) × 100 +1. For each file, collect each author's line contribution +2. For each pair of authors who share files: + - `shared_files` = count of files both touched (any amount) + - `shared_lines` = `Σ min(lines_A, lines_B)` across shared files + - `weight` = `shared_files / max(files_A, files_B) × 100` (legacy) + +Sort is by `shared_lines` descending (tiebreak by `shared_files`). -**How to interpret**: Strong connections indicate collaboration. Isolated nodes with no connections may signal silos. High weight between two devs means they work on the same parts of the codebase. +**How to interpret**: +- `shared_lines` is the honest signal. Alice editing 1 line and Bob editing 200 on the same file share only 1 line of real collaboration — `shared_lines` reflects this, `shared_files` doesn't. +- Strong `shared_lines` = deep co-ownership. Low `shared_lines` with high `shared_files` = trivial touches (one-line fixes, format commits). +- Isolated nodes may signal silos. ## Profile @@ -144,11 +200,11 @@ Per-developer report combining multiple metrics. | Lines changed | additions + deletions across all commits | | Files touched | Unique files modified (from `contribFiles` accumulator) | | Active days | Unique dates with at least one commit | -| Pace | commits / active_days | +| Pace | commits / active_days (smooths bursts — a dev with 100 commits on 2 days and silence for 28 shows pace=50, which reads as a steady rate but isn't) | | Weekend % | commits on Saturday+Sunday / total commits × 100 | | Scope | Top 5 directories by unique file count, as % of total files touched | | Contribution type | Based on del/add ratio: growth (<0.4), balanced (0.4-0.8), refactor (>0.8) | -| Collaborators | Top 5 devs sharing the most files with this dev | +| Collaborators | Top 5 devs sharing code with this dev. Ranked by `shared_lines` (Σ min(linesA, linesB) across shared files), tiebreak `shared_files`, then email. Same `shared_lines` semantics as the Developer Network metric — discounts trivial one-line touches so "collaborator" reflects real overlap. | ## Top Commits @@ -161,19 +217,27 @@ Commits ranked by total lines changed (additions + deletions). How asymmetric is the work distribution. -**Calculation**: For each dimension (files, devs, directories), sort by the metric descending and count how many items are needed to reach 80% of the total. +**Calculation**: For each dimension, sort by the metric descending and count how many items are needed to reach 80% of the total. -| Metric | Dimension sorted by | -|--------|-------------------| -| Files | Churn (additions + deletions) | -| Devs | Commit count | -| Directories | Churn (additions + deletions) | +| Dimension | Sort key | Why | +|-----------|----------|-----| +| Files | Churn (additions + deletions) | — | +| Devs (commits) | Commit count | Rewards frequent committers. Inflated by bots and squash-off workflows. | +| Devs (churn) | additions + deletions | Rewards volume of lines written/removed. Inflated by generated-file authors and verbose coders. | +| Directories | Churn (additions + deletions) | — | + +Two dev lenses are surfaced because commit count alone is a flawed proxy for contribution: a squash-merge counts as one commit while a rebase-and-merge counts as many; bots routinely dominate commit leaderboards despite writing little code. Rather than replace one bias with another, gitcortex shows both and lets the divergence be the signal. + +**Reading the divergence**: +- Aligned counts (e.g. 17 ≈ 17) → consistent contributor base; both lenses agree. +- Commits ≫ churn (e.g. 267 vs 132 on kubernetes) → bots or squash workflows inflate commit counts. The smaller list is closer to "who actually wrote code". +- Churn ≫ commits → single heavy-feature authors who commit rarely but write volumes. **Judgment thresholds**: -- ≤10%: extremely concentrated (plus "key-person dependence" for the Devs dimension) +- ≤10%: extremely concentrated (plus "key-person dependence" on the Devs-by-commits card) - ≤25%: moderately concentrated - \>25%: well distributed -- total == 0: no data (neutral marker) +- total == 0 (no commits, or no churn for the churn lens): no data (neutral marker) **How to interpret**: "20 files concentrate 80% of all churn" describes where change lands — it can indicate a healthy core module under active development, or a bottleneck if combined with low bus factor. Cross-reference with the Churn Risk section before drawing conclusions. @@ -188,3 +252,101 @@ git cat-file --batch-check ─── sizes ─↗ - Commit messages excluded by default - Stats computed from pre-aggregated maps (not raw records) - All processing is 100% local + +## Behavior and caveats + +### Thresholds + +Every classification boundary is a named constant in `internal/stats/stats.go`. Values below are the defaults; there is no runtime flag to override them yet — changing a threshold requires editing the source and rebuilding. + +| Constant | Default | Controls | +|----------|---------|----------| +| `classifyColdChurnRatio` | `0.5` | A file is `cold` when `recent_churn ≤ ratio × median(recent_churn)`. | +| `classifyActiveBusFactor` | `3` | A file is `active` (shared, healthy) when `bus_factor ≥ this`. | +| `classifyOldAgeDays` | `180` | Age cutoff for `active-core` vs `silo`/`legacy-hotspot`. | +| `classifyDecliningTrend` | `0.5` | Trend ratio below this marks `legacy-hotspot` (old + declining). | +| `classifyTrendWindowMonths` | `3` | Window (months, relative to latest commit) for the recent vs earlier split in `trend`. | +| `contribRefactorRatio` | `0.8` | `del/add ≥ this` → dev profile `contribType = refactor`. | +| `contribBalancedRatio` | `0.4` | `0.4 ≤ del/add < 0.8` → `balanced`; below 0.4 → `growth`. | +| `refactorMinFiles` | `10` | Minimum files for a commit to be a mechanical-refactor candidate (coupling filter). | +| `refactorMaxChurnPerFile` | `5.0` | Mean churn per file below this in a candidate commit → treated as refactor; its pairs are excluded from coupling. | + +### Reproducibility + +Every ranking function has an explicit tiebreaker so the same input produces the same output across runs and between the CLI (`stats --format json`) and the HTML report. Without this, ties on integer keys (ubiquitous — e.g. many files with `bus_factor = 1`) would let Go's randomized map iteration produce a different top-N each time. + +| Stat | Primary key (desc unless noted) | Tiebreaker | +|------|---------------------------------|------------| +| `summary` | — | N/A (scalar) | +| `contributors` | commits | email asc | +| `hotspots` | commits | path asc | +| `directories` | file_touches | dir asc | +| `busfactor` | bus_factor (asc) | path asc | +| `coupling` | co_changes | coupling_pct | +| `churn-risk` | recent_churn | bus_factor asc | +| `top-commits` | lines_changed | sha asc | +| `dev-network` | shared_lines | shared_files | +| `profile` | commits | email asc | + +A third-level tiebreaker on path/sha/email asc is applied where primary and secondary can both tie (`churn-risk`, `coupling`, `dev-network`) so ordering is stable even with exact equality on the first two keys. Inside each profile, the `TopFiles`, `Scope`, and `Collaborators` sub-lists are also sorted with explicit tiebreakers (path / dir / email asc) so their internal ordering is deterministic too. + +### `--mailmap` off by default + +`extract` does not apply `.mailmap` unless you pass `--mailmap`. Without it, the same person with two emails (e.g. `alice@work.com` and `alice@personal.com`) splits into two contributors. Affected metrics: `contributors`, `bus factor`, `dev network`, `profiles`, churn-risk label (via bus factor). + +Extract emits a warning when the repo has a `.mailmap` file but the flag was omitted. Enable it for any repo where identity matters: +```bash +gitcortex extract --repo . --mailmap +``` + +### Timezone handling + +Two classes of metrics, different rules: + +- **Cross-commit aggregation** (monthly/weekly/yearly buckets, active-day counts, trend calculation, display dates) uses **UTC**. Two commits at the same UTC instant land in the same bucket regardless of each author's local offset. +- **Human-rhythm metrics** (working patterns heatmap, per-dev work grid) use the **author's local timezone** as recorded by git. These describe *when the person was typing*, not instant-of-time. + +Side effects worth knowing: +- A commit at `2024-03-31T23:00-05:00` (local) equals `2024-04-01T04:00Z` (UTC). It belongs to **April** in monthly activity but to **March 31** in the author's working grid. +- `active_days` is counted by UTC date. A developer who always commits near midnight local time may show slightly different day counts than a pure local-TZ count. +- `first_commit_date` / `last_commit_date` in the summary are UTC. +- **Working patterns trust the author's committed timezone.** If a dev has their laptop clock set wrong or a CI agent impersonates the author in UTC, the heatmap silently reflects that. There is no sanity check. + +### Rename tracking + +gitcortex uses `git log -M` so renames and moves (including cross-directory moves) are detected by git's similarity matcher. When a rename is detected, the historical entries for the old and new paths are **merged into a single canonical entry** (the newest path in the chain) during finalization. + +Effects: +- `Total files touched` reflects canonical files, not distinct paths seen in history. +- `firstChange`, `monthChurn`, and `devLines` on the canonical entry span the **full history** including pre-rename commits. +- Rename chains (`A → B → C`) collapse to `C`. +- Files renamed to each other and later co-changing don't produce self-pairs in coupling. +- `FilesTouched` in developer profiles counts canonical paths, so one dev editing a file before and after a rename counts as **one** file. + +Limits: +- Git's rename detection defaults to ~50% similarity. A rename with heavy edits may not be detected, resulting in separate delete + add entries. +- Copies (`C*` status) are **not** merged — copied files legitimately live as two entries. +- If the rename commit falls outside a `--since` filter, the edge isn't captured and the old/new paths stay separate within the filtered window. The reuse heuristics (`oldPathCounts`, `isRenameTarget`, `maxEdgeDate`) are computed only over edges *inside* the filter window, so a rename that happened outside the window cannot participate in the signal — `wasRecreated` may be false even when the repo history contains a recreation, which can cause the single-edge reuse check to fire (or not fire) differently than it would on the full history. +- **Path reuse patterns.** gitcortex detects four distinct shapes where `oldPath` is ambiguous and handles each explicitly: + + 1. **Multi-edge reuse (different lineages):** `A → B`, then later the name `A` is reused for an unrelated file that gets renamed to `D`. Two edges with `oldPath = A`, no intermediate edge pointing at `A`. gitcortex **refuses to migrate** either edge — A stays put with merged lineages, B and D keep only their own post-rename data. Detected by `oldPathCounts[A] > 1 && !isRenameTarget[A]`. + + 2. **Rename-back chain (same lineage):** `A → B → A → C`. The repeated `A` is recreated by an explicit `B → A` edge, so the whole chain belongs to one lineage and collapses into `C`. Detected by `oldPathCounts[A] > 1 && isRenameTarget[A]`. + + 3. **Single-edge reuse (different lineages, common in the wild):** `A → B` (one edge), then a new unrelated file is created at `A` and never renamed further. `oldPathCounts[A] = 1`, so the multi-edge heuristic does not fire. Detected **temporally**: `ds.files["A"].lastChange` is after `maxEdgeDate[A]` (the latest commitDate of any edge involving `A`, as source or target). Activity on `A` that continued past every rename involving `A` means a new lineage. Skip migration. + + 4. **Chain + reuse (same temporal check, harder pattern):** `D → A → B` chain followed by a new file at `A`. `A` is both a rename target (`D → A`) and a rename source (`A → B`), so pattern #2 would classify it as rename-back. The temporal check correctly fires because `lastChange(A) > maxEdgeDate[A]` (= `A → B` commit date). Real-data impact: 136 such files in the kubernetes snapshot. + + **Known limitation:** the exotic mixed pattern `A → B` (lineage 1), then `C → A` (lineage 2 recreation-via-rename), then `A → D` (lineage 2 renamed) — here `wasRecreated[A] = true` via `C → A`, `maxEdgeDate[A]` = `A → D` commit date, and typical `lastChange(A)` is bounded by that. The heuristic treats it as chain and misattributes lineage 1's pre-rename commits to `D`. A full fix requires per-commit temporal segmentation of `ds.files["A"]`, a larger architectural change. + +### `--since` filter + ChurnRisk age + +`firstChange` is the first time a file appears **in the filtered dataset**, not in repo history. When you run `--since=30d`, a file created 4 years ago but touched yesterday gets `age_days ≈ 0` and classifies as `active-core` — even though it's genuinely legacy. + +If you need the label to reflect true age, either extract without `--since` (then filter queries in post), or treat label output under a `--since` run as "what's happening in this window" rather than "what kind of file is this". + +### Classification degenerate edge cases + +- **Renames reverted (cycle A→B→A).** The resolver bails out of the cycle with the current path; it doesn't crash but the "canonical" is implementation-defined for cyclic inputs. +- **Repo with single file.** The median-based `cold` threshold degenerates (median is that file's churn); the single file is never classified `cold`. +- **All files with identical churn.** Median equals every value, `lowChurn = median × 0.5`, so nothing is `cold`. Everything falls into the bf/age/trend tree. diff --git a/internal/extract/extract.go b/internal/extract/extract.go index 8a54911..ad14bf9 100644 --- a/internal/extract/extract.go +++ b/internal/extract/extract.go @@ -93,6 +93,15 @@ func LoadState(stateFile string, flagOffset int, flagSHA string) (State, error) } func Run(ctx context.Context, cfg Config) error { + // Warn when the repo has a .mailmap but the flag wasn't set — identity + // stats (bus factor, dev network, profiles) will split the same person + // across aliases without it. Silent divergence is the main concern. + if !cfg.Mailmap { + if _, err := os.Stat(filepath.Join(cfg.Repo, ".mailmap")); err == nil { + log.Printf("warning: %s contains .mailmap but --mailmap was not set; identity metrics (bus factor, dev network, profiles) may split the same person across aliases", cfg.Repo) + } + } + initialState, err := LoadState(cfg.StateFile, cfg.StartOffset, cfg.StartSHA) if err != nil { return fmt.Errorf("load state: %w", err) diff --git a/internal/report/profile_template.go b/internal/report/profile_template.go index ca3dc3c..0c23585 100644 --- a/internal/report/profile_template.go +++ b/internal/report/profile_template.go @@ -77,12 +77,12 @@ footer { margin-top: 40px; padding-top: 16px; border-top: 1px solid #d0d7de; col {{if .Profile.Collaborators}}
Collaboration
-
Developers who touch the same files as this developer (number = shared files). Higher counts signal close collaboration or shared ownership; few connections with high counts suggest a tight pair rather than broad reach.
+
Developers who touch the same files as this developer. files = shared file count; lines = Σ min(linesA, linesB) across those files — the honest overlap signal that discounts trivial one-line touches. Sorted by shared lines.
{{range .Profile.Collaborators}} {{.Email}} - {{.SharedFiles}} + {{.SharedFiles}}f · {{.SharedLines}}l {{end}}
diff --git a/internal/report/report.go b/internal/report/report.go index 6a3fa2f..6cbe586 100644 --- a/internal/report/report.go +++ b/internal/report/report.go @@ -35,15 +35,17 @@ type ReportData struct { } type ParetoData struct { - FilesPct80Churn float64 // % of files that account for 80% of churn - DevsPct80Commits float64 // % of devs that account for 80% of commits - DirsPct80Churn float64 // % of dirs that account for 80% of churn - TopChurnFiles int - TotalFiles int - TopCommitDevs int - TotalDevs int - TopChurnDirs int - TotalDirs int + FilesPct80Churn float64 // % of files that account for 80% of churn + DevsPct80Commits float64 // % of devs that account for 80% of commits + DevsPct80Churn float64 // % of devs that account for 80% of churn (see METRICS.md — complements commits) + DirsPct80Churn float64 // % of dirs that account for 80% of churn + TopChurnFiles int + TotalFiles int + TopCommitDevs int + TopChurnDevs int + TotalDevs int + TopChurnDirs int + TotalDirs int } func ComputePareto(ds *stats.Dataset) ParetoData { @@ -57,37 +59,84 @@ func ComputePareto(ds *stats.Dataset) ParetoData { totalChurn += h.Churn } p.TotalFiles = len(hotspots) - threshold := float64(totalChurn) * 0.8 - var cum int64 - for _, h := range hotspots { - cum += h.Churn - p.TopChurnFiles++ - if float64(cum) >= threshold { - break + // Guard: when totalChurn is zero (merges-only dataset, or all empty + // commits), skip the loop entirely. Without this, the first iteration + // trips on `cum >= 0` and leaves TopChurnFiles = 1 for empty signal. + if totalChurn > 0 { + threshold := float64(totalChurn) * 0.8 + var cum int64 + for _, h := range hotspots { + cum += h.Churn + p.TopChurnFiles++ + if float64(cum) >= threshold { + break + } + } + if p.TotalFiles > 0 { + p.FilesPct80Churn = math.Round(float64(p.TopChurnFiles) / float64(p.TotalFiles) * 1000) / 10 } - } - if p.TotalFiles > 0 { - p.FilesPct80Churn = math.Round(float64(p.TopChurnFiles) / float64(p.TotalFiles) * 1000) / 10 } - // Devs: % of devs for 80% of commits + // Devs: two complementary lenses. + // - 80% of commits: rewards frequent committers (bots, squash-off teams). + // - 80% of churn: rewards volume of lines written/removed. + // Divergence between the two is informative (bot author vs feature author). contribs := stats.TopContributors(ds, 0) + p.TotalDevs = len(contribs) + var totalCommits int for _, c := range contribs { totalCommits += c.Commits } - p.TotalDevs = len(contribs) - commitThreshold := float64(totalCommits) * 0.8 - var cumCommits int - for _, c := range contribs { - cumCommits += c.Commits - p.TopCommitDevs++ - if float64(cumCommits) >= commitThreshold { - break + // Guard: when the aggregate is zero, the 80% threshold is zero and the + // first iteration trips it, producing TopX=1 for an empty signal. Skip. + if totalCommits > 0 { + commitThreshold := float64(totalCommits) * 0.8 + var cumCommits int + for _, c := range contribs { + cumCommits += c.Commits + p.TopCommitDevs++ + if float64(cumCommits) >= commitThreshold { + break + } + } + if p.TotalDevs > 0 { + p.DevsPct80Commits = math.Round(float64(p.TopCommitDevs) / float64(p.TotalDevs) * 1000) / 10 } } - if p.TotalDevs > 0 { - p.DevsPct80Commits = math.Round(float64(p.TopCommitDevs) / float64(p.TotalDevs) * 1000) / 10 + + // Dev churn ranking: re-sort contribs by lines changed, apply same 80% + // cumulative cutoff. Tiebreaker on email asc for determinism. The copy + // preserves the commits-ordered `contribs` slice in case of future reuse. + byChurn := make([]stats.ContributorStat, len(contribs)) + copy(byChurn, contribs) + sort.Slice(byChurn, func(i, j int) bool { + li := byChurn[i].Additions + byChurn[i].Deletions + lj := byChurn[j].Additions + byChurn[j].Deletions + if li != lj { + return li > lj + } + return byChurn[i].Email < byChurn[j].Email + }) + var totalDevChurn int64 + for _, c := range byChurn { + totalDevChurn += c.Additions + c.Deletions + } + // Same zero-aggregate guard as above. Without it, zero-churn datasets + // (e.g., all empty commits) would report 1 dev as the 80% owner. + if totalDevChurn > 0 { + devChurnThreshold := float64(totalDevChurn) * 0.8 + var cumDevChurn int64 + for _, c := range byChurn { + cumDevChurn += c.Additions + c.Deletions + p.TopChurnDevs++ + if float64(cumDevChurn) >= devChurnThreshold { + break + } + } + if p.TotalDevs > 0 { + p.DevsPct80Churn = math.Round(float64(p.TopChurnDevs) / float64(p.TotalDevs) * 1000) / 10 + } } // Dirs: % of dirs for 80% of churn @@ -97,17 +146,20 @@ func ComputePareto(ds *stats.Dataset) ParetoData { totalDirChurn += d.Churn } p.TotalDirs = len(dirs) - dirThreshold := float64(totalDirChurn) * 0.8 - var cumDirChurn int64 - for _, d := range dirs { - cumDirChurn += d.Churn - p.TopChurnDirs++ - if float64(cumDirChurn) >= dirThreshold { - break + // Same zero-churn guard as files. + if totalDirChurn > 0 { + dirThreshold := float64(totalDirChurn) * 0.8 + var cumDirChurn int64 + for _, d := range dirs { + cumDirChurn += d.Churn + p.TopChurnDirs++ + if float64(cumDirChurn) >= dirThreshold { + break + } + } + if p.TotalDirs > 0 { + p.DirsPct80Churn = math.Round(float64(p.TopChurnDirs) / float64(p.TotalDirs) * 1000) / 10 } - } - if p.TotalDirs > 0 { - p.DirsPct80Churn = math.Round(float64(p.TopChurnDirs) / float64(p.TotalDirs) * 1000) / 10 } return p diff --git a/internal/report/report_test.go b/internal/report/report_test.go index a41d815..0a5a872 100644 --- a/internal/report/report_test.go +++ b/internal/report/report_test.go @@ -2,6 +2,7 @@ package report import ( "bytes" + "fmt" "os" "path/filepath" "strings" @@ -186,6 +187,120 @@ func TestBuildActivityGrid_Monthly(t *testing.T) { } } +func TestComputeParetoDivergenceBotVsAuthor(t *testing.T) { + // Scenario: bot commits 100 tiny commits, human commits 3 big ones. + // Commits-based lens says bot dominates (100/103 ≈ 97%). + // Churn-based lens says human dominates (3000/3100 ≈ 97%). + // The two numbers must diverge — that is the whole reason the card exists. + var lines []string + // 100 tiny commits from bot (1 line each) + for i := 0; i < 100; i++ { + sha := fmt.Sprintf("%040d", i+1) + lines = append(lines, + fmt.Sprintf(`{"type":"commit","sha":"%s","author_name":"bot","author_email":"bot@ci","author_date":"2024-01-15T10:00:00Z","additions":1,"deletions":0,"files_changed":1}`, sha), + fmt.Sprintf(`{"type":"commit_file","commit":"%s","path_current":"log.txt","status":"M","additions":1,"deletions":0}`, sha), + ) + } + // 3 big commits from a human (1000 lines each) + for i := 0; i < 3; i++ { + sha := fmt.Sprintf("h%039d", i+1) + lines = append(lines, + fmt.Sprintf(`{"type":"commit","sha":"%s","author_name":"Human","author_email":"h@x","author_date":"2024-01-15T10:00:00Z","additions":1000,"deletions":0,"files_changed":1}`, sha), + fmt.Sprintf(`{"type":"commit_file","commit":"%s","path_current":"feature.go","status":"A","additions":1000,"deletions":0}`, sha), + ) + } + dir := t.TempDir() + path := filepath.Join(dir, "divergence.jsonl") + if err := os.WriteFile(path, []byte(strings.Join(lines, "\n")+"\n"), 0644); err != nil { + t.Fatal(err) + } + ds, err := stats.LoadJSONL(path) + if err != nil { + t.Fatal(err) + } + p := ComputePareto(ds) + + // Both lenses should identify 1 dev (out of 2) as holding 80%. + // But WHICH dev is different: commits → bot, churn → human. + if p.TopCommitDevs != 1 { + t.Errorf("TopCommitDevs = %d, want 1 (bot dominates commits)", p.TopCommitDevs) + } + if p.TopChurnDevs != 1 { + t.Errorf("TopChurnDevs = %d, want 1 (human dominates churn)", p.TopChurnDevs) + } + // The percentage is the same (1/2 = 50%) — divergence is in WHICH dev, + // not in the count. We assert both are populated and distinct data paths. + if p.DevsPct80Commits != p.DevsPct80Churn { + // With this crafted input they happen to tie at 50%, but the test's + // purpose is to exercise both code paths independently. + } +} + +func TestComputeParetoZeroChurn(t *testing.T) { + // All commits have zero additions and zero deletions (e.g., pure merges + // or empty commits). TopChurnDevs must stay 0, not leak to 1 via the + // zero-threshold-tripped-on-first-iteration bug. + jsonl := `{"type":"commit","sha":"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa","author_name":"A","author_email":"a@x","author_date":"2024-01-10T10:00:00Z","additions":0,"deletions":0,"files_changed":0} +{"type":"commit","sha":"bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb","author_name":"B","author_email":"b@x","author_date":"2024-01-11T10:00:00Z","additions":0,"deletions":0,"files_changed":0} +` + dir := t.TempDir() + path := filepath.Join(dir, "zero.jsonl") + if err := os.WriteFile(path, []byte(jsonl), 0644); err != nil { + t.Fatal(err) + } + ds, err := stats.LoadJSONL(path) + if err != nil { + t.Fatal(err) + } + p := ComputePareto(ds) + + if p.TopChurnDevs != 0 { + t.Errorf("TopChurnDevs = %d, want 0 on zero-churn dataset", p.TopChurnDevs) + } + if p.DevsPct80Churn != 0 { + t.Errorf("DevsPct80Churn = %.1f, want 0", p.DevsPct80Churn) + } +} + +func TestComputeParetoFilesAndDirsZeroChurn(t *testing.T) { + // Files and dirs exist but every commit_file record has zero churn + // (e.g. a sequence of pure renames with no content change). Previously + // the Files and Dirs loops would trip the zero-threshold on the first + // iteration and leave TopChurnFiles = TopChurnDirs = 1 — producing a + // false "extremely concentrated" label. Guards added to ComputePareto + // now skip the loops entirely when the aggregate is zero. + jsonl := `{"type":"commit","sha":"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa","author_name":"A","author_email":"a@x","author_date":"2024-01-10T10:00:00Z","additions":0,"deletions":0,"files_changed":2} +{"type":"commit_file","commit":"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa","path_current":"src/foo.go","path_previous":"src/foo.go","status":"M","additions":0,"deletions":0} +{"type":"commit_file","commit":"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa","path_current":"src/bar.go","path_previous":"src/bar.go","status":"M","additions":0,"deletions":0} +` + dir := t.TempDir() + path := filepath.Join(dir, "zero_file_churn.jsonl") + if err := os.WriteFile(path, []byte(jsonl), 0644); err != nil { + t.Fatal(err) + } + ds, err := stats.LoadJSONL(path) + if err != nil { + t.Fatal(err) + } + p := ComputePareto(ds) + + if p.TotalFiles != 2 { + t.Errorf("TotalFiles = %d, want 2 (files exist in dataset)", p.TotalFiles) + } + if p.TopChurnFiles != 0 { + t.Errorf("TopChurnFiles = %d, want 0 (no churn signal)", p.TopChurnFiles) + } + if p.FilesPct80Churn != 0 { + t.Errorf("FilesPct80Churn = %.1f, want 0", p.FilesPct80Churn) + } + if p.TopChurnDirs != 0 { + t.Errorf("TopChurnDirs = %d, want 0 (no churn signal)", p.TopChurnDirs) + } + if p.DirsPct80Churn != 0 { + t.Errorf("DirsPct80Churn = %.1f, want 0", p.DirsPct80Churn) + } +} + func TestComputePareto(t *testing.T) { ds := loadFixture(t) p := ComputePareto(ds) @@ -213,11 +328,16 @@ func TestComputePareto(t *testing.T) { } // Percentages must be in [0, 100]. - for _, v := range []float64{p.FilesPct80Churn, p.DevsPct80Commits, p.DirsPct80Churn} { + for _, v := range []float64{p.FilesPct80Churn, p.DevsPct80Commits, p.DevsPct80Churn, p.DirsPct80Churn} { if v < 0 || v > 100 { t.Errorf("pct out of range: %.1f", v) } } + + // DevsPct80Churn should also be populated (fixture has non-zero churn). + if p.TopChurnDevs == 0 { + t.Errorf("TopChurnDevs = 0, want > 0 (devs have non-zero churn in fixture)") + } } // Documents current behavior: buildActivityGrid only parses "YYYY-MM". diff --git a/internal/report/template.go b/internal/report/template.go index 963ef40..1b643ab 100644 --- a/internal/report/template.go +++ b/internal/report/template.go @@ -56,24 +56,31 @@ footer { margin-top: 40px; padding-top: 16px; border-top: 1px solid #d0d7de; col

Pareto distribution across files, developers, and directories. Few items carrying 80% of activity means high concentration. Red and yellow markers deserve a closer look — concentration may signal a critical core module or a knowledge bottleneck, depending on context.

- {{if eq .Pareto.TotalFiles 0}}⚪{{else if le .Pareto.FilesPct80Churn 10.0}}🔴{{else if le .Pareto.FilesPct80Churn 25.0}}🟡{{else}}🟢{{end}} + {{if eq .Pareto.TopChurnFiles 0}}⚪{{else if le .Pareto.FilesPct80Churn 10.0}}🔴{{else if le .Pareto.FilesPct80Churn 25.0}}🟡{{else}}🟢{{end}}
{{.Pareto.TopChurnFiles}} files concentrate 80% of all churn
-
out of {{.Pareto.TotalFiles}} total files — {{if eq .Pareto.TotalFiles 0}}no data{{else if le .Pareto.FilesPct80Churn 10.0}}extremely concentrated{{else if le .Pareto.FilesPct80Churn 25.0}}moderately concentrated{{else}}well distributed{{end}}
+
out of {{.Pareto.TotalFiles}} total files — {{if eq .Pareto.TopChurnFiles 0}}no data{{else if le .Pareto.FilesPct80Churn 10.0}}extremely concentrated{{else if le .Pareto.FilesPct80Churn 25.0}}moderately concentrated{{else}}well distributed{{end}}
- {{if eq .Pareto.TotalDevs 0}}⚪{{else if le .Pareto.DevsPct80Commits 10.0}}🔴{{else if le .Pareto.DevsPct80Commits 25.0}}🟡{{else}}🟢{{end}} + {{if eq .Pareto.TopCommitDevs 0}}⚪{{else if le .Pareto.DevsPct80Commits 10.0}}🔴{{else if le .Pareto.DevsPct80Commits 25.0}}🟡{{else}}🟢{{end}}
{{.Pareto.TopCommitDevs}} devs produce 80% of all commits
-
out of {{.Pareto.TotalDevs}} total devs — {{if eq .Pareto.TotalDevs 0}}no data{{else if le .Pareto.DevsPct80Commits 10.0}}extremely concentrated, key-person dependence{{else if le .Pareto.DevsPct80Commits 25.0}}moderately concentrated{{else}}well distributed{{end}}
+
out of {{.Pareto.TotalDevs}} total devs — {{if eq .Pareto.TopCommitDevs 0}}no data{{else if le .Pareto.DevsPct80Commits 10.0}}extremely concentrated, key-person dependence{{else if le .Pareto.DevsPct80Commits 25.0}}moderately concentrated{{else}}well distributed{{end}}
- {{if eq .Pareto.TotalDirs 0}}⚪{{else if le .Pareto.DirsPct80Churn 10.0}}🔴{{else if le .Pareto.DirsPct80Churn 25.0}}🟡{{else}}🟢{{end}} + {{if eq .Pareto.TopChurnDevs 0}}⚪{{else if le .Pareto.DevsPct80Churn 10.0}}🔴{{else if le .Pareto.DevsPct80Churn 25.0}}🟡{{else}}🟢{{end}} +
+
{{.Pareto.TopChurnDevs}} devs produce 80% of all line churn
+
out of {{.Pareto.TotalDevs}} total devs — {{if eq .Pareto.TopChurnDevs 0}}no data{{else if le .Pareto.DevsPct80Churn 10.0}}extremely concentrated{{else if le .Pareto.DevsPct80Churn 25.0}}moderately concentrated{{else}}well distributed{{end}}. Compare to the commits card: divergence reveals bots (commits ≫ churn) or feature owners (churn ≫ commits).
+
+
+
+ {{if eq .Pareto.TopChurnDirs 0}}⚪{{else if le .Pareto.DirsPct80Churn 10.0}}🔴{{else if le .Pareto.DirsPct80Churn 25.0}}🟡{{else}}🟢{{end}}
{{.Pareto.TopChurnDirs}} directories concentrate 80% of all churn
-
out of {{.Pareto.TotalDirs}} total directories — {{if eq .Pareto.TotalDirs 0}}no data{{else if le .Pareto.DirsPct80Churn 10.0}}extremely concentrated{{else if le .Pareto.DirsPct80Churn 25.0}}moderately concentrated{{else}}well distributed{{end}}
+
out of {{.Pareto.TotalDirs}} total directories — {{if eq .Pareto.TopChurnDirs 0}}no data{{else if le .Pareto.DirsPct80Churn 10.0}}extremely concentrated{{else if le .Pareto.DirsPct80Churn 25.0}}moderately concentrated{{else}}well distributed{{end}}
@@ -156,13 +163,13 @@ footer { margin-top: 40px; padding-top: 16px; border-top: 1px solid #d0d7de; col {{if .Directories}}

Directories

-

Module-level health. Commits and churn aggregated by directory. Low bus factor = knowledge concentrated in few people.

+

Module-level health. File touches is the sum of per-file commit counts (one commit touching N files contributes N), not distinct commits. Low bus factor = knowledge concentrated in few people.

- + {{range .Directories}} - + @@ -174,17 +181,19 @@ footer { margin-top: 40px; padding-top: 16px; border-top: 1px solid #d0d7de; col {{if .ChurnRisk}}

Churn Risk

-

Files ranked by recent churn weighted by bus factor. High risk = lots of recent changes owned by few people. Prioritize knowledge transfer here.

+

Files ranked by recent churn. Label classifies context so you can judge action: legacy-hotspot (old code + concentrated + declining) is the urgent alarm; silo suggests knowledge transfer; active-core is young code with a single author (often fine); active is shared healthy work; cold is quiet.

DirectoryCommitsChurnFilesDevsBus Factor
DirectoryFile TouchesChurnFilesDevsBus Factor
{{.Dir}}{{.Commits}}{{.FileTouches}} {{.Churn}} {{.Files}} {{.UniqueDevs}}
- -{{$maxRisk := 0.0}}{{range .ChurnRisk}}{{if gt .RiskScore $maxRisk}}{{$maxRisk = .RiskScore}}{{end}}{{end}} + +{{$maxChurn := 0.0}}{{range .ChurnRisk}}{{if gt .RecentChurn $maxChurn}}{{$maxChurn = .RecentChurn}}{{end}}{{end}} {{range .ChurnRisk}} - - + + + + {{end}} @@ -258,14 +267,15 @@ footer { margin-top: 40px; padding-top: 16px; border-top: 1px solid #d0d7de; col {{if .DevNetwork}}

Developer Network

-

Developers who modify the same files. Strong connections indicate collaboration or shared ownership. Isolated nodes may signal silos.

+

Developers who modify the same files. Shared lines = Σ min(lines_A, lines_B) per file — measures real overlap, not trivial one-line touches. Sorted by shared lines.

PathRiskRecent ChurnBus FactorLast Change
PathLabelRecent ChurnBFAgeTrendLast Change
{{.Path}}{{printf "%.1f" .RiskScore}}
{{if eq .Label "legacy-hotspot"}}🔴 {{.Label}}{{else if eq .Label "silo"}}🟡 {{.Label}}{{else if eq .Label "active-core"}}{{.Label}}{{else if eq .Label "active"}}{{.Label}}{{else}}{{.Label}}{{end}} {{printf "%.1f" .RecentChurn}}
{{.BusFactor}}{{.AgeDays}}d{{if lt .Trend 0.5}}↓ {{printf "%.2f" .Trend}}{{else if gt .Trend 1.5}}↑ {{printf "%.2f" .Trend}}{{else}}→ {{printf "%.2f" .Trend}}{{end}} {{.LastChangeDate}}
- + {{range .DevNetwork}} + {{end}} @@ -291,7 +301,7 @@ footer { margin-top: 40px; padding-top: 16px; border-top: 1px solid #d0d7de; col {{printf "%.1f" .Pace}} commits/active dayCollaboration - {{if .Collaborators}}{{range $i, $c := .Collaborators}}{{if $i}}, {{end}}{{$c.Email}} ({{$c.SharedFiles}}){{end}}{{else}}solo contributor{{end}} + {{if .Collaborators}}{{range $i, $c := .Collaborators}}{{if $i}}, {{end}}{{$c.Email}} ({{$c.SharedFiles}} files, {{$c.SharedLines}} lines){{end}}{{else}}solo contributor{{end}}Weekend{{printf "%.1f" .WeekendPct}}% diff --git a/internal/stats/format.go b/internal/stats/format.go index 6fd0b2f..806777b 100644 --- a/internal/stats/format.go +++ b/internal/stats/format.go @@ -124,20 +124,20 @@ func (f *Formatter) PrintDirectories(dirs []DirStat) error { for i, d := range dirs { rows[i] = []string{ d.Dir, - fmt.Sprintf("%d", d.Commits), + fmt.Sprintf("%d", d.FileTouches), fmt.Sprintf("%d", d.Churn), fmt.Sprintf("%d", d.Files), fmt.Sprintf("%d", d.UniqueDevs), fmt.Sprintf("%d", d.BusFactor), } } - return f.writeCSV([]string{"directory", "commits", "churn", "files", "devs", "bus_factor"}, rows) + return f.writeCSV([]string{"directory", "file_touches", "churn", "files", "devs", "bus_factor"}, rows) default: tw := tabwriter.NewWriter(f.w, 0, 4, 2, ' ', 0) - fmt.Fprintf(tw, "DIRECTORY\tCOMMITS\tCHURN\tFILES\tDEVS\tBUS FACTOR\n") - fmt.Fprintf(tw, "---------\t-------\t-----\t-----\t----\t----------\n") + fmt.Fprintf(tw, "DIRECTORY\tFILE TOUCHES\tCHURN\tFILES\tDEVS\tBUS FACTOR\n") + fmt.Fprintf(tw, "---------\t------------\t-----\t-----\t----\t----------\n") for _, d := range dirs { - fmt.Fprintf(tw, "%s\t%d\t%d\t%d\t%d\t%d\n", d.Dir, d.Commits, d.Churn, d.Files, d.UniqueDevs, d.BusFactor) + fmt.Fprintf(tw, "%s\t%d\t%d\t%d\t%d\t%d\n", d.Dir, d.FileTouches, d.Churn, d.Files, d.UniqueDevs, d.BusFactor) } return tw.Flush() } @@ -230,20 +230,23 @@ func (f *Formatter) PrintChurnRisk(results []ChurnRiskResult) error { for i, r := range results { rows[i] = []string{ r.Path, - fmt.Sprintf("%.1f", r.RiskScore), + r.Label, fmt.Sprintf("%.1f", r.RecentChurn), fmt.Sprintf("%d", r.BusFactor), + fmt.Sprintf("%d", r.AgeDays), + fmt.Sprintf("%.2f", r.Trend), fmt.Sprintf("%d", r.TotalChanges), + r.FirstChangeDate, r.LastChangeDate, } } - return f.writeCSV([]string{"path", "risk_score", "recent_churn", "bus_factor", "total_changes", "last_change"}, rows) + return f.writeCSV([]string{"path", "label", "recent_churn", "bus_factor", "age_days", "trend", "total_changes", "first_change", "last_change"}, rows) default: tw := tabwriter.NewWriter(f.w, 0, 4, 2, ' ', 0) - fmt.Fprintf(tw, "PATH\tRISK\tRECENT CHURN\tBUS FACTOR\tTOTAL CHANGES\tLAST CHANGE\n") - fmt.Fprintf(tw, "----\t----\t------------\t----------\t-------------\t-----------\n") + fmt.Fprintf(tw, "PATH\tLABEL\tRECENT CHURN\tBF\tAGE\tTREND\tLAST CHANGE\n") + fmt.Fprintf(tw, "----\t-----\t------------\t--\t---\t-----\t-----------\n") for _, r := range results { - fmt.Fprintf(tw, "%s\t%.1f\t%.1f\t%d\t%d\t%s\n", r.Path, r.RiskScore, r.RecentChurn, r.BusFactor, r.TotalChanges, r.LastChangeDate) + fmt.Fprintf(tw, "%s\t%s\t%.1f\t%d\t%dd\t%.2f\t%s\n", r.Path, r.Label, r.RecentChurn, r.BusFactor, r.AgeDays, r.Trend, r.LastChangeDate) } return tw.Flush() } @@ -301,16 +304,17 @@ func (f *Formatter) PrintDevNetwork(edges []DevEdge) error { rows[i] = []string{ e.DevA, e.DevB, fmt.Sprintf("%d", e.SharedFiles), + fmt.Sprintf("%d", e.SharedLines), fmt.Sprintf("%.1f", e.Weight), } } - return f.writeCSV([]string{"dev_a", "dev_b", "shared_files", "weight_pct"}, rows) + return f.writeCSV([]string{"dev_a", "dev_b", "shared_files", "shared_lines", "weight_pct"}, rows) default: tw := tabwriter.NewWriter(f.w, 0, 4, 2, ' ', 0) - fmt.Fprintf(tw, "DEV A\tDEV B\tSHARED FILES\tWEIGHT\n") - fmt.Fprintf(tw, "-----\t-----\t------------\t------\n") + fmt.Fprintf(tw, "DEV A\tDEV B\tSHARED FILES\tSHARED LINES\tWEIGHT\n") + fmt.Fprintf(tw, "-----\t-----\t------------\t------------\t------\n") for _, e := range edges { - fmt.Fprintf(tw, "%s\t%s\t%d\t%.1f%%\n", e.DevA, e.DevB, e.SharedFiles, e.Weight) + fmt.Fprintf(tw, "%s\t%s\t%d\t%d\t%.1f%%\n", e.DevA, e.DevB, e.SharedFiles, e.SharedLines, e.Weight) } return tw.Flush() } diff --git a/internal/stats/reader.go b/internal/stats/reader.go index 9c25944..d75952c 100644 --- a/internal/stats/reader.go +++ b/internal/stats/reader.go @@ -8,6 +8,7 @@ import ( "math" "os" "path/filepath" + "sort" "strings" "time" @@ -30,11 +31,23 @@ type fileEntry struct { devLines map[string]int64 devCommits map[string]int recentChurn float64 + firstChange time.Time lastChange time.Time + monthChurn map[string]int64 // key: "YYYY-MM"; used for trend classification } type filePair struct{ a, b string } +type renameEdge struct { + oldPath, newPath string + // commitDate captures when the rename was made. applyRenames sorts by + // this descending so the "first-seen wins" dedup corresponds to the + // most recent rename even if the JSONL arrives out of order (e.g. if + // the extractor ever emits --reverse or a future format change + // reshuffles the stream). + commitDate time.Time +} + type Dataset struct { // Summary CommitCount int @@ -58,6 +71,9 @@ type Dataset struct { couplingPairs map[filePair]int couplingFileChanges map[string]int + // Rename edges captured during ingest; resolved in finalizeDataset. + renameEdges []renameEdge + // Internal accumulators contribDays map[string]map[string]struct{} // email → set of active dates contribFiles map[string]map[string]struct{} // email → set of file paths @@ -154,6 +170,7 @@ func streamLoadInto(ds *Dataset, r io.Reader, opt LoadOptions, pathPrefix string // Coupling streaming state var coupCurrentSHA string var coupCurrentFiles []string + var coupCurrentChurn int64 dayIndex := map[time.Weekday]int{ time.Monday: 0, time.Tuesday: 1, time.Wednesday: 2, @@ -222,9 +239,12 @@ func streamLoadInto(ds *Dataset, r io.Reader, opt LoadOptions, pathPrefix string cs.Additions += c.Additions cs.Deletions += c.Deletions - // Contributor detail: active days, first/last date + // Contributor detail: active days, first/last date. + // Active days bucket in UTC so the count is stable across the + // distributed-team case where two commits near midnight local + // would otherwise land in different local dates. if !t.IsZero() { - dayKey := t.Format("2006-01-02") + dayKey := t.UTC().Format("2006-01-02") if ds.contribDays[c.AuthorEmail] == nil { ds.contribDays[c.AuthorEmail] = make(map[string]struct{}) } @@ -272,6 +292,23 @@ func streamLoadInto(ds *Dataset, r io.Reader, opt LoadOptions, pathPrefix string } } + // Capture rename edges (git log emits status "R" followed by + // similarity score, e.g. "R100"). Chains are resolved in + // finalizeDataset once all edges are known — required because + // the log is emitted newest-first, so pre-rename history comes + // later in the stream than the rename commit itself. + if strings.HasPrefix(cf.Status, "R") && cf.PathPrevious != "" && cf.PathCurrent != "" && cf.PathPrevious != cf.PathCurrent { + var edgeDate time.Time + if cm := ds.commits[cf.Commit]; cm != nil { + edgeDate = cm.date + } + ds.renameEdges = append(ds.renameEdges, renameEdge{ + oldPath: pathPrefix + cf.PathPrevious, + newPath: pathPrefix + cf.PathCurrent, + commitDate: edgeDate, + }) + } + path := cf.PathCurrent if path == "" { path = cf.PathPrevious @@ -286,7 +323,11 @@ func streamLoadInto(ds *Dataset, r io.Reader, opt LoadOptions, pathPrefix string // File aggregation (hotspots + busfactor + churn-risk) fe, ok := ds.files[path] if !ok { - fe = &fileEntry{devLines: make(map[string]int64), devCommits: make(map[string]int)} + fe = &fileEntry{ + devLines: make(map[string]int64), + devCommits: make(map[string]int), + monthChurn: make(map[string]int64), + } ds.files[path] = fe } fe.commits++ @@ -311,16 +352,22 @@ func streamLoadInto(ds *Dataset, r io.Reader, opt LoadOptions, pathPrefix string if cm.date.After(fe.lastChange) { fe.lastChange = cm.date } + if fe.firstChange.IsZero() || cm.date.Before(fe.firstChange) { + fe.firstChange = cm.date + } + fe.monthChurn[cm.date.UTC().Format("2006-01")] += cf.Additions + cf.Deletions } } // Coupling: streaming pair computation if cf.Commit != coupCurrentSHA { - flushCoupling(ds, coupCurrentFiles, opt.CoupMaxFiles) + flushCoupling(ds, coupCurrentFiles, coupCurrentChurn, opt.CoupMaxFiles) coupCurrentSHA = cf.Commit coupCurrentFiles = coupCurrentFiles[:0] + coupCurrentChurn = 0 } coupCurrentFiles = append(coupCurrentFiles, path) + coupCurrentChurn += cf.Additions + cf.Deletions case model.DevType: // dev records are skipped — DevCount is derived from contributors (authors only) @@ -331,7 +378,7 @@ func streamLoadInto(ds *Dataset, r io.Reader, opt LoadOptions, pathPrefix string return fmt.Errorf("read: %w", err) } - flushCoupling(ds, coupCurrentFiles, opt.CoupMaxFiles) + flushCoupling(ds, coupCurrentFiles, coupCurrentChurn, opt.CoupMaxFiles) ds.UniqueFileCount += len(uniqueFiles) return nil @@ -346,14 +393,18 @@ func finalizeDataset(ds *Dataset) { } } + applyRenames(ds) + // UniqueFileCount reflects post-merge canonical paths. + ds.UniqueFileCount = len(ds.files) + for email, cs := range ds.contributors { cs.ActiveDays = len(ds.contribDays[email]) cs.FilesTouched = len(ds.contribFiles[email]) if t, ok := ds.contribFirst[email]; ok { - cs.FirstDate = t.Format("2006-01-02") + cs.FirstDate = t.UTC().Format("2006-01-02") } if t, ok := ds.contribLast[email]; ok { - cs.LastDate = t.Format("2006-01-02") + cs.LastDate = t.UTC().Format("2006-01-02") } } ds.contribDays = nil @@ -362,7 +413,205 @@ func finalizeDataset(ds *Dataset) { ds.contribLast = nil } -func flushCoupling(ds *Dataset, files []string, maxFiles int) { +// applyRenames resolves rename edges captured during ingest and re-keys +// ds.files / coupling maps to their canonical (latest) path. Safe to call +// with no edges. Defensive against cycles (A→B→A) — rare but possible if +// a repo reverted a rename. +func applyRenames(ds *Dataset) { + if len(ds.renameEdges) == 0 { + return + } + + // Distinguish two scenarios that both produce duplicate oldPaths: + // + // Path reuse (different lineages): + // t1: A → B t2: (unrelated file created at A) t3: A → D + // The repeated A has no intermediate edge pointing at it — it + // simply reappears. ds.files["A"] contains two lineages merged at + // ingest time and we cannot disambiguate without per-commit + // temporal tracking. Skip migration. + // + // Rename-back chain (same lineage): + // t1: A → B t2: B → A (rename back) t3: A → C + // The repeated A is recreated by an explicit rename edge (B → A). + // Both segments of the chain belong to the same lineage and both + // should end up at C. Migrate normally. + // + // Heuristic: treat oldPath as reuse iff count > 1 AND no edge has + // it as newPath. Imperfect on exotic mixed cases (A→B, C→A, A→D) + // but correctly handles pure reuse and the common rename-back flow. + + // Sort edges by commitDate desc so the dedup below ("first seen for + // each oldPath wins") corresponds to the chronologically newest + // rename regardless of the order the stream arrived in. Otherwise we + // would silently break if the extractor ever changed from newest- + // first emission. + sort.SliceStable(ds.renameEdges, func(i, j int) bool { + return ds.renameEdges[i].commitDate.After(ds.renameEdges[j].commitDate) + }) + + oldPathCounts := make(map[string]int) + isRenameTarget := make(map[string]bool) + // maxEdgeDate[path] holds the most recent commitDate of any edge + // involving `path` — as source OR target. The single-edge reuse + // check below compares ds.files[path].lastChange against this, so + // legitimate rename-back activity (which is bounded by a later + // rename's commitDate) never looks like reuse. + maxEdgeDate := make(map[string]time.Time) + for _, e := range ds.renameEdges { + oldPathCounts[e.oldPath]++ + isRenameTarget[e.newPath] = true + if e.commitDate.After(maxEdgeDate[e.oldPath]) { + maxEdgeDate[e.oldPath] = e.commitDate + } + if e.commitDate.After(maxEdgeDate[e.newPath]) { + maxEdgeDate[e.newPath] = e.commitDate + } + } + + direct := make(map[string]string, len(ds.renameEdges)) + for _, e := range ds.renameEdges { + isDuplicate := oldPathCounts[e.oldPath] > 1 + wasRecreated := isRenameTarget[e.oldPath] + if isDuplicate && !wasRecreated { + continue // path reused — refuse to migrate (multi-edge pattern) + } + // Single-edge reuse: oldPath has activity AFTER every rename it + // participates in. Common case: A → B, then an unrelated file + // is created at A and never renamed. Catches both simple reuse + // (A→B + new-A) and the chain+reuse case (D→A, A→B, then new-A) + // where the wasRecreated signal alone can't tell them apart. + // + // Rename-back chains (A→B→A→C) and cycles (A↔B) naturally pass + // this check because maxEdgeDate includes later edges that + // recreate the path, bounding lastChange from above. + // + // Both dates must be non-zero — defensive when fileEntry is + // hand-built in tests without dates. In real ingest both are + // populated from the JSONL stream. + if fe, ok := ds.files[e.oldPath]; ok && + !maxEdgeDate[e.oldPath].IsZero() && !fe.lastChange.IsZero() && + fe.lastChange.After(maxEdgeDate[e.oldPath]) { + continue + } + // First edge wins per oldPath; edges are pre-sorted so this is + // the chronologically newest rename (the direction the lineage + // continues in). + if _, ok := direct[e.oldPath]; !ok { + direct[e.oldPath] = e.newPath + } + } + + canonical := func(p string) string { + seen := map[string]struct{}{} + for { + if _, ok := seen[p]; ok { + return p // cycle detected — bail with current + } + seen[p] = struct{}{} + next, ok := direct[p] + if !ok || next == p { + return p + } + p = next + } + } + + // Re-key file entries, merging colliders. + newFiles := make(map[string]*fileEntry, len(ds.files)) + for path, fe := range ds.files { + c := canonical(path) + if existing, ok := newFiles[c]; ok { + mergeFileEntry(existing, fe) + } else { + newFiles[c] = fe + } + } + ds.files = newFiles + + // Re-key coupling denominator. + newChanges := make(map[string]int, len(ds.couplingFileChanges)) + for path, count := range ds.couplingFileChanges { + newChanges[canonical(path)] += count + } + ds.couplingFileChanges = newChanges + + // Re-key coupling pairs. Drop pairs that collapse onto themselves + // (both sides resolved to the same canonical path = rename chain). + newPairs := make(map[filePair]int, len(ds.couplingPairs)) + for pair, count := range ds.couplingPairs { + ca, cb := canonical(pair.a), canonical(pair.b) + if ca == cb { + continue + } + if ca > cb { + ca, cb = cb, ca + } + newPairs[filePair{a: ca, b: cb}] += count + } + ds.couplingPairs = newPairs + + // Re-key contributor file sets so FilesTouched reflects canonical paths. + // Without this, a dev who edited old.go and new.go across a rename would + // be counted as touching two files instead of one. + for email, paths := range ds.contribFiles { + newSet := make(map[string]struct{}, len(paths)) + for p := range paths { + newSet[canonical(p)] = struct{}{} + } + ds.contribFiles[email] = newSet + } +} + +// mergeFileEntry folds src into dst: sums scalars, unions maps, keeps the +// widest firstChange→lastChange span. +func mergeFileEntry(dst, src *fileEntry) { + dst.commits += src.commits + dst.additions += src.additions + dst.deletions += src.deletions + dst.recentChurn += src.recentChurn + + if dst.devLines == nil { + dst.devLines = make(map[string]int64) + } + for k, v := range src.devLines { + dst.devLines[k] += v + } + if dst.devCommits == nil { + dst.devCommits = make(map[string]int) + } + for k, v := range src.devCommits { + dst.devCommits[k] += v + } + if dst.monthChurn == nil { + dst.monthChurn = make(map[string]int64) + } + for k, v := range src.monthChurn { + dst.monthChurn[k] += v + } + + if !src.firstChange.IsZero() { + if dst.firstChange.IsZero() || src.firstChange.Before(dst.firstChange) { + dst.firstChange = src.firstChange + } + } + if src.lastChange.After(dst.lastChange) { + dst.lastChange = src.lastChange + } +} + +// isMechanicalRefactor returns true when a commit's shape matches a likely +// rename / format / lint pass: many files with very low mean churn per file. +// Such commits generate spurious coupling pairs and are filtered in +// flushCoupling. Thresholds are package constants (refactor*). +func isMechanicalRefactor(fileCount int, commitChurn int64) bool { + if fileCount < refactorMinFiles { + return false + } + return float64(commitChurn)/float64(fileCount) < refactorMaxChurnPerFile +} + +func flushCoupling(ds *Dataset, files []string, commitChurn int64, maxFiles int) { // Always count file changes (denominator for coupling %) // so single-file commits are included in the base rate. seen := make(map[string]bool, len(files)) @@ -380,6 +629,13 @@ func flushCoupling(ds *Dataset, files []string, maxFiles int) { return } + // Global renames, format/lint passes, and similar mechanical commits + // would otherwise generate spurious coupling pairs. Denominator above + // is still incremented so ChangesA/ChangesB stay honest. + if isMechanicalRefactor(len(unique), commitChurn) { + return + } + for i := 0; i < len(unique); i++ { for j := i + 1; j < len(unique); j++ { a, b := unique[i], unique[j] diff --git a/internal/stats/stats.go b/internal/stats/stats.go index c0d9341..3620f18 100644 --- a/internal/stats/stats.go +++ b/internal/stats/stats.go @@ -5,6 +5,35 @@ import ( "math" "sort" "strings" + "time" +) + +// Thresholds for classification and profile categorization. Exposed as named +// constants so the values are discoverable and consistent across the package. +// See docs/METRICS.md for rationale. +const ( + // Churn Risk classification + classifyColdChurnRatio = 0.5 // recent_churn ≤ 0.5 × median → cold + classifyActiveBusFactor = 3 // bf ≥ 3 → active (shared) + classifyOldAgeDays = 180 // age ≥ 180d is "old" + classifyDecliningTrend = 0.5 // trend < 0.5 = declining + classifyTrendWindowMonths = 3 // recent vs earlier split + + // Developer profile contribution type (del/add ratio) + contribRefactorRatio = 0.8 // ratio ≥ 0.8 → refactor + contribBalancedRatio = 0.4 // 0.4 ≤ ratio < 0.8 → balanced; else growth + + // Coupling — mechanical refactor heuristic. A commit touching many + // files with very low mean churn per file is almost always a rename, + // format, or lint fix, not meaningful co-change. Pairs from such + // commits are excluded to reduce false coupling. Denominators + // (couplingFileChanges) are still counted so ChangesA/ChangesB + // remain honest totals. + refactorMinFiles = 10 + // Unit is line-churn = additions + deletions per file, not just + // additions. Strict < threshold: a commit with mean exactly 5.0 is + // NOT filtered. + refactorMaxChurnPerFile = 5.0 ) type ContributorStat struct { @@ -68,9 +97,13 @@ type ChurnRiskResult struct { Path string RecentChurn float64 BusFactor int - RiskScore float64 + RiskScore float64 // kept for CI gate compatibility; not used for ranking TotalChanges int LastChangeDate string + FirstChangeDate string + AgeDays int + Trend float64 // recent 3mo churn / earlier churn; 1 = flat, <0.5 declining, >1.5 growing + Label string // "cold" | "active" | "active-core" | "silo" | "legacy-hotspot" } type WorkingPattern struct { @@ -82,8 +115,9 @@ type WorkingPattern struct { type DevEdge struct { DevA string DevB string - SharedFiles int - Weight float64 + SharedFiles int // files where both devs contributed at least one line + SharedLines int64 // Σ min(linesA, linesB) across shared files — measures real overlap + Weight float64 // shared_files / max(files_A, files_B) * 100 (legacy) } type StatsFlags struct { @@ -110,10 +144,10 @@ func ComputeSummary(ds *Dataset) Summary { } if !ds.Earliest.IsZero() { - s.FirstCommitDate = ds.Earliest.Format("2006-01-02") + s.FirstCommitDate = ds.Earliest.UTC().Format("2006-01-02") } if !ds.Latest.IsZero() { - s.LastCommitDate = ds.Latest.Format("2006-01-02") + s.LastCommitDate = ds.Latest.UTC().Format("2006-01-02") } return s @@ -125,8 +159,12 @@ func TopContributors(ds *Dataset, n int) []ContributorStat { result = append(result, *cs) } + // Deterministic ordering under ties: commits desc, then email asc. sort.Slice(result, func(i, j int) bool { - return result[i].Commits > result[j].Commits + if result[i].Commits != result[j].Commits { + return result[i].Commits > result[j].Commits + } + return result[i].Email < result[j].Email }) if n > 0 && n < len(result) { @@ -148,8 +186,12 @@ func FileHotspots(ds *Dataset, n int) []FileStat { }) } + // Deterministic ordering under ties: commits desc, then path asc. sort.Slice(result, func(i, j int) bool { - return result[i].Commits > result[j].Commits + if result[i].Commits != result[j].Commits { + return result[i].Commits > result[j].Commits + } + return result[i].Path < result[j].Path }) if n > 0 && n < len(result) { @@ -160,19 +202,23 @@ func FileHotspots(ds *Dataset, n int) []FileStat { type DirStat struct { Dir string - Commits int - Churn int64 - Files int - UniqueDevs int - BusFactor int + // FileTouches is the sum of per-file commit counts across files in this + // directory. A single commit touching N files in the directory contributes + // N to this number — it is NOT distinct commits. Named accordingly to + // avoid the prior "Commits" misnomer. + FileTouches int + Churn int64 + Files int + UniqueDevs int + BusFactor int } func DirectoryStats(ds *Dataset, n int) []DirStat { type dirAcc struct { - commits int - churn int64 - files int - devs map[string]int64 + fileTouches int + churn int64 + files int + devs map[string]int64 } dirs := make(map[string]*dirAcc) @@ -187,7 +233,7 @@ func DirectoryStats(ds *Dataset, n int) []DirStat { dirs[dir] = d } d.files++ - d.commits += fe.commits + d.fileTouches += fe.commits d.churn += fe.additions + fe.deletions for email, lines := range fe.devLines { d.devs[email] += lines @@ -222,16 +268,22 @@ func DirectoryStats(ds *Dataset, n int) []DirStat { } result = append(result, DirStat{ - Dir: dir, - Commits: d.commits, - Churn: d.churn, - Files: d.files, - UniqueDevs: len(d.devs), - BusFactor: bf, + Dir: dir, + FileTouches: d.fileTouches, + Churn: d.churn, + Files: d.files, + UniqueDevs: len(d.devs), + BusFactor: bf, }) } - sort.Slice(result, func(i, j int) bool { return result[i].Commits > result[j].Commits }) + // Deterministic ordering under ties: file touches desc, then dir asc. + sort.Slice(result, func(i, j int) bool { + if result[i].FileTouches != result[j].FileTouches { + return result[i].FileTouches > result[j].FileTouches + } + return result[i].Dir < result[j].Dir + }) if n > 0 && n < len(result) { result = result[:n] } @@ -247,17 +299,20 @@ func ActivityOverTime(ds *Dataset, granularity string) []ActivityBucket { continue } + // Bucket in UTC so the same commit can't fall into different periods + // depending on the author's local timezone. + d := cm.date.UTC() var key string switch granularity { case "day": - key = cm.date.Format("2006-01-02") + key = d.Format("2006-01-02") case "week": - y, w := cm.date.ISOWeek() + y, w := d.ISOWeek() key = fmt.Sprintf("%04d-W%02d", y, w) case "year": - key = cm.date.Format("2006") + key = d.Format("2006") default: - key = cm.date.Format("2006-01") + key = d.Format("2006-01") } b, ok := buckets[key] @@ -323,8 +378,14 @@ func BusFactor(ds *Dataset, n int) []BusFactorResult { }) } + // Deterministic ordering under ties: bus factor asc, then path asc. + // Ties on bf=1 are universal in real repos; without a tiebreaker the + // top-N varies between invocations (map iteration order is random). sort.Slice(result, func(i, j int) bool { - return result[i].BusFactor < result[j].BusFactor + if result[i].BusFactor != result[j].BusFactor { + return result[i].BusFactor < result[j].BusFactor + } + return result[i].Path < result[j].Path }) if n > 0 && n < len(result) { @@ -363,11 +424,18 @@ func FileCoupling(ds *Dataset, n, minCoChanges int) []CouplingResult { }) } + // Co-changes desc, coupling % desc, then path-asc final tiebreak. sort.Slice(results, func(i, j int) bool { if results[i].CoChanges != results[j].CoChanges { return results[i].CoChanges > results[j].CoChanges } - return results[i].CouplingPct > results[j].CouplingPct + if results[i].CouplingPct != results[j].CouplingPct { + return results[i].CouplingPct > results[j].CouplingPct + } + if results[i].FileA != results[j].FileA { + return results[i].FileA < results[j].FileA + } + return results[i].FileB < results[j].FileB }) if n > 0 && n < len(results) { @@ -376,11 +444,92 @@ func FileCoupling(ds *Dataset, n, minCoChanges int) []CouplingResult { return results } +// churnTrend compares churn from the last 3 months (relative to latest) to +// churn from earlier months. Returns 1 when there isn't enough signal to tell. +// +// Uses string comparison on "YYYY-MM" keys so the classification is stable +// regardless of the day-of-month of the dataset's latest commit. +// +// When the dataset span is shorter than the trend window (e.g. under a tight +// --since filter), returns 1 — without this, every file appears in the +// "recent" bucket only, which would falsely return 2 (growing from nothing) +// for the entire dataset. +// +// A file with a single month of churn is still a valid signal: if that month +// falls before the cutoff, the file's activity ended in the past (declining +// to zero → 0); if it falls at or after the cutoff, the file has only recent +// activity (growing from nothing → 2). Only an entirely empty map returns +// neutral. +func churnTrend(monthChurn map[string]int64, earliest, latest time.Time) float64 { + if len(monthChurn) == 0 || latest.IsZero() { + return 1 + } + cutoffKey := latest.UTC().AddDate(0, -classifyTrendWindowMonths, 0).Format("2006-01") + + // Dataset too narrow: the trend window extends before the earliest commit, + // so nothing can fall into the "earlier" bucket. No meaningful signal. + if !earliest.IsZero() && earliest.UTC().Format("2006-01") >= cutoffKey { + return 1 + } + + var recent, earlier int64 + for month, v := range monthChurn { + if month < cutoffKey { + earlier += v + } else { + recent += v + } + } + if earlier == 0 { + if recent == 0 { + return 1 + } + return 2 // recent-only: growing from nothing + } + if recent == 0 { + return 0 // earlier-only: declined to nothing (single-month-in-past case) + } + return float64(recent) / float64(earlier) +} + +// classifyFile assigns an actionable label based on churn, ownership, age, +// and trend. Thresholds are package constants (classify*). +func classifyFile(recentChurn, lowChurn float64, bf, ageDays int, trend float64) string { + if recentChurn <= lowChurn { + return "cold" + } + if bf >= classifyActiveBusFactor { + return "active" // shared, healthy + } + // Concentrated ownership (bf 1-2) with meaningful churn. + if ageDays < classifyOldAgeDays { + return "active-core" // new code, single author is expected + } + if trend < classifyDecliningTrend { + return "legacy-hotspot" // old + concentrated + declining → urgent + } + return "silo" // old + concentrated + stable/growing → knowledge bottleneck +} + func ChurnRisk(ds *Dataset, n int) []ChurnRiskResult { + // Compute median recentChurn as the "cold" threshold. + churns := make([]float64, 0, len(ds.files)) + for _, fe := range ds.files { + if fe.recentChurn > 0 { + churns = append(churns, fe.recentChurn) + } + } + sort.Float64s(churns) + lowChurn := 0.0 + if len(churns) > 0 { + median := churns[len(churns)/2] + lowChurn = median * classifyColdChurnRatio + } + var results []ChurnRiskResult for path, fe := range ds.files { - // Compute real bus factor (80% threshold), same as BusFactor stat + // Compute bus factor (80% threshold), same as BusFactor stat. type dl struct{ lines int64 } devs := make([]dl, 0, len(fe.devLines)) var totalLines int64 @@ -406,23 +555,47 @@ func ChurnRisk(ds *Dataset, n int) []ChurnRiskResult { risk := fe.recentChurn / float64(bf) - lastDate := "" + lastDate, firstDate := "", "" if !fe.lastChange.IsZero() { - lastDate = fe.lastChange.Format("2006-01-02") + lastDate = fe.lastChange.UTC().Format("2006-01-02") + } + if !fe.firstChange.IsZero() { + firstDate = fe.firstChange.UTC().Format("2006-01-02") + } + + ageDays := 0 + if !fe.firstChange.IsZero() && !ds.Latest.IsZero() { + ageDays = int(ds.Latest.Sub(fe.firstChange).Hours() / 24) } + trend := churnTrend(fe.monthChurn, ds.Earliest, ds.Latest) + label := classifyFile(fe.recentChurn, lowChurn, bf, ageDays, trend) + results = append(results, ChurnRiskResult{ - Path: path, - RecentChurn: math.Round(fe.recentChurn*10) / 10, - BusFactor: bf, - RiskScore: math.Round(risk*10) / 10, - TotalChanges: fe.commits, - LastChangeDate: lastDate, + Path: path, + RecentChurn: math.Round(fe.recentChurn*10) / 10, + BusFactor: bf, + RiskScore: math.Round(risk*10) / 10, + TotalChanges: fe.commits, + LastChangeDate: lastDate, + FirstChangeDate: firstDate, + AgeDays: ageDays, + Trend: math.Round(trend*100) / 100, + Label: label, }) } + // Primary sort: recent churn descending (attention = where the activity is). + // Tiebreak: lower bus factor first (more concentrated = more exposed). + // Final tiebreak on path asc for determinism when integer bus_factor ties. sort.Slice(results, func(i, j int) bool { - return results[i].RiskScore > results[j].RiskScore + if results[i].RecentChurn != results[j].RecentChurn { + return results[i].RecentChurn > results[j].RecentChurn + } + if results[i].BusFactor != results[j].BusFactor { + return results[i].BusFactor < results[j].BusFactor + } + return results[i].Path < results[j].Path }) if n > 0 && n < len(results) { @@ -478,7 +651,7 @@ func TopCommits(ds *Dataset, n int) []BigCommit { SHA: sha, AuthorName: authorName, AuthorEmail: cm.email, - Date: cm.date.Format("2006-01-02"), + Date: cm.date.UTC().Format("2006-01-02"), Message: msg, Additions: cm.add, Deletions: cm.del, @@ -487,8 +660,12 @@ func TopCommits(ds *Dataset, n int) []BigCommit { }) } + // Deterministic ordering under ties: lines desc, then SHA asc. sort.Slice(result, func(i, j int) bool { - return result[i].LinesChanged > result[j].LinesChanged + if result[i].LinesChanged != result[j].LinesChanged { + return result[i].LinesChanged > result[j].LinesChanged + } + return result[i].SHA < result[j].SHA }) if n > 0 && n < len(result) { @@ -529,7 +706,8 @@ type DirScope struct { type DevCollaborator struct { Email string - SharedFiles int + SharedFiles int // files where both devs contributed at least one line + SharedLines int64 // Σ min(linesA, linesB) across shared files — mirrors DeveloperNetwork } type DevFileContrib struct { @@ -558,6 +736,65 @@ func DevProfiles(ds *Dataset, filterEmail string) []DevProfile { } } + // Per-dev collaborator pairs pre-computed in one pass over ds.files. + // Previously computed inside the per-dev loop below as O(D × F × D_avg), + // which is O(1.2e9) on kubernetes. Pre-computing is O(F × D_per_file²) + // = O(7e5), roughly three orders of magnitude faster. + // + // Trade-off: peak memory holds one acc entry per ordered dev-pair that + // shares at least one file (~50 MB on kubernetes, ~150 MB on linux). + // Acceptable for modern machines; if it matters, a map[devPair]*collabAcc + // keyed by sorted pair would halve memory at the cost of a read-time + // sort step. + type collabAcc struct { + files int + lines int64 + } + allCollabs := make(map[string]map[string]*collabAcc) + for _, fe := range ds.files { + if len(fe.devLines) < 2 { + continue + } + devs := make([]string, 0, len(fe.devLines)) + for email := range fe.devLines { + devs = append(devs, email) + } + // Undirected pair accumulation: write to both a→b and b→a so each + // dev's map contains everyone they share the file with. + for i := 0; i < len(devs); i++ { + a := devs[i] + la := fe.devLines[a] + for j := i + 1; j < len(devs); j++ { + b := devs[j] + lb := fe.devLines[b] + overlap := la + if lb < overlap { + overlap = lb + } + if allCollabs[a] == nil { + allCollabs[a] = make(map[string]*collabAcc) + } + accA := allCollabs[a][b] + if accA == nil { + accA = &collabAcc{} + allCollabs[a][b] = accA + } + accA.files++ + accA.lines += overlap + if allCollabs[b] == nil { + allCollabs[b] = make(map[string]*collabAcc) + } + accB := allCollabs[b][a] + if accB == nil { + accB = &collabAcc{} + allCollabs[b][a] = accB + } + accB.files++ + accB.lines += overlap + } + } + } + // Per-dev work grid + monthly activity devGrid := make(map[string]*[7][24]int) devMonthly := make(map[string]map[string]*ActivityBucket) @@ -574,10 +811,13 @@ func DevProfiles(ds *Dataset, filterEmail string) []DevProfile { if devGrid[cm.email] == nil { devGrid[cm.email] = &[7][24]int{} } + // devGrid uses local TZ on purpose — it describes the author's + // work rhythm (when *they* were typing), not UTC instants. di := dayIdx[cm.date.Weekday()] devGrid[cm.email][di][cm.date.Hour()]++ - month := cm.date.Format("2006-01") + // Monthly bucket uses UTC for stable grouping. + month := cm.date.UTC().Format("2006-01") if devMonthly[cm.email] == nil { devMonthly[cm.email] = make(map[string]*ActivityBucket) } @@ -602,8 +842,14 @@ func DevProfiles(ds *Dataset, filterEmail string) []DevProfile { for path, fa := range files { topFiles = append(topFiles, DevFileContrib{Path: path, Commits: fa.commits, Churn: fa.churn}) } + // Deterministic: churn desc, then path asc. Without the path + // tiebreaker, devs with several files tied on churn would get + // different top-N across runs. sort.Slice(topFiles, func(i, j int) bool { - return topFiles[i].Churn > topFiles[j].Churn + if topFiles[i].Churn != topFiles[j].Churn { + return topFiles[i].Churn > topFiles[j].Churn + } + return topFiles[i].Path < topFiles[j].Path }) if len(topFiles) > 10 { topFiles = topFiles[:10] @@ -659,7 +905,13 @@ func DevProfiles(ds *Dataset, filterEmail string) []DevProfile { } scope = append(scope, DirScope{Dir: dir, Files: count, Pct: pct}) } - sort.Slice(scope, func(i, j int) bool { return scope[i].Files > scope[j].Files }) + // Deterministic: file count desc, then dir asc. + sort.Slice(scope, func(i, j int) bool { + if scope[i].Files != scope[j].Files { + return scope[i].Files > scope[j].Files + } + return scope[i].Dir < scope[j].Dir + }) if len(scope) > 5 { scope = scope[:5] } @@ -670,9 +922,9 @@ func DevProfiles(ds *Dataset, filterEmail string) []DevProfile { if cs.Additions > 0 { contribRatio = math.Round(float64(cs.Deletions)/float64(cs.Additions)*100) / 100 } - if contribRatio >= 0.8 { + if contribRatio >= contribRefactorRatio { contribType = "refactor" - } else if contribRatio >= 0.4 { + } else if contribRatio >= contribBalancedRatio { contribType = "balanced" } @@ -682,23 +934,25 @@ func DevProfiles(ds *Dataset, filterEmail string) []DevProfile { pace = math.Round(float64(cs.Commits)/float64(cs.ActiveDays)*10) / 10 } - // Collaborators: devs sharing files with this dev - collabMap := make(map[string]int) - for _, fe := range ds.files { - if _, ok := fe.devLines[email]; !ok { - continue - } - for otherEmail := range fe.devLines { - if otherEmail != email { - collabMap[otherEmail]++ - } - } - } + // Collaborators: looked up from the pre-computed allCollabs map + // (built once before this loop). SharedLines uses the min-per-file + // overlap (same semantics as DeveloperNetwork). var collabs []DevCollaborator - for e, count := range collabMap { - collabs = append(collabs, DevCollaborator{Email: e, SharedFiles: count}) + if m, ok := allCollabs[email]; ok { + for e, acc := range m { + collabs = append(collabs, DevCollaborator{Email: e, SharedFiles: acc.files, SharedLines: acc.lines}) + } } - sort.Slice(collabs, func(i, j int) bool { return collabs[i].SharedFiles > collabs[j].SharedFiles }) + // Deterministic: shared-lines desc, shared-files desc, email asc. + sort.Slice(collabs, func(i, j int) bool { + if collabs[i].SharedLines != collabs[j].SharedLines { + return collabs[i].SharedLines > collabs[j].SharedLines + } + if collabs[i].SharedFiles != collabs[j].SharedFiles { + return collabs[i].SharedFiles > collabs[j].SharedFiles + } + return collabs[i].Email < collabs[j].Email + }) if len(collabs) > 5 { collabs = collabs[:5] } @@ -715,8 +969,12 @@ func DevProfiles(ds *Dataset, filterEmail string) []DevProfile { }) } + // Deterministic ordering under ties: commits desc, then email asc. sort.Slice(profiles, func(i, j int) bool { - return profiles[i].Commits > profiles[j].Commits + if profiles[i].Commits != profiles[j].Commits { + return profiles[i].Commits > profiles[j].Commits + } + return profiles[i].Email < profiles[j].Email }) return profiles @@ -724,7 +982,11 @@ func DevProfiles(ds *Dataset, filterEmail string) []DevProfile { func DeveloperNetwork(ds *Dataset, n, minSharedFiles int) []DevEdge { type devPair struct{ a, b string } - pairFiles := make(map[devPair]int) + type pairAcc struct { + files int + sharedLines int64 + } + pairs := make(map[devPair]*pairAcc) devFileCount := make(map[string]int) for _, fe := range ds.files { @@ -741,14 +1003,28 @@ func DeveloperNetwork(ds *Dataset, n, minSharedFiles int) []DevEdge { if a > b { a, b = b, a } - pairFiles[devPair{a, b}]++ + acc, ok := pairs[devPair{a, b}] + if !ok { + acc = &pairAcc{} + pairs[devPair{a, b}] = acc + } + acc.files++ + // Real overlap signal: min of each dev's line contribution to + // the file. If Alice edited 1 line of README and Bob edited + // 200, they share 1 line of real collaboration, not 200. + la, lb := fe.devLines[a], fe.devLines[b] + if la < lb { + acc.sharedLines += la + } else { + acc.sharedLines += lb + } } } } var results []DevEdge - for p, shared := range pairFiles { - if shared < minSharedFiles { + for p, acc := range pairs { + if acc.files < minSharedFiles { continue } maxFiles := devFileCount[p.a] @@ -757,18 +1033,29 @@ func DeveloperNetwork(ds *Dataset, n, minSharedFiles int) []DevEdge { } weight := 0.0 if maxFiles > 0 { - weight = float64(shared) / float64(maxFiles) * 100 + weight = float64(acc.files) / float64(maxFiles) * 100 } results = append(results, DevEdge{ DevA: p.a, DevB: p.b, - SharedFiles: shared, + SharedFiles: acc.files, + SharedLines: acc.sharedLines, Weight: math.Round(weight*10) / 10, }) } + // Shared-lines desc, shared-files desc, then dev-pair-asc final tiebreak. sort.Slice(results, func(i, j int) bool { - return results[i].SharedFiles > results[j].SharedFiles + if results[i].SharedLines != results[j].SharedLines { + return results[i].SharedLines > results[j].SharedLines + } + if results[i].SharedFiles != results[j].SharedFiles { + return results[i].SharedFiles > results[j].SharedFiles + } + if results[i].DevA != results[j].DevA { + return results[i].DevA < results[j].DevA + } + return results[i].DevB < results[j].DevB }) if n > 0 && n < len(results) { diff --git a/internal/stats/stats_test.go b/internal/stats/stats_test.go index d19dd13..743bf48 100644 --- a/internal/stats/stats_test.go +++ b/internal/stats/stats_test.go @@ -1,6 +1,7 @@ package stats import ( + "fmt" "os" "strings" "testing" @@ -169,6 +170,363 @@ func TestBusFactor(t *testing.T) { } } +// makeTiesDataset builds a dataset where every primary sort key ties, so +// only tiebreakers determine ordering. Used to guard against non-determinism +// regressions across every sort function in the package. +func makeTiesDataset() *Dataset { + t1 := time.Date(2024, 1, 15, 10, 0, 0, 0, time.UTC) + + ds := &Dataset{ + CommitCount: 6, + Earliest: t1, + Latest: t1, + commits: map[string]*commitEntry{ + "sha1": {email: "dev-a@x", date: t1, add: 10, del: 5, files: 1}, + "sha2": {email: "dev-b@x", date: t1, add: 10, del: 5, files: 1}, + "sha3": {email: "dev-c@x", date: t1, add: 10, del: 5, files: 1}, + "sha4": {email: "dev-d@x", date: t1, add: 10, del: 5, files: 1}, + "sha5": {email: "dev-e@x", date: t1, add: 10, del: 5, files: 1}, + "sha6": {email: "dev-f@x", date: t1, add: 10, del: 5, files: 1}, + }, + contributors: map[string]*ContributorStat{ + "dev-a@x": {Email: "dev-a@x", Name: "A", Commits: 1, ActiveDays: 1, Additions: 10, Deletions: 5}, + "dev-b@x": {Email: "dev-b@x", Name: "B", Commits: 1, ActiveDays: 1, Additions: 10, Deletions: 5}, + "dev-c@x": {Email: "dev-c@x", Name: "C", Commits: 1, ActiveDays: 1, Additions: 10, Deletions: 5}, + "dev-d@x": {Email: "dev-d@x", Name: "D", Commits: 1, ActiveDays: 1, Additions: 10, Deletions: 5}, + "dev-e@x": {Email: "dev-e@x", Name: "E", Commits: 1, ActiveDays: 1, Additions: 10, Deletions: 5}, + "dev-f@x": {Email: "dev-f@x", Name: "F", Commits: 1, ActiveDays: 1, Additions: 10, Deletions: 5}, + }, + files: map[string]*fileEntry{ + "a/one.go": {commits: 1, additions: 10, deletions: 5, devLines: map[string]int64{"dev-a@x": 15}, monthChurn: map[string]int64{"2024-01": 15}, firstChange: t1, lastChange: t1}, + "a/two.go": {commits: 1, additions: 10, deletions: 5, devLines: map[string]int64{"dev-b@x": 15}, monthChurn: map[string]int64{"2024-01": 15}, firstChange: t1, lastChange: t1}, + "b/one.go": {commits: 1, additions: 10, deletions: 5, devLines: map[string]int64{"dev-c@x": 15}, monthChurn: map[string]int64{"2024-01": 15}, firstChange: t1, lastChange: t1}, + "b/two.go": {commits: 1, additions: 10, deletions: 5, devLines: map[string]int64{"dev-d@x": 15}, monthChurn: map[string]int64{"2024-01": 15}, firstChange: t1, lastChange: t1}, + "c/one.go": {commits: 1, additions: 10, deletions: 5, devLines: map[string]int64{"dev-e@x": 15}, monthChurn: map[string]int64{"2024-01": 15}, firstChange: t1, lastChange: t1}, + "c/two.go": {commits: 1, additions: 10, deletions: 5, devLines: map[string]int64{"dev-f@x": 15}, monthChurn: map[string]int64{"2024-01": 15}, firstChange: t1, lastChange: t1}, + }, + couplingPairs: map[filePair]int{}, + couplingFileChanges: map[string]int{}, + } + return ds +} + +// TestAllSortsDeterministicUnderTies ensures every sort function in the +// package produces identical ordering across runs when the primary sort key +// ties. Without tiebreakers, Go map iteration order + unstable sort.Slice +// cause the CLI and the HTML report to show different top-N entries. This +// test guards every function that ranks results. +func TestAllSortsDeterministicUnderTies(t *testing.T) { + ds := makeTiesDataset() + + identify := func(v interface{}) []string { + switch r := v.(type) { + case []BusFactorResult: + out := make([]string, len(r)) + for i, e := range r { + out[i] = e.Path + } + return out + case []FileStat: + out := make([]string, len(r)) + for i, e := range r { + out[i] = e.Path + } + return out + case []ContributorStat: + out := make([]string, len(r)) + for i, e := range r { + out[i] = e.Email + } + return out + case []DirStat: + out := make([]string, len(r)) + for i, e := range r { + out[i] = e.Dir + } + return out + case []BigCommit: + out := make([]string, len(r)) + for i, e := range r { + out[i] = e.SHA + } + return out + case []DevProfile: + out := make([]string, len(r)) + for i, e := range r { + out[i] = e.Email + } + return out + } + t.Fatalf("unknown type: %T", v) + return nil + } + + cases := []struct { + name string + run func() []string + }{ + {"BusFactor", func() []string { return identify(BusFactor(ds, 0)) }}, + {"FileHotspots", func() []string { return identify(FileHotspots(ds, 0)) }}, + {"TopContributors", func() []string { return identify(TopContributors(ds, 0)) }}, + {"DirectoryStats", func() []string { return identify(DirectoryStats(ds, 0)) }}, + {"TopCommits", func() []string { return identify(TopCommits(ds, 0)) }}, + {"DevProfiles", func() []string { return identify(DevProfiles(ds, "")) }}, + } + + for _, c := range cases { + t.Run(c.name, func(t *testing.T) { + first := c.run() + if len(first) < 2 { + t.Fatalf("fixture produced too few entries (%d) — need ties to exercise tiebreakers", len(first)) + } + for i := 0; i < 20; i++ { + got := c.run() + for j := range got { + if j >= len(first) || got[j] != first[j] { + t.Fatalf("iteration %d: non-deterministic at index %d (got %q, first run had %q)", + i, j, got[j], first[j]) + } + } + } + }) + } +} + +func TestDevProfilesCollaboratorsSharedLines(t *testing.T) { + // Alice co-edits 2 files with each of Bob and Carol. Alice has 500 + // lines in each file. Bob's overlap is trivial (1 line per file); + // Carol wrote substantive lines (500 per file). SharedFiles ties at + // 2 for both collaborators — only SharedLines can distinguish. + // Expected overlap: Bob = min(500,1)*2 = 2; Carol = min(500,500)*2 = 1000. + ds := &Dataset{ + files: map[string]*fileEntry{ + "a.go": {devLines: map[string]int64{"alice@x": 500, "bob@x": 1, "carol@x": 500}}, + "b.go": {devLines: map[string]int64{"alice@x": 500, "bob@x": 1, "carol@x": 500}}, + }, + contributors: map[string]*ContributorStat{ + "alice@x": {Email: "alice@x", Name: "A", Commits: 1, ActiveDays: 1, FilesTouched: 2}, + "bob@x": {Email: "bob@x", Name: "B", Commits: 1, ActiveDays: 1, FilesTouched: 2}, + "carol@x": {Email: "carol@x", Name: "C", Commits: 1, ActiveDays: 1, FilesTouched: 2}, + }, + } + profiles := DevProfiles(ds, "alice@x") + if len(profiles) != 1 { + t.Fatalf("profiles = %d", len(profiles)) + } + collabs := profiles[0].Collaborators + if len(collabs) != 2 { + t.Fatalf("collabs = %d, want 2", len(collabs)) + } + // SharedFiles ties at 2; SharedLines distinguishes Carol (1000) from Bob (2). + if collabs[0].Email != "carol@x" || collabs[0].SharedLines != 1000 { + t.Errorf("top collaborator = {%s, lines=%d}, want {carol@x, 1000}", + collabs[0].Email, collabs[0].SharedLines) + } + if collabs[1].Email != "bob@x" || collabs[1].SharedLines != 2 { + t.Errorf("second collaborator = {%s, lines=%d}, want {bob@x, 2}", + collabs[1].Email, collabs[1].SharedLines) + } + // SharedFiles must still be populated for both. + for _, c := range collabs { + if c.SharedFiles != 2 { + t.Errorf("%s SharedFiles = %d, want 2", c.Email, c.SharedFiles) + } + } +} + +func TestFileCouplingTertiaryTiebreaker(t *testing.T) { + // Two pairs with identical CoChanges AND identical CouplingPct must + // still order deterministically by file name. + ds := &Dataset{ + couplingPairs: map[filePair]int{ + {a: "m/x.go", b: "m/y.go"}: 5, + {a: "a/x.go", b: "a/y.go"}: 5, + }, + couplingFileChanges: map[string]int{ + "m/x.go": 10, "m/y.go": 10, + "a/x.go": 10, "a/y.go": 10, + }, + } + // CouplingPct ties at 50% for both. CoChanges ties at 5. + first := FileCoupling(ds, 0, 1) + for i := 0; i < 20; i++ { + got := FileCoupling(ds, 0, 1) + if got[0].FileA != first[0].FileA { + t.Fatalf("iter %d non-deterministic: %q vs %q", i, got[0].FileA, first[0].FileA) + } + } + if first[0].FileA != "a/x.go" { + t.Errorf("top FileA = %q, want a/x.go (path asc)", first[0].FileA) + } +} + +func TestDeveloperNetworkTertiaryTiebreaker(t *testing.T) { + // Two dev pairs with identical SharedLines AND SharedFiles must order + // by (DevA, DevB) asc. + ds := &Dataset{ + files: map[string]*fileEntry{ + "f1.go": {devLines: map[string]int64{"m@x": 10, "n@x": 10}}, + "f2.go": {devLines: map[string]int64{"a@x": 10, "b@x": 10}}, + }, + } + // Both pairs: SharedFiles=1, SharedLines=10. + first := DeveloperNetwork(ds, 0, 1) + for i := 0; i < 20; i++ { + got := DeveloperNetwork(ds, 0, 1) + if got[0].DevA != first[0].DevA { + t.Fatalf("iter %d non-deterministic: %q vs %q", i, got[0].DevA, first[0].DevA) + } + } + if first[0].DevA != "a@x" || first[0].DevB != "b@x" { + t.Errorf("top pair = {%s,%s}, want {a@x,b@x} (pair asc)", first[0].DevA, first[0].DevB) + } +} + +func TestChurnRiskTertiaryTiebreaker(t *testing.T) { + // Two files with identical RecentChurn AND BusFactor must order by path asc. + t1 := time.Date(2024, 1, 15, 10, 0, 0, 0, time.UTC) + ds := &Dataset{ + Earliest: t1, + Latest: t1, + files: map[string]*fileEntry{ + "z.go": {commits: 1, recentChurn: 100, devLines: map[string]int64{"a@x": 50}, monthChurn: map[string]int64{"2024-01": 100}, firstChange: t1, lastChange: t1}, + "a.go": {commits: 1, recentChurn: 100, devLines: map[string]int64{"a@x": 50}, monthChurn: map[string]int64{"2024-01": 100}, firstChange: t1, lastChange: t1}, + "m.go": {commits: 1, recentChurn: 100, devLines: map[string]int64{"a@x": 50}, monthChurn: map[string]int64{"2024-01": 100}, firstChange: t1, lastChange: t1}, + }, + } + first := ChurnRisk(ds, 0) + for i := 0; i < 20; i++ { + got := ChurnRisk(ds, 0) + for j := range got { + if got[j].Path != first[j].Path { + t.Fatalf("iter %d [%d]: %q vs %q", i, j, got[j].Path, first[j].Path) + } + } + } + // All three tie on RecentChurn (100) and BusFactor (1). Expect a, m, z. + want := []string{"a.go", "m.go", "z.go"} + for i, p := range want { + if first[i].Path != p { + t.Errorf("[%d] = %q, want %q (tertiary path-asc tiebreak)", i, first[i].Path, p) + } + } +} + +func TestDevProfilesInnerSortsDeterministic(t *testing.T) { + // Profile-internal slices (topFiles, scope, collaborators) are assembled + // via map iteration. Before the tiebreaker fix they could land in any + // order on ties, producing different top-N per run. This test forces + // multiple internal ties and asserts stability across 20 invocations. + t1 := time.Date(2024, 1, 15, 10, 0, 0, 0, time.UTC) + ds := &Dataset{ + CommitCount: 4, + Earliest: t1, + Latest: t1, + commits: map[string]*commitEntry{ + "sha1": {email: "alice@x", date: t1, add: 10, del: 5, files: 4}, + "sha2": {email: "bob@x", date: t1, add: 10, del: 5, files: 1}, + "sha3": {email: "carol@x", date: t1, add: 10, del: 5, files: 1}, + "sha4": {email: "dan@x", date: t1, add: 10, del: 5, files: 1}, + }, + contributors: map[string]*ContributorStat{ + "alice@x": {Email: "alice@x", Name: "A", Commits: 1, ActiveDays: 1, Additions: 10, Deletions: 5, FilesTouched: 4}, + "bob@x": {Email: "bob@x", Name: "B", Commits: 1, ActiveDays: 1, Additions: 10, Deletions: 5, FilesTouched: 1}, + "carol@x": {Email: "carol@x", Name: "C", Commits: 1, ActiveDays: 1, Additions: 10, Deletions: 5, FilesTouched: 1}, + "dan@x": {Email: "dan@x", Name: "D", Commits: 1, ActiveDays: 1, Additions: 10, Deletions: 5, FilesTouched: 1}, + }, + // Alice touches 4 files across 3 dirs; each of bob/carol/dan overlaps + // with Alice on exactly one file (equal collaborator strength). + files: map[string]*fileEntry{ + "a/one.go": {commits: 2, devLines: map[string]int64{"alice@x": 15, "bob@x": 15}, devCommits: map[string]int{"alice@x": 1, "bob@x": 1}}, + "a/two.go": {commits: 2, devLines: map[string]int64{"alice@x": 15, "carol@x": 15}, devCommits: map[string]int{"alice@x": 1, "carol@x": 1}}, + "b/one.go": {commits: 2, devLines: map[string]int64{"alice@x": 15, "dan@x": 15}, devCommits: map[string]int{"alice@x": 1, "dan@x": 1}}, + "c/one.go": {commits: 1, devLines: map[string]int64{"alice@x": 15}, devCommits: map[string]int{"alice@x": 1}}, + }, + } + + // Run DevProfiles(ds, "alice@x") 20 times; all three inner slices must + // be identical across iterations. + first := DevProfiles(ds, "alice@x") + if len(first) != 1 { + t.Fatalf("expected 1 profile, got %d", len(first)) + } + for i := 0; i < 20; i++ { + next := DevProfiles(ds, "alice@x")[0] + for k, f := range next.TopFiles { + if f.Path != first[0].TopFiles[k].Path { + t.Fatalf("TopFiles iter %d: [%d] got %q, first %q", i, k, f.Path, first[0].TopFiles[k].Path) + } + } + for k, s := range next.Scope { + if s.Dir != first[0].Scope[k].Dir { + t.Fatalf("Scope iter %d: [%d] got %q, first %q", i, k, s.Dir, first[0].Scope[k].Dir) + } + } + for k, c := range next.Collaborators { + if c.Email != first[0].Collaborators[k].Email { + t.Fatalf("Collaborators iter %d: [%d] got %q, first %q", i, k, c.Email, first[0].Collaborators[k].Email) + } + } + } + + // Verify the tiebreakers produced the expected asc order. + // TopFiles all tied at churn=15; path asc expected. + wantFiles := []string{"a/one.go", "a/two.go", "b/one.go", "c/one.go"} + for i, want := range wantFiles { + if first[0].TopFiles[i].Path != want { + t.Errorf("TopFiles[%d] = %q, want %q (path-asc tiebreak)", i, first[0].TopFiles[i].Path, want) + } + } + // Scope: dirs "a" (2 files), "b" (1), "c" (1) — dirs "b" and "c" tie at + // 1 file and must appear in dir-asc order. + if first[0].Scope[1].Dir != "b" || first[0].Scope[2].Dir != "c" { + t.Errorf("Scope dir-asc tiebreak failed: got %+v", first[0].Scope) + } + // Collaborators: bob, carol, dan all share exactly 1 file with alice; + // email-asc order. + wantCollab := []string{"bob@x", "carol@x", "dan@x"} + for i, want := range wantCollab { + if first[0].Collaborators[i].Email != want { + t.Errorf("Collaborators[%d] = %q, want %q", i, first[0].Collaborators[i].Email, want) + } + } +} + +func TestBusFactorOrderDeterministicUnderTies(t *testing.T) { + // 6 files, all with bus factor = 1. Without a deterministic tiebreaker + // the top-N varies between invocations because Go map iteration is + // randomized. The CLI and HTML report call BusFactor separately and + // must return the same ordering — this test guards against regression. + ds := &Dataset{ + files: map[string]*fileEntry{ + "z/last.go": {commits: 1, devLines: map[string]int64{"a@x": 10}}, + "a/first.go": {commits: 1, devLines: map[string]int64{"b@x": 10}}, + "m/mid.go": {commits: 1, devLines: map[string]int64{"c@x": 10}}, + "a/second.go": {commits: 1, devLines: map[string]int64{"d@x": 10}}, + "b/one.go": {commits: 1, devLines: map[string]int64{"e@x": 10}}, + "b/two.go": {commits: 1, devLines: map[string]int64{"f@x": 10}}, + }, + } + first := BusFactor(ds, 3) + // Run 20 times; order must be identical every time. + for i := 0; i < 20; i++ { + run := BusFactor(ds, 3) + for j := range run { + if run[j].Path != first[j].Path { + t.Fatalf("iteration %d index %d: got %q, want %q", + i, j, run[j].Path, first[j].Path) + } + } + } + // Expected ordering: all bf=1, so tiebreaker is path asc. + want := []string{"a/first.go", "a/second.go", "b/one.go"} + for i, p := range want { + if first[i].Path != p { + t.Errorf("[%d] got %q, want %q", i, first[i].Path, p) + } + } +} + func TestBusFactorSingleDev(t *testing.T) { ds := &Dataset{ files: map[string]*fileEntry{ @@ -212,12 +570,175 @@ func TestChurnRisk(t *testing.T) { if len(result) != 3 { t.Fatalf("len = %d", len(result)) } - // main.go: recentChurn=100, 2 devs → risk=50 - // util.go: recentChurn=50, 1 dev → risk=50 - // readme.md: recentChurn=10, 1 dev → risk=10 - // Sorted by risk descending - if result[0].RiskScore <= result[len(result)-1].RiskScore { - t.Errorf("not sorted descending: first=%f, last=%f", result[0].RiskScore, result[len(result)-1].RiskScore) + // Sorted by RecentChurn descending. + // main.go=100, util.go=50, readme.md=10. + for i := 1; i < len(result); i++ { + if result[i-1].RecentChurn < result[i].RecentChurn { + t.Errorf("not sorted by RecentChurn desc at index %d: %.1f < %.1f", + i, result[i-1].RecentChurn, result[i].RecentChurn) + } + } + if result[0].Path != "main.go" { + t.Errorf("top file = %q, want main.go", result[0].Path) + } +} + +func TestClassifyFile(t *testing.T) { + cases := []struct { + name string + recentChurn float64 + lowChurn float64 + bf, ageDays int + trend float64 + want string + }{ + {"cold: churn below threshold", 5, 50, 1, 365, 1.0, "cold"}, + {"active: shared ownership", 200, 50, 3, 365, 1.0, "active"}, + {"active-core: new code, single author", 200, 50, 1, 30, 1.0, "active-core"}, + {"silo: old + concentrated + stable", 200, 50, 2, 365, 1.0, "silo"}, + {"silo: old + concentrated + growing", 200, 50, 2, 365, 2.0, "silo"}, + {"legacy-hotspot: old + concentrated + declining", 200, 50, 1, 365, 0.3, "legacy-hotspot"}, + {"cold wins over everything when churn low", 10, 50, 1, 365, 0.1, "cold"}, + } + for _, c := range cases { + got := classifyFile(c.recentChurn, c.lowChurn, c.bf, c.ageDays, c.trend) + if got != c.want { + t.Errorf("%s: got %q, want %q", c.name, got, c.want) + } + } +} + +func TestActivityBucketUsesUTC(t *testing.T) { + // Two commits at the same UTC instant but parsed with different TZs. + // Both must fall into the same month bucket — without the UTC fix they + // would split across months depending on the author's offset. + jsonl := `{"type":"commit","sha":"a","author_name":"A","author_email":"a@x","author_date":"2024-03-31T23:00:00-05:00","additions":5,"deletions":0,"files_changed":1} +{"type":"commit_file","commit":"a","path_current":"x.go","additions":5,"deletions":0} +{"type":"commit","sha":"b","author_name":"B","author_email":"b@x","author_date":"2024-04-01T06:00:00+02:00","additions":3,"deletions":0,"files_changed":1} +{"type":"commit_file","commit":"b","path_current":"x.go","additions":3,"deletions":0} +` + // Same UTC instant (2024-04-01T04:00Z) expressed in -05:00 and +02:00. + ds, err := streamLoad(strings.NewReader(jsonl), LoadOptions{HalfLifeDays: 90, CoupMaxFiles: 50}) + if err != nil { + t.Fatalf("streamLoad: %v", err) + } + + buckets := ActivityOverTime(ds, "month") + if len(buckets) != 1 { + t.Fatalf("got %d buckets, want 1 (same UTC instant): %+v", len(buckets), buckets) + } + if buckets[0].Period != "2024-04" { + t.Errorf("bucket period = %q, want 2024-04", buckets[0].Period) + } +} + +func TestChurnTrend(t *testing.T) { + latest := time.Date(2024, 6, 15, 0, 0, 0, 0, time.UTC) + // "earliest" far before the trend window so the short-span guard does + // not fire for these cases. + earliestWide := time.Date(2023, 1, 1, 0, 0, 0, 0, time.UTC) + + // No data → neutral. + if got := churnTrend(map[string]int64{}, earliestWide, latest); got != 1 { + t.Errorf("empty → %.2f, want 1", got) + } + + // Only recent activity (nothing earlier) → growing signal (2). + recentOnly := map[string]int64{"2024-05": 100, "2024-06": 100} + if got := churnTrend(recentOnly, earliestWide, latest); got != 2 { + t.Errorf("recent-only → %.2f, want 2", got) + } + + // Declining: old months dominate. + declining := map[string]int64{ + "2024-01": 1000, "2024-02": 1000, // earlier + "2024-05": 50, "2024-06": 50, // recent + } + if got := churnTrend(declining, earliestWide, latest); got >= 0.5 { + t.Errorf("declining trend = %.2f, want < 0.5", got) + } + + // Stability across day-of-month. + data := map[string]int64{"2024-02": 100, "2024-03": 100, "2024-06": 100} + early := churnTrend(data, earliestWide, time.Date(2024, 6, 1, 0, 0, 0, 0, time.UTC)) + late := churnTrend(data, earliestWide, time.Date(2024, 6, 30, 23, 0, 0, 0, time.UTC)) + if early != late { + t.Errorf("trend depends on day-of-month: early=%.2f late=%.2f", early, late) + } + + // Short-span guard: dataset whose entire span is inside the trend window. + // earliest=2024-05, latest=2024-06, window=3 months → cutoff=2024-03. + // earliestKey="2024-05" >= cutoffKey="2024-03" → no "earlier" bucket. + // Previously returned 2 (falsely "growing from nothing") for every file; + // now must return 1 (no signal). + shortSpan := time.Date(2024, 5, 1, 0, 0, 0, 0, time.UTC) + if got := churnTrend(recentOnly, shortSpan, latest); got != 1 { + t.Errorf("short-span → %.2f, want 1 (no signal when window < trend window)", got) + } + + // Single-month histories used to short-circuit to 1 via len(monthChurn)<2, + // silencing the two strongest trend signals: earlier-only (declined to + // nothing) and recent-only (grew from nothing). Both must now come + // through so old concentrated files can be classified as legacy-hotspot. + + // Earlier-only: a single month well before the cutoff. + earlierOnly := map[string]int64{"2023-05": 500} + if got := churnTrend(earlierOnly, earliestWide, latest); got != 0 { + t.Errorf("earlier-only single-month → %.2f, want 0 (declined to nothing)", got) + } + + // Recent-only: a single month inside the trend window. + recentSingle := map[string]int64{"2024-06": 100} + if got := churnTrend(recentSingle, earliestWide, latest); got != 2 { + t.Errorf("recent-only single-month → %.2f, want 2 (grew from nothing)", got) + } + + // Zero-value entry (possible when a month only contains renames with + // no content change: cf.Additions + cf.Deletions = 0). Both buckets + // end up zero — defensive fallthrough must return 1, not 0 or 2. + zeroOnly := map[string]int64{"2024-05": 0} + if got := churnTrend(zeroOnly, earliestWide, latest); got != 1 { + t.Errorf("zero-value single-month → %.2f, want 1 (no signal)", got) + } +} + +func TestChurnRiskLegacyHotspotFromSingleOldMonth(t *testing.T) { + // Integration: file touched only in one old month, bf=1, age > 180. + // Before the churnTrend fix (len<2 guard), this file returned + // trend=1 (stable) and landed at label=silo. After the fix, + // trend=0 (declined to nothing) routes it through the legacy-hotspot + // branch. Pins the end-to-end wiring so a future regression in + // churnTrend, classifyFile, or ChurnRisk can't silently send such + // files back to silo. + t1 := time.Date(2020, 1, 15, 10, 0, 0, 0, time.UTC) + latest := time.Date(2024, 6, 15, 10, 0, 0, 0, time.UTC) + ds := &Dataset{ + Earliest: t1, + Latest: latest, + files: map[string]*fileEntry{ + "old.go": { + commits: 1, + recentChurn: 100, // kept high enough to clear the cold threshold + devLines: map[string]int64{"alice@x": 500}, + monthChurn: map[string]int64{"2020-01": 500}, + firstChange: t1, + lastChange: t1, + }, + }, + } + results := ChurnRisk(ds, 0) + if len(results) != 1 { + t.Fatalf("len = %d, want 1", len(results)) + } + r := results[0] + if r.Label != "legacy-hotspot" { + t.Errorf("Label = %q, want legacy-hotspot (single-old-month + bf=1 + age>180)", r.Label) + } + if r.Trend != 0 { + t.Errorf("Trend = %.2f, want 0 (earlier-only — the fix)", r.Trend) + } + if r.BusFactor != 1 { + t.Errorf("BusFactor = %d, want 1", r.BusFactor) } } @@ -251,6 +772,31 @@ func TestDeveloperNetwork(t *testing.T) { } } +func TestDeveloperNetworkSharedLinesWeighsRealOverlap(t *testing.T) { + // Two files, both touched by Alice and Bob. + // File 1: Alice=1 line, Bob=1000 lines → real overlap = 1 (the minimum) + // File 2: Alice=500, Bob=500 → real overlap = 500 + // A raw "shared files" count shows 2 in both extremes; SharedLines + // correctly reports 501, not 1001 or 2000. + ds := &Dataset{ + files: map[string]*fileEntry{ + "f1.go": {devLines: map[string]int64{"alice@x": 1, "bob@x": 1000}}, + "f2.go": {devLines: map[string]int64{"alice@x": 500, "bob@x": 500}}, + }, + } + edges := DeveloperNetwork(ds, 10, 1) + if len(edges) != 1 { + t.Fatalf("edges = %d, want 1", len(edges)) + } + e := edges[0] + if e.SharedFiles != 2 { + t.Errorf("SharedFiles = %d, want 2", e.SharedFiles) + } + if e.SharedLines != 501 { + t.Errorf("SharedLines = %d, want 501 (min(1,1000)+min(500,500))", e.SharedLines) + } +} + func TestDeveloperNetworkMinThreshold(t *testing.T) { ds := makeDataset() result := DeveloperNetwork(ds, 10, 100) // min 100 shared files @@ -385,6 +931,162 @@ func TestStreamLoadCoupling(t *testing.T) { } } +func TestCouplingExcludesMechanicalRefactor(t *testing.T) { + // A commit touching 10 files with ~1 line each (a rename or format pass) + // would generate 45 coupling pairs and swamp the real signal. The heuristic + // skips pair accumulation for such commits. + var lines []string + // c1: mechanical refactor — 10 files, 1 line each, mean churn = 1 < 5 + lines = append(lines, `{"type":"commit","sha":"c1","author_name":"A","author_email":"a@x","author_date":"2024-01-01T10:00:00Z","additions":10,"deletions":0,"files_changed":10}`) + for i := 0; i < 10; i++ { + lines = append(lines, + fmt.Sprintf(`{"type":"commit_file","commit":"c1","path_current":"f%d.go","status":"M","additions":1,"deletions":0}`, i)) + } + // c2: real multi-file change — 3 files, 100 lines each + lines = append(lines, `{"type":"commit","sha":"c2","author_name":"A","author_email":"a@x","author_date":"2024-01-02T10:00:00Z","additions":300,"deletions":0,"files_changed":3}`) + for i := 0; i < 3; i++ { + lines = append(lines, + fmt.Sprintf(`{"type":"commit_file","commit":"c2","path_current":"f%d.go","status":"M","additions":100,"deletions":0}`, i)) + } + jsonl := strings.Join(lines, "\n") + "\n" + + ds, err := streamLoad(strings.NewReader(jsonl), LoadOptions{HalfLifeDays: 90, CoupMaxFiles: 50}) + if err != nil { + t.Fatalf("streamLoad: %v", err) + } + + // c1 must NOT contribute pairs, c2 must. For files f0..f2 the only pair + // counts should come from c2 (3 pairs: {f0,f1}, {f0,f2}, {f1,f2}). + for pair, count := range ds.couplingPairs { + if count > 1 { + t.Errorf("pair %v has count %d — mechanical refactor leaked into coupling", pair, count) + } + } + p01 := filePair{a: "f0.go", b: "f1.go"} + if ds.couplingPairs[p01] != 1 { + t.Errorf("pair {f0,f1} count = %d, want 1 (only c2 contributes)", ds.couplingPairs[p01]) + } + + // But file-change denominators MUST still include c1 — those are honest + // counts of how often each file appears, not a coupling signal. + if ds.couplingFileChanges["f0.go"] != 2 { + t.Errorf("f0.go changes = %d, want 2 (c1 + c2, even though c1 didn't contribute pairs)", + ds.couplingFileChanges["f0.go"]) + } + if ds.couplingFileChanges["f9.go"] != 1 { + t.Errorf("f9.go changes = %d, want 1", ds.couplingFileChanges["f9.go"]) + } +} + +func TestCouplingRefactorFilterBoundaries(t *testing.T) { + // Exercise the threshold boundaries directly. The refactor filter uses + // `len(unique) >= refactorMinFiles` (10) AND `meanChurn < refactorMax- + // ChurnPerFile` (5.0, strict). These cases cover both dimensions plus + // the pure-rename (zero churn) case that motivated the filter. + cases := []struct { + name string + files int + churnPerFile int64 + wantPairs int // 0 = filtered; non-zero = pairs produced + }{ + {"below minFiles: 9 files zero churn", 9, 0, 9 * 8 / 2}, + {"pure rename: 10 files zero churn", 10, 0, 0}, + {"below threshold: 10 files × 4 lines (mean 4.0)", 10, 4, 0}, + {"at threshold: 10 files × 5 lines (mean 5.0, strict <)", 10, 5, 10 * 9 / 2}, + {"above threshold: 10 files × 6 lines (mean 6.0)", 10, 6, 10 * 9 / 2}, + {"large substantive: 20 files × 50 lines", 20, 50, 20 * 19 / 2}, + } + + for _, c := range cases { + t.Run(c.name, func(t *testing.T) { + total := int64(c.files) * c.churnPerFile + var lines []string + lines = append(lines, + fmt.Sprintf(`{"type":"commit","sha":"c1","author_name":"A","author_email":"a@x","author_date":"2024-01-01T10:00:00Z","additions":%d,"deletions":0,"files_changed":%d}`, + total, c.files)) + for i := 0; i < c.files; i++ { + lines = append(lines, + fmt.Sprintf(`{"type":"commit_file","commit":"c1","path_current":"f%d.go","status":"M","additions":%d,"deletions":0}`, + i, c.churnPerFile)) + } + ds, err := streamLoad(strings.NewReader(strings.Join(lines, "\n")+"\n"), + LoadOptions{HalfLifeDays: 90, CoupMaxFiles: 50}) + if err != nil { + t.Fatalf("streamLoad: %v", err) + } + if got := len(ds.couplingPairs); got != c.wantPairs { + t.Errorf("pairs = %d, want %d", got, c.wantPairs) + } + // Regardless of whether pairs were counted, each file's change + // denominator must equal 1 (all participated in the single commit). + for i := 0; i < c.files; i++ { + path := fmt.Sprintf("f%d.go", i) + if got := ds.couplingFileChanges[path]; got != 1 { + t.Errorf("%s file changes = %d, want 1 (denominator always counted)", path, got) + } + } + }) + } +} + +func TestCouplingRenameBelowFilterThreshold(t *testing.T) { + // Pins down current behavior at the intersection of two subsystems: + // rename tracking and the refactor filter. A commit renaming 8 files + // (below refactorMinFiles = 10) has zero churn per file, so mean < 5, + // BUT the fileCount threshold isn't met — the filter does not fire. + // Pairs between the old paths are generated during streaming, then + // re-keyed to the canonical (new) paths in applyRenames. + // + // Each pair has CoChanges = 1, so a sensible --coupling-min-changes + // threshold filters them out of user-facing output. But for callers + // who inspect the raw ds.couplingPairs map, the pairs are present. + // This test exists so any future tightening (lower refactorMinFiles, + // or explicit skip-when-all-renames) is a conscious decision backed + // by a failing test, not a silent behavior drift. + var lines []string + lines = append(lines, `{"type":"commit","sha":"c1","author_name":"A","author_email":"a@x","author_date":"2024-01-01T10:00:00Z","additions":0,"deletions":0,"files_changed":8}`) + for i := 0; i < 8; i++ { + lines = append(lines, + fmt.Sprintf(`{"type":"commit_file","commit":"c1","path_current":"new%d.go","path_previous":"old%d.go","status":"R100","additions":0,"deletions":0}`, i, i)) + } + ds, err := streamLoad(strings.NewReader(strings.Join(lines, "\n")+"\n"), + LoadOptions{HalfLifeDays: 90, CoupMaxFiles: 50}) + if err != nil { + t.Fatalf("streamLoad: %v", err) + } + + // 8 files, filter does not fire → C(8,2) = 28 pairs exist. + if got := len(ds.couplingPairs); got != 28 { + t.Errorf("pairs = %d, want 28 (current behavior — rename commits below refactorMinFiles leak)", got) + } + // Pairs must be keyed to the NEW paths (post-rename canonical), not old. + for pair := range ds.couplingPairs { + if !strings.HasPrefix(pair.a, "new") || !strings.HasPrefix(pair.b, "new") { + t.Errorf("pair %v contains old-path key; applyRenames should have re-keyed", pair) + } + } +} + +func TestCouplingKeepsNormalMultiFileCommits(t *testing.T) { + // Counter-test: a commit touching 10 files with substantial churn per + // file (40+ lines avg) must still contribute pairs — it's not a rename. + var lines []string + lines = append(lines, `{"type":"commit","sha":"c1","author_name":"A","author_email":"a@x","author_date":"2024-01-01T10:00:00Z","additions":500,"deletions":0,"files_changed":10}`) + for i := 0; i < 10; i++ { + lines = append(lines, + fmt.Sprintf(`{"type":"commit_file","commit":"c1","path_current":"g%d.go","status":"M","additions":50,"deletions":0}`, i)) + } + ds, err := streamLoad(strings.NewReader(strings.Join(lines, "\n")+"\n"), + LoadOptions{HalfLifeDays: 90, CoupMaxFiles: 50}) + if err != nil { + t.Fatalf("streamLoad: %v", err) + } + // 10 choose 2 = 45 pairs + if got := len(ds.couplingPairs); got != 45 { + t.Errorf("coupling pairs = %d, want 45 (substantial per-file churn, should not trigger refactor filter)", got) + } +} + func TestStreamLoadCouplingSingleFileCommits(t *testing.T) { // x.go and y.go co-change in c1, but x.go also changes alone in c2. // Coupling denominator for x.go should be 2 (both commits), not 1 (only multi-file). @@ -675,6 +1377,470 @@ func TestDevProfilesContribType(t *testing.T) { } } +func TestRenameMergesHistory(t *testing.T) { + // JSONL newest-first (as git log emits). Historical sequence: + // 1) 2024-01 c1 creates old.go + // 2) 2024-02 c2 edits old.go + // 3) 2024-03 c3 renames old.go → new.go + edits + // 4) 2024-04 c4 edits new.go + jsonl := `{"type":"commit","sha":"c4","author_name":"A","author_email":"a@x","author_date":"2024-04-15T10:00:00Z","additions":5,"deletions":0,"files_changed":1} +{"type":"commit_file","commit":"c4","path_current":"new.go","path_previous":"new.go","status":"M","additions":5,"deletions":0} +{"type":"commit","sha":"c3","author_name":"A","author_email":"a@x","author_date":"2024-03-15T10:00:00Z","additions":3,"deletions":2,"files_changed":1} +{"type":"commit_file","commit":"c3","path_current":"new.go","path_previous":"old.go","status":"R100","additions":3,"deletions":2} +{"type":"commit","sha":"c2","author_name":"A","author_email":"a@x","author_date":"2024-02-10T10:00:00Z","additions":8,"deletions":1,"files_changed":1} +{"type":"commit_file","commit":"c2","path_current":"old.go","path_previous":"old.go","status":"M","additions":8,"deletions":1} +{"type":"commit","sha":"c1","author_name":"A","author_email":"a@x","author_date":"2024-01-05T10:00:00Z","additions":20,"deletions":0,"files_changed":1} +{"type":"commit_file","commit":"c1","path_current":"old.go","path_previous":"old.go","status":"A","additions":20,"deletions":0} +` + ds, err := streamLoad(strings.NewReader(jsonl), LoadOptions{HalfLifeDays: 90, CoupMaxFiles: 50}) + if err != nil { + t.Fatalf("streamLoad: %v", err) + } + + if _, ok := ds.files["old.go"]; ok { + t.Errorf("old.go should be merged into new.go, but still exists") + } + fe, ok := ds.files["new.go"] + if !ok { + t.Fatalf("new.go missing") + } + // All 4 commits merged into new.go. + if fe.commits != 4 { + t.Errorf("new.go commits = %d, want 4", fe.commits) + } + // firstChange must come from c1 (oldest), not c3 (the rename). + if got := fe.firstChange.UTC().Format("2006-01-02"); got != "2024-01-05" { + t.Errorf("firstChange = %q, want 2024-01-05 (pre-rename history preserved)", got) + } + if got := fe.lastChange.UTC().Format("2006-01-02"); got != "2024-04-15" { + t.Errorf("lastChange = %q, want 2024-04-15", got) + } + // monthChurn must span all 4 months. + if len(fe.monthChurn) != 4 { + t.Errorf("monthChurn months = %d, want 4", len(fe.monthChurn)) + } + if ds.UniqueFileCount != 1 { + t.Errorf("UniqueFileCount = %d, want 1 (canonical)", ds.UniqueFileCount) + } +} + +func TestRenameChain(t *testing.T) { + // A → B (in c2), B → C (in c3). Canonical = C. + jsonl := `{"type":"commit","sha":"c3","author_name":"A","author_email":"a@x","author_date":"2024-03-10T10:00:00Z","additions":1,"deletions":0,"files_changed":1} +{"type":"commit_file","commit":"c3","path_current":"C.go","path_previous":"B.go","status":"R100","additions":1,"deletions":0} +{"type":"commit","sha":"c2","author_name":"A","author_email":"a@x","author_date":"2024-02-10T10:00:00Z","additions":1,"deletions":0,"files_changed":1} +{"type":"commit_file","commit":"c2","path_current":"B.go","path_previous":"A.go","status":"R100","additions":1,"deletions":0} +{"type":"commit","sha":"c1","author_name":"A","author_email":"a@x","author_date":"2024-01-10T10:00:00Z","additions":10,"deletions":0,"files_changed":1} +{"type":"commit_file","commit":"c1","path_current":"A.go","path_previous":"A.go","status":"A","additions":10,"deletions":0} +` + ds, err := streamLoad(strings.NewReader(jsonl), LoadOptions{HalfLifeDays: 90, CoupMaxFiles: 50}) + if err != nil { + t.Fatalf("streamLoad: %v", err) + } + for _, orphan := range []string{"A.go", "B.go"} { + if _, ok := ds.files[orphan]; ok { + t.Errorf("%s should not exist after chain rename", orphan) + } + } + fe, ok := ds.files["C.go"] + if !ok { + t.Fatal("C.go missing — chain rename not resolved") + } + if fe.commits != 3 { + t.Errorf("C.go commits = %d, want 3 (A+B+C merged)", fe.commits) + } +} + +func TestRenameCollapsesCouplingSelfPair(t *testing.T) { + // Two files A.go and B.go co-change in c1. Later c2 renames A→B (both + // end up as B.go). The pair {A, B} must NOT survive as a self-pair. + jsonl := `{"type":"commit","sha":"c2","author_name":"A","author_email":"a@x","author_date":"2024-02-10T10:00:00Z","additions":1,"deletions":0,"files_changed":1} +{"type":"commit_file","commit":"c2","path_current":"B.go","path_previous":"A.go","status":"R100","additions":1,"deletions":0} +{"type":"commit","sha":"c1","author_name":"A","author_email":"a@x","author_date":"2024-01-10T10:00:00Z","additions":10,"deletions":0,"files_changed":2} +{"type":"commit_file","commit":"c1","path_current":"A.go","path_previous":"A.go","status":"A","additions":5,"deletions":0} +{"type":"commit_file","commit":"c1","path_current":"B.go","path_previous":"B.go","status":"A","additions":5,"deletions":0} +` + ds, err := streamLoad(strings.NewReader(jsonl), LoadOptions{HalfLifeDays: 90, CoupMaxFiles: 50}) + if err != nil { + t.Fatalf("streamLoad: %v", err) + } + for pair := range ds.couplingPairs { + if pair.a == pair.b { + t.Errorf("self-pair survived rename collapse: %+v", pair) + } + } +} + +func TestRenameDevFilesTouchedUsesCanonical(t *testing.T) { + // A single dev edits old.go, then renames it to new.go and edits again. + // FilesTouched should be 1 (one canonical file), not 2. + jsonl := `{"type":"commit","sha":"c3","author_name":"Alice","author_email":"alice@x","author_date":"2024-03-10T10:00:00Z","additions":2,"deletions":0,"files_changed":1} +{"type":"commit_file","commit":"c3","path_current":"new.go","path_previous":"new.go","status":"M","additions":2,"deletions":0} +{"type":"commit","sha":"c2","author_name":"Alice","author_email":"alice@x","author_date":"2024-02-10T10:00:00Z","additions":1,"deletions":0,"files_changed":1} +{"type":"commit_file","commit":"c2","path_current":"new.go","path_previous":"old.go","status":"R100","additions":1,"deletions":0} +{"type":"commit","sha":"c1","author_name":"Alice","author_email":"alice@x","author_date":"2024-01-10T10:00:00Z","additions":5,"deletions":0,"files_changed":1} +{"type":"commit_file","commit":"c1","path_current":"old.go","path_previous":"old.go","status":"A","additions":5,"deletions":0} +` + ds, err := streamLoad(strings.NewReader(jsonl), LoadOptions{HalfLifeDays: 90, CoupMaxFiles: 50}) + if err != nil { + t.Fatalf("streamLoad: %v", err) + } + alice := ds.contributors["alice@x"] + if alice == nil { + t.Fatal("alice missing") + } + if alice.FilesTouched != 1 { + t.Errorf("FilesTouched = %d, want 1 (old.go and new.go collapse to same canonical)", alice.FilesTouched) + } +} + +func TestRenameSkipsReusedOldPath(t *testing.T) { + // Path-reuse scenario: A was renamed to B in old history, then the + // name "A" was reused for an unrelated file, which was renamed to D. + // JSONL is newest-first, so renameEdges hold {A→D} first and {A→B} + // second. Both have oldPath="A" — this is the signal that lineages + // cannot be disambiguated safely. + // + // Before the fix, the first edge won the dedup and applyRenames + // migrated the merged ds.files["A"] (containing both lineages' + // pre-rename data) into "D", polluting D with lineage-1 commits + // that belong to B and stripping B of its pre-rename history. + // + // After the fix, applyRenames detects the duplicate oldPath and + // refuses to migrate either edge. A stays where it is (still with + // merged lineages, but no cross-contamination into B or D), and + // B and D each keep only their own post-rename data. This is + // underattribution, which is safer than active misattribution. + ds := newDataset() + ds.renameEdges = []renameEdge{ + {oldPath: "A", newPath: "D"}, + {oldPath: "A", newPath: "B"}, + } + ds.files = map[string]*fileEntry{ + "A": {commits: 2, monthChurn: map[string]int64{}}, + "B": {commits: 1, monthChurn: map[string]int64{}}, + "D": {commits: 1, monthChurn: map[string]int64{}}, + } + applyRenames(ds) + + a, ok := ds.files["A"] + if !ok { + t.Fatal("A should remain — path reuse must disable migration") + } + if a.commits != 2 { + t.Errorf("A commits = %d, want 2 (unchanged)", a.commits) + } + if ds.files["B"].commits != 1 { + t.Errorf("B commits = %d, want 1 (A must not have been merged in)", ds.files["B"].commits) + } + if ds.files["D"].commits != 1 { + t.Errorf("D commits = %d, want 1 (A must not have been merged in)", ds.files["D"].commits) + } +} + +func TestRenameSkipsSingleEdgeReusedOldPath(t *testing.T) { + // Path reuse with a SINGLE rename edge: A → B in old history, then + // an unrelated file is created at "A" and never renamed further. + // Only one edge involves A, so oldPathCounts[A] = 1 and the + // multi-edge heuristic cannot detect reuse. The tell is temporal: + // ds.files["A"].lastChange is AFTER the rename's commitDate, which + // means activity on "A" continued past the rename — the new file. + // + // Without this check, direct[A] = B is applied and the new file's + // history silently merges into B, corrupting B's stats. + tRename := time.Date(2024, 2, 10, 0, 0, 0, 0, time.UTC) + tNewFile := time.Date(2024, 5, 10, 0, 0, 0, 0, time.UTC) + ds := newDataset() + ds.renameEdges = []renameEdge{ + {oldPath: "A", newPath: "B", commitDate: tRename}, + } + ds.files = map[string]*fileEntry{ + "A": { + commits: 2, // 1 pre-rename (lineage 1) + 1 post-rename (lineage 2) + lastChange: tNewFile, + monthChurn: map[string]int64{}, + }, + "B": { + commits: 1, + lastChange: tRename, + monthChurn: map[string]int64{}, + }, + } + applyRenames(ds) + + if _, ok := ds.files["A"]; !ok { + t.Error("A should remain — single-edge path reuse must skip migration") + } + if a, ok := ds.files["A"]; ok && a.commits != 2 { + t.Errorf("A.commits = %d, want 2 (unchanged)", a.commits) + } + if ds.files["B"].commits != 1 { + t.Errorf("B.commits = %d, want 1 (lineage 2 must not have leaked in)", ds.files["B"].commits) + } +} + +func TestRenameSingleEdgeReuseViaStream(t *testing.T) { + // End-to-end equivalent of TestRenameSkipsSingleEdgeReusedOldPath, + // exercising the full ingest path so fileEntry.lastChange and the + // renameEdge.commitDate are populated naturally from commit_file + // records. Covers the wiring between streamLoadInto and applyRenames. + // + // Timeline (encoded newest-first as git log emits): + // c6 2024-06-01: modify new A (lineage 2) + // c5 2024-05-01: new file created at A (lineage 2 starts) + // c4 2024-04-01: modify B + // c3 2024-03-15: rename A → B (ends lineage 1) + // c2 2024-02-10: modify A (lineage 1) + // c1 2024-01-05: create A (lineage 1 starts) + jsonl := `{"type":"commit","sha":"c6","author_name":"X","author_email":"x@x","author_date":"2024-06-01T10:00:00Z","additions":30,"deletions":0,"files_changed":1} +{"type":"commit_file","commit":"c6","path_current":"A.go","path_previous":"A.go","status":"M","additions":30,"deletions":0} +{"type":"commit","sha":"c5","author_name":"X","author_email":"x@x","author_date":"2024-05-01T10:00:00Z","additions":20,"deletions":0,"files_changed":1} +{"type":"commit_file","commit":"c5","path_current":"A.go","path_previous":"A.go","status":"A","additions":20,"deletions":0} +{"type":"commit","sha":"c4","author_name":"Y","author_email":"y@x","author_date":"2024-04-01T10:00:00Z","additions":10,"deletions":2,"files_changed":1} +{"type":"commit_file","commit":"c4","path_current":"B.go","path_previous":"B.go","status":"M","additions":10,"deletions":2} +{"type":"commit","sha":"c3","author_name":"Y","author_email":"y@x","author_date":"2024-03-15T10:00:00Z","additions":0,"deletions":0,"files_changed":1} +{"type":"commit_file","commit":"c3","path_current":"B.go","path_previous":"A.go","status":"R100","additions":0,"deletions":0} +{"type":"commit","sha":"c2","author_name":"Y","author_email":"y@x","author_date":"2024-02-10T10:00:00Z","additions":15,"deletions":3,"files_changed":1} +{"type":"commit_file","commit":"c2","path_current":"A.go","path_previous":"A.go","status":"M","additions":15,"deletions":3} +{"type":"commit","sha":"c1","author_name":"Y","author_email":"y@x","author_date":"2024-01-05T10:00:00Z","additions":40,"deletions":0,"files_changed":1} +{"type":"commit_file","commit":"c1","path_current":"A.go","path_previous":"A.go","status":"A","additions":40,"deletions":0} +` + ds, err := streamLoad(strings.NewReader(jsonl), LoadOptions{HalfLifeDays: 90, CoupMaxFiles: 50}) + if err != nil { + t.Fatalf("streamLoad: %v", err) + } + a, ok := ds.files["A.go"] + if !ok { + t.Fatal("A.go missing — reuse detection should have skipped migration") + } + // A.go should contain c1 + c2 (lineage 1 pre-rename) + c5 + c6 (lineage + // 2 post-reuse) = 4 commits. They are conflated at the path key level + // (known limitation of non-temporal-segmented ingest), but at least + // don't leak into B. + if a.commits != 4 { + t.Errorf("A.go commits = %d, want 4 (c1+c2+c5+c6 conflated at the reused path)", a.commits) + } + b, ok := ds.files["B.go"] + if !ok { + t.Fatal("B.go missing") + } + // B.go should have c3 (rename commit) + c4 (modify) = 2 commits. + if b.commits != 2 { + t.Errorf("B.go commits = %d, want 2 (c3+c4, no leak from A)", b.commits) + } +} + +func TestRenameChainPlusReuse(t *testing.T) { + // Refinement catches a case that the pure wasRecreated guard missed: + // D → A → B chain followed by a NEW unrelated file created at A. + // A is both a rename target (via D→A) and a rename source (via + // A→B), so wasRecreated[A] = true. Under the old single-edge + // check the reuse detection was disabled here, letting the new A + // lineage leak into B. Using maxEdgeDate (the latest edge touching + // A), we see that lastChange(A) is AFTER the last A-involving + // rename and correctly skip. + tD_to_A := time.Date(2024, 1, 10, 0, 0, 0, 0, time.UTC) + tA_to_B := time.Date(2024, 3, 10, 0, 0, 0, 0, time.UTC) + tReuse := time.Date(2024, 6, 10, 0, 0, 0, 0, time.UTC) + ds := newDataset() + ds.renameEdges = []renameEdge{ + {oldPath: "A", newPath: "B", commitDate: tA_to_B}, + {oldPath: "D", newPath: "A", commitDate: tD_to_A}, + } + ds.files = map[string]*fileEntry{ + "D": {commits: 1, lastChange: tD_to_A, monthChurn: map[string]int64{}}, + "A": {commits: 3, lastChange: tReuse, monthChurn: map[string]int64{}}, // post-reuse + "B": {commits: 1, lastChange: tA_to_B, monthChurn: map[string]int64{}}, + } + applyRenames(ds) + + // A must stay — reuse detected. + if _, ok := ds.files["A"]; !ok { + t.Fatal("A should remain — reuse detected by lastChange > maxEdgeDate") + } + // D still migrates through its own rename edge (D has no post- + // rename activity). D → A because direct[A] was not set. + if _, ok := ds.files["D"]; ok { + t.Error("D should have been migrated (clean rename with no reuse on D)") + } + // B stays — A did not leak in. + if b := ds.files["B"]; b.commits != 1 { + t.Errorf("B.commits = %d, want 1 (A's reuse lineage must not leak into B)", b.commits) + } +} + +func TestRenameCleanRenameStillMigrates(t *testing.T) { + // Counter-test: the single-edge reuse check must not fire for a + // clean rename where all activity on oldPath predates the rename. + // lastChange(A) ≤ commitDate(A→B) → migration proceeds, A merges + // into B. + tLastA := time.Date(2024, 1, 10, 0, 0, 0, 0, time.UTC) + tRename := time.Date(2024, 2, 10, 0, 0, 0, 0, time.UTC) + ds := newDataset() + ds.renameEdges = []renameEdge{ + {oldPath: "A", newPath: "B", commitDate: tRename}, + } + ds.files = map[string]*fileEntry{ + "A": {commits: 3, lastChange: tLastA, monthChurn: map[string]int64{}}, + "B": {commits: 1, lastChange: tRename, monthChurn: map[string]int64{}}, + } + applyRenames(ds) + + if _, ok := ds.files["A"]; ok { + t.Error("A should have been migrated into B (clean rename)") + } + if ds.files["B"].commits != 4 { + t.Errorf("B.commits = %d, want 4 (A=3 + B=1 merged)", ds.files["B"].commits) + } +} + +func TestRenameBackThenRenameAgain(t *testing.T) { + // Same-lineage chain with a rename-back: A → B → A → C. The repeated + // "A" is NOT path reuse — the intermediate edge B → A recreates A + // from B, so both A-segments belong to the same lineage and both + // should collapse into C. Without the hasIncomingRename check this + // would be mistaken for reuse and skipped, leaving stale B files + // and an un-migrated A with merged-but-untouched history. + ds := newDataset() + ds.renameEdges = []renameEdge{ + {oldPath: "A", newPath: "C"}, // newest rename + {oldPath: "B", newPath: "A"}, // middle: rename back + {oldPath: "A", newPath: "B"}, // oldest rename + } + ds.files = map[string]*fileEntry{ + "A": {commits: 2, monthChurn: map[string]int64{}}, // pre-first + post-back, merged + "B": {commits: 1, monthChurn: map[string]int64{}}, // mid-chain activity + "C": {commits: 1, monthChurn: map[string]int64{}}, // post-final activity + } + applyRenames(ds) + + for _, orphan := range []string{"A", "B"} { + if _, ok := ds.files[orphan]; ok { + t.Errorf("%s should have been merged into C (rename-back chain, same lineage)", orphan) + } + } + c, ok := ds.files["C"] + if !ok { + t.Fatal("C missing — chain should have collapsed into it") + } + if c.commits != 4 { + t.Errorf("C.commits = %d, want 4 (A=2 + B=1 + C=1 merged)", c.commits) + } +} + +func TestRenameChainNotMistakenForReuse(t *testing.T) { + // Guard against the fix being too aggressive: a genuine chain + // A→B→C has DIFFERENT oldPaths (A once, B once) and must still + // be collapsed into C as before. + ds := newDataset() + ds.renameEdges = []renameEdge{ + {oldPath: "B", newPath: "C"}, + {oldPath: "A", newPath: "B"}, + } + ds.files = map[string]*fileEntry{ + "A": {commits: 1, monthChurn: map[string]int64{}}, + "B": {commits: 1, monthChurn: map[string]int64{}}, + "C": {commits: 1, monthChurn: map[string]int64{}}, + } + applyRenames(ds) + + if _, ok := ds.files["A"]; ok { + t.Error("A should have been merged — not a reuse case") + } + if _, ok := ds.files["B"]; ok { + t.Error("B should have been merged — not a reuse case") + } + if ds.files["C"].commits != 3 { + t.Errorf("C.commits = %d, want 3 (A+B+C chain)", ds.files["C"].commits) + } +} + +func TestRenameCycleDoesNotCrash(t *testing.T) { + // Degenerate: A→B then B→A. Canonical resolution should bail out of + // the cycle instead of infinite-looping, and — since migration is + // ambiguous — both files survive with their own data rather than + // one being silently merged into the other. + jsonl := `{"type":"commit","sha":"c2","author_name":"A","author_email":"a@x","author_date":"2024-02-10T10:00:00Z","additions":1,"deletions":0,"files_changed":1} +{"type":"commit_file","commit":"c2","path_current":"A.go","path_previous":"B.go","status":"R100","additions":1,"deletions":0} +{"type":"commit","sha":"c1","author_name":"A","author_email":"a@x","author_date":"2024-01-10T10:00:00Z","additions":1,"deletions":0,"files_changed":1} +{"type":"commit_file","commit":"c1","path_current":"B.go","path_previous":"A.go","status":"R100","additions":1,"deletions":0} +` + ds, err := streamLoad(strings.NewReader(jsonl), LoadOptions{HalfLifeDays: 90, CoupMaxFiles: 50}) + if err != nil { + t.Fatalf("streamLoad: %v", err) + } + // Both files must survive the cycle. If the bail logic accidentally + // routed one into the other, this would fail. + if _, ok := ds.files["A.go"]; !ok { + t.Error("A.go missing — cycle bail should leave both files intact") + } + if _, ok := ds.files["B.go"]; !ok { + t.Error("B.go missing — cycle bail should leave both files intact") + } +} + +func TestRenameMixedLineageKnownLimitation(t *testing.T) { + // Exotic mixed case documented in METRICS.md as a known limitation: + // + // t1: A → B (lineage 1 renames A away) + // t2: C → A (lineage 2: a different file C takes the name A) + // t3: A → D (lineage 2 renames away) + // + // Because the repeated oldPath "A" has an incoming edge (C → A), the + // heuristic treats the two A-edges as a rename-back chain and migrates. + // Lineage 1's pre-A→B commits end up misattributed to D. + // + // Pinning the current behavior here so that a future refactor cannot + // silently change the outcome — any future fix (proper temporal + // segmentation) should update or remove this test consciously. + // + // NOTE: fileEntry.lastChange is intentionally left zero. The single- + // edge reuse check (`lastChange.After(maxEdgeDate)`) requires non- + // zero timestamps to fire; leaving them unset exercises the mixed- + // lineage "chain-like" path that the heuristic currently cannot + // disambiguate. If a test update accidentally sets lastChange, this + // fixture becomes a different scenario (chain+reuse) which IS + // detected by the refined check and produces a different result. + ds := newDataset() + tNew := time.Date(2024, 3, 1, 0, 0, 0, 0, time.UTC) + tMid := time.Date(2024, 2, 1, 0, 0, 0, 0, time.UTC) + tOld := time.Date(2024, 1, 1, 0, 0, 0, 0, time.UTC) + ds.renameEdges = []renameEdge{ + {oldPath: "A", newPath: "D", commitDate: tNew}, // t3 + {oldPath: "C", newPath: "A", commitDate: tMid}, // t2 + {oldPath: "A", newPath: "B", commitDate: tOld}, // t1 + } + ds.files = map[string]*fileEntry{ + "A": {commits: 2, monthChurn: map[string]int64{}}, // merged lineages (no lastChange on purpose) + "B": {commits: 1, monthChurn: map[string]int64{}}, + "C": {commits: 1, monthChurn: map[string]int64{}}, + "D": {commits: 1, monthChurn: map[string]int64{}}, + } + applyRenames(ds) + + // Current (imperfect) behavior: heuristic sees hasIncomingRename[A]=true + // via C→A, so it proceeds with migration. Newest edge wins: A → D. + // Both C and A collapse into D. B keeps only its post-rename data. + if _, ok := ds.files["A"]; ok { + t.Error("A should have been migrated (current behavior — chain-like)") + } + if _, ok := ds.files["C"]; ok { + t.Error("C should have been migrated into D") + } + d, ok := ds.files["D"] + if !ok { + t.Fatal("D missing") + } + if d.commits != 4 { + t.Errorf("D.commits = %d, want 4 (A=2 + C=1 + D=1 merged — lineage 1's pre-A→B data is misattributed here)", d.commits) + } + // B keeps only its post-rename data. Lineage 1's pre-rename history + // that should live here is instead at D. + if ds.files["B"].commits != 1 { + t.Errorf("B.commits = %d, want 1 (lineage-1 pre-rename data leaked to D)", ds.files["B"].commits) + } +} + func TestStreamLoadFullPipeline(t *testing.T) { jsonl := `{"type":"commit","sha":"c1","tree":"t","parents":["p1","p2"],"author_name":"Alice","author_email":"alice@x.com","author_date":"2024-03-15T10:00:00Z","committer_name":"Alice","committer_email":"alice@x.com","committer_date":"2024-03-15T10:00:00Z","message":"merge feature","additions":50,"deletions":10,"files_changed":3} {"type":"commit_parent","sha":"c1","parent_sha":"p1"} @@ -730,3 +1896,82 @@ func TestStreamLoadFullPipeline(t *testing.T) { t.Error("alice scope empty") } } + +// buildSyntheticLargeDataset creates a deterministic dataset shaped like a +// mid-size repo: thousands of devs, tens of thousands of files, with each +// file touched by a few devs. Used by BenchmarkDevProfiles* to exercise the +// allCollabs pre-compute path at a scale where the old O(D × F × D_avg) +// implementation was measurable. +func buildSyntheticLargeDataset(numDevs, numFiles, avgDevsPerFile, numCommits int) *Dataset { + t1 := time.Date(2024, 1, 15, 10, 0, 0, 0, time.UTC) + ds := &Dataset{ + CommitCount: numCommits, + Earliest: t1, + Latest: t1.AddDate(0, 6, 0), + commits: make(map[string]*commitEntry, numCommits), + contributors: make(map[string]*ContributorStat, numDevs), + files: make(map[string]*fileEntry, numFiles), + } + for i := 0; i < numDevs; i++ { + email := fmt.Sprintf("dev-%04d@x", i) + ds.contributors[email] = &ContributorStat{ + Email: email, Name: email, + Commits: numCommits / numDevs, ActiveDays: 1, FilesTouched: avgDevsPerFile, + Additions: 100, Deletions: 10, + } + } + for i := 0; i < numCommits; i++ { + email := fmt.Sprintf("dev-%04d@x", i%numDevs) + sha := fmt.Sprintf("c%06d", i) + ds.commits[sha] = &commitEntry{ + email: email, date: t1.Add(time.Duration(i) * time.Hour), + add: 10, del: 1, files: 1, + } + } + // Deterministic dev assignment per file: stride through dev list. + for i := 0; i < numFiles; i++ { + path := fmt.Sprintf("src/pkg%03d/file%04d.go", i/100, i) + fe := &fileEntry{ + commits: avgDevsPerFile, + additions: 100 * int64(avgDevsPerFile), + deletions: 10 * int64(avgDevsPerFile), + devLines: make(map[string]int64, avgDevsPerFile), + devCommits: make(map[string]int, avgDevsPerFile), + monthChurn: map[string]int64{"2024-01": 100 * int64(avgDevsPerFile)}, + firstChange: t1, lastChange: t1, + } + for d := 0; d < avgDevsPerFile; d++ { + email := fmt.Sprintf("dev-%04d@x", (i*7+d*11)%numDevs) // stride + prime for spread + fe.devLines[email] = 100 + fe.devCommits[email] = 1 + } + ds.files[path] = fe + } + return ds +} + +// BenchmarkDevProfilesAll measures the full-dataset path (generate profiles +// for every dev). Before the allCollabs pre-compute, this was the hot path +// in `report` on large repos — kubernetes-scale took >80s even though stats +// on the same data took <12s. The pre-compute makes this roughly linear in +// F × D_per_file² instead of D × F × D_per_file. +func BenchmarkDevProfilesAll(b *testing.B) { + ds := buildSyntheticLargeDataset(1000, 10000, 5, 30000) + b.ResetTimer() + b.ReportAllocs() + for i := 0; i < b.N; i++ { + _ = DevProfiles(ds, "") + } +} + +// BenchmarkDevProfilesFilterEmail measures the single-dev query path. +// The pre-compute does more work than strictly necessary for a single dev, +// but the absolute cost is bounded by F × D_per_file² and remains fast. +func BenchmarkDevProfilesFilterEmail(b *testing.B) { + ds := buildSyntheticLargeDataset(1000, 10000, 5, 30000) + b.ResetTimer() + b.ReportAllocs() + for i := 0; i < b.N; i++ { + _ = DevProfiles(ds, "dev-0042@x") + } +}
Developer ADeveloper BShared FilesWeight
Developer ADeveloper BShared FilesShared LinesWeight
{{.DevA}} {{.DevB}} {{.SharedFiles}}{{.SharedLines}}
{{printf "%.1f" .Weight}}%