Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 11 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,10 +42,12 @@ gitcortex extract --repo . \
--ignore "*_generated.go"
```

Patterns match against the file path as emitted by `git log --raw` (forward-slash, repo-relative). Directory patterns like `vendor/*` exclude anything under that prefix. File-name patterns like `*.pb.go` match at any depth.
Patterns match against the file path as emitted by `git log --raw` (forward-slash, repo-relative). Directory patterns like `vendor/*` are **repo-root prefixes** — they exclude everything under `vendor/` at the top of the tree, but **not** nested occurrences like `pkg/vendor/foo.go` or `services/auth/vendor/bar.go`. For those you need explicit entries such as `--ignore "pkg/vendor/*"`. File-name patterns like `*.pb.go` and `package-lock.json` match at any depth via extract's basename match, so one entry covers every occurrence.

Start permissive, run `gitcortex stats --stat hotspots --top 20` and `--stat churn-risk --top 20`, and add `--ignore` entries for whatever generated file type dominates the output. Re-extract until the top list represents real changes worth understanding.

**You don't need to get this right on the first try.** When `stats` runs on an un-filtered dataset and likely vendor/generated paths account for ≥10% of repo churn, it prints a warning to stderr with the matched buckets and a copy-pasteable `--ignore` invocation. The warning enumerates the exact nested prefixes it found (e.g. `wp-includes/js/dist/*`, `services/auth/vendor/*`), so monorepos and subproject-heavy layouts get the specific entries they need without guessing. Running the suggestion and re-extracting is the fastest path from raw repo to usable stats.

> Both commit-level (`Summary.TotalAdditions/Deletions`) and file-level aggregations recompute from the filtered set, so all totals stay consistent after `--ignore` — the extract step recalculates commit additions/deletions as the sum of non-ignored file records before writing them to JSONL.

## Privacy and reliability
Expand Down Expand Up @@ -256,27 +258,27 @@ gitcortex stats --input data.jsonl --stat churn-risk --top 15
gitcortex stats --input data.jsonl --stat churn-risk --churn-half-life 60 # faster decay
```

Real output from the Pi-hole repository (one sample per label):
Real output:

```
PATH LABEL CHURN BF AGE TREND
automated install/basic-install.sh active 115.3 15 4121d 0.00
.github/workflows/codeql-analysis.yml legacy-hotspot 66.2 2 1640d 0.26
advanced/bash-completion/pihole-ftl.bash silo 16.5 1 240d 1.00
test/_alpine_3_23.Dockerfile active-core 7.1 1 120d 1.00
advanced/Templates/gravity.db.schema cold 0.0 1 2616d 1.00
PATH LABEL RECENT CHURN BF AGE TREND
automated install/basic-install.sh active (age P90, trend P87) 115.3 15 4121d 0.00
.github/workflows/codeql-analysis.yml active-core (age P30, trend P95) 66.2 2 1640d 0.26
advanced/Scripts/utils.sh active-core (age P27, trend P94) 53.3 2 1523d 0.10
```

| Label | Meaning |
|-------|---------|
| `cold` | Low recent churn — ignore. |
| `active` | Shared ownership (bus factor ≥ 3). Healthy. |
| `active-core` | New code (< 180d), single author. Usually fine. |
| `active-core` | New code (younger than most of the repo), single author. Usually fine. |
| `silo` | Old + concentrated + stable/growing. Knowledge bottleneck — plan transfer. |
| `legacy-hotspot` | **Urgent.** Old + concentrated + declining. Deprecated paths still being touched. |

Sort key is `recent_churn`; the label answers "is this activity a problem?". The composite `risk_score` field (`recent_churn / bus_factor`) is still emitted for CI gate back-compat.

**The `(age PXX, trend PYY)` suffix** reports where the file sits in this repo's distribution: `age P90` = older than 90% of tracked files, `trend P08` = declining more sharply than 92%. Classification thresholds are not absolute — they adapt to each dataset (P75 age and P25 trend, with a fallback to fixed constants for repos under 8 files). A `legacy-hotspot` with `(age P76, trend P24)` barely qualifies; one at `(age P98, trend P03)` is the real alarm. Distance from the boundary is now visible instead of hidden. See `docs/METRICS.md` for the adaptive-thresholds section.

`--churn-half-life` controls how fast old changes lose weight (default 90 days = changes lose half their weight every 90 days).

### Working patterns
Expand Down
48 changes: 48 additions & 0 deletions cmd/gitcortex/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ import (
"context"
"encoding/json"
"fmt"
"io"
"os"
"os/signal"
"path/filepath"
Expand Down Expand Up @@ -180,6 +181,16 @@ func statsCmd() *cobra.Command {
fmt.Fprintf(os.Stderr, "Loaded %d commits, %d files, %d devs\n\n",
ds.CommitCount, ds.UniqueFileCount, ds.DevCount)

// Suspect vendor/generated warning only fires when the
// aggregate matched churn exceeds suspectWarningMinChurnRatio
// of the total. Text-format stats only: JSON/CSV consumers
// typically pipe the output and don't want a chatter prefix.
if sf.format == "" || sf.format == "table" {
if buckets, worth := stats.DetectSuspectFiles(ds); worth {
printSuspectWarning(os.Stderr, buckets)
}
}

return renderStats(ds, &sf)
},
}
Expand All @@ -188,6 +199,43 @@ func statsCmd() *cobra.Command {
return cmd
}

// printSuspectWarning emits a stderr block listing likely vendor/generated
// patterns that matched, with a copy-pasteable --ignore suggestion. Called
// only when DetectSuspectFiles reports the matched churn crosses the
// noise floor, so repos with one incidental .lock file don't get spammed.
func printSuspectWarning(w io.Writer, buckets []stats.SuspectBucket) {
if len(buckets) == 0 {
return
}
// Top 6 buckets — enough to be useful, not enough to drown the prompt.
const maxShown = 6
shown := buckets
if len(shown) > maxShown {
shown = shown[:maxShown]
}
fmt.Fprintln(w, "⚠ Suspect vendor/generated paths detected — they inflate churn and bus factor")
fmt.Fprintln(w, " without reflecting hand-authored code. Top matches:")
for _, b := range shown {
fmt.Fprintf(w, " %-22s %4d files, %8d churn (%s)\n",
b.Pattern.Glob, len(b.Paths), b.Churn, b.Pattern.Reason)
}
if len(buckets) > len(shown) {
fmt.Fprintf(w, " ... and %d more bucket(s) — see suggestion below for full set\n",
len(buckets)-len(shown))
}
// Suggestions cover ALL buckets, not just the shown subset — the
// warning threshold is computed over every bucket, so a remediation
// that skips unshown ones would leave the warning firing after the
// suggested fix.
suggestions := stats.CollectAllSuggestions(buckets)
fmt.Fprint(w, " Rerun extract with --ignore to drop them, e.g.:\n gitcortex extract --repo .")
for _, s := range suggestions {
fmt.Fprintf(w, " --ignore %s", stats.ShellQuoteSingle(s))
}
fmt.Fprintln(w)
fmt.Fprintln(w)
}

func renderStats(ds *stats.Dataset, sf *statsFlags) error {
showAll := sf.stat == ""
f := stats.NewFormatter(os.Stdout, sf.format)
Expand Down
35 changes: 31 additions & 4 deletions docs/METRICS.md
Original file line number Diff line number Diff line change
Expand Up @@ -134,14 +134,26 @@ rows implicitly assume the earlier rows didn't match.
|---|-------|------|--------|
| 1 | **cold** | `recent_churn ≤ 0.5 × median(recent_churn)` | Ignore. |
| 2 | **active** | `bus_factor ≥ 3` | Healthy, shared. |
| 3 | **active-core** | `bus_factor ≤ 2` and `age < 180 days` | New code, single author is expected. |
| 4 | **legacy-hotspot** | `bus_factor ≤ 2`, `age ≥ 180 days`, and `trend < 0.5` | **Urgent.** Old + concentrated + declining. |
| 3 | **active-core** | `bus_factor ≤ 2` and `age < oldAgeThreshold` | New code, single author is expected. |
| 4 | **legacy-hotspot** | `bus_factor ≤ 2`, `age ≥ oldAgeThreshold`, and `trend < decliningTrendThreshold` | **Urgent.** Old + concentrated + declining. |
| 5 | **silo** | default (everything the rules above didn't catch) | Knowledge bottleneck — plan transfer. |

Where:
- `age = days between firstChange and latest commit in dataset`
- `trend = churn_last_3_months / churn_earlier`. Edge cases: empty history returns 1 (no signal); recent-only history returns 2 (grew from nothing); earlier-only history returns 0 (declined to nothing — the strongest `legacy-hotspot` signal); short-span datasets whose entire window fits inside the trend window return 1 to avoid false "growing" reports

### Adaptive thresholds (per-dataset calibration)

`oldAgeThreshold` and `decliningTrendThreshold` are not fixed constants: they are derived from the dataset's own distribution each run. With at least `classifyMinSample` (8) files present:
- `oldAgeThreshold` = **P75** of file ages in this dataset
- `decliningTrendThreshold` = **P25** of file trends in this dataset, clamped to at least `adaptiveDecliningTrendFloor` (0.01). The floor matters on mature repos where ≥25% of files are dormant (trend=0 via the earlier-only path): P25 would otherwise collapse to 0 and the strict `trend < threshold` check would never fire, silently misclassifying every dormant concentrated file as `silo` instead of `legacy-hotspot`. The floor keeps the threshold strictly positive so the trend=0 signal — the strongest legacy-hotspot alarm — still reaches the rule.

This makes "old" mean "older than 75% of tracked files in this repo" instead of an absolute 180 days. A 4-year-old file in a 12-year-old codebase was previously tagged `legacy-hotspot` even though it was newer than most of the repo — now the same file lands in `active-core`. Below the sample threshold, the absolute fallbacks `classifyOldAgeDays` and `classifyDecliningTrend` apply so tiny repos still produce labels.

Each `ChurnRiskResult` also exposes `AgePercentile` and `TrendPercentile` (0-100) showing where the file sits in the distribution. The fields are nil (omitted from JSON, empty in CSV) when the fallback path ran. The CLI and HTML surface these alongside the label — `legacy-hotspot (age P92, trend P08)` tells you the file is both old and sharply declining relative to peers; `legacy-hotspot (age P76, trend P24)` barely qualifies. Distance from the classification boundary is now readable, not hidden.

> **Degenerate trend distribution.** When every file's entire history fits inside the trend window (e.g. a repo with <3 months of commits), `churnTrend` returns the flat-signal sentinel `1.0` for all of them. The adaptive P25 then lands on `1.0` too, and the `trend < P25` predicate matches nobody — no file reaches `legacy-hotspot` through the trend check. Old + concentrated files fall through to `silo` instead. This is mathematically correct (there's no variation to classify on) but can surprise readers of short-lived repos. Pinned by `TestChurnRiskAdaptiveDegenerateTrendDistribution` so future refactors don't silently flip it.

> **Sensitivity note.** Files touched a single time long ago and never again correctly route to `legacy-hotspot` via the earlier-only trend=0 path. On large mature repos this pattern is the common case, not the exception — e.g. validation on a kubernetes snapshot classified ~29k files this way. If the label distribution looks heavy on `legacy-hotspot` for a long-lived codebase, that is usually diagnosing real dormant code, not a bug.

### Additional columns
Expand Down Expand Up @@ -264,8 +276,11 @@ Every classification boundary is a named constant in `internal/stats/stats.go`.
|----------|---------|----------|
| `classifyColdChurnRatio` | `0.5` | A file is `cold` when `recent_churn ≤ ratio × median(recent_churn)`. |
| `classifyActiveBusFactor` | `3` | A file is `active` (shared, healthy) when `bus_factor ≥ this`. |
| `classifyOldAgeDays` | `180` | Age cutoff for `active-core` vs `silo`/`legacy-hotspot`. |
| `classifyDecliningTrend` | `0.5` | Trend ratio below this marks `legacy-hotspot` (old + declining). |
| `classifyOldAgeDays` | `180` | **Fallback only** (dataset < `classifyMinSample` files). Adaptive path uses P75 of the dataset's own age distribution. |
| `classifyDecliningTrend` | `0.5` | **Fallback only**. Adaptive path uses P25 of the dataset's own trend distribution. |
| `classifyMinSample` | `8` | Below this many files, percentile estimates are too noisy to trust and the two thresholds above revert to absolutes. |
| `adaptiveDecliningTrendFloor` | `0.01` | Minimum value for the adaptive `decliningTrendThreshold`. Prevents P25 from collapsing to 0 on mature repos where dormant files dominate, which would hide every legacy-hotspot. |
| `suspectWarningMinChurnRatio` | `0.10` | Vendor/generated path warning fires only when matched paths together exceed this fraction of total repo churn — prevents a single incidental `.lock` file from triggering noise. |
| `classifyTrendWindowMonths` | `3` | Window (months, relative to latest commit) for the recent vs earlier split in `trend`. |
| `contribRefactorRatio` | `0.8` | `del/add ≥ this` → dev profile `contribType = refactor`. |
| `contribBalancedRatio` | `0.4` | `0.4 ≤ del/add < 0.8` → `balanced`; below 0.4 → `growth`. |
Expand Down Expand Up @@ -299,6 +314,18 @@ A third-level tiebreaker on path/sha/email asc is applied where primary and seco

Inside `busfactor`, the per-file `TopDevs` list is sorted by lines desc with an email asc tiebreaker. Without it, binary assets and small files where two devs contribute equal lines (e.g. `.gif`, `.png`, one-line configs) produced a different `TopDevs` email order on every run.

### Vendor/generated path warning

When `stats` loads a dataset in table format, it scans for paths matching a conservative list of vendor/generated heuristics: `vendor/`, `node_modules/`, `dist/`, `build/`, `third_party/`, `*.min.js`, `*.min.css`, `*.lock`, language-specific lockfiles (`package-lock.json`, `go.sum`, `Cargo.lock`, `poetry.lock`, `yarn.lock`, `pnpm-lock.yaml`), and common generated extensions (`*.pb.go`, `*_pb2.py`, `*.generated.*`).

If the matched paths together account for at least `suspectWarningMinChurnRatio` (10%) of total repo churn, a warning is emitted to stderr listing the top-6 buckets with a copy-pasteable `extract --ignore` invocation. Below the floor, no warning — a single incidental `.lock` file in an otherwise clean repo stays silent.

Directory-segment heuristics (`vendor`, `node_modules`, `dist`, `build`, `third_party`) match the segment wherever it appears in the path, but `extract --ignore` treats a bare `dist/*` glob as a repo-root prefix. To avoid suggesting a fix that wouldn't actually remove the matched files, each bucket carries a `Suggestions` list of the specific parent prefixes it matched (e.g. `wp-includes/js/dist/*`, `services/auth/vendor/*`), and the warning emits every unique prefix so the copy-pasteable command covers every source of distortion. Suffix and basename patterns (`*.min.js`, `package-lock.json`, etc.) collapse to a single glob because extract's basename match already handles them at any depth.

The warning is advisory. Nothing is auto-filtered; the user decides whether to re-extract. Matches do not affect computed stats in that run. JSON/CSV output paths skip the warning since they're typically piped.

Statistical heuristics (very high churn-per-commit, single-author bulk updates) are deliberately out of scope — their false-positive rate on hand-authored code is higher than the path-based list and we'd rather stay quiet than cry wolf.

### `--mailmap` off by default

`extract` does not apply `.mailmap` unless you pass `--mailmap`. Without it, the same person with two emails (e.g. `alice@work.com` and `alice@personal.com`) splits into two contributors. Affected metrics: `contributors`, `bus factor`, `dev network`, `profiles`, churn-risk label (via bus factor).
Expand Down
9 changes: 7 additions & 2 deletions internal/extract/extract.go
Original file line number Diff line number Diff line change
Expand Up @@ -244,7 +244,7 @@ func emitCommit(writer *bufio.Writer, commit *git.StreamCommit, sizeMap map[stri
if path == "" {
path = entry.PathOld
}
if shouldIgnore(path, ignorePatterns) {
if ShouldIgnore(path, ignorePatterns) {
continue
}
filteredRaw = append(filteredRaw, entry)
Expand Down Expand Up @@ -389,7 +389,12 @@ func loadDevEmails(path string) (map[string]struct{}, error) {
return cache, scanner.Err()
}

func shouldIgnore(path string, patterns []string) bool {
// ShouldIgnore reports whether path matches any of the ignore patterns.
// Exported so downstream packages (e.g. the stats suspect-warning)
// can verify that the globs they emit actually match the paths they
// describe — without a shared predicate the two surfaces can drift
// and users end up with --ignore suggestions that don't do anything.
func ShouldIgnore(path string, patterns []string) bool {
if len(patterns) == 0 {
return false
}
Expand Down
4 changes: 2 additions & 2 deletions internal/extract/extract_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -127,9 +127,9 @@ func TestShouldIgnore(t *testing.T) {
}

for _, tt := range tests {
got := shouldIgnore(tt.path, tt.patterns)
got := ShouldIgnore(tt.path, tt.patterns)
if got != tt.want {
t.Errorf("shouldIgnore(%q, %v) = %v, want %v", tt.path, tt.patterns, got, tt.want)
t.Errorf("ShouldIgnore(%q, %v) = %v, want %v", tt.path, tt.patterns, got, tt.want)
}
}
}
20 changes: 16 additions & 4 deletions internal/report/report.go
Original file line number Diff line number Diff line change
Expand Up @@ -394,10 +394,22 @@ var funcMap = template.FuncMap{
"joinDevs": stats.JoinDevs,
"seq": seq,
"list": list,
"int64": toInt64,
"actColor": actColor,
"pctRatio": pctRatio,
"plusInt": plusInt,
"int64": toInt64,
"actColor": actColor,
"pctRatio": pctRatio,
"plusInt": plusInt,
"derefInt": derefInt,
}

// derefInt returns the value behind an *int, or 0 if nil. Template-side
// helper for optional percentile fields on ChurnRiskResult: nil becomes a
// safe zero so `{{derefInt .AgePercentile}}` never panics or prints a
// pointer address (which is what %d on *int would do).
func derefInt(p *int) int {
if p == nil {
return 0
}
return *p
}

var tmpl = template.Must(template.New("report").Funcs(funcMap).Parse(reportHTML))
Expand Down
Loading
Loading