Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 42 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -143,6 +143,47 @@ The default branch is auto-detected from `origin/HEAD`, falling back to `main`,

The `--mailmap` flag uses git's built-in `.mailmap` support to unify developer identities. Without it, the same person with different emails (e.g., `alice@work.com` and `alice@personal.com`) appears as separate contributors.

### What gitcortex collects from git

Extraction runs two git commands against the local repository and streams their output. No source-code bytes are read.

```
git log -M --raw --numstat --format=<metadata> <branch> → commits, parents, per-file diffs (counts only)
git cat-file --batch-check → blob sizes (old/new) for each file change
```

Per-commit metadata (populates the `commit` record):

| Field | Source | Used by |
|---|---|---|
| `sha`, `tree`, `parents` | `git log --format` | commit graph, merge detection |
| `author_name`, `author_email`, `author_date` | `git log --format` | contributors, activity, working patterns, bus factor |
| `committer_name`, `committer_email`, `committer_date` | `git log --format` | committer identity feeds the `dev` registry (so a committer who is never an author still appears as a known developer); no other stat consumes these fields |
| `additions`, `deletions`, `files_changed` | summed from `--numstat` | summary totals, hotspots, churn-risk |
| `message` | `git log --format` | opt-in only (`--include-commit-messages`); truncated to 80 chars in `top-commits` when present |

Per-file-change metadata (populates the `commit_file` record):

| Field | Source | Used by |
|---|---|---|
| `path_current`, `path_previous`, `status` | `git log --raw` | hotspots, directories, extensions, rename tracking (`R100` / `C075` trigger merges) |
| `additions`, `deletions` | `git log --numstat` | per-file churn, recent churn, coupling |
| `old_hash`, `new_hash`, `old_size`, `new_size` | `git cat-file --batch-check` | retained but not currently used in stats |

**Not collected:**
- File contents / diff hunks — only line counts from `--numstat`.
- Commit messages (unless `--include-commit-messages` is passed).
- Tags, refs other than the traversed branch, reflog, notes.
- Any network traffic — extraction is 100% local to the git directory.

**Opt-ins that change what ships in the JSONL:**
- `--include-commit-messages` — adds the commit subject to each `commit` record (off by default).
- `--mailmap` — normalizes author/committer names+emails via git's `.mailmap` before recording (off by default; warned when a `.mailmap` exists but the flag is omitted).
- `--ignore <glob>` — drops matching `commit_file` records entirely at extract time (counts in the `commit` record are recomputed so totals remain consistent).
- `--first-parent` — traverses only the first-parent chain, skipping merged branch history.

Full per-record schema (every field, types, enums): see [`docs/RUNBOOK.md`](docs/RUNBOOK.md#jsonl-format).

Output is a JSONL file with one record per line. Four record types:

```jsonl
Expand Down Expand Up @@ -211,6 +252,7 @@ Available stats:
| `top-commits` | Largest commits ranked by lines changed (includes message if extracted with `--include-commit-messages`) |
| `pareto` | Concentration (80% threshold) across files, devs (two lenses: commits and churn), and directories |
| `structure` | Repo layout as a `tree(1)`-style view, dirs sorted by aggregate churn, capped by `--tree-depth` (default 3) |
| `extensions` | File extensions ranked by recent churn, with file count, unique devs, and first/last-seen — the historical lens on language distribution |

Output formats: `table` (default, human-readable), `csv` (single clean table per `--stat`, header row on line 1), `json` (unified object with all sections).

Expand Down
17 changes: 13 additions & 4 deletions cmd/gitcortex/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -105,8 +105,8 @@ func isValidGranularity(s string) bool {

func isValidStat(s string) bool {
switch s {
case "summary", "contributors", "hotspots", "directories", "activity",
"busfactor", "coupling", "churn-risk", "working-patterns",
case "summary", "contributors", "hotspots", "directories", "extensions",
"activity", "busfactor", "coupling", "churn-risk", "working-patterns",
"dev-network", "profile", "top-commits", "pareto", "structure":
return true
}
Expand All @@ -133,7 +133,7 @@ func addStatsFlags(cmd *cobra.Command, sf *statsFlags) {
cmd.Flags().StringVar(&sf.format, "format", "table", "Output format: table, csv, json")
cmd.Flags().IntVar(&sf.topN, "top", 10, "Number of top entries to show (0 = all)")
cmd.Flags().StringVar(&sf.granularity, "granularity", "month", "Activity granularity: day, week, month, year")
cmd.Flags().StringVar(&sf.stat, "stat", "", "Show a specific stat: summary, contributors, hotspots, directories, activity, busfactor, coupling, churn-risk, working-patterns, dev-network, profile, top-commits, pareto, structure")
cmd.Flags().StringVar(&sf.stat, "stat", "", "Show a specific stat: summary, contributors, hotspots, directories, extensions, activity, busfactor, coupling, churn-risk, working-patterns, dev-network, profile, top-commits, pareto, structure")
cmd.Flags().IntVar(&sf.couplingMaxFiles, "coupling-max-files", 50, "Max files per commit for coupling analysis")
cmd.Flags().IntVar(&sf.couplingMinChanges, "coupling-min-changes", 5, "Min co-changes for coupling results")
cmd.Flags().IntVar(&sf.churnHalfLife, "churn-half-life", 90, "Half-life in days for churn decay (churn-risk)")
Expand All @@ -151,7 +151,7 @@ func validateStatsFlags(sf *statsFlags) error {
return fmt.Errorf("invalid --granularity %q; must be one of: day, week, month, year", sf.granularity)
}
if sf.stat != "" && !isValidStat(sf.stat) {
return fmt.Errorf("invalid --stat %q; valid: summary, contributors, hotspots, directories, activity, busfactor, coupling, churn-risk, working-patterns, dev-network, profile, top-commits, pareto, structure", sf.stat)
return fmt.Errorf("invalid --stat %q; valid: summary, contributors, hotspots, directories, extensions, activity, busfactor, coupling, churn-risk, working-patterns, dev-network, profile, top-commits, pareto, structure", sf.stat)
}
return nil
}
Expand Down Expand Up @@ -271,6 +271,12 @@ func renderStats(ds *stats.Dataset, sf *statsFlags) error {
return err
}
}
if showAll || sf.stat == "extensions" {
fmt.Fprintf(os.Stderr, "\n=== Top %d Extensions ===\n", sf.topN)
if err := f.PrintExtensions(stats.ExtensionStats(ds, sf.topN)); err != nil {
return err
}
}
if showAll || sf.stat == "activity" {
fmt.Fprintf(os.Stderr, "\n=== Activity (%s) ===\n", sf.granularity)
if err := f.PrintActivity(stats.ActivityOverTime(ds, sf.granularity)); err != nil {
Expand Down Expand Up @@ -369,6 +375,9 @@ func renderStatsJSON(f *stats.Formatter, ds *stats.Dataset, sf *statsFlags) erro
if showAll || sf.stat == "directories" {
report["directories"] = stats.DirectoryStats(ds, sf.topN)
}
if showAll || sf.stat == "extensions" {
report["extensions"] = stats.ExtensionStats(ds, sf.topN)
}
if showAll || sf.stat == "activity" {
report["activity"] = stats.ActivityOverTime(ds, sf.granularity)
}
Expand Down
26 changes: 26 additions & 0 deletions docs/METRICS.md
Original file line number Diff line number Diff line change
Expand Up @@ -215,6 +215,7 @@ Per-developer report combining multiple metrics.
| Pace | commits / active_days (smooths bursts — a dev with 100 commits on 2 days and silence for 28 shows pace=50, which reads as a steady rate but isn't) |
| Weekend % | commits on Saturday+Sunday / total commits × 100 |
| Scope | Top 5 directories by unique file count, as % of total files touched |
| Extensions | Top 5 file extensions the dev touched, sorted by **files desc** (tiebreak churn desc, then ext asc) so the displayed `Pct` is monotonic with the sort order and HTML bar widths read correctly. `Pct` is `Files/FilesTouched * 100`; the raw dev-attributable `Churn` (sum of `devLines[email]` across bucket files) is kept on the struct for JSON consumers who want a churn-ranked view. Answers the "language/skill fingerprint" question (`.go` + `.yaml` → backend+infra; `.tsx` + `.ts` + `.css` → frontend). **Caveats:** (1) bucket is derived from the file's canonical (post-rename) path — a dev who worked on `foo.js` pre-migration still shows up under `.ts` if it was later renamed; per-era per-dev attribution would need `byExt` to carry a dev dimension, which isn't tracked. (2) `Pct` values may sum to less than 100% when the dev appears as a contributor on files without adding lines (pure-rename contributions), since the extension aggregation only walks files with non-zero `devLines[email]`. |
| Specialization | Herfindahl index over the **full** per-directory file-count distribution: Σ pᵢ² where pᵢ is the share of the dev's files in directory i. 1 = all files in one directory (narrow specialist); 1/N for a uniform spread across N directories; approaches 0 as the distribution widens. Computed before the top-5 Scope truncation so it reflects actual breadth. Labels (see `specBroadGeneralistMax`, `specBalancedMax`, `specFocusedMax` constants): `< 0.15` broad generalist, `< 0.35` balanced, `< 0.7` focused specialist, `≥ 0.7` narrow specialist. Herfindahl, not Gini, because Gini would collapse "1 file in 1 dir" and "1 file in each of 5 dirs" to the same value (both have zero inequality among buckets), which misses the specialization distinction. **Measures file distribution, not domain expertise** — see caveat below. **Display vs raw:** CLI and HTML show the value rounded to 3 decimals (`%.3f`) for readability; JSON output preserves the full float64. Band classification runs against the raw float, so a value like 0.149 lands in `broad generalist` even though %.2f would have rounded it to `0.15`. JSON consumers that reproduce the banding must use the raw value, not a rounded version. |
| Contribution type | Based on del/add ratio: growth (<0.4), balanced (0.4-0.8), refactor (>0.8) |
| Collaborators | Top 5 devs sharing code with this dev. Ranked by `shared_lines` (Σ min(linesA, linesB) across shared files), tiebreak `shared_files`, then email. Same `shared_lines` semantics as the Developer Network metric — discounts trivial one-line touches so "collaborator" reflects real overlap. |
Expand Down Expand Up @@ -254,6 +255,31 @@ Two dev lenses are surfaced because commit count alone is a flawed proxy for con

**How to interpret**: "20 files concentrate 80% of all churn" describes where change lands — it can indicate a healthy core module under active development, or a bottleneck if combined with low bus factor. Cross-reference with the Churn Risk section before drawing conclusions.

## Extensions

File extensions aggregated from `ds.files`, ranked by **recent churn** (decay-weighted — see "Recent churn" below). The historical lens is the point: `cloc`/`tokei` answer "what languages exist on disk"; this answers "which extensions is the team spending effort on right now".

**Extraction policy** (`extractExtension`):
- Last path segment (after the final `/`).
- Multi-dot names report the final segment: `foo.tar.gz` → `.gz`, `.eslintrc.json` → `.json`.
- Single-dot dotfiles keep their full name: `.gitignore` → `.gitignore`, `.env` → `.env`. Merging these into "(none)" would erase a meaningful group.
- No-dot names collapse into the `(none)` bucket: `Makefile`, `LICENSE`, `bin/run`.
- Extensions lowercased so `.PNG` and `.png` aggregate.

**Per-bucket fields**:
- `files` — distinct file lineages that ever held this extension. A file renamed across extensions (foo.js → foo.ts) counts once in each bucket; totals across buckets can therefore exceed the dataset's file count in migration-heavy repos.
- `churn` — lifetime additions + deletions attributed to this extension specifically. A foo.js → foo.ts migration with 1000 lines of pre-rename churn and 500 post-rename does **not** collapse all 1500 onto `.ts`; `.js` keeps its 1000 and `.ts` gets 500. The attribution comes from capturing the path's extension at each change before `applyRenames` merges the lineage.
- `recent_churn` — same per-era semantics, decay-weighted (same half-life as other stats, set at load time). Leads the sort so a dormant extension with high lifetime churn won't displace an active one.
- `unique_devs` — distinct emails that touched any file that ever held this extension. **Over-counts across migrations**: a dev who only worked on `foo.js` pre-migration still appears under `.ts` if that file was migrated. Splitting devs per era would need per-commit dev tracking that `fileEntry` does not retain. Read this as "people with context on files that at some point were this extension" rather than "active contributors in this extension".
- `first_seen` / `last_seen` — min/max within the bucket's era, UTC date. For the `.js` bucket in a TypeScript migration, `last_seen` is the migration cutoff, not today's date.

**Reading signals**:
- `.yaml` recent churn high + unique_devs low → config owned by one person; schedule handoff before they leave.
- `.md` recent churn high → docs-heavy phase (release prep?) or churn-heavy README thrash.
- Cross-read with Directories: `.yaml` concentrated in one dir is config-as-code; `.yaml` spread across many dirs is config sprawl.

**What it does not do**: no language-family grouping (`.js`+`.ts`+`.tsx` stay distinct). Aggregate downstream if you need "frontend vs backend"; the tool does not prescribe the taxonomy. Generated-file buckets (`.lock`, `.pb.go`, `.min.js`) will dominate unless filtered via `--ignore` at extract time — the suspect-paths warning flags these.

## Repo Structure

A `tree(1)`-style view of the repository's directory layout, built from paths seen in history (`FileHotspots`), not from the filesystem at HEAD. Deleted files are included — the view answers "what shaped the codebase", not "what is present today".
Expand Down
1 change: 1 addition & 0 deletions docs/RUNBOOK.md
Original file line number Diff line number Diff line change
Expand Up @@ -156,6 +156,7 @@ Section headers go to stderr, data to stdout. To capture only data:
./gitcortex stats --input data.jsonl --stat profile --email alice@company.com
./gitcortex stats --input data.jsonl --stat top-commits --top 20
./gitcortex stats --input data.jsonl --stat structure --tree-depth 3
./gitcortex stats --input data.jsonl --stat extensions --top 15
```

### Time filtering
Expand Down
13 changes: 13 additions & 0 deletions internal/report/profile_template.go
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,19 @@ footer { margin-top: 40px; padding-top: 16px; border-top: 1px solid #d0d7de; col
</div>
</div>

{{if .Profile.Extensions}}
<div style="margin-bottom:16px;">
<div style="font-size:13px; font-weight:600; margin-bottom:2px;">Extensions</div>
<div class="hint" style="margin-bottom:6px;">The dev's language/skill fingerprint by share of files touched. Extension attribution uses the file's current canonical path, so cross-extension renames (e.g. <code>.js → .ts</code>) credit pre-rename work to the new extension. · {{docRef "profile"}}</div>
<div style="display:flex; height:28px; border-radius:4px; overflow:hidden; gap:1px;">
{{range $i, $e := .Profile.Extensions}}<div style="flex:{{printf "%.0f" $e.Pct}}; background:{{index (list "#0969da" "#2da44e" "#8250df" "#bf8700" "#cf222e") $i}}; display:flex; align-items:center; justify-content:center; color:#fff; font-size:10px; min-width:30px; overflow:hidden;" title="{{$e.Ext}} — {{$e.Files}} files ({{printf "%.0f" $e.Pct}}%)">{{if gt $e.Pct 8.0}}{{$e.Ext}} {{printf "%.0f" $e.Pct}}%{{end}}</div>{{end}}
</div>
<div style="display:flex; flex-wrap:wrap; gap:8px; margin-top:4px; font-size:11px; color:#656d76;">
{{range $i, $e := .Profile.Extensions}}<span><span style="display:inline-block; width:8px; height:8px; border-radius:2px; background:{{index (list "#0969da" "#2da44e" "#8250df" "#bf8700" "#cf222e") $i}};"></span> {{$e.Ext}} ({{printf "%.0f" $e.Pct}}%)</span>{{end}}
</div>
</div>
{{end}}

<div style="margin-bottom:16px; font-size:13px;">
<div style="margin-bottom:2px;">
<span style="font-weight:600;">Contribution</span>
Expand Down
16 changes: 16 additions & 0 deletions internal/report/report.go
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ type ReportData struct {
Contributors []stats.ContributorStat
Hotspots []stats.FileStat
Directories []stats.DirStat
Extensions []stats.ExtensionStat
ActivityRaw []stats.ActivityBucket
ActivityYears []string
ActivityGrid [][]ActivityCell // [year][month 0-11]
Expand Down Expand Up @@ -341,6 +342,7 @@ func Generate(w io.Writer, ds *stats.Dataset, repoName string, topN int, sf stat
Contributors: stats.TopContributors(ds, topN),
Hotspots: stats.FileHotspots(ds, topN),
Directories: stats.DirectoryStats(ds, topN),
Extensions: stats.ExtensionStats(ds, topN),
ActivityRaw: actRaw,
ActivityYears: actYears,
ActivityGrid: actGrid,
Expand Down Expand Up @@ -392,6 +394,19 @@ func pctInt(val, max int) string {
return fmt.Sprintf("%.1f", float64(val)/float64(max)*100)
}

// pctFloat is the float-domain sibling of pct. Needed for metrics like
// RecentChurn that carry sub-1 fractional values after heavy decay
// (small repos, or --since restricting the window): casting through
// int64 truncates every bucket to 0 and the bar reads 0% across the
// board even though the table shows non-zero churn. Accepting
// float64 straight through preserves the relative scale.
func pctFloat(val, max float64) string {
if max == 0 {
return "0"
}
return fmt.Sprintf("%.1f", val/max*100)
}

func heatColor(val, max int) string {
if max == 0 || val == 0 {
return "#f0f0f0"
Expand Down Expand Up @@ -451,6 +466,7 @@ func actColor(commits, max int) string {
var funcMap = template.FuncMap{
"pct": pct,
"pctInt": pctInt,
"pctFloat": pctFloat,
"heatColor": heatColor,
"joinDevs": stats.JoinDevs,
"seq": seq,
Expand Down
29 changes: 29 additions & 0 deletions internal/report/report_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -436,6 +436,35 @@ func TestBuildLabelCountListOmitsEmpty(t *testing.T) {
}
}

// Regression: pct(int64(x), int64(y)) collapsed every sub-1 float to
// 0 before this helper existed, so extension/churn-risk bars all
// rendered as 0% on datasets with heavily decayed RecentChurn (small
// repos, aggressive --since filters). pctFloat preserves the relative
// scale.
func TestPctFloat(t *testing.T) {
cases := []struct {
val, max float64
want string
}{
// Sub-1 values: relative scale preserved (would all be 0 under int64 cast).
{0.5, 1.0, "50.0"},
{0.25, 0.5, "50.0"},
{0.1, 0.9, "11.1"},
// Mixed small + large.
{50.0, 200.0, "25.0"},
// max at zero → safe zero string, no NaN or division by zero.
{5.0, 0.0, "0"},
{0.0, 0.0, "0"},
// val > max (can happen under rounding noise in sort+display).
{10.0, 5.0, "200.0"},
}
for _, c := range cases {
if got := pctFloat(c.val, c.max); got != c.want {
t.Errorf("pctFloat(%v, %v) = %q, want %q", c.val, c.max, got, c.want)
}
}
}

func TestThousands(t *testing.T) {
cases := []struct {
in interface{}
Expand Down
Loading
Loading