diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
index bc8016d..4995dc5 100644
--- a/.github/workflows/ci.yml
+++ b/.github/workflows/ci.yml
@@ -31,5 +31,5 @@ jobs:
run: ./gitcortex extract --repo .
- name: Quality gates
- run: ./gitcortex ci --fail-on-churn-risk 2500 --format github-actions
+ run: ./gitcortex ci --fail-on-churn-risk 5500 --format github-actions
# Add --fail-on-busfactor 1 when team grows
diff --git a/README.md b/README.md
index 8c2fc70..db44fad 100644
--- a/README.md
+++ b/README.md
@@ -196,7 +196,7 @@ Available stats:
| `churn-risk` | Files ranked by recent churn, classified into `cold` / `active` / `active-core` / `silo` / `legacy-hotspot` |
| `working-patterns` | Commit heatmap by hour and day of week |
| `dev-network` | Developer collaboration graph based on shared file ownership |
-| `profile` | Per-developer report: scope, contribution type, pace, collaboration, top files |
+| `profile` | Per-developer report: scope, specialization index, contribution type, pace, collaboration, top files |
| `top-commits` | Largest commits ranked by lines changed (includes message if extracted with `--include-commit-messages`) |
| `pareto` | Concentration (80% threshold) across files, devs (two lenses: commits and churn), and directories |
@@ -206,7 +206,7 @@ See [`docs/METRICS.md`](docs/METRICS.md) for how each metric is calculated, incl
### Developer profile
-Manager-facing report per developer showing scope, contribution type, pace, collaboration, and top files.
+Manager-facing report per developer showing scope, specialization, contribution type, pace, collaboration, and top files.
```bash
# All developers, ranked by commits
@@ -221,9 +221,10 @@ gitcortex stats --input data.jsonl --stat profile --format json
Each profile includes:
- **Scope**: top directories where the dev works (by unique files, %)
+- **Specialization**: Herfindahl concentration over the dev's full directory distribution; 1 = all files in one dir (narrow specialist), approaches 0 for broad generalists. Labelled `broad generalist` / `balanced` / `focused specialist` / `narrow specialist`. *Measures file distribution on disk, not domain expertise — a security engineer who refactored auth across four dirs looks like a generalist even though they are a domain specialist. See METRICS.md for the caveat in full.*
- **Contribution**: growth (add >> del), balanced, or refactor (del >> add)
- **Pace**: commits per active day
-- **Collaboration**: top devs sharing the same files
+- **Collaboration**: top devs sharing the same files (ranked by `shared_lines` = Σ min(linesA, linesB))
- **Weekend %**: off-hours work ratio
- **Top files**: most impacted files by churn
diff --git a/docs/METRICS.md b/docs/METRICS.md
index 843ceeb..f8625cd 100644
--- a/docs/METRICS.md
+++ b/docs/METRICS.md
@@ -203,6 +203,7 @@ Per-developer report combining multiple metrics.
| Pace | commits / active_days (smooths bursts — a dev with 100 commits on 2 days and silence for 28 shows pace=50, which reads as a steady rate but isn't) |
| Weekend % | commits on Saturday+Sunday / total commits × 100 |
| Scope | Top 5 directories by unique file count, as % of total files touched |
+| Specialization | Herfindahl index over the **full** per-directory file-count distribution: Σ pᵢ² where pᵢ is the share of the dev's files in directory i. 1 = all files in one directory (narrow specialist); 1/N for a uniform spread across N directories; approaches 0 as the distribution widens. Computed before the top-5 Scope truncation so it reflects actual breadth. Labels (see `specBroadGeneralistMax`, `specBalancedMax`, `specFocusedMax` constants): `< 0.15` broad generalist, `< 0.35` balanced, `< 0.7` focused specialist, `≥ 0.7` narrow specialist. Herfindahl, not Gini, because Gini would collapse "1 file in 1 dir" and "1 file in each of 5 dirs" to the same value (both have zero inequality among buckets), which misses the specialization distinction. **Measures file distribution, not domain expertise** — see caveat below. **Display vs raw:** CLI and HTML show the value rounded to 3 decimals (`%.3f`) for readability; JSON output preserves the full float64. Band classification runs against the raw float, so a value like 0.149 lands in `broad generalist` even though %.2f would have rounded it to `0.15`. JSON consumers that reproduce the banding must use the raw value, not a rounded version. |
| Contribution type | Based on del/add ratio: growth (<0.4), balanced (0.4-0.8), refactor (>0.8) |
| Collaborators | Top 5 devs sharing code with this dev. Ranked by `shared_lines` (Σ min(linesA, linesB) across shared files), tiebreak `shared_files`, then email. Same `shared_lines` semantics as the Developer Network metric — discounts trivial one-line touches so "collaborator" reflects real overlap. |
@@ -270,6 +271,9 @@ Every classification boundary is a named constant in `internal/stats/stats.go`.
| `contribBalancedRatio` | `0.4` | `0.4 ≤ del/add < 0.8` → `balanced`; below 0.4 → `growth`. |
| `refactorMinFiles` | `10` | Minimum files for a commit to be a mechanical-refactor candidate (coupling filter). |
| `refactorMaxChurnPerFile` | `5.0` | Mean churn per file below this in a candidate commit → treated as refactor; its pairs are excluded from coupling. |
+| `specBroadGeneralistMax` | `0.15` | Specialization Herfindahl `< 0.15` → `broad generalist` label in dev profile. |
+| `specBalancedMax` | `0.35` | `0.15 ≤ H < 0.35` → `balanced`. |
+| `specFocusedMax` | `0.7` | `0.35 ≤ H < 0.7` → `focused specialist`; `H ≥ 0.7` → `narrow specialist`. |
### Reproducibility
@@ -350,3 +354,13 @@ If you need the label to reflect true age, either extract without `--since` (the
- **Renames reverted (cycle A→B→A).** The resolver bails out of the cycle with the current path; it doesn't crash but the "canonical" is implementation-defined for cyclic inputs.
- **Repo with single file.** The median-based `cold` threshold degenerates (median is that file's churn); the single file is never classified `cold`.
- **All files with identical churn.** Median equals every value, `lowChurn = median × 0.5`, so nothing is `cold`. Everything falls into the bf/age/trend tree.
+
+### Dev specialization measures distribution, not expertise
+
+The `Specialization` number and its label (`broad generalist` … `narrow specialist`) describe **where the dev's files live on disk**, not their semantic area of expertise. The two diverge whenever the person's domain cuts across the directory structure rather than aligning with it:
+
+- A security engineer who audited and refactored auth across `api/`, `web/`, `gateway/`, and `services/` touches four dirs. Herfindahl is low, the label says "broad generalist" — but the person is a domain specialist whose domain happens to be cross-cutting.
+- A release engineer who maintains CI/CD config scattered across `.github/`, `docker/`, `scripts/`, and `deploy/` lands the same way.
+- Conversely, a generalist who happened to do a big one-off refactor of a single module in the recent window looks like a "narrow specialist" for the snapshot.
+
+The label is a shortcut for reading the Herfindahl value. Use it when directory structure aligns with domains (one dir per module); cross-reference with `TopFiles`, `Scope`, and `Collaborators` to confirm when the repo is organized along another axis (e.g. monorepo with service boundaries cutting across dirs, or a library where concerns are horizontal). The raw Herfindahl value is objective; the interpretation of the label is not.
diff --git a/internal/report/profile_template.go b/internal/report/profile_template.go
index 0c23585..8d12a8a 100644
--- a/internal/report/profile_template.go
+++ b/internal/report/profile_template.go
@@ -53,8 +53,8 @@ footer { margin-top: 40px; padding-top: 16px; border-top: 1px solid #d0d7de; col
-
Scope
-
Where this developer works, by share of files touched per directory. One dominant bar = specialist; evenly split = generalist or cross-team.
+
Scope Specialization {{printf "%.3f" .Profile.Specialization}} — {{if lt .Profile.Specialization 0.15}}broad generalist{{else if lt .Profile.Specialization 0.35}}balanced{{else if lt .Profile.Specialization 0.7}}focused specialist{{else}}narrow specialist{{end}}
+
Where this developer works, by share of files touched per directory. The specialization number is the Herfindahl index over the full per-directory distribution: 1 = all files in a single directory, 1/N for a uniform spread across N directories (approaches 0 as N grows).
{{range $i, $s := .Profile.Scope}}
{{if gt $s.Pct 8.0}}{{$s.Dir}} {{printf "%.0f" $s.Pct}}%{{end}}
{{end}}
diff --git a/internal/report/template.go b/internal/report/template.go
index 1b643ab..8a7fb2d 100644
--- a/internal/report/template.go
+++ b/internal/report/template.go
@@ -294,6 +294,9 @@ footer { margin-top: 40px; padding-top: 16px; border-top: 1px solid #d0d7de; col
Scope
{{range $i, $s := .Scope}}{{if $i}}, {{end}}{{$s.Dir}} ({{printf "%.0f" $s.Pct}}%){{end}}
+
Specialization
+
{{printf "%.3f" .Specialization}} ({{if lt .Specialization 0.15}}broad generalist{{else if lt .Specialization 0.35}}balanced{{else if lt .Specialization 0.7}}focused specialist{{else}}narrow specialist{{end}})
+
Contribution
{{if eq .ContribType "growth"}}{{.ContribType}}{{else if eq .ContribType "refactor"}}{{.ContribType}}{{else}}{{.ContribType}}{{end}} (ratio {{printf "%.2f" .ContribRatio}} · +{{.Additions}} −{{.Deletions}})
diff --git a/internal/stats/format.go b/internal/stats/format.go
index 806777b..54dc853 100644
--- a/internal/stats/format.go
+++ b/internal/stats/format.go
@@ -9,6 +9,22 @@ import (
"text/tabwriter"
)
+// specLabel turns a DevProfile.Specialization (Herfindahl) value into a
+// short human-readable classification. Thresholds live in stats.go as
+// named constants so templates can reuse the same values.
+func specLabel(h float64) string {
+ switch {
+ case h < specBroadGeneralistMax:
+ return "broad generalist"
+ case h < specBalancedMax:
+ return "balanced"
+ case h < specFocusedMax:
+ return "focused specialist"
+ default:
+ return "narrow specialist"
+ }
+}
+
func JoinDevs(devs []string) string {
if len(devs) <= 3 {
return strings.Join(devs, ", ")
@@ -397,6 +413,12 @@ func (f *Formatter) PrintProfiles(profiles []DevProfile) error {
fmt.Fprintf(f.w, "%s (%.0f%%)", s.Dir, s.Pct)
}
fmt.Fprintln(f.w)
+ // %.3f (not %.2f): labels are assigned at thresholds 0.15 / 0.35
+ // / 0.7 using the unrounded float. With %.2f a value like
+ // 0.149 displays as "0.15" and the "broad generalist" label
+ // reads as inconsistent with the shown number. %.3f keeps
+ // the boundary distinguishable (0.149 vs 0.150).
+ fmt.Fprintf(f.w, " Specialization: %.3f (%s)\n", p.Specialization, specLabel(p.Specialization))
fmt.Fprintf(f.w, " Contribution: %s (ratio %.2f — add: %d, del: %d)\n", p.ContribType, p.ContribRatio, p.Additions, p.Deletions)
fmt.Fprintf(f.w, " Pace: %.1f commits/active day\n", p.Pace)
fmt.Fprintf(f.w, " Collaboration: ")
diff --git a/internal/stats/stats.go b/internal/stats/stats.go
index 3620f18..e80ed54 100644
--- a/internal/stats/stats.go
+++ b/internal/stats/stats.go
@@ -34,6 +34,17 @@ const (
// additions. Strict < threshold: a commit with mean exactly 5.0 is
// NOT filtered.
refactorMaxChurnPerFile = 5.0
+
+ // Developer specialization labels, applied to DevProfile.Specialization
+ // (Herfindahl over per-directory file distribution). Tuned so that
+ // plausible repo shapes land in the expected band:
+ // uniform spread over 7+ dirs → broad generalist
+ // 2-4 dirs with one somewhat dominant → balanced
+ // one dir clearly dominant (~60-85% of files) → focused specialist
+ // ≥ 85% of files in one dir → narrow specialist
+ specBroadGeneralistMax = 0.15
+ specBalancedMax = 0.35
+ specFocusedMax = 0.7
)
type ContributorStat struct {
@@ -120,6 +131,51 @@ type DevEdge struct {
Weight float64 // shared_files / max(files_A, files_B) * 100 (legacy)
}
+// herfindahl returns the Herfindahl–Hirschman concentration index of a
+// sample of non-negative values: Σ (pᵢ)² where pᵢ = valueᵢ / Σ value.
+//
+// Unlike Gini (which measures inequality between buckets and so returns 0
+// for both "100% in 1 bucket" and "evenly across N buckets"), Herfindahl
+// distinguishes these cases:
+// 100% in 1 bucket → 1 (maximal concentration / specialization)
+// evenly across N buckets → 1/N (approaches 0 as N grows)
+// This matches the specialization semantics needed here: a developer
+// working in a single directory is maximally specialized, a developer
+// spread across many directories is a generalist.
+//
+// Returns 0 for empty input or zero-sum input; returns 1 for a single
+// non-zero bucket. Returns full float64 precision — callers that need
+// to display the value should round at format time (the CLI and HTML
+// templates use %.2f). Rounding inside this function caused quantization-
+// induced label misclassification at band boundaries: a true value of
+// 0.1496 would round to 0.150 and flip from "broad generalist" to
+// "balanced".
+func herfindahl(values []int) float64 {
+ if len(values) == 0 {
+ return 0
+ }
+ var sum int64
+ for _, v := range values {
+ if v < 0 {
+ v = 0
+ }
+ sum += int64(v)
+ }
+ if sum == 0 {
+ return 0
+ }
+ total := float64(sum)
+ var h float64
+ for _, v := range values {
+ if v <= 0 {
+ continue
+ }
+ p := float64(v) / total
+ h += p * p
+ }
+ return h
+}
+
type StatsFlags struct {
CouplingMinChanges int
NetworkMinFiles int
@@ -689,6 +745,7 @@ type DevProfile struct {
LastDate string
TopFiles []DevFileContrib
Scope []DirScope
+ Specialization float64 // Gini over dir file-count distribution: 0 = broad generalist, 1 = single-dir specialist
ContribRatio float64 // del/add — 0=growth, ~1=rewrite, >1=cleanup
ContribType string // "growth", "balanced", "refactor"
Pace float64 // commits per active day
@@ -886,11 +943,17 @@ func DevProfiles(ds *Dataset, filterEmail string) []DevProfile {
wpct = math.Round(float64(weekend)/float64(total)*1000) / 10
}
- // Scope: top directories by file count
+ // Scope: top directories by file count. Root-level files (no "/"
+ // in path) collapse into "." so they form a single bucket instead
+ // of each filename becoming its own pseudo-directory. Matches the
+ // convention in DirectoryStats and keeps Specialization honest —
+ // otherwise a dev who only touches README, Makefile, go.mod, etc.
+ // appears as a broad generalist across N pseudo-dirs instead of
+ // a narrow specialist on the repo root.
dirCount := make(map[string]int)
if files, ok := devFiles[email]; ok {
for path := range files {
- dir := path
+ dir := "."
if idx := strings.LastIndex(path, "/"); idx >= 0 {
dir = path[:idx]
}
@@ -912,6 +975,16 @@ func DevProfiles(ds *Dataset, filterEmail string) []DevProfile {
}
return scope[i].Dir < scope[j].Dir
})
+ // Specialization index: Herfindahl over the FULL per-directory
+ // file-count distribution (before truncation to top 5). 1.0 = all
+ // files in one directory (narrow specialist); ~0 = spread across
+ // many dirs (broad generalist). See herfindahl() for why this
+ // captures concentration rather than inequality.
+ specValues := make([]int, 0, len(dirCount))
+ for _, count := range dirCount {
+ specValues = append(specValues, count)
+ }
+ specialization := herfindahl(specValues)
if len(scope) > 5 {
scope = scope[:5]
}
@@ -962,7 +1035,7 @@ func DevProfiles(ds *Dataset, filterEmail string) []DevProfile {
Commits: cs.Commits, Additions: cs.Additions, Deletions: cs.Deletions,
LinesChanged: cs.Additions + cs.Deletions, FilesTouched: cs.FilesTouched,
ActiveDays: cs.ActiveDays, FirstDate: cs.FirstDate, LastDate: cs.LastDate,
- TopFiles: topFiles, Scope: scope,
+ TopFiles: topFiles, Scope: scope, Specialization: specialization,
ContribRatio: contribRatio, ContribType: contribType,
Pace: pace, Collaborators: collabs,
MonthlyActivity: monthly, WorkGrid: grid, WeekendPct: wpct,
diff --git a/internal/stats/stats_test.go b/internal/stats/stats_test.go
index 743bf48..ff02d7d 100644
--- a/internal/stats/stats_test.go
+++ b/internal/stats/stats_test.go
@@ -1,6 +1,7 @@
package stats
import (
+ "bytes"
"fmt"
"os"
"strings"
@@ -1897,6 +1898,291 @@ func TestStreamLoadFullPipeline(t *testing.T) {
}
}
+func TestHerfindahlHelper(t *testing.T) {
+ cases := []struct {
+ name string
+ in []int
+ want float64
+ }{
+ {"empty", nil, 0},
+ {"single", []int{5}, 1}, // 1 bucket = fully concentrated
+ {"single zero", []int{0}, 0}, // sum=0 short-circuits
+ {"zeros only", []int{0, 0, 0}, 0},
+ {"uniform 2", []int{5, 5}, 0.5}, // 0.25 + 0.25
+ {"uniform 5", []int{1, 1, 1, 1, 1}, 0.2}, // 5 × (1/5)²
+ {"70/30", []int{7, 3}, 0.58}, // 0.49 + 0.09
+ {"90/10", []int{9, 1}, 0.82}, // 0.81 + 0.01
+ {"100-in-one", []int{0, 0, 100}, 1}, // single non-zero bucket
+ }
+ for _, c := range cases {
+ t.Run(c.name, func(t *testing.T) {
+ got := herfindahl(c.in)
+ diff := got - c.want
+ if diff < 0 {
+ diff = -diff
+ }
+ if diff > 0.005 {
+ t.Errorf("herfindahl(%v) = %.3f, want ≈ %.3f", c.in, got, c.want)
+ }
+ })
+ }
+}
+
+func TestPrintProfilesSpecializationDisplayPrecision(t *testing.T) {
+ // The Specialization display must show enough decimals that the
+ // rendered number is self-consistent with the band label. At %.2f a
+ // true value of 0.149 rounds to "0.15" and the shown label
+ // "broad generalist" (correct: 0.149 < 0.15) appears to contradict
+ // the displayed number (0.15 is NOT < 0.15). Using %.3f renders
+ // "0.149" and the reader can verify the classification at a glance.
+ p := DevProfile{
+ Name: "N", Email: "n@x", Commits: 1, ActiveDays: 1,
+ FirstDate: "2024-01-01", LastDate: "2024-01-01",
+ Specialization: 0.149, // just under specBroadGeneralistMax
+ }
+ var buf bytes.Buffer
+ f := NewFormatter(&buf, "table")
+ if err := f.PrintProfiles([]DevProfile{p}); err != nil {
+ t.Fatalf("PrintProfiles: %v", err)
+ }
+ out := buf.String()
+ if !strings.Contains(out, "0.149") {
+ t.Errorf("output should contain %q to match the classification band (%%.3f), got:\n%s", "0.149", out)
+ }
+ if !strings.Contains(out, "broad generalist") {
+ t.Errorf("output should contain label 'broad generalist' for H=0.149, got:\n%s", out)
+ }
+}
+
+func TestHerfindahlPreservesPrecision(t *testing.T) {
+ // herfindahl must return the full float64 value; rounding happens only
+ // at display time. A prior version rounded to 3 decimals inside the
+ // function, which would misclassify boundary cases (e.g. true 0.1496
+ // rounding to 0.150 and flipping from "broad generalist" to "balanced").
+ // Three uniform buckets produce H = 1/3 exactly; the stored value must
+ // be the full-precision float, not a rounded approximation.
+ h := herfindahl([]int{1, 1, 1})
+ if h == 0.333 {
+ t.Fatal("herfindahl returned 0.333 — the function is rounding internally again")
+ }
+ oneThird := 1.0 / 3.0
+ if diff := h - oneThird; diff < -1e-12 || diff > 1e-12 {
+ t.Errorf("herfindahl([1,1,1]) = %.18f, want 1/3 = %.18f", h, oneThird)
+ }
+}
+
+func TestSpecLabelBandsBoundaryPrecision(t *testing.T) {
+ // Thresholds are strict <, so a value exactly at the threshold lands in
+ // the next band. Before the precision fix, internal rounding could move
+ // a value JUST under a threshold (say 0.14999) up to 0.150, crossing
+ // the band. Verify that values near boundaries classify by their true
+ // precision, not by a rounded approximation.
+ cases := []struct {
+ h float64
+ want string
+ }{
+ {0.149999, "broad generalist"}, // just under specBroadGeneralistMax
+ {0.150001, "balanced"}, // just over
+ {0.349999, "balanced"},
+ {0.350001, "focused specialist"},
+ {0.699999, "focused specialist"},
+ {0.700001, "narrow specialist"},
+ }
+ for _, c := range cases {
+ if got := specLabel(c.h); got != c.want {
+ t.Errorf("specLabel(%.6f) = %q, want %q (boundary precision)", c.h, got, c.want)
+ }
+ }
+}
+
+func TestSpecLabelBands(t *testing.T) {
+ // Guard the four-band classification: boundaries are defined by the
+ // specBroadGeneralistMax / specBalancedMax / specFocusedMax constants.
+ // Constants drift without this test would silently change label output
+ // in both CLI and HTML.
+ cases := []struct {
+ h float64
+ want string
+ }{
+ {0.0, "broad generalist"},
+ {0.14, "broad generalist"},
+ {specBroadGeneralistMax, "balanced"}, // boundary: < is strict
+ {0.34, "balanced"},
+ {specBalancedMax, "focused specialist"},
+ {0.69, "focused specialist"},
+ {specFocusedMax, "narrow specialist"},
+ {1.0, "narrow specialist"},
+ }
+ for _, c := range cases {
+ if got := specLabel(c.h); got != c.want {
+ t.Errorf("specLabel(%.3f) = %q, want %q", c.h, got, c.want)
+ }
+ }
+}
+
+func TestDevProfilesSpecialization(t *testing.T) {
+ // Three devs with deliberately distinct scope patterns:
+ // - narrow: 100% in one dir
+ // - focused: 70/30 across two dirs
+ // - broad: evenly spread across 5 dirs
+ // Specialization (Gini over per-dir file counts) must rank them
+ // narrow > focused > broad.
+ t1 := time.Date(2024, 1, 15, 10, 0, 0, 0, time.UTC)
+ ds := &Dataset{
+ Earliest: t1, Latest: t1,
+ commits: map[string]*commitEntry{
+ "c1": {email: "narrow@x", date: t1, add: 10, del: 0, files: 1},
+ "c2": {email: "focused@x", date: t1, add: 10, del: 0, files: 1},
+ "c3": {email: "broad@x", date: t1, add: 10, del: 0, files: 1},
+ },
+ contributors: map[string]*ContributorStat{
+ "narrow@x": {Email: "narrow@x", Name: "N", Commits: 1, ActiveDays: 1, FilesTouched: 5, Additions: 10},
+ "focused@x": {Email: "focused@x", Name: "F", Commits: 1, ActiveDays: 1, FilesTouched: 10, Additions: 10},
+ "broad@x": {Email: "broad@x", Name: "B", Commits: 1, ActiveDays: 1, FilesTouched: 5, Additions: 10},
+ },
+ files: map[string]*fileEntry{},
+ }
+ // narrow@x: 5 files all in one dir
+ for i := 0; i < 5; i++ {
+ path := fmt.Sprintf("auth/f%d.go", i)
+ ds.files[path] = &fileEntry{commits: 1, devLines: map[string]int64{"narrow@x": 10}, devCommits: map[string]int{"narrow@x": 1}, monthChurn: map[string]int64{}}
+ }
+ // focused@x: 7 in one dir, 3 in another
+ for i := 0; i < 7; i++ {
+ path := fmt.Sprintf("api/f%d.go", i)
+ ds.files[path] = &fileEntry{commits: 1, devLines: map[string]int64{"focused@x": 10}, devCommits: map[string]int{"focused@x": 1}, monthChurn: map[string]int64{}}
+ }
+ for i := 0; i < 3; i++ {
+ path := fmt.Sprintf("web/f%d.go", i)
+ ds.files[path] = &fileEntry{commits: 1, devLines: map[string]int64{"focused@x": 10}, devCommits: map[string]int{"focused@x": 1}, monthChurn: map[string]int64{}}
+ }
+ // broad@x: 1 file in each of 5 different dirs
+ for i, d := range []string{"a", "b", "c", "d", "e"} {
+ path := fmt.Sprintf("%s/f%d.go", d, i)
+ ds.files[path] = &fileEntry{commits: 1, devLines: map[string]int64{"broad@x": 10}, devCommits: map[string]int{"broad@x": 1}, monthChurn: map[string]int64{}}
+ }
+
+ profiles := DevProfiles(ds, "")
+ get := func(email string) float64 {
+ for _, p := range profiles {
+ if p.Email == email {
+ return p.Specialization
+ }
+ }
+ t.Fatalf("missing profile %s", email)
+ return 0
+ }
+ narrow := get("narrow@x")
+ focused := get("focused@x")
+ broad := get("broad@x")
+
+ // Herfindahl semantics:
+ // narrow@x (5 files all in 1 dir) → H = 1
+ // focused@x (7 in api, 3 in web) → H = 0.49 + 0.09 = 0.58
+ // broad@x (1 file in each of 5 dirs) → H = 5 × 0.04 = 0.2
+ // Ordering: narrow > focused > broad. The old Gini collapsed narrow
+ // and broad to 0 and was the reason this test was rewritten.
+ if narrow != 1.0 {
+ t.Errorf("narrow@x (1 dir) specialization = %.3f, want 1.0 (fully concentrated)", narrow)
+ }
+ if !(focused > 0.5 && focused < 0.65) {
+ t.Errorf("focused@x (7+3 split) specialization = %.3f, want ~0.58", focused)
+ }
+ if !(broad > 0.15 && broad < 0.25) {
+ t.Errorf("broad@x (5 dirs uniform) specialization = %.3f, want ~0.2", broad)
+ }
+ if !(narrow > focused && focused > broad) {
+ t.Errorf("ordering failed: narrow=%.2f focused=%.2f broad=%.2f; want narrow > focused > broad",
+ narrow, focused, broad)
+ }
+}
+
+func TestDevProfilesSpecializationRootFilesBucket(t *testing.T) {
+ // Bug reported in review: when a dev touches only repo-root files
+ // (no slash in path), DevProfiles used to treat each filename as its
+ // own "directory". A dev with README, Makefile, go.mod, LICENSE
+ // ended up with 4 pseudo-dirs × 1 file → Herfindahl = 0.25
+ // ("balanced") instead of 1.0 ("narrow specialist on the repo root").
+ // Fix collapses root-level files into the "." bucket, matching the
+ // convention in DirectoryStats.
+ t1 := time.Date(2024, 1, 15, 10, 0, 0, 0, time.UTC)
+ ds := &Dataset{
+ Earliest: t1, Latest: t1,
+ commits: map[string]*commitEntry{"c1": {email: "root@x", date: t1, add: 10, del: 0, files: 4}},
+ contributors: map[string]*ContributorStat{
+ "root@x": {Email: "root@x", Name: "R", Commits: 1, ActiveDays: 1, FilesTouched: 4, Additions: 10},
+ },
+ files: map[string]*fileEntry{
+ "README.md": {commits: 1, devLines: map[string]int64{"root@x": 5}, devCommits: map[string]int{"root@x": 1}, monthChurn: map[string]int64{}},
+ "Makefile": {commits: 1, devLines: map[string]int64{"root@x": 5}, devCommits: map[string]int{"root@x": 1}, monthChurn: map[string]int64{}},
+ "go.mod": {commits: 1, devLines: map[string]int64{"root@x": 5}, devCommits: map[string]int{"root@x": 1}, monthChurn: map[string]int64{}},
+ "LICENSE": {commits: 1, devLines: map[string]int64{"root@x": 5}, devCommits: map[string]int{"root@x": 1}, monthChurn: map[string]int64{}},
+ },
+ }
+ profiles := DevProfiles(ds, "")
+ if len(profiles) != 1 {
+ t.Fatalf("profiles = %d", len(profiles))
+ }
+ p := profiles[0]
+ // All four files are at the root, so they must collapse into one
+ // bucket named ".". Specialization must be 1.0 (narrow specialist).
+ if p.Specialization != 1.0 {
+ t.Errorf("root-only dev specialization = %.3f, want 1.0 (all four files collapse to one bucket)", p.Specialization)
+ }
+ if len(p.Scope) != 1 {
+ t.Fatalf("Scope = %d entries, want 1 (the '.' bucket)", len(p.Scope))
+ }
+ if p.Scope[0].Dir != "." {
+ t.Errorf("Scope[0].Dir = %q, want \".\"", p.Scope[0].Dir)
+ }
+ if p.Scope[0].Files != 4 {
+ t.Errorf("Scope[0].Files = %d, want 4", p.Scope[0].Files)
+ }
+}
+
+func TestDevProfilesSpecializationEdgeCases(t *testing.T) {
+ t1 := time.Date(2024, 1, 15, 10, 0, 0, 0, time.UTC)
+
+ // Case 1: dev listed as contributor but no files touched.
+ // dirCount empty → Herfindahl returns 0. Label falls through to
+ // "broad generalist", which is semantically stretchy but consistent
+ // with "no signal means no specialization".
+ ds := &Dataset{
+ Earliest: t1, Latest: t1,
+ commits: map[string]*commitEntry{"c1": {email: "ghost@x", date: t1, add: 0, del: 0, files: 0}},
+ contributors: map[string]*ContributorStat{
+ "ghost@x": {Email: "ghost@x", Name: "G", Commits: 1, ActiveDays: 1, FilesTouched: 0},
+ },
+ files: map[string]*fileEntry{},
+ }
+ profiles := DevProfiles(ds, "")
+ if len(profiles) != 1 {
+ t.Fatalf("profiles = %d", len(profiles))
+ }
+ if profiles[0].Specialization != 0 {
+ t.Errorf("no-files dev specialization = %.2f, want 0", profiles[0].Specialization)
+ }
+
+ // Case 2: dev with 1 file in 1 dir. Should be maximally specialized.
+ ds2 := &Dataset{
+ Earliest: t1, Latest: t1,
+ commits: map[string]*commitEntry{"c1": {email: "solo@x", date: t1, add: 10, del: 0, files: 1}},
+ contributors: map[string]*ContributorStat{
+ "solo@x": {Email: "solo@x", Name: "S", Commits: 1, ActiveDays: 1, FilesTouched: 1, Additions: 10},
+ },
+ files: map[string]*fileEntry{
+ "auth/login.go": {commits: 1, devLines: map[string]int64{"solo@x": 10}, devCommits: map[string]int{"solo@x": 1}, monthChurn: map[string]int64{}},
+ },
+ }
+ profiles = DevProfiles(ds2, "")
+ if len(profiles) != 1 {
+ t.Fatalf("profiles = %d", len(profiles))
+ }
+ if profiles[0].Specialization != 1.0 {
+ t.Errorf("single-file-single-dir specialization = %.2f, want 1.0 (fully concentrated — the canonical narrow specialist)", profiles[0].Specialization)
+ }
+}
+
// buildSyntheticLargeDataset creates a deterministic dataset shaped like a
// mid-size repo: thousands of devs, tens of thousands of files, with each
// file touched by a few devs. Used by BenchmarkDevProfiles* to exercise the