Add extensions stat: file extensions ranked by recent churn#5
Conversation
Historical lens on language distribution — "which extension is the
team spending effort on now", not "what exists in the tree" (which
cloc/tokei answer from the filesystem).
ExtensionStats aggregates ds.files by extension bucket with files,
churn, recent_churn, unique_devs, first_seen, last_seen. Sort is
recent_churn desc so dormant extensions with high lifetime churn
can't displace active ones.
extractExtension policy: last segment after the final dot, single-dot
dotfiles kept verbatim (".gitignore" as its own bucket), multi-dot
takes the final segment (.env.local → .local), extensionless and
degenerate inputs collapse into "(none)".
Included in CLI --stat default sweep (output is compact), new HTML
section below Directories, new Extensions section in METRICS.md
documenting the policy + reading signals + explicit non-goals
(no language-family grouping).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5143f4d487
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
The pct template helper returns a string, so wrapping its output in printf "%.0f" produced %!f(string=...) which html/template then replaced with the ZgotmplZ safe-escape sentinel in CSS contexts. The rendered width was never a valid length and the bars collapsed. Drop the printf wrapper in both places (Extensions table just shipped, ChurnRisk inherited the same pattern) and pass pct straight through — its "68.6" output is already a valid CSS length. Hotspots already did this and was the only bar-churn surface rendering correctly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3f61643f3f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
ds.files is keyed by canonical (post-rename) path, so bucketing on extractExtension(path) lumped all of a foo.js → foo.ts lineage onto .ts and left .js at zero. Migration-heavy repos were misattributed. Fix: capture the extension at each change time (pre-rename) into a new fileEntry.byExt map populated in lockstep with additions / deletions / recentChurn. mergeFileEntry folds byExt across rename collapse. ExtensionStats consumes the per-era split, falling back to the canonical path's extension when byExt is nil (hand-built test dataset). Caveats documented in METRICS.md: "files" counts once per extension the lineage ever held (total across buckets can exceed len(ds.files)); "unique_devs" is still lineage-union so a dev who only touched pre-migration appears under the post-migration extension too — fixing this would need per-ext dev tracking on extContribution. Tests: rename split on the consumer side (TestExtensionStatsHonors PerEraSplit), and three TestMergeFileEntryByExt* variants pinning the merger so an accidental drop/clobber in the producer path can't pass unit tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f111e731f1
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
pct cast RecentChurn through int64 before scaling, so any dataset where every bucket's RecentChurn was below 1 (heavy decay, aggressive --since, or --churn-half-life shrunk to a day or two) collapsed every bar to 0% — the visualization became useless precisely when the table still carried meaningful relative differences. Add pctFloat(val, max float64) and route the Extensions and ChurnRisk bars through it. Verified end-to-end with gitcortex report --churn-half-life 1 ... where Extensions buckets ranged 0.4 → 0.1 RecentChurn and bars now render at 100% / 25% / 0% (proportional) instead of flat-zero. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e3b14e40b5
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
LoadMultiJSONL prepends "<stem>:" to every tracked path so multi-repo
reports can disambiguate colliding filenames. For nested paths the
slash-split in extractExtension already discards the prefix, but a
root-level extensionless file (Makefile, LICENSE) keeps it — and if
the stem contains a dot (repo.v1, project.2024), LastIndex(".")
picked the stem's dot and emitted a bogus bucket like ".v1:makefile"
instead of "(none)". Silent corruption of counts and ranking in
multi-repo mode.
Fix in extractExtension so the producer (reader.go ingest) and the
fallback in ExtensionStats both benefit from one change. Stripping on
the first ":" accepts a theoretical false positive for filenames that
genuinely contain ":" — rare on POSIX/Windows and absent from every
ds.files key except multi-input prefixes in practice.
Verified end-to-end: reproducing the case by copying a fixture to
pi-hole.v1.jsonl and running multi-input stats showed zero ".v1:*"
buckets and a correctly populated "(none)" bucket. Unit test covers
7 prefix patterns including nested paths (where the fix is a no-op)
and empty basenames.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@codex review |
|
Codex Review: Didn't find any major issues. Already looking forward to the next diff. ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
Consolidates the "what data leaves the repo" story that was previously scattered across README (JSONL example), RUNBOOK (per-record schema), and METRICS.md (data-flow diagram). Readers evaluating adoption on a sensitive repo had to stitch three docs together to answer "does this ever read source code" and "what's opt-in vs default". New section covers: the two git commands run, a per-field table for commit and commit_file records showing source + which stat consumes each field, an explicit "not collected" list (file contents, messages off by default, refs other than the branch, zero network), and the four opt-in flags that change what ships. Verified the "consumed by" column against the source — committer_* feeds emitDev (so committers appear in the dev registry) but no other stat reads it from the commit record; old_hash/new_hash/old/new_size are written by extract but never read by stats. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DevProfile now surfaces the top 5 extensions a dev has worked on — their language/skill fingerprint. Answers questions the rest of the profile can't: "who on the team writes .rs?", "is this dev .yaml-heavy (infra) or .go-heavy (app code)?", "did Carol stop touching .sql after the migration?". Aggregation reuses the devFiles map already built for Scope and TopFiles — no new ingest cost. Sort is Files desc (tiebreak Churn desc, Ext asc) so the displayed Pct is monotonic in both CLI and HTML bar widths, matching Scope's UX. Churn is still exposed on the DevExtContrib struct for JSON consumers who want a churn-ranked view. Rendered in three surfaces: CLI PrintProfiles adds an Extensions line after Scope; main report profile cards add a row in the grid; the dedicated profile page (gitcortex report --email ...) gets a full block with a proportional horizontal bar mirroring the Scope widget. Caveats documented in METRICS.md: bucket is derived from the file's canonical (post-rename) path so cross-extension renames credit pre-rename work to the new extension; Pct values may sum < 100% when the dev's contribution to a file was pure rename with no line change. Per-era per-dev attribution would require byExt to carry a dev dimension, which isn't tracked. Tests cover the sort discrimination (churn-first vs files-first diverges on a hand-built case), top-5 truncation, all-(none) edge, and empty-Extensions guarding. Verified on pi-hole: CLI monotonic, main report 20/20 cards monotonic by script, dedicated profile bars render 32→32→5→5→3. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@codex review |
|
Codex Review: Didn't find any major issues. You're on a roll. ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
Historical lens on language distribution — "which extension is the team spending effort on now", not "what exists in the tree" (which cloc/tokei answer from the filesystem).
ExtensionStats aggregates ds.files by extension bucket with files, churn, recent_churn, unique_devs, first_seen, last_seen. Sort is recent_churn desc so dormant extensions with high lifetime churn can't displace active ones.
extractExtension policy: last segment after the final dot, single-dot dotfiles kept verbatim (".gitignore" as its own bucket), multi-dot takes the final segment (.env.local → .local), extensionless and degenerate inputs collapse into "(none)".
Included in CLI --stat default sweep (output is compact), new HTML section below Directories, new Extensions section in METRICS.md documenting the policy + reading signals + explicit non-goals (no language-family grouping).