Skip to content

Add extensions stat: file extensions ranked by recent churn#5

Merged
lex0c merged 7 commits intomainfrom
feat/extension-stats
Apr 20, 2026
Merged

Add extensions stat: file extensions ranked by recent churn#5
lex0c merged 7 commits intomainfrom
feat/extension-stats

Conversation

@lex0c
Copy link
Copy Markdown
Owner

@lex0c lex0c commented Apr 19, 2026

Historical lens on language distribution — "which extension is the team spending effort on now", not "what exists in the tree" (which cloc/tokei answer from the filesystem).

ExtensionStats aggregates ds.files by extension bucket with files, churn, recent_churn, unique_devs, first_seen, last_seen. Sort is recent_churn desc so dormant extensions with high lifetime churn can't displace active ones.

extractExtension policy: last segment after the final dot, single-dot dotfiles kept verbatim (".gitignore" as its own bucket), multi-dot takes the final segment (.env.local → .local), extensionless and degenerate inputs collapse into "(none)".

Included in CLI --stat default sweep (output is compact), new HTML section below Directories, new Extensions section in METRICS.md documenting the policy + reading signals + explicit non-goals (no language-family grouping).

Historical lens on language distribution — "which extension is the
team spending effort on now", not "what exists in the tree" (which
cloc/tokei answer from the filesystem).

ExtensionStats aggregates ds.files by extension bucket with files,
churn, recent_churn, unique_devs, first_seen, last_seen. Sort is
recent_churn desc so dormant extensions with high lifetime churn
can't displace active ones.

extractExtension policy: last segment after the final dot, single-dot
dotfiles kept verbatim (".gitignore" as its own bucket), multi-dot
takes the final segment (.env.local → .local), extensionless and
degenerate inputs collapse into "(none)".

Included in CLI --stat default sweep (output is compact), new HTML
section below Directories, new Extensions section in METRICS.md
documenting the policy + reading signals + explicit non-goals
(no language-family grouping).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@lex0c
Copy link
Copy Markdown
Owner Author

lex0c commented Apr 19, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5143f4d487

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread internal/report/template.go Outdated
The pct template helper returns a string, so wrapping its output in
printf "%.0f" produced %!f(string=...) which html/template then
replaced with the ZgotmplZ safe-escape sentinel in CSS contexts. The
rendered width was never a valid length and the bars collapsed.

Drop the printf wrapper in both places (Extensions table just shipped,
ChurnRisk inherited the same pattern) and pass pct straight through —
its "68.6" output is already a valid CSS length. Hotspots already did
this and was the only bar-churn surface rendering correctly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@lex0c
Copy link
Copy Markdown
Owner Author

lex0c commented Apr 20, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3f61643f3f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread internal/stats/stats.go Outdated
ds.files is keyed by canonical (post-rename) path, so bucketing on
extractExtension(path) lumped all of a foo.js → foo.ts lineage onto
.ts and left .js at zero. Migration-heavy repos were misattributed.

Fix: capture the extension at each change time (pre-rename) into a
new fileEntry.byExt map populated in lockstep with additions /
deletions / recentChurn. mergeFileEntry folds byExt across rename
collapse. ExtensionStats consumes the per-era split, falling back to
the canonical path's extension when byExt is nil (hand-built test
dataset).

Caveats documented in METRICS.md: "files" counts once per extension
the lineage ever held (total across buckets can exceed len(ds.files));
"unique_devs" is still lineage-union so a dev who only touched
pre-migration appears under the post-migration extension too — fixing
this would need per-ext dev tracking on extContribution.

Tests: rename split on the consumer side (TestExtensionStatsHonors
PerEraSplit), and three TestMergeFileEntryByExt* variants pinning the
merger so an accidental drop/clobber in the producer path can't pass
unit tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@lex0c
Copy link
Copy Markdown
Owner Author

lex0c commented Apr 20, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f111e731f1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread internal/report/template.go Outdated
pct cast RecentChurn through int64 before scaling, so any dataset
where every bucket's RecentChurn was below 1 (heavy decay, aggressive
--since, or --churn-half-life shrunk to a day or two) collapsed every
bar to 0% — the visualization became useless precisely when the
table still carried meaningful relative differences.

Add pctFloat(val, max float64) and route the Extensions and ChurnRisk
bars through it. Verified end-to-end with
  gitcortex report --churn-half-life 1 ...
where Extensions buckets ranged 0.4 → 0.1 RecentChurn and bars now
render at 100% / 25% / 0% (proportional) instead of flat-zero.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@lex0c
Copy link
Copy Markdown
Owner Author

lex0c commented Apr 20, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e3b14e40b5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread internal/stats/stats.go
LoadMultiJSONL prepends "<stem>:" to every tracked path so multi-repo
reports can disambiguate colliding filenames. For nested paths the
slash-split in extractExtension already discards the prefix, but a
root-level extensionless file (Makefile, LICENSE) keeps it — and if
the stem contains a dot (repo.v1, project.2024), LastIndex(".")
picked the stem's dot and emitted a bogus bucket like ".v1:makefile"
instead of "(none)". Silent corruption of counts and ranking in
multi-repo mode.

Fix in extractExtension so the producer (reader.go ingest) and the
fallback in ExtensionStats both benefit from one change. Stripping on
the first ":" accepts a theoretical false positive for filenames that
genuinely contain ":" — rare on POSIX/Windows and absent from every
ds.files key except multi-input prefixes in practice.

Verified end-to-end: reproducing the case by copying a fixture to
pi-hole.v1.jsonl and running multi-input stats showed zero ".v1:*"
buckets and a correctly populated "(none)" bucket. Unit test covers
7 prefix patterns including nested paths (where the fix is a no-op)
and empty basenames.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@lex0c
Copy link
Copy Markdown
Owner Author

lex0c commented Apr 20, 2026

@codex review

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Already looking forward to the next diff.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

lex0c and others added 2 commits April 19, 2026 22:22
Consolidates the "what data leaves the repo" story that was previously
scattered across README (JSONL example), RUNBOOK (per-record schema),
and METRICS.md (data-flow diagram). Readers evaluating adoption on a
sensitive repo had to stitch three docs together to answer "does this
ever read source code" and "what's opt-in vs default".

New section covers: the two git commands run, a per-field table for
commit and commit_file records showing source + which stat consumes
each field, an explicit "not collected" list (file contents, messages
off by default, refs other than the branch, zero network), and the
four opt-in flags that change what ships.

Verified the "consumed by" column against the source — committer_*
feeds emitDev (so committers appear in the dev registry) but no other
stat reads it from the commit record; old_hash/new_hash/old/new_size
are written by extract but never read by stats.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DevProfile now surfaces the top 5 extensions a dev has worked on —
their language/skill fingerprint. Answers questions the rest of the
profile can't: "who on the team writes .rs?", "is this dev .yaml-heavy
(infra) or .go-heavy (app code)?", "did Carol stop touching .sql
after the migration?".

Aggregation reuses the devFiles map already built for Scope and
TopFiles — no new ingest cost. Sort is Files desc (tiebreak Churn
desc, Ext asc) so the displayed Pct is monotonic in both CLI and
HTML bar widths, matching Scope's UX. Churn is still exposed on the
DevExtContrib struct for JSON consumers who want a churn-ranked
view.

Rendered in three surfaces: CLI PrintProfiles adds an Extensions line
after Scope; main report profile cards add a row in the grid; the
dedicated profile page (gitcortex report --email ...) gets a full
block with a proportional horizontal bar mirroring the Scope widget.

Caveats documented in METRICS.md: bucket is derived from the file's
canonical (post-rename) path so cross-extension renames credit
pre-rename work to the new extension; Pct values may sum < 100% when
the dev's contribution to a file was pure rename with no line change.
Per-era per-dev attribution would require byExt to carry a dev
dimension, which isn't tracked.

Tests cover the sort discrimination (churn-first vs files-first
diverges on a hand-built case), top-5 truncation, all-(none) edge,
and empty-Extensions guarding. Verified on pi-hole: CLI monotonic,
main report 20/20 cards monotonic by script, dedicated profile bars
render 32→32→5→5→3.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@lex0c
Copy link
Copy Markdown
Owner Author

lex0c commented Apr 20, 2026

@codex review

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. You're on a roll.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@lex0c lex0c merged commit 2dcbaa8 into main Apr 20, 2026
1 check passed
@lex0c lex0c deleted the feat/extension-stats branch April 20, 2026 01:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant