Skip to content

Conversation

@jaried
Copy link

@jaried jaried commented Jan 27, 2026

Summary

This PR improves the deduplication logic in ccusage to handle edge cases where API proxies don't include requestId field, and ensures that records with the highest token count are preserved during deduplication.

Changes

1. Modified createUniqueHash

  • When requestId is missing, fall back to using only messageId for deduplication
  • Previously returned null when requestId was missing, causing records to bypass deduplication

2. Added getTotalTokensFromUsageData

  • New function to calculate total tokens using complete formula
  • Formula: input_tokens + output_tokens + cache_creation_input_tokens + cache_read_input_tokens

3. Updated deduplication logic

  • Changed from Set<string> to Map<string, {index, tokens}> for tracking processed entries
  • When duplicate is found, compare token counts and keep the record with more tokens
  • This prevents data loss from streaming intermediate states (which often have 0 or very few tokens)

4. Updated functions

  • loadDailyUsageData
  • loadSessionData
  • loadSessionBlockData

Testing

  • Added unit tests for createUniqueHash (including fallback behavior)
  • Added unit tests for getTotalTokensFromUsageData
  • Added unit tests for shouldReplaceExisting
  • Updated existing deduplication tests to reflect new behavior

Summary by CodeRabbit

  • Bug Fixes
    • Improved duplicate entry handling by retaining entries with higher token counts for more accurate data reporting.
    • Enhanced resilience when processing records with incomplete or missing identifiers.

✏️ Tip: You can customize this high-level summary in your review settings.

- Modified createUniqueHash to fall back to messageId when requestId is missing
- Added getTotalTokensFromUsageData function for complete token calculation
- Changed deduplication logic to keep records with highest token count
- Updated loadDailyUsageData, loadSessionData, and loadSessionBlockData
- Added comprehensive unit tests for new deduplication behavior

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@coderabbitai
Copy link

coderabbitai bot commented Jan 27, 2026

📝 Walkthrough

Walkthrough

The data-loader.ts module's deduplication logic is refactored to prioritize entries with higher token counts instead of simple duplicate skipping. A new exported utility function getTotalTokensFromUsageData computes total tokens from usage entries. The deduplication map switches from Set-based tracking to Map-based with token storage for comparison.

Changes

Cohort / File(s) Summary
Deduplication Logic Overhaul
apps/ccusage/src/data-loader.ts
Replaced per-entry duplicate skipping with token-count–based replacement policy. Updated createUniqueHash to handle missing messageId/requestId with fallback logic. Added getTotalTokensFromUsageData utility to sum input, output, cache tokens. Converted deduplication tracking from Set to Map with { index, tokens } structure. Modified daily/session/session-block loaders to replace existing entries when new duplicates have higher token counts. Updated tests to expect higher-token entries in dedup scenarios.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • PR #58: Modifies src/data-loader.ts deduplication logic including createUniqueHash and load functions; this PR evolves the Set-based chronological deduplication into token-count–based replacement policy.
  • PR #65: Introduces the isDuplicateEntry/markAsProcessed deduplication helpers that this PR replaces with token-count–based replacement policy and new data structures.

Suggested reviewers

  • ryoppippi
  • skylinesales

Poem

🐰✨ Tokens counted, not just skipped away,
Duplicates compared by their token display—
Keep the richer entry, let the weaker fade,
A smarter dedup dance our loader has made! 🎭

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely describes the main change: improving deduplication logic to retain records with the highest token counts.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@apps/ccusage/src/data-loader.ts`:
- Around line 529-543: The function getTotalTokensFromUsageData is exported but
not used externally; remove the export to make it module-private by changing its
declaration to a non-exported function (keep the name
getTotalTokensFromUsageData and signature using UsageData so callers inside this
file still work), then run a quick grep across the repo to confirm no external
imports reference it and update any internal callers if needed.
🧹 Nitpick comments (1)
apps/ccusage/src/data-loader.ts (1)

490-496: Avoid a test-only helper drifting from production logic.
shouldReplaceExisting is only exercised in the test block, while the dedup paths inline the comparison. Consider using this helper in those paths (or removing it) to keep a single source of truth.

Comment on lines +529 to +543
/**
* Calculate total tokens from UsageData for deduplication comparison
* Uses complete formula: input + output + cache_creation + cache_read
* @param data - UsageData object
* @returns Total number of tokens
*/
export function getTotalTokensFromUsageData(data: UsageData): number {
const usage = data.message.usage;
return (
usage.input_tokens +
usage.output_tokens +
(usage.cache_creation_input_tokens ?? 0) +
(usage.cache_read_input_tokens ?? 0)
);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

rg -n "getTotalTokensFromUsageData" -g "!apps/ccusage/src/data-loader.ts"

Repository: ryoppippi/ccusage

Length of output: 43


Remove the export keyword from getTotalTokensFromUsageData. This function is not used outside this module and violates the guideline to export only symbols actually used by other modules.

🤖 Prompt for AI Agents
In `@apps/ccusage/src/data-loader.ts` around lines 529 - 543, The function
getTotalTokensFromUsageData is exported but not used externally; remove the
export to make it module-private by changing its declaration to a non-exported
function (keep the name getTotalTokensFromUsageData and signature using
UsageData so callers inside this file still work), then run a quick grep across
the repo to confirm no external imports reference it and update any internal
callers if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant