Fix CacheDataset.itemlen returning wrong length by eyupcanakman · Pull Request #1029 · ml-explore/mlx-lm

eyupcanakman · 2026-03-20T14:54:55Z

fixes #596.

CacheDataset.itemlen calls len(self._data[idx]) which returns the number of dict keys (always 1 for ChatDataset), not the actual token count. Length-based batch sorting in the trainer is effectively disabled because every item looks the same length.

Changed to len(self[idx][0]) which processes the item through the tokenizer and returns the real sequence length.

fix: return actual token count from CacheDataset.itemlen

54e57b6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix CacheDataset.itemlen returning wrong length#1029

Fix CacheDataset.itemlen returning wrong length#1029
eyupcanakman wants to merge 1 commit intoml-explore:mainfrom
eyupcanakman:fix/cache-dataset-itemlen-596

eyupcanakman commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

eyupcanakman commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant