Skip to content

Fix CacheDataset.itemlen returning wrong length#1029

Open
eyupcanakman wants to merge 1 commit intoml-explore:mainfrom
eyupcanakman:fix/cache-dataset-itemlen-596
Open

Fix CacheDataset.itemlen returning wrong length#1029
eyupcanakman wants to merge 1 commit intoml-explore:mainfrom
eyupcanakman:fix/cache-dataset-itemlen-596

Conversation

@eyupcanakman
Copy link
Copy Markdown
Contributor

fixes #596.

CacheDataset.itemlen calls len(self._data[idx]) which returns the number of dict keys (always 1 for ChatDataset), not the actual token count. Length-based batch sorting in the trainer is effectively disabled because every item looks the same length.

Changed to len(self[idx][0]) which processes the item through the tokenizer and returns the real sequence length.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

When the LoRA-invoked trainer sorts a CacheDataset by length, it doesn't appear to change ordering

1 participant