Small fix to prepare_dataloader #178

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open

jackson-flux wants to merge 1 commit into microsoft:quarot-main from jackson-flux:hjacksonflux/dataloader-fix

src/slicegpt/data_utils.py

-Original file line number
+Diff line change
@@ Expand Up / @@ -150,13 +150,25 @@ def prepare_dataloader( @@
                 start_idx = torch.randint(0, len(indices), (1,)).item()
                 idx = start_idx
                 tokens = []
-                while len(tokens) < max_seqlen and idx < len(indices):
+                while len(tokens) < max_seqlen:
                     item = data_list[indices[idx]]
                     sep = "" if not tokens else "\n\n"
                     tokens += tokenizer.tokenize(sep + item)
-                    idx += 1
-                indices = indices[:start_idx] + indices[idx:]  # remove the used indices
+                    idx = (idx + 1) % len(indices)
+                    if idx == start_idx:
+                        # In this case, idx has wrapped around and caught up with start_idx.
+                        # There is no more data left to continue.
+                        break
+                if idx <= start_idx:
+                    # We wrapped around and used the indices in the rage
+                    #    [start_idx:end] and [0:idx)
+                    # Remaining indices are:
+                    indices = indices[idx:start_idx]
+                else:
+                    # We used the indices in the range [start_idx:idx)
+                    # Remaining indices are:
+                    indices = indices[:start_idx] + indices[idx:]
                 if len(tokens) >= max_seqlen:
                     tokens = tokens[:max_seqlen]  # truncate to max_seqlen
@@ Expand Down @@

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Small fix to prepare_dataloader #178

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Small fix to prepare_dataloader #178

Are you sure you want to change the base?

Uh oh!

Small fix to prepare_dataloader #178

Uh oh!

Uh oh!

Diff view

Diff view

There are no files selected for viewing