-
-
Notifications
You must be signed in to change notification settings - Fork 28
Description
deterministic dataset encoding pipeline and a minimal implementation of a LLaMA architecture.
-
The dataset is currently processed into wiki_clean.txt and a tokenizer has been trained. We need to implement a memory-efficient script to encode the entire dataset into binary format for training.
-
Implement a minimal PyTorch LLaMA-style architecture to ensure deterministic behavior and full control over initialization.
-
Read the dataset in chunks to avoid high memory usage.
-
Use the trained tokenizer (BPE/SentencePiece) to convert text into token IDs.
-
Stream token IDs into a binary dataset file (.bin, uint16 or similar).
-
Compute a SHA256 hash of the resulting file.
Verification Criteria
-
Running the dataset encoding pipeline twice should produce identical binary files and SHA256 hashes.
-
Initializing the model twice with the same seed should produce identical parameter hashes.
must output the exact same initial parameter hashes.
Additional Context
No response
Code of Conduct
- I have joined the Discord server and will post updates there
- I have searched existing issues to avoid duplicates