Skip to content

[FEATURE]: Implement Deterministic Dataset Encoding Pipeline and Verifiable LLaMA Model (Model Architecture) #55

@Shubhamx404

Description

@Shubhamx404

deterministic dataset encoding pipeline and a minimal implementation of a LLaMA architecture.

  • The dataset is currently processed into wiki_clean.txt and a tokenizer has been trained. We need to implement a memory-efficient script to encode the entire dataset into binary format for training.

  • Implement a minimal PyTorch LLaMA-style architecture to ensure deterministic behavior and full control over initialization.

  • Read the dataset in chunks to avoid high memory usage.

  • Use the trained tokenizer (BPE/SentencePiece) to convert text into token IDs.

  • Stream token IDs into a binary dataset file (.bin, uint16 or similar).

  • Compute a SHA256 hash of the resulting file.

Verification Criteria

  • Running the dataset encoding pipeline twice should produce identical binary files and SHA256 hashes.

  • Initializing the model twice with the same seed should produce identical parameter hashes.

must output the exact same initial parameter hashes.

Additional Context

No response

Code of Conduct

  • I have joined the Discord server and will post updates there
  • I have searched existing issues to avoid duplicates

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions