Skip to content

Conversation

@anshumanpatil
Copy link
Contributor

Isolated data loading and later loading data from csv/json

@anshumanpatil
Copy link
Contributor Author

@tekaratzas any suggestions ?

@tekaratzas
Copy link
Owner

Hi! Thanks for the PR!

I am not super clear what the value add is here.

Can you explain what the benefit of this change is?

@hobson
Copy link
Contributor

hobson commented Sep 16, 2025

As a 3rd party observer, it's critical to separate data from code on any software project, even more so for a machine learning model when you want to work with various training and test sets for different applications, and even more so for an ML model that has ambitions to be a large language model where you will need to scrape all of wikipedia (at a minimum), saving the text to structured data files, to achieve useful results.

@tekaratzas
Copy link
Owner

As a 3rd party observer, it's critical to separate data from code on any software project, even more so for a machine learning model when you want to work with various training and test sets for different applications, and even more so for an ML model that has ambitions to be a large language model where you will need to scrape all of wikipedia (at a minimum), saving the text to structured data files, to achieve useful results.

OK I didn't notice the later commits that were added here. I just saw the initial one yesterday when the PR was made.

Importing the data from a json file was something I had in the back of my head.

Yea this makes more sense. I was actually thinking of bringing some of the Huggingface datasets to see what happens.

anshumanpatil and others added 3 commits September 17, 2025 20:31
* isolate data loading

* pair

* encode to bytes for vocab

* data loading from json

* data loading from csv

* csv files added

* cargo run works!

* cargo update and dataset_loader redundant paren

---------

Co-authored-by: anshumanpatil <info@anshumanpatil.com>
Co-authored-by: Nikhil Sriram <nikhil.sriram5@gmail.com>
Co-authored-by: hobs <github@totalgood.com>
@anshumanpatil
Copy link
Contributor Author

@tekaratzas
With help of @hobson I have updated PR
Please share your thoughts

@tekaratzas
Copy link
Owner

Hey!

Didn't forget about this. Just tough to juggle time.

I took a glance. Appreciate the import cleanup!

I like the idea of moving data to JSON/CSV. I wanted to investigate using external datasets too. I hear huggingface has some public ones. If we think this puts us in a place to use that, then I say let's merge this.

@tekaratzas
Copy link
Owner

Hey! Merged the PR that added the github actions. Unfortunately, it also had some linting. So caused a few conflcits

@tekaratzas
Copy link
Owner

OK remove all linting for now. Going to put a push hook and add auto linting with standard linting rules

@anshumanpatil
Copy link
Contributor Author

OK remove all linting for now. Going to put a push hook and add auto linting with standard linting rules

Pushed - lint conflicts resolved.

@tekaratzas
Copy link
Owner

@anshumanpatil You still working on getting the HF datasets in here?

@anshumanpatil
Copy link
Contributor Author

@anshumanpatil You still working on getting the HF datasets in here?

It worked I will remove that json from data folder.

Copy link
Owner

@tekaratzas tekaratzas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good.

A few little nits I think should be fixed before merging

This is the first "real" change to go in though!

@anshumanpatil
Copy link
Contributor Author

This is looking good.

A few little nits I think should be fixed before merging

This is the first "real" change to go in though!

Changes done - sorry for mess - still learning.

@tekaratzas
Copy link
Owner

Let me know when you want me to merge

@anshumanpatil
Copy link
Contributor Author

Let me know when you want me to merge

We can merge this one, so that I will continue for saving model on disk.

@tekaratzas tekaratzas merged commit 97ba6c9 into tekaratzas:main Sep 25, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants