-
Notifications
You must be signed in to change notification settings - Fork 255
Isolate data loading #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Isolate data loading #2
Conversation
|
@tekaratzas any suggestions ? |
|
Hi! Thanks for the PR! I am not super clear what the value add is here. Can you explain what the benefit of this change is? |
|
As a 3rd party observer, it's critical to separate data from code on any software project, even more so for a machine learning model when you want to work with various training and test sets for different applications, and even more so for an ML model that has ambitions to be a large language model where you will need to scrape all of wikipedia (at a minimum), saving the text to structured data files, to achieve useful results. |
OK I didn't notice the later commits that were added here. I just saw the initial one yesterday when the PR was made. Importing the data from a json file was something I had in the back of my head. Yea this makes more sense. I was actually thinking of bringing some of the Huggingface datasets to see what happens. |
* isolate data loading * pair * encode to bytes for vocab * data loading from json * data loading from csv * csv files added * cargo run works! * cargo update and dataset_loader redundant paren --------- Co-authored-by: anshumanpatil <info@anshumanpatil.com> Co-authored-by: Nikhil Sriram <nikhil.sriram5@gmail.com> Co-authored-by: hobs <github@totalgood.com>
|
@tekaratzas |
|
Hey! Didn't forget about this. Just tough to juggle time. I took a glance. Appreciate the import cleanup! I like the idea of moving data to JSON/CSV. I wanted to investigate using external datasets too. I hear huggingface has some public ones. If we think this puts us in a place to use that, then I say let's merge this. |
|
Hey! Merged the PR that added the github actions. Unfortunately, it also had some linting. So caused a few conflcits |
|
OK remove all linting for now. Going to put a push hook and add auto linting with standard linting rules |
Pushed - lint conflicts resolved. |
|
@anshumanpatil You still working on getting the HF datasets in here? |
It worked I will remove that json from data folder. |
tekaratzas
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good.
A few little nits I think should be fixed before merging
This is the first "real" change to go in though!
Changes done - sorry for mess - still learning. |
|
Let me know when you want me to merge |
We can merge this one, so that I will continue for saving model on disk. |
Isolated data loading and later loading data from csv/json