-
Notifications
You must be signed in to change notification settings - Fork 255
Isolate data loading #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
tekaratzas
merged 17 commits into
tekaratzas:main
from
anshumanpatil:feature/isolate-data-loading
Sep 25, 2025
Merged
Changes from all commits
Commits
Show all changes
17 commits
Select commit
Hold shift + click to select a range
869d60e
isolate data loading
anshumanpatil 710d086
data loading from json
anshumanpatil 7e876e3
data loading from csv
anshumanpatil 4a506b3
csv files added
anshumanpatil 813a011
code format
anshumanpatil efa2b04
Added verbose printing of vocab to confirm correct data loading (#1)
hobson babb0e5
Merge master and PR
anshumanpatil 830ae33
refactoring
anshumanpatil 7c90d1c
refactoring
anshumanpatil 4e2df4f
fmt conflicts solved
anshumanpatil 362bde4
fmt conflicts solved
anshumanpatil 6e9b67f
logs removed
anshumanpatil e04156e
cargo fmt
anshumanpatil 1540b5a
remove HF dataset
anshumanpatil d3e7ed8
merge master
anshumanpatil 4c84326
merge master
anshumanpatil 27b6cae
PR comments addressed
anshumanpatil File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,55 @@ | ||
| [ | ||
| "User: What causes rain? Assistant: Rain is caused by water vapor in clouds condensing into droplets that become too heavy to remain airborne </s>", | ||
| "User: How do mountains form? Assistant: Mountains are formed through tectonic forces or volcanism over long geological time periods </s>", | ||
| "User: What is the Amazon rainforest? Assistant: The Amazon rainforest is one of the most biodiverse places on earth, home to countless species </s>", | ||
| "User: At what temperature does water boil? Assistant: Water boils at 100 degrees celsius at standard atmospheric pressure </s>", | ||
| "User: How long does it take the moon to orbit Earth? Assistant: The moon orbits the earth approximately every 27.3 days </s>", | ||
| "User: What is photosynthesis? Assistant: Photosynthesis is the process by which green plants use sunlight to synthesize food from carbon dioxide </s>", | ||
| "User: How does gravity work? Assistant: Gravity is a force that attracts two bodies toward each other based on their mass </s>", | ||
| "User: How many neurons are in the human brain? Assistant: The human brain contains about 86 billion neurons that transmit information throughout the body </s>", | ||
| "User: What is electricity? Assistant: Electricity is the flow of electrons through a conductor, often used to power devices </s>", | ||
| "User: What is climate change? Assistant: Climate change refers to long-term shifts in temperatures and weather patterns on Earth </s>", | ||
| "User: How long do oak trees live? Assistant: Oak trees can live for hundreds of years and produce acorns as their fruit </s>", | ||
| "User: What happened to Pluto? Assistant: Pluto was reclassified from a planet to a dwarf planet in 2006 by astronomers </s>", | ||
| "User: How is glass made? Assistant: Glass is made by heating sand, soda ash, and limestone to very high temperatures until they melt </s>", | ||
| "User: What do volcanoes do? Assistant: Volcanoes can erupt with lava, ash, and gases, altering landscapes and ecosystems </s>", | ||
| "User: Why was the Great Wall built? Assistant: The great wall of china was built to protect ancient china from invasions </s>", | ||
| "User: What are penguins like? Assistant: Penguins are flightless birds that are well adapted to life in cold environments </s>", | ||
| "User: How much rain do deserts get? Assistant: Deserts receive less than 250 millimeters of precipitation each year </s>", | ||
| "User: What is Jupiter like? Assistant: Jupiter is the largest planet in our solar system and has dozens of moons </s>", | ||
| "User: How fast does light travel? Assistant: Light travels at approximately 299,792 kilometers per second in a vacuum </s>", | ||
| "User: What is gold used for? Assistant: Gold is a dense, soft metal often used in jewelry and electronics due to its conductivity </s>", | ||
| "User: What covers most of Earth? Assistant: Most of the earth's surface is covered by water, primarily in oceans </s>", | ||
| "User: How do bicycles work? Assistant: Bicycles are an efficient mode of transport that convert human energy into motion through gears </s>", | ||
| "User: How is chocolate made? Assistant: Chocolate is made from roasted and ground cacao seeds, often sweetened and flavored </s>", | ||
| "User: What is the internet? Assistant: The internet is a global network that allows for digital communication and information sharing </s>", | ||
| "User: How do wind turbines work? Assistant: Wind energy is harnessed using turbines that convert wind motion into electricity </s>", | ||
| "User: What are cats like? Assistant: Cats are domesticated mammals known for their independence and hunting instincts </s>", | ||
| "User: How do languages change? Assistant: Languages evolve over time through cultural, social, and technological influences </s>", | ||
| "User: What did the printing press do? Assistant: The printing press revolutionized the spread of information in the 15th century </s>", | ||
| "User: What is sound? Assistant: Sound is a vibration that travels through air, water, or solid materials to reach our ears </s>", | ||
| "User: Why is carbon important? Assistant: Carbon is an essential element in organic chemistry, forming the basis of all life on Earth </s>", | ||
| "User: What was the Library of Alexandria? Assistant: The library of alexandria was one of the most significant libraries of the ancient world </s>", | ||
| "User: What do honeybees do? Assistant: Honeybees play a vital role in pollination, which supports ecosystems and agriculture </s>", | ||
| "User: Are electric cars better? Assistant: Electric vehicles produce less air pollution than traditional gasoline-powered cars </s>", | ||
| "User: How do you make bread? Assistant: Bread is typically made from flour, water, yeast, and salt through a baking process </s>", | ||
| "User: What is the Sahara desert? Assistant: The sahara desert is the largest hot desert in the world, spanning multiple countries </s>", | ||
| "User: What are renewable resources? Assistant: Renewable resources replenish naturally and include sunlight, wind, and water </s>", | ||
| "User: What are eclipses? Assistant: Eclipses occur when one celestial body moves into the shadow of another </s>", | ||
| "User: How are language models trained? Assistant: Language models are trained using vast amounts of text to learn patterns in language </s>", | ||
| "User: How do compasses work? Assistant: Compasses work by aligning a magnetic needle with the earth's magnetic field </s>", | ||
| "User: What do vaccines do? Assistant: Vaccines help the immune system recognize and fight off specific pathogens </s>", | ||
| "User: Hello! Assistant: Hello! How can I help you today? </s>", | ||
| "User: Hi! Assistant: Hi there! What can I do for you? </s>", | ||
| "User: Hey! Assistant: Hey! How's it going? What would you like to know? </s>", | ||
| "User: Good morning! Assistant: Good morning! Hope you're having a great day. How can I assist you? </s>", | ||
| "User: Good afternoon! Assistant: Good afternoon! What can I help you with today? </s>", | ||
| "User: Good evening! Assistant: Good evening! How may I help you tonight? </s>", | ||
| "User: How are you? Assistant: I'm doing well, thank you for asking! How can I help you today? </s>", | ||
| "User: What's up? Assistant: Not much, just here to help! What would you like to know? </s>", | ||
| "User: Greetings! Assistant: Greetings! It's nice to meet you. How can I be of assistance? </s>", | ||
| "User: Thank you. Assistant: You're welcome! I'm happy to help </s>", | ||
| "User: Can you explain that? Assistant: Of course! I'd be happy to explain in more detail </s>", | ||
| "User: I don't understand. Assistant: No problem! Let me try to explain it differently </s>", | ||
| "User: What do you think? Assistant: Based on the information available, I think it's important to consider multiple perspectives </s>" | ||
| ] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,27 @@ | ||
| [ | ||
| "The sun rises in the east and sets in the west </s>", | ||
| "Water flows downhill due to gravity </s>", | ||
| "Birds fly through the air using their wings </s>", | ||
| "Fish swim in rivers, lakes, and oceans </s>", | ||
| "Trees grow tall and produce leaves </s>", | ||
| "Rain falls from clouds in the sky </s>", | ||
| "Fire is hot and produces light </s>", | ||
| "Ice is frozen water that melts when heated </s>", | ||
| "Mountains are tall and rocky formations </s>", | ||
| "The moon orbits around planet Earth </s>", | ||
| "Flowers bloom in spring and summer </s>", | ||
| "Snow is cold and white </s>", | ||
| "Wind moves air from place to place </s>", | ||
| "Rivers flow into larger bodies of water </s>", | ||
| "Sand is found on beaches and in deserts </s>", | ||
| "Grass grows in fields and yards </s>", | ||
| "Rocks are hard and can be different colors </s>", | ||
| "Stars shine bright in the night sky </s>", | ||
| "Waves move across the surface of water </s>", | ||
| "Clouds form when water vapor rises </s>", | ||
| "Lightning is bright and makes thunder </s>", | ||
| "Storms bring rain and strong winds </s>", | ||
| "Seasons change throughout the year </s>", | ||
| "Animals eat food to survive </s>", | ||
| "Plants need sunlight and water to grow </s>" | ||
| ] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,62 @@ | ||
| use csv::ReaderBuilder; | ||
| use serde_json; | ||
| use std::fs; | ||
|
|
||
| pub struct Dataset { | ||
| pub pretraining_data: Vec<String>, | ||
| pub chat_training_data: Vec<String>, | ||
| } | ||
anshumanpatil marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| #[allow(dead_code)] | ||
| pub enum DatasetType { | ||
| JSON, | ||
| CSV, | ||
| } | ||
|
|
||
| impl Dataset { | ||
| pub fn new( | ||
| pretraining_data_path: String, | ||
| chat_training_data_path: String, | ||
| type_of_data: DatasetType, | ||
| ) -> Self { | ||
| let pretraining_data: Vec<String>; | ||
| let chat_training_data: Vec<String>; | ||
|
|
||
| match type_of_data { | ||
| DatasetType::CSV => { | ||
| pretraining_data = get_data_from_csv(pretraining_data_path); | ||
| chat_training_data = get_data_from_csv(chat_training_data_path); | ||
| } | ||
| DatasetType::JSON => { | ||
| pretraining_data = get_data_from_json(pretraining_data_path); | ||
| chat_training_data = get_data_from_json(chat_training_data_path); | ||
| } | ||
| } | ||
|
|
||
| Dataset { | ||
| pretraining_data: pretraining_data.clone(), | ||
| chat_training_data: chat_training_data.clone(), | ||
| } | ||
| } | ||
| } | ||
|
|
||
| fn get_data_from_json(path: String) -> Vec<String> { | ||
| // convert json file to Vec<String> | ||
| let data_json = fs::read_to_string(path).expect("Failed to read data file"); | ||
| let data: Vec<String> = serde_json::from_str(&data_json).expect("Failed to parse data file"); | ||
| data | ||
| } | ||
|
|
||
| fn get_data_from_csv(path: String) -> Vec<String> { | ||
| // convert csv file to Vec<String> | ||
| let file = fs::File::open(path).expect("Failed to open CSV file"); | ||
| let mut rdr = ReaderBuilder::new().has_headers(false).from_reader(file); | ||
| let mut data = Vec::new(); | ||
|
|
||
| for result in rdr.records() { | ||
anshumanpatil marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| let record = result.expect("Failed to read CSV record"); | ||
| // Each record is a row, join all columns into a single string | ||
| data.push(record.iter().collect::<Vec<_>>().join(",")); | ||
| } | ||
| data | ||
| } | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.