Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
254 changes: 143 additions & 111 deletions Cargo.lock

Large diffs are not rendered by default.

5 changes: 5 additions & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,13 +1,18 @@
[package]
name = "llm"
version = "0.1.0"
# 2024 = latest edition
edition = "2024"

[dependencies]
bincode = "2.0.1"
ndarray = "0.16.1"
rand = "0.9.0"
rand_distr = "0.5.0"
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"

csv = "1.3"
[dev-dependencies]
# Add any test-specific dependencies here if needed

Expand Down
55 changes: 55 additions & 0 deletions data/chat_training_data.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
[
"User: What causes rain? Assistant: Rain is caused by water vapor in clouds condensing into droplets that become too heavy to remain airborne </s>",
"User: How do mountains form? Assistant: Mountains are formed through tectonic forces or volcanism over long geological time periods </s>",
"User: What is the Amazon rainforest? Assistant: The Amazon rainforest is one of the most biodiverse places on earth, home to countless species </s>",
"User: At what temperature does water boil? Assistant: Water boils at 100 degrees celsius at standard atmospheric pressure </s>",
"User: How long does it take the moon to orbit Earth? Assistant: The moon orbits the earth approximately every 27.3 days </s>",
"User: What is photosynthesis? Assistant: Photosynthesis is the process by which green plants use sunlight to synthesize food from carbon dioxide </s>",
"User: How does gravity work? Assistant: Gravity is a force that attracts two bodies toward each other based on their mass </s>",
"User: How many neurons are in the human brain? Assistant: The human brain contains about 86 billion neurons that transmit information throughout the body </s>",
"User: What is electricity? Assistant: Electricity is the flow of electrons through a conductor, often used to power devices </s>",
"User: What is climate change? Assistant: Climate change refers to long-term shifts in temperatures and weather patterns on Earth </s>",
"User: How long do oak trees live? Assistant: Oak trees can live for hundreds of years and produce acorns as their fruit </s>",
"User: What happened to Pluto? Assistant: Pluto was reclassified from a planet to a dwarf planet in 2006 by astronomers </s>",
"User: How is glass made? Assistant: Glass is made by heating sand, soda ash, and limestone to very high temperatures until they melt </s>",
"User: What do volcanoes do? Assistant: Volcanoes can erupt with lava, ash, and gases, altering landscapes and ecosystems </s>",
"User: Why was the Great Wall built? Assistant: The great wall of china was built to protect ancient china from invasions </s>",
"User: What are penguins like? Assistant: Penguins are flightless birds that are well adapted to life in cold environments </s>",
"User: How much rain do deserts get? Assistant: Deserts receive less than 250 millimeters of precipitation each year </s>",
"User: What is Jupiter like? Assistant: Jupiter is the largest planet in our solar system and has dozens of moons </s>",
"User: How fast does light travel? Assistant: Light travels at approximately 299,792 kilometers per second in a vacuum </s>",
"User: What is gold used for? Assistant: Gold is a dense, soft metal often used in jewelry and electronics due to its conductivity </s>",
"User: What covers most of Earth? Assistant: Most of the earth's surface is covered by water, primarily in oceans </s>",
"User: How do bicycles work? Assistant: Bicycles are an efficient mode of transport that convert human energy into motion through gears </s>",
"User: How is chocolate made? Assistant: Chocolate is made from roasted and ground cacao seeds, often sweetened and flavored </s>",
"User: What is the internet? Assistant: The internet is a global network that allows for digital communication and information sharing </s>",
"User: How do wind turbines work? Assistant: Wind energy is harnessed using turbines that convert wind motion into electricity </s>",
"User: What are cats like? Assistant: Cats are domesticated mammals known for their independence and hunting instincts </s>",
"User: How do languages change? Assistant: Languages evolve over time through cultural, social, and technological influences </s>",
"User: What did the printing press do? Assistant: The printing press revolutionized the spread of information in the 15th century </s>",
"User: What is sound? Assistant: Sound is a vibration that travels through air, water, or solid materials to reach our ears </s>",
"User: Why is carbon important? Assistant: Carbon is an essential element in organic chemistry, forming the basis of all life on Earth </s>",
"User: What was the Library of Alexandria? Assistant: The library of alexandria was one of the most significant libraries of the ancient world </s>",
"User: What do honeybees do? Assistant: Honeybees play a vital role in pollination, which supports ecosystems and agriculture </s>",
"User: Are electric cars better? Assistant: Electric vehicles produce less air pollution than traditional gasoline-powered cars </s>",
"User: How do you make bread? Assistant: Bread is typically made from flour, water, yeast, and salt through a baking process </s>",
"User: What is the Sahara desert? Assistant: The sahara desert is the largest hot desert in the world, spanning multiple countries </s>",
"User: What are renewable resources? Assistant: Renewable resources replenish naturally and include sunlight, wind, and water </s>",
"User: What are eclipses? Assistant: Eclipses occur when one celestial body moves into the shadow of another </s>",
"User: How are language models trained? Assistant: Language models are trained using vast amounts of text to learn patterns in language </s>",
"User: How do compasses work? Assistant: Compasses work by aligning a magnetic needle with the earth's magnetic field </s>",
"User: What do vaccines do? Assistant: Vaccines help the immune system recognize and fight off specific pathogens </s>",
"User: Hello! Assistant: Hello! How can I help you today? </s>",
"User: Hi! Assistant: Hi there! What can I do for you? </s>",
"User: Hey! Assistant: Hey! How's it going? What would you like to know? </s>",
"User: Good morning! Assistant: Good morning! Hope you're having a great day. How can I assist you? </s>",
"User: Good afternoon! Assistant: Good afternoon! What can I help you with today? </s>",
"User: Good evening! Assistant: Good evening! How may I help you tonight? </s>",
"User: How are you? Assistant: I'm doing well, thank you for asking! How can I help you today? </s>",
"User: What's up? Assistant: Not much, just here to help! What would you like to know? </s>",
"User: Greetings! Assistant: Greetings! It's nice to meet you. How can I be of assistance? </s>",
"User: Thank you. Assistant: You're welcome! I'm happy to help </s>",
"User: Can you explain that? Assistant: Of course! I'd be happy to explain in more detail </s>",
"User: I don't understand. Assistant: No problem! Let me try to explain it differently </s>",
"User: What do you think? Assistant: Based on the information available, I think it's important to consider multiple perspectives </s>"
]
27 changes: 27 additions & 0 deletions data/pretraining_data.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
[
"The sun rises in the east and sets in the west </s>",
"Water flows downhill due to gravity </s>",
"Birds fly through the air using their wings </s>",
"Fish swim in rivers, lakes, and oceans </s>",
"Trees grow tall and produce leaves </s>",
"Rain falls from clouds in the sky </s>",
"Fire is hot and produces light </s>",
"Ice is frozen water that melts when heated </s>",
"Mountains are tall and rocky formations </s>",
"The moon orbits around planet Earth </s>",
"Flowers bloom in spring and summer </s>",
"Snow is cold and white </s>",
"Wind moves air from place to place </s>",
"Rivers flow into larger bodies of water </s>",
"Sand is found on beaches and in deserts </s>",
"Grass grows in fields and yards </s>",
"Rocks are hard and can be different colors </s>",
"Stars shine bright in the night sky </s>",
"Waves move across the surface of water </s>",
"Clouds form when water vapor rises </s>",
"Lightning is bright and makes thunder </s>",
"Storms bring rain and strong winds </s>",
"Seasons change throughout the year </s>",
"Animals eat food to survive </s>",
"Plants need sunlight and water to grow </s>"
]
62 changes: 62 additions & 0 deletions src/dataset_loader.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
use csv::ReaderBuilder;
use serde_json;
use std::fs;

pub struct Dataset {
pub pretraining_data: Vec<String>,
pub chat_training_data: Vec<String>,
}

#[allow(dead_code)]
pub enum DatasetType {
JSON,
CSV,
}

impl Dataset {
pub fn new(
pretraining_data_path: String,
chat_training_data_path: String,
type_of_data: DatasetType,
) -> Self {
let pretraining_data: Vec<String>;
let chat_training_data: Vec<String>;

match type_of_data {
DatasetType::CSV => {
pretraining_data = get_data_from_csv(pretraining_data_path);
chat_training_data = get_data_from_csv(chat_training_data_path);
}
DatasetType::JSON => {
pretraining_data = get_data_from_json(pretraining_data_path);
chat_training_data = get_data_from_json(chat_training_data_path);
}
}

Dataset {
pretraining_data: pretraining_data.clone(),
chat_training_data: chat_training_data.clone(),
}
}
}

fn get_data_from_json(path: String) -> Vec<String> {
// convert json file to Vec<String>
let data_json = fs::read_to_string(path).expect("Failed to read data file");
let data: Vec<String> = serde_json::from_str(&data_json).expect("Failed to parse data file");
data
}

fn get_data_from_csv(path: String) -> Vec<String> {
// convert csv file to Vec<String>
let file = fs::File::open(path).expect("Failed to open CSV file");
let mut rdr = ReaderBuilder::new().has_headers(false).from_reader(file);
let mut data = Vec::new();

for result in rdr.records() {
let record = result.expect("Failed to read CSV record");
// Each record is a row, join all columns into a single string
data.push(record.iter().collect::<Vec<_>>().join(","));
}
data
}
3 changes: 1 addition & 2 deletions src/layer_norm.rs
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
use crate::adam::Adam;
use crate::llm::Layer;
use ndarray::Array2;
use ndarray::Axis;
use ndarray::{Array2, Axis};

pub struct LayerNorm {
epsilon: f32, // Small constant for stability
Expand Down
5 changes: 3 additions & 2 deletions src/lib.rs
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
pub mod adam;
pub mod dataset_loader;
pub mod embeddings;
pub mod feed_forward;
pub mod layer_norm;
Expand All @@ -8,9 +9,9 @@ pub mod self_attention;
pub mod transformer;
pub mod vocab;
// Re-export key structs for easier access
pub use dataset_loader::{Dataset, DatasetType};
pub use embeddings::Embeddings;
pub use llm::LLM;
pub use llm::Layer;
pub use llm::{LLM, Layer};
pub use vocab::Vocab;

// Constants
Expand Down
6 changes: 3 additions & 3 deletions src/llm.rs
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,9 @@ use crate::MAX_SEQ_LEN;
use crate::Vocab;
use crate::output_projection::OutputProjection;
use crate::transformer::TransformerBlock;
use ndarray::Array1;
use ndarray::{Array2, Axis};
use ndarray::{Array1, Array2, Axis};
use std::cmp::Ordering;

pub trait Layer {
fn layer_type(&self) -> &str;

Expand Down Expand Up @@ -138,7 +138,7 @@ impl LLM {
pub fn train(&mut self, data: Vec<&str>, epochs: usize, lr: f32) {
let tokenized_data = data
.iter()
.map(|input| (self.tokenize(input)))
.map(|input| self.tokenize(input))
.collect::<Vec<Vec<usize>>>();

for epoch in 0..epochs {
Expand Down
Loading