Llm training scripts by erosika · Pull Request #5 · fractal-nyc/eri-chatbot

erosika · 2025-02-20T20:50:21Z

added some llm scripts to train a gpt-2 model on tweets with modal

Important

Add scripts to fetch, process, and train a GPT-2 model on Twitter data using Modal.

Data Processing:
- dataProcessing.py: Functions to load, clean, preprocess, and tokenize Twitter data for model training.
- Handles JSON data conversion to HuggingFace Dataset and tokenization with AutoTokenizer.
Tweet Fetching:
- getTweets.ts: Fetches tweets from a specified URL and saves them as JSON for training.
- Includes functions getUserInput, fetchUserTweets, and saveTweetsToJson.
Model Training:
- trainLLM.py: Defines a Modal app to train a GPT-2 model using the processed Twitter data.
- Uses Trainer from transformers for training and saving the model.
Misc:
- Adds .gitattributes for LFS management of large JSON files.
- Removes mycustomchat.tsx from src/ui.

^{This description was created by}^{for 87432da. It will automatically update as commits are pushed.}

ellipsis-dev

❌ Changes requested. Reviewed everything up to 87432da in 2 minutes and 46 seconds

More details

Looked at 505 lines of code in 6 files
Skipped 3 files when reviewing.
Skipped posting 7 drafted comments based on config settings.

1. src/twitter/dataProcessing.py:86

Draft comment:
Ensure removing all columns is intended in dataset.map (using dataset.column_names).
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50%
The comment is asking the author to confirm their intention of removing all columns in a dataset. This violates the rule against asking the author to confirm their intention. The comment does not provide a specific suggestion or point out a clear issue with the code.

2. src/twitter/trainLLM.py:98

Draft comment:
Consider awaiting the remote call or handling its future appropriately.
Reason this comment was not posted:
Marked as duplicate.

3. .gitattributes:1

Draft comment:
LFS config looks correct for managing large training data files.
Reason this comment was not posted:
Confidence changes required: 0% <= threshold 50%
None

4. src/twitter/dataProcessing.py:45

Draft comment:
Regex removing non-word characters may strip out emojis that could be important for sentiment analysis. Consider refining the cleaning if emojis are needed.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 10% vs. threshold = 50%
The code is preparing data for training with GPT-2 tokenizer. There's no indication in the code or comments that sentiment analysis or emoji preservation is a requirement. The docstring simply states it's for "removing URLs, mentions, hashtags, and special characters". Making assumptions about needing emojis would be speculative without clear requirements.
The comment raises a valid technical point about emoji handling. If this were for sentiment analysis, preserving emojis would be important.
However, there's no evidence in the code that emoji preservation is needed for this use case. The code appears to be doing exactly what its docstring claims - removing special characters.
The comment makes assumptions about requirements that aren't evident in the code. It's a speculative "if X, then Y" type comment that should be removed.

5. src/twitter/dataProcessing.py:80

Draft comment:
Returning PyTorch tensors directly via 'return_tensors="pt"' in tokenization may conflict with Trainer's collator. Consider letting the tokenizer return lists instead.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 50%
This is a somewhat technical comment about HuggingFace's internals. While it could be valid, we don't see the actual training code that would use this data. The comment is speculative ("may conflict") rather than definitive. Without seeing the training code, we can't be certain this is an actual issue rather than a theoretical one.
The comment could be correct - HuggingFace's documentation does sometimes recommend letting the collator handle tensor conversion. But we're missing the training code context to know if this is actually causing problems.
Given our rules about speculative comments ("If X, then Y is an issue"), and the requirement for strong evidence, we should err on the side of removing this comment.
Delete the comment because it's speculative and we lack the context to confirm it's actually causing problems. The current code may work perfectly fine as is.

6. src/twitter/trainLLM.py:21

Draft comment:
Ensure that the local python source added via add_local_python_source is correctly mapped. Verify that 'dataProcessing' is accessible as expected in the remote container.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50%
The comment is asking the PR author to ensure and verify certain behaviors, which violates the rule against asking for confirmation or verification. It does not provide a specific code suggestion or ask for a specific test to be written.

7. src/ui/mycustomchat.tsx:1

Draft comment:
Removal of the chat component file is noted. Confirm that this component is no longer needed to avoid breaking any dependent UI flows.
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

Workflow ID: wflow_dsdRQLTufc6eG1Ja

Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

ellipsis-dev · 2025-02-20T20:53:12Z

+from transformers import AutoTokenizer
+import pandas as pd
+import re
+import numpy as np


Remove unused import 'numpy' (np) if not needed.

Suggested change

import numpy as np

ellipsis-dev · 2025-02-20T20:53:13Z

+        logging_steps=100,
+        save_strategy="epoch",
+        evaluation_strategy="epoch",
+        load_best_model_at_end=True,


Specify evaluation metrics when using load_best_model_at_end.

Suggested change

load_best_model_at_end=True,

load_best_model_at_end=True, metric_for_best_model='accuracy',

ellipsis-dev · 2025-02-20T20:53:13Z

+
+    # Extract tweets into a list
+    tweets = data.get('tweets', [])
+    tweet_texts = [tweet['content'] for tweet in tweets]


Accessing tweet['content'] assumes every tweet object has a 'content' key. Consider adding error handling or a default value.

Suggested change

tweet_texts = [tweet['content'] for tweet in tweets]

tweet_texts = [tweet.get('content', '') for tweet in tweets]

ellipsis-dev · 2025-02-20T20:53:13Z

+}
+
+async function fetchUserTweets(username: username): Promise<user> {
+    const baseUrl = 'https://fabxmporizzqflnftavs.supabase.co/storage/v1/object/public/archives';


Consider externalizing the base URL instead of hard-coding it to improve maintainability and configurability.

ellipsis-dev · 2025-02-20T20:53:13Z

+
+        // Save to JSON file
+        const outputPath = `${outputDir}/${username}_tweets.json`;
+        fs.writeFileSync(


Using synchronous file writing (fs.writeFileSync) can block the process; if scaling or non-blocking behavior is desired, consider using asynchronous file I/O.

ellipsis-dev · 2025-02-20T20:53:13Z

+        username = input("Enter Twitter username to train model on: ")
+
+        # Train on the specified user's data
+        model, tokenizer = train_model.remote(username=username)


The remote function call 'train_model.remote' returns a future. Consider awaiting its result (e.g., using .result() if supported) to ensure that training completes before printing success.

Suggested change

model, tokenizer = train_model.remote(username=username)

model, tokenizer = train_model.remote(username=username).result()

ellipsis-dev · 2025-02-20T20:53:13Z

+
+    # Save the final model
+    if username:
+        final_output_dir = os.path.join(output_dir, f"{username}_model")


When constructing file paths using the 'username', consider sanitizing the input to prevent potential path injection issues.

Suggested change

final_output_dir = os.path.join(output_dir, f"{username}_model")

final_output_dir = os.path.join(output_dir, f"{os.path.basename(username)}_model")

erosika added 8 commits February 20, 2025 13:33

getTweets script

ace2092

getTweets script

f3734bb

getTweets script

b3cc9c5

getTweets

7ea723d

added tweet training data to .gitignore

c63aec0

Added large JSON file with Git LFS

740e530

Remove large file that exceeds GitHub's size limit

f09f510

python scripts for LLM training

87432da

ellipsis-dev Bot reviewed Feb 20, 2025

View reviewed changes

gitignore

dc85179

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llm training scripts#5

Llm training scripts#5
erosika wants to merge 9 commits intofractal-nyc:mainfrom
erosika:LLM-training-scripts

erosika commented Feb 20, 2025 •

edited by ellipsis-dev Bot

Loading

Uh oh!

ellipsis-dev Bot left a comment

Uh oh!

ellipsis-dev Bot Feb 20, 2025

Uh oh!

ellipsis-dev Bot Feb 20, 2025

Uh oh!

ellipsis-dev Bot Feb 20, 2025

Uh oh!

ellipsis-dev Bot Feb 20, 2025

Uh oh!

ellipsis-dev Bot Feb 20, 2025

Uh oh!

ellipsis-dev Bot Feb 20, 2025

Uh oh!

ellipsis-dev Bot Feb 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	load_best_model_at_end=True,
	load_best_model_at_end=True, metric_for_best_model='accuracy',

	tweet_texts = [tweet['content'] for tweet in tweets]
	tweet_texts = [tweet.get('content', '') for tweet in tweets]

	model, tokenizer = train_model.remote(username=username)
	model, tokenizer = train_model.remote(username=username).result()

	final_output_dir = os.path.join(output_dir, f"{username}_model")
	final_output_dir = os.path.join(output_dir, f"{os.path.basename(username)}_model")

Conversation

erosika commented Feb 20, 2025 • edited by ellipsis-dev Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ellipsis-dev Bot left a comment

Choose a reason for hiding this comment

Uh oh!

ellipsis-dev Bot Feb 20, 2025

Choose a reason for hiding this comment

Uh oh!

ellipsis-dev Bot Feb 20, 2025

Choose a reason for hiding this comment

Uh oh!

ellipsis-dev Bot Feb 20, 2025

Choose a reason for hiding this comment

Uh oh!

ellipsis-dev Bot Feb 20, 2025

Choose a reason for hiding this comment

Uh oh!

ellipsis-dev Bot Feb 20, 2025

Choose a reason for hiding this comment

Uh oh!

ellipsis-dev Bot Feb 20, 2025

Choose a reason for hiding this comment

Uh oh!

ellipsis-dev Bot Feb 20, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

erosika commented Feb 20, 2025 •

edited by ellipsis-dev Bot

Loading