Skip to content

Llm training scripts#5

Open
erosika wants to merge 9 commits intofractal-nyc:mainfrom
erosika:LLM-training-scripts
Open

Llm training scripts#5
erosika wants to merge 9 commits intofractal-nyc:mainfrom
erosika:LLM-training-scripts

Conversation

@erosika
Copy link
Copy Markdown
Collaborator

@erosika erosika commented Feb 20, 2025

added some llm scripts to train a gpt-2 model on tweets with modal


Important

Add scripts to fetch, process, and train a GPT-2 model on Twitter data using Modal.

  • Data Processing:
    • dataProcessing.py: Functions to load, clean, preprocess, and tokenize Twitter data for model training.
    • Handles JSON data conversion to HuggingFace Dataset and tokenization with AutoTokenizer.
  • Tweet Fetching:
    • getTweets.ts: Fetches tweets from a specified URL and saves them as JSON for training.
    • Includes functions getUserInput, fetchUserTweets, and saveTweetsToJson.
  • Model Training:
    • trainLLM.py: Defines a Modal app to train a GPT-2 model using the processed Twitter data.
    • Uses Trainer from transformers for training and saving the model.
  • Misc:
    • Adds .gitattributes for LFS management of large JSON files.
    • Removes mycustomchat.tsx from src/ui.

This description was created by Ellipsis for 87432da. It will automatically update as commits are pushed.

Copy link
Copy Markdown

@ellipsis-dev ellipsis-dev Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ Changes requested. Reviewed everything up to 87432da in 2 minutes and 46 seconds

More details
  • Looked at 505 lines of code in 6 files
  • Skipped 3 files when reviewing.
  • Skipped posting 7 drafted comments based on config settings.
1. src/twitter/dataProcessing.py:86
  • Draft comment:
    Ensure removing all columns is intended in dataset.map (using dataset.column_names).
  • Reason this comment was not posted:
    Comment did not seem useful. Confidence is useful = 0% <= threshold 50%
    The comment is asking the author to confirm their intention of removing all columns in a dataset. This violates the rule against asking the author to confirm their intention. The comment does not provide a specific suggestion or point out a clear issue with the code.
2. src/twitter/trainLLM.py:98
  • Draft comment:
    Consider awaiting the remote call or handling its future appropriately.
  • Reason this comment was not posted:
    Marked as duplicate.
3. .gitattributes:1
  • Draft comment:
    LFS config looks correct for managing large training data files.
  • Reason this comment was not posted:
    Confidence changes required: 0% <= threshold 50%
    None
4. src/twitter/dataProcessing.py:45
  • Draft comment:
    Regex removing non-word characters may strip out emojis that could be important for sentiment analysis. Consider refining the cleaning if emojis are needed.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 10% vs. threshold = 50%
    The code is preparing data for training with GPT-2 tokenizer. There's no indication in the code or comments that sentiment analysis or emoji preservation is a requirement. The docstring simply states it's for "removing URLs, mentions, hashtags, and special characters". Making assumptions about needing emojis would be speculative without clear requirements.
    The comment raises a valid technical point about emoji handling. If this were for sentiment analysis, preserving emojis would be important.
    However, there's no evidence in the code that emoji preservation is needed for this use case. The code appears to be doing exactly what its docstring claims - removing special characters.
    The comment makes assumptions about requirements that aren't evident in the code. It's a speculative "if X, then Y" type comment that should be removed.
5. src/twitter/dataProcessing.py:80
  • Draft comment:
    Returning PyTorch tensors directly via 'return_tensors="pt"' in tokenization may conflict with Trainer's collator. Consider letting the tokenizer return lists instead.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 50%
    This is a somewhat technical comment about HuggingFace's internals. While it could be valid, we don't see the actual training code that would use this data. The comment is speculative ("may conflict") rather than definitive. Without seeing the training code, we can't be certain this is an actual issue rather than a theoretical one.
    The comment could be correct - HuggingFace's documentation does sometimes recommend letting the collator handle tensor conversion. But we're missing the training code context to know if this is actually causing problems.
    Given our rules about speculative comments ("If X, then Y is an issue"), and the requirement for strong evidence, we should err on the side of removing this comment.
    Delete the comment because it's speculative and we lack the context to confirm it's actually causing problems. The current code may work perfectly fine as is.
6. src/twitter/trainLLM.py:21
  • Draft comment:
    Ensure that the local python source added via add_local_python_source is correctly mapped. Verify that 'dataProcessing' is accessible as expected in the remote container.
  • Reason this comment was not posted:
    Comment did not seem useful. Confidence is useful = 0% <= threshold 50%
    The comment is asking the PR author to ensure and verify certain behaviors, which violates the rule against asking for confirmation or verification. It does not provide a specific code suggestion or ask for a specific test to be written.
7. src/ui/mycustomchat.tsx:1
  • Draft comment:
    Removal of the chat component file is noted. Confirm that this component is no longer needed to avoid breaking any dependent UI flows.
  • Reason this comment was not posted:
    Comment was not on a location in the diff, so it can't be submitted as a review comment.

Workflow ID: wflow_dsdRQLTufc6eG1Ja


Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

from transformers import AutoTokenizer
import pandas as pd
import re
import numpy as np
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove unused import 'numpy' (np) if not needed.

Suggested change
import numpy as np

Comment thread src/twitter/trainLLM.py
logging_steps=100,
save_strategy="epoch",
evaluation_strategy="epoch",
load_best_model_at_end=True,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specify evaluation metrics when using load_best_model_at_end.

Suggested change
load_best_model_at_end=True,
load_best_model_at_end=True, metric_for_best_model='accuracy',


# Extract tweets into a list
tweets = data.get('tweets', [])
tweet_texts = [tweet['content'] for tweet in tweets]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accessing tweet['content'] assumes every tweet object has a 'content' key. Consider adding error handling or a default value.

Suggested change
tweet_texts = [tweet['content'] for tweet in tweets]
tweet_texts = [tweet.get('content', '') for tweet in tweets]

Comment thread src/twitter/getTweets.ts
}

async function fetchUserTweets(username: username): Promise<user> {
const baseUrl = 'https://fabxmporizzqflnftavs.supabase.co/storage/v1/object/public/archives';
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider externalizing the base URL instead of hard-coding it to improve maintainability and configurability.

Comment thread src/twitter/getTweets.ts

// Save to JSON file
const outputPath = `${outputDir}/${username}_tweets.json`;
fs.writeFileSync(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using synchronous file writing (fs.writeFileSync) can block the process; if scaling or non-blocking behavior is desired, consider using asynchronous file I/O.

Comment thread src/twitter/trainLLM.py
username = input("Enter Twitter username to train model on: ")

# Train on the specified user's data
model, tokenizer = train_model.remote(username=username)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The remote function call 'train_model.remote' returns a future. Consider awaiting its result (e.g., using .result() if supported) to ensure that training completes before printing success.

Suggested change
model, tokenizer = train_model.remote(username=username)
model, tokenizer = train_model.remote(username=username).result()

Comment thread src/twitter/trainLLM.py

# Save the final model
if username:
final_output_dir = os.path.join(output_dir, f"{username}_model")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When constructing file paths using the 'username', consider sanitizing the input to prevent potential path injection issues.

Suggested change
final_output_dir = os.path.join(output_dir, f"{username}_model")
final_output_dir = os.path.join(output_dir, f"{os.path.basename(username)}_model")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant