Llm training scripts#5
Conversation
There was a problem hiding this comment.
❌ Changes requested. Reviewed everything up to 87432da in 2 minutes and 46 seconds
More details
- Looked at
505lines of code in6files - Skipped
3files when reviewing. - Skipped posting
7drafted comments based on config settings.
1. src/twitter/dataProcessing.py:86
- Draft comment:
Ensure removing all columns is intended in dataset.map (using dataset.column_names). - Reason this comment was not posted:
Comment did not seem useful. Confidence is useful =0%<= threshold50%
The comment is asking the author to confirm their intention of removing all columns in a dataset. This violates the rule against asking the author to confirm their intention. The comment does not provide a specific suggestion or point out a clear issue with the code.
2. src/twitter/trainLLM.py:98
- Draft comment:
Consider awaiting the remote call or handling its future appropriately. - Reason this comment was not posted:
Marked as duplicate.
3. .gitattributes:1
- Draft comment:
LFS config looks correct for managing large training data files. - Reason this comment was not posted:
Confidence changes required:0%<= threshold50%
None
4. src/twitter/dataProcessing.py:45
- Draft comment:
Regex removing non-word characters may strip out emojis that could be important for sentiment analysis. Consider refining the cleaning if emojis are needed. - Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 10% vs. threshold = 50%
The code is preparing data for training with GPT-2 tokenizer. There's no indication in the code or comments that sentiment analysis or emoji preservation is a requirement. The docstring simply states it's for "removing URLs, mentions, hashtags, and special characters". Making assumptions about needing emojis would be speculative without clear requirements.
The comment raises a valid technical point about emoji handling. If this were for sentiment analysis, preserving emojis would be important.
However, there's no evidence in the code that emoji preservation is needed for this use case. The code appears to be doing exactly what its docstring claims - removing special characters.
The comment makes assumptions about requirements that aren't evident in the code. It's a speculative "if X, then Y" type comment that should be removed.
5. src/twitter/dataProcessing.py:80
- Draft comment:
Returning PyTorch tensors directly via 'return_tensors="pt"' in tokenization may conflict with Trainer's collator. Consider letting the tokenizer return lists instead. - Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 50%
This is a somewhat technical comment about HuggingFace's internals. While it could be valid, we don't see the actual training code that would use this data. The comment is speculative ("may conflict") rather than definitive. Without seeing the training code, we can't be certain this is an actual issue rather than a theoretical one.
The comment could be correct - HuggingFace's documentation does sometimes recommend letting the collator handle tensor conversion. But we're missing the training code context to know if this is actually causing problems.
Given our rules about speculative comments ("If X, then Y is an issue"), and the requirement for strong evidence, we should err on the side of removing this comment.
Delete the comment because it's speculative and we lack the context to confirm it's actually causing problems. The current code may work perfectly fine as is.
6. src/twitter/trainLLM.py:21
- Draft comment:
Ensure that the local python source added via add_local_python_source is correctly mapped. Verify that 'dataProcessing' is accessible as expected in the remote container. - Reason this comment was not posted:
Comment did not seem useful. Confidence is useful =0%<= threshold50%
The comment is asking the PR author to ensure and verify certain behaviors, which violates the rule against asking for confirmation or verification. It does not provide a specific code suggestion or ask for a specific test to be written.
7. src/ui/mycustomchat.tsx:1
- Draft comment:
Removal of the chat component file is noted. Confirm that this component is no longer needed to avoid breaking any dependent UI flows. - Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.
Workflow ID: wflow_dsdRQLTufc6eG1Ja
Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.
| from transformers import AutoTokenizer | ||
| import pandas as pd | ||
| import re | ||
| import numpy as np |
There was a problem hiding this comment.
Remove unused import 'numpy' (np) if not needed.
| import numpy as np |
| logging_steps=100, | ||
| save_strategy="epoch", | ||
| evaluation_strategy="epoch", | ||
| load_best_model_at_end=True, |
There was a problem hiding this comment.
Specify evaluation metrics when using load_best_model_at_end.
| load_best_model_at_end=True, | |
| load_best_model_at_end=True, metric_for_best_model='accuracy', |
|
|
||
| # Extract tweets into a list | ||
| tweets = data.get('tweets', []) | ||
| tweet_texts = [tweet['content'] for tweet in tweets] |
There was a problem hiding this comment.
Accessing tweet['content'] assumes every tweet object has a 'content' key. Consider adding error handling or a default value.
| tweet_texts = [tweet['content'] for tweet in tweets] | |
| tweet_texts = [tweet.get('content', '') for tweet in tweets] |
| } | ||
|
|
||
| async function fetchUserTweets(username: username): Promise<user> { | ||
| const baseUrl = 'https://fabxmporizzqflnftavs.supabase.co/storage/v1/object/public/archives'; |
There was a problem hiding this comment.
Consider externalizing the base URL instead of hard-coding it to improve maintainability and configurability.
|
|
||
| // Save to JSON file | ||
| const outputPath = `${outputDir}/${username}_tweets.json`; | ||
| fs.writeFileSync( |
There was a problem hiding this comment.
Using synchronous file writing (fs.writeFileSync) can block the process; if scaling or non-blocking behavior is desired, consider using asynchronous file I/O.
| username = input("Enter Twitter username to train model on: ") | ||
|
|
||
| # Train on the specified user's data | ||
| model, tokenizer = train_model.remote(username=username) |
There was a problem hiding this comment.
The remote function call 'train_model.remote' returns a future. Consider awaiting its result (e.g., using .result() if supported) to ensure that training completes before printing success.
| model, tokenizer = train_model.remote(username=username) | |
| model, tokenizer = train_model.remote(username=username).result() |
|
|
||
| # Save the final model | ||
| if username: | ||
| final_output_dir = os.path.join(output_dir, f"{username}_model") |
There was a problem hiding this comment.
When constructing file paths using the 'username', consider sanitizing the input to prevent potential path injection issues.
| final_output_dir = os.path.join(output_dir, f"{username}_model") | |
| final_output_dir = os.path.join(output_dir, f"{os.path.basename(username)}_model") |
added some llm scripts to train a gpt-2 model on tweets with modal
Important
Add scripts to fetch, process, and train a GPT-2 model on Twitter data using Modal.
dataProcessing.py: Functions to load, clean, preprocess, and tokenize Twitter data for model training.Datasetand tokenization withAutoTokenizer.getTweets.ts: Fetches tweets from a specified URL and saves them as JSON for training.getUserInput,fetchUserTweets, andsaveTweetsToJson.trainLLM.py: Defines a Modal app to train a GPT-2 model using the processed Twitter data.Trainerfromtransformersfor training and saving the model..gitattributesfor LFS management of large JSON files.mycustomchat.tsxfromsrc/ui.This description was created by
for 87432da. It will automatically update as commits are pushed.