This repository contains a project that trains and generates text using a GPT-based language model enhanced with Retrieval-Augmented Generation (RAG). The model utilizes PyTorch for deep learning, Sentence-Transformers for encoding documents, and FAISS for efficient similarity search. There is the capability to pause training and resume at a later time, as well as early termination to avoid overfitting.
To set up the project, follow these steps:
-
Clone the repository:
git clone https://github.com/yourusername/gpt-language-model-rag.git cd gpt-language-model-rag -
Create and activate a virtual environment (optional but recommended):
python -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate`
-
Install the required dependencies:
pip install -r requirements.txt
To train the model, use the following command:
python MyLLM.py <path_to_text_file> [options]Arguments:
filepath: Path to the input text, PDF, or Parquet file.
Options:
--normalize: Normalize text during preprocessing.--use_subword: Use subword tokenization.--handle_unicode: Handle Unicode characters during preprocessing.--resume: Resume training from the interim checkpoint.
Example:
python MyLLM.py data/sample.txt --normalize --use_subword --handle_unicodeTo generate text using the trained model, use the following command:
python MyLLM.py <path_to_text_file> --generate --start_text "<starting text>" [options]Options:
--start_text: Starting text for text generation.--max_new_tokens: Maximum number of new tokens to generate (default: 100).--temperature: Sampling temperature for text generation (default: 1.0).
Example:
python MyLLM.py data/sample.txt --generate --start_text "Once upon a time" --max_new_tokens 50 --temperature 0.7The training process can be stopped and resumed using checkpoints. If the training is interrupted or the --resume flag is used, the model will continue training from the last saved checkpoint. Additionally, if the model reaches a specified target loss, it will stop training early, even if it hasn't completed the full number of epochs.
The configuration for the model is handled via the ModelConfig class within MyLLM.py. Key parameters include:
device: Device to run the model on (cudaif available, elsecpu).batch_size: Batch size for training.block_size: Size of input blocks.max_iters: Maximum number of training iterations.learning_rate: Learning rate for the optimizer.eval_interval: Interval for evaluation.n_embd: Dimension of the embeddings.n_head: Number of attention heads.n_layer: Number of transformer layers.dropout: Dropout rate.num_workers: Number of worker threads for data loading.model_dir,log_dir,checkpoint_dir: Directories for saving models, logs, and checkpoints.
MyLLM.py: Main script for training and text generation, contains theModelConfigclass.data/: Directory for storing input data.logs/: Directory for storing logs.saved_models/: Directory for storing saved models.checkpoints/: Directory for storing interim checkpoints.requirements.txt: List of required dependencies.
Contributions are welcome! Please fork the repository and submit a pull request with your changes.
This project is licensed under the MIT License. See the LICENSE file for more details.
