This framework enables distributed training of text-to-text models (like T5 and BART) across multiple machines using federated learning. It consists of three main components:
- Coordinator: Manages the training process and aggregates model updates
- Worker: Performs local training on data and sends updates to the coordinator
- Monitor: Web-based dashboard to track training progress
- Support for multiple text-to-text models (T5, BART)
- Real-time training monitoring via web dashboard
- Automatic model aggregation using federated averaging
- GPU/CPU support
- Fault tolerance and automatic node recovery
- Easy to extend with new models
- Clone the repository:
git clone https://github.com/yourusername/decentralized-text-training.git
cd decentralized-text-training- Install dependencies:
pip install -r requirements.txtpython coordinator.py --host 0.0.0.0 --port 5000 --model t5-smallOn each machine that will participate in training:
python worker.py --coordinator-address <coordinator-ip>:5000 --model t5-small [--gpu]python web_monitor.py --coordinator-address <coordinator-ip>:5000 --port 8080Then open your browser and navigate to http://localhost:8080 to view the training dashboard.
The framework can be configured through config.py. Key settings include:
BATCH_SIZE: Batch size for trainingLEARNING_RATE: Learning rate for model updatesLOCAL_EPOCHS: Number of local training epochsMIN_NODES_TO_START: Minimum number of nodes required to start trainingNODE_TIMEOUT: Timeout for inactive nodesNODE_HEARTBEAT_INTERVAL: Interval for node heartbeats
To add a new model:
- Add the model configuration to
model_registry.py:
model_registry.register_model(
name="your-model-name",
model_class=YourModelClass,
tokenizer_class=YourTokenizerClass,
pretrained_name="pretrained-model-name",
config={
"max_length": 512,
"num_beams": 4,
"early_stopping": True
}
)- Update the worker to handle the new model's data format and training process.
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.