Skip to content
forked from FedML-AI/FedML

FEDML - The unified and scalable ML library for large-scale distributed training, model serving, and federated learning. FEDML Launch, a cross-cloud scheduler, further enables running any AI jobs on any GPU cloud or on-premise cluster. Built on this library, TensorOpera AI (https://TensorOpera.ai) is your generative AI platform at scale.

License

Notifications You must be signed in to change notification settings

bageldotcom/FedML

 
 

Repository files navigation

Federated Learning (FL) with GRPO Setup Guide

This guide provides setup instructions for running GRPO (Group Relative Policy Optimization) experiments using FedML.

Initial Setup

Note: Server and Client(s) use the same initial setup process.

1. Clone Repository

git clone --recurse-submodules https://github.com/bagel-org/FedML.git

2. Install Dependencies

pip install -r python/spotlight_prj/fedllm/requirements.txt
pip install "trl>=0.9.0" "accelerate>=0.27.0"
pip install -e python/
cd FedML/python/spotlight_prj/fedllm

3. Environment Configuration

Set up AWS credentials:

export AWS_ACCESS_KEY_ID=<your_key>
export AWS_SECRET_ACCESS_KEY=<your_other_key>

Generate a unique run ID:

export RUN_ID=$(python -c "import uuid; print(uuid.uuid4().hex)") 

Important: Server and Client(s) should all use the same run ID for a given run to avoid data conflicts in the S3 bucket.

4. Weights & Biases Setup

Configure wandb for experiment logging:

wandb login

Running Experiments

1-Client Test

Server

bash scripts/run_fedml_server_custom.sh 0 "$RUN_ID" localhost 29500 1 auto fedml_config/scenario1.yaml

Client

bash scripts/run_fedml_client_custom.sh 1 "$RUN_ID" localhost 29500 1 auto fedml_config/scenario1.yaml

Note: To run with 2 or more clients, the first number after scripts/run_fedml_client_custom.sh will indicate the client id. One client script should be executed in each client with a different client id.

Notes

  • The RUN_ID should be unique for each experimental run to prevent data conflicts across different experiments.
  • All participants (server and clients) must use the same RUN_ID for a given experimental run.
  • Make sure AWS credentials are properly configured before starting the experiments.

The YAML Configuration File

All parameters of GRPO can be setup using yaml configuration files. Usually we store these files in python/spotlight_prj/fedllm/fedml_config. In these file we can configure the GRPO parameters like the batch size, number of rollouts, completion length, etc. The client and server scripts read the parameters from these files.

The table below shows the yaml configuration files for each escenario analyzed in the paper.

Scenario yaml file
#1 scenario1.yaml
#2 scenario2.yaml
#3 scenario3.yaml

About

FEDML - The unified and scalable ML library for large-scale distributed training, model serving, and federated learning. FEDML Launch, a cross-cloud scheduler, further enables running any AI jobs on any GPU cloud or on-premise cluster. Built on this library, TensorOpera AI (https://TensorOpera.ai) is your generative AI platform at scale.

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 78.7%
  • Jupyter Notebook 14.4%
  • Java 2.8%
  • Shell 2.0%
  • C++ 1.4%
  • Dockerfile 0.3%
  • Other 0.4%