feat: add dynamic data seed option for flowertune examples#6831
feat: add dynamic data seed option for flowertune examples#6831xiaoyanshen799 wants to merge 2 commits intoflwrlabs:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds an opt-in configuration to vary the per-round data sampling seed in FlowerTune LLM example clients, addressing repeated training on the same early subset when max_steps is used.
Changes:
- Introduces
train.dynamic-data-seed = false(default off) in each FlowerTune LLM examplepyproject.toml. - Updates each example
client_app.pyto (optionally) derive a per-roundTrainingArguments.data_seedbased onserver-round.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| examples/flowertune-llm-medical/pyproject.toml | Adds train.dynamic-data-seed config toggle. |
| examples/flowertune-llm-medical/flowertune_medical/client_app.py | Optionally sets data_seed per round and reuses server_round variable. |
| examples/flowertune-llm-general-nlp/pyproject.toml | Adds train.dynamic-data-seed config toggle. |
| examples/flowertune-llm-general-nlp/flowertune_generalnlp/client_app.py | Optionally sets data_seed per round and reuses server_round variable. |
| examples/flowertune-llm-finance/pyproject.toml | Adds train.dynamic-data-seed config toggle. |
| examples/flowertune-llm-finance/flowertune_finance/client_app.py | Optionally sets data_seed per round and reuses server_round variable. |
| examples/flowertune-llm-code/pyproject.toml | Adds train.dynamic-data-seed config toggle. |
| examples/flowertune-llm-code/flowertune_code/client_app.py | Optionally sets data_seed per round and reuses server_round variable. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| training_arguments = TrainingArguments(**cfg.train.training_arguments) | ||
|
|
||
| if cfg.train.get("dynamic_data_seed", False): | ||
| base_seed = int(training_arguments.data_seed or training_arguments.seed) |
There was a problem hiding this comment.
base_seed = int(training_arguments.data_seed or training_arguments.seed) treats data_seed=0 as falsy and will incorrectly fall back to seed. Use an explicit is not None check (or getattr default) so 0 remains a valid seed value.
| base_seed = int(training_arguments.data_seed or training_arguments.seed) | |
| if training_arguments.data_seed is not None: | |
| base_seed = int(training_arguments.data_seed) | |
| elif training_arguments.seed is not None: | |
| base_seed = int(training_arguments.seed) | |
| else: | |
| raise ValueError("Neither data_seed nor seed is set in TrainingArguments.") |
| training_arguments = TrainingArguments(**cfg.train.training_arguments) | ||
|
|
||
| if cfg.train.get("dynamic_data_seed", False): | ||
| base_seed = int(training_arguments.data_seed or training_arguments.seed) |
There was a problem hiding this comment.
base_seed = int(training_arguments.data_seed or training_arguments.seed) treats data_seed=0 as falsy and will incorrectly fall back to seed. Use an explicit is not None check (or getattr default) so 0 remains a valid seed value.
| base_seed = int(training_arguments.data_seed or training_arguments.seed) | |
| base_seed = int( | |
| training_arguments.data_seed | |
| if training_arguments.data_seed is not None | |
| else training_arguments.seed | |
| ) |
| training_arguments = TrainingArguments(**cfg.train.training_arguments) | ||
|
|
||
| if cfg.train.get("dynamic_data_seed", False): | ||
| base_seed = int(training_arguments.data_seed or training_arguments.seed) |
There was a problem hiding this comment.
base_seed = int(training_arguments.data_seed or training_arguments.seed) treats data_seed=0 as falsy and will incorrectly fall back to seed. Use an explicit is not None check (or getattr default) so 0 remains a valid seed value.
| base_seed = int(training_arguments.data_seed or training_arguments.seed) | |
| base_seed_source = ( | |
| training_arguments.data_seed | |
| if training_arguments.data_seed is not None | |
| else training_arguments.seed | |
| ) | |
| base_seed = int(base_seed_source) |
| training_arguments = TrainingArguments(**cfg.train.training_arguments) | ||
|
|
||
| if cfg.train.get("dynamic_data_seed", False): | ||
| base_seed = int(training_arguments.data_seed or training_arguments.seed) |
There was a problem hiding this comment.
base_seed = int(training_arguments.data_seed or training_arguments.seed) treats data_seed=0 as falsy and will incorrectly fall back to seed. Use an explicit is not None check (or getattr default) so 0 remains a valid seed value.
| base_seed = int(training_arguments.data_seed or training_arguments.seed) | |
| base_seed = ( | |
| int(training_arguments.data_seed) | |
| if getattr(training_arguments, "data_seed", None) is not None | |
| else int(training_arguments.seed) | |
| ) |
Issue
Description
When
max_stepsis used in FlowerTune LLM examples, clients repeatedly train on the same data subset every round due to a fixed default seed, causing the remaining data in each client's partition to never be seen during training.Related issues/PRs
#6808
Proposal
Explanation
train.dynamic-data-seedtopyproject.tomlto allow clients to cover more training data across rounds.Default is
falseto maintain backward compatibility.client_app.pyin each example to dynamically setdata_seedwhen thetrain.dynamic-data-seed = true:base_seed = int(training_arguments.data_seed or training_arguments.seed)training_arguments.data_seed = base_seed + server_round - 1Closes #6808
Checklist
#contributions)Any other comments?