Skip to content

feat: add dynamic data seed option for flowertune examples#6831

Open
xiaoyanshen799 wants to merge 2 commits intoflwrlabs:mainfrom
xiaoyanshen799:feature/dynamic-data-seed-flowertune
Open

feat: add dynamic data seed option for flowertune examples#6831
xiaoyanshen799 wants to merge 2 commits intoflwrlabs:mainfrom
xiaoyanshen799:feature/dynamic-data-seed-flowertune

Conversation

@xiaoyanshen799
Copy link

Issue

Description

When max_steps is used in FlowerTune LLM examples, clients repeatedly train on the same data subset every round due to a fixed default seed, causing the remaining data in each client's partition to never be seen during training.

Related issues/PRs

#6808

Proposal

Explanation

  • Added train.dynamic-data-seed to pyproject.toml to allow clients to cover more training data across rounds.
    Default is false to maintain backward compatibility.
  • Updated client_app.py in each example to dynamically set data_seed when the train.dynamic-data-seed = true:
    base_seed = int(training_arguments.data_seed or training_arguments.seed)
    training_arguments.data_seed = base_seed + server_round - 1
  • All changes are applied consistently across all four FlowerTune LLM examples (finance, medical, code, general-nlp).

Closes #6808

Checklist

  • Implement proposed change
  • Write tests
  • Update documentation
  • Address LLM-reviewer comments, if applicable (e.g., GitHub Copilot)
  • Make CI checks pass
  • Ping maintainers on Slack (channel #contributions)

Any other comments?

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an opt-in configuration to vary the per-round data sampling seed in FlowerTune LLM example clients, addressing repeated training on the same early subset when max_steps is used.

Changes:

  • Introduces train.dynamic-data-seed = false (default off) in each FlowerTune LLM example pyproject.toml.
  • Updates each example client_app.py to (optionally) derive a per-round TrainingArguments.data_seed based on server-round.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
examples/flowertune-llm-medical/pyproject.toml Adds train.dynamic-data-seed config toggle.
examples/flowertune-llm-medical/flowertune_medical/client_app.py Optionally sets data_seed per round and reuses server_round variable.
examples/flowertune-llm-general-nlp/pyproject.toml Adds train.dynamic-data-seed config toggle.
examples/flowertune-llm-general-nlp/flowertune_generalnlp/client_app.py Optionally sets data_seed per round and reuses server_round variable.
examples/flowertune-llm-finance/pyproject.toml Adds train.dynamic-data-seed config toggle.
examples/flowertune-llm-finance/flowertune_finance/client_app.py Optionally sets data_seed per round and reuses server_round variable.
examples/flowertune-llm-code/pyproject.toml Adds train.dynamic-data-seed config toggle.
examples/flowertune-llm-code/flowertune_code/client_app.py Optionally sets data_seed per round and reuses server_round variable.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

training_arguments = TrainingArguments(**cfg.train.training_arguments)

if cfg.train.get("dynamic_data_seed", False):
base_seed = int(training_arguments.data_seed or training_arguments.seed)
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

base_seed = int(training_arguments.data_seed or training_arguments.seed) treats data_seed=0 as falsy and will incorrectly fall back to seed. Use an explicit is not None check (or getattr default) so 0 remains a valid seed value.

Suggested change
base_seed = int(training_arguments.data_seed or training_arguments.seed)
if training_arguments.data_seed is not None:
base_seed = int(training_arguments.data_seed)
elif training_arguments.seed is not None:
base_seed = int(training_arguments.seed)
else:
raise ValueError("Neither data_seed nor seed is set in TrainingArguments.")

Copilot uses AI. Check for mistakes.
training_arguments = TrainingArguments(**cfg.train.training_arguments)

if cfg.train.get("dynamic_data_seed", False):
base_seed = int(training_arguments.data_seed or training_arguments.seed)
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

base_seed = int(training_arguments.data_seed or training_arguments.seed) treats data_seed=0 as falsy and will incorrectly fall back to seed. Use an explicit is not None check (or getattr default) so 0 remains a valid seed value.

Suggested change
base_seed = int(training_arguments.data_seed or training_arguments.seed)
base_seed = int(
training_arguments.data_seed
if training_arguments.data_seed is not None
else training_arguments.seed
)

Copilot uses AI. Check for mistakes.
training_arguments = TrainingArguments(**cfg.train.training_arguments)

if cfg.train.get("dynamic_data_seed", False):
base_seed = int(training_arguments.data_seed or training_arguments.seed)
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

base_seed = int(training_arguments.data_seed or training_arguments.seed) treats data_seed=0 as falsy and will incorrectly fall back to seed. Use an explicit is not None check (or getattr default) so 0 remains a valid seed value.

Suggested change
base_seed = int(training_arguments.data_seed or training_arguments.seed)
base_seed_source = (
training_arguments.data_seed
if training_arguments.data_seed is not None
else training_arguments.seed
)
base_seed = int(base_seed_source)

Copilot uses AI. Check for mistakes.
training_arguments = TrainingArguments(**cfg.train.training_arguments)

if cfg.train.get("dynamic_data_seed", False):
base_seed = int(training_arguments.data_seed or training_arguments.seed)
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

base_seed = int(training_arguments.data_seed or training_arguments.seed) treats data_seed=0 as falsy and will incorrectly fall back to seed. Use an explicit is not None check (or getattr default) so 0 remains a valid seed value.

Suggested change
base_seed = int(training_arguments.data_seed or training_arguments.seed)
base_seed = (
int(training_arguments.data_seed)
if getattr(training_arguments, "data_seed", None) is not None
else int(training_arguments.seed)
)

Copilot uses AI. Check for mistakes.
@github-actions github-actions bot added the Contributor Used to determine what PRs (mainly) come from external contributors. label Mar 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Contributor Used to determine what PRs (mainly) come from external contributors.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Repeated local training on nearly the same subset of a client partition across rounds in examples/flowertune-llm-finance

2 participants