Skip to content

Conversation

kiukchung
Copy link
Contributor

Summary:
So far you can only specify the workspace in the runner's API:

runner.dryrun(appdef, cfg, workspace=...)

in which case the workspace applies to ONLY role[0].

This behavior was intentional since multi-role usecases of TorchX typically had a single "main" role that the application owner actually owned and the other roles were prepackaged apps (not part of your project).

This is no longer the case with applications such as reenforcement learning where the project encompasses multiple applications (e.g. trainer, generator, etc) therefore we need a more flexible way to specify a workspace per Role.

For BC this I'm maintaining the following behavior:

  1. If workspace is specified as a runner argument then it takes precedence over role[0].workspace

  2. Non-zero roles (e.g. role[1], role[2], ...) are unaffected by the workspace argument. That is their workspace attributes (e.g. role[1].workspace) are respected as is.

  3. "disabling" workspace (e.g. passing workspace=None from the runner argument) can still build a workspace if the role's workspace attribute is not None.

NOTE: we need to do further optimization for cases where multiple roles have the same "image" and "workspace". In this case we only need to build the image+workspace once. But as it stands we end up building a separate ephemeral per role (even if the ephemeral is the SAME across all the roles). This isn't an issue practically since image builders like Docker are content addressed and caches layers.

Reviewed By: AbishekS

Differential Revision: D83793199

Summary:
So far you can only specify the workspace in the runner's API:

```
runner.dryrun(appdef, cfg, workspace=...)
```

in which case the workspace applies to ONLY `role[0]`.

This behavior was intentional since multi-role usecases of TorchX typically had a single "main" role that the application owner actually owned and the other roles were prepackaged apps (not part of your project).

This is no longer the case with applications such as reenforcement learning where the project encompasses multiple applications (e.g. trainer, generator, etc) therefore we need a more flexible way to specify a workspace per Role.

For BC this I'm maintaining the following behavior:

1. If `workspace` is specified as a runner argument then it takes precedence over `role[0].workspace`

2. Non-zero roles (e.g. `role[1], role[2], ...`) are unaffected by the workspace argument. That is their workspace attributes (e.g. `role[1].workspace`) are respected as is.

3. "disabling" workspace (e.g. passing `workspace=None` from the runner argument) can still build a workspace if the role's workspace attribute is not `None`.

NOTE: we need to do further optimization for cases where multiple roles have the same "image" and "workspace". In this case we only need to build the image+workspace once. But as it stands we end up building a separate ephemeral per role (even if the ephemeral is the SAME across all the roles). This isn't an issue practically since image builders like Docker are content addressed and caches layers.

Reviewed By: AbishekS

Differential Revision: D83793199
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 3, 2025
@facebook-github-bot
Copy link
Contributor

@kiukchung has exported this pull request. If you are a Meta employee, you can view the originating Diff in D83793199.

@facebook-github-bot
Copy link
Contributor

@kiukchung has exported this pull request. If you are a Meta employee, you can view the originating Diff in D83793199.

@codecov-commenter
Copy link

codecov-commenter commented Oct 3, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 91.64%. Comparing base (1e3df20) to head (b748dcd).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1139      +/-   ##
==========================================
+ Coverage   91.60%   91.64%   +0.04%     
==========================================
  Files          83       83              
  Lines        6431     6441      +10     
==========================================
+ Hits         5891     5903      +12     
+ Misses        540      538       -2     
Flag Coverage Δ
unittests 91.64% <100.00%> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported meta-exported
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants