Control flow support by HeydrichBeillschmidt · Pull Request #124 · alpa-projects/tensorflow-alpa

HeydrichBeillschmidt · 2022-07-08T10:00:34Z

[WIP] support tuple-shaped parameters for while instruction

tdietert · 2022-08-25T22:26:23Z

Hi @HeydrichBeillschmidt, when I merge your changes into my fork and try to call run_auto_sharding_pass on a simple MNIST model, I get this error:

  File "/workspaces/alpa/alpa/shard_parallel/auto_sharding.py", line 355, in run_auto_sharding_pass
    xe.run_auto_sharding(hlo_module, compile_options)
IndexError: absl::container_internal::raw_hash_map<>::at

The source of the error is the CreateStrategyVector code, where apparently a select operation has not been added to the strategy_map, and thus results in an error when iterating through the operands of the dot.278 instruction. Below is some HLO that comes from an intermediate stage of compilation, after the spmd_simplify pipeline, and before the spmd_pipeline that runs the auto sharding pass:

  broadcast.6 = f32[2048,1600]{1,0} broadcast(constant.171), dimensions={}
  select = f32[2048,1600]{1,0} select(compare.183, reshape.29, broadcast.6), metadata={op_type="Mul" op_name="mnist/sequential/dropout/dropout/Mul_1" source_file="/home/vscode/.local/lib/python3.8/site-packages/keras/backend.py" source_line=1940}
  arg34.35 = f32[1600,10]{1,0} parameter(34), parameter_replication={false}, metadata={op_name="XLA_Args"}
  dot.268 = f32[2048,10]{1,0} dot(select, arg34.35), lhs_contracting_dims={1}, rhs_contracting_dims={0}, metadata={op_type="MatMul" op_name="mnist/sequential/dense/MatMul" source_file="/home/vscode/.local/lib/python3.8/site-packages/keras/layers/core/dense.py" source_line=221}

And finally, here is some logging output I've generated that shows the sequence of events leading up to this failed indexing into the strategy map:

HandleDot[0]: dot.268
CreateLeafStrategyVector: dot.268
Potential Failing operand instruction: %select = f32[2048,1600]{1,0} select(pred[2048,1600]{1,0} %compare.183, f32[2048,1600]{1,0} %reshape.29, f32[2048,1600]{1,0} %broadcast.6), metadata={op_type="Mul" op_name="mnist/sequential/dropout/dropout/Mul_1" source_file="/home/vscode/.local/lib/python3.8/site-packages/keras/backend.py" source_line=1940}

Do you have any idea what could be the problem?

tdietert · 2022-08-26T22:20:18Z

@HeydrichBeillschmidt I've solved this problem by undoing the part of the diff where you build an instruction sequence from the entry_computation->instructions() list. You passed this entry_sequence value to BuildStrategyAndCost, instead of the sequence value constructed from the hlo_live_range, but it doesn't actually contain all the instructions in the computation: https://github.com/alpa-projects/tensorflow-alpa/pull/124/files#diff-83aa23c5123bde398bcd2002e8bf5d5bdf79341e11f461715a127f9547357a13R2806

Is there a reason you did this? Replacing entry_sequence with sequence (from the hlo_live_range value, like in the master branch) passed to BuildStrategyAndCost solved my issue.

HeydrichBeillschmidt · 2022-08-27T05:19:11Z

@HeydrichBeillschmidt I've solved this problem by undoing the part of the diff where you build an instruction sequence from the entry_computation->instructions() list. You passed this entry_sequence value to BuildStrategyAndCost, instead of the sequence value constructed from the hlo_live_range, but it doesn't actually contain all the instructions in the computation: https://github.com/alpa-projects/tensorflow-alpa/pull/124/files#diff-83aa23c5123bde398bcd2002e8bf5d5bdf79341e11f461715a127f9547357a13R2806

Is there a reason you did this? Replacing entry_sequence with sequence (from the hlo_live_range value, like in the master branch) passed to BuildStrategyAndCost solved my issue.

Hi @tdietert , thank you for your issue. The BuildStrategyAndCost is designed as a recursive structure, and entry_sequence here is passed for avoiding repeated construction for instructions in computations such as while body. However, simply letting entry_sequence = entry_computation->instructions() was incorrect. The problem is addressed in the latest commit.

Co-authored-by: Yonghao Zhuang <zhuangyh@sjtu.edu.cn> Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>

Co-authored-by: Hexu Zhao <zhaohx19@mails.tsinghua.edu.cn> Co-authored-by: Yonghao Zhuang <zhuangyh@sjtu.edu.cn>

tdietert · 2022-09-02T17:07:32Z

@HeydrichBeillschmidt Thanks for your response! We have tried your latest changes and they work well for us, thank you. We have not validated the output, that the while loops are parallelized "correctly", but we don't experience any of the issues we experienced before.

…ects#132)

HeydrichBeillschmidt closed this Jul 9, 2022

merrymercy reopened this Jul 9, 2022

merrymercy mentioned this pull request Aug 30, 2022

[FEATURE] Control flow support in shard_parallel alpa-projects/alpa#400

Open

7 tasks

merrymercy and others added 9 commits August 30, 2022 11:12

[NCCL] Statically link nccl from cupy

86b81f6

Export memory stats from tf allocator

57fbe1f

Add random number to temp filename to avoid name conflcitions

d6e60b8

Add pass_context to pass python arguments to c++

7495d93

Add bypass_device_assignment_check and rng seed setter

693b063

Add auto-sharding related passes

b58c711

Co-authored-by: Yonghao Zhuang <zhuangyh@sjtu.edu.cn> Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>

Add all changes to GPU backend

b66aedf

Co-authored-by: Hexu Zhao <zhaohx19@mails.tsinghua.edu.cn> Co-authored-by: Yonghao Zhuang <zhuangyh@sjtu.edu.cn>

Add all changes under service

b66746b

Add all changes to the xla_compiler and client's python interface

9c6f230

Co-authored-by: Hexu Zhao <zhaohx19@mails.tsinghua.edu.cn> Co-authored-by: Yonghao Zhuang <zhuangyh@sjtu.edu.cn>

merrymercy force-pushed the master branch from b80a87c to 9c6f230 Compare August 30, 2022 12:23

Use identity marker in optimization barrier expander

6e31c0f

merrymercy and others added 2 commits September 2, 2022 17:33

Convert pipeline_marker and OptimizationBarrier to bitcast (alpa-proj…

0e5d1b0

…ects#132)

[WIP] rebasing based on master branch

a904257

HeydrichBeillschmidt force-pushed the control-flow-support branch from 359f6a8 to a904257 Compare September 8, 2022 22:52

merrymercy force-pushed the master branch 2 times, most recently from 736d0db to 2aa7486 Compare November 4, 2022 21:57

merrymercy force-pushed the master branch from 24d5f1b to 272dc9e Compare November 22, 2022 11:05

AdjusteDebugged kConditional

d00f182

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Control flow support#124

Control flow support#124
HeydrichBeillschmidt wants to merge 13 commits intoalpa-projects:masterfrom
HeydrichBeillschmidt:control-flow-support

HeydrichBeillschmidt commented Jul 8, 2022

Uh oh!

tdietert commented Aug 25, 2022

Uh oh!

tdietert commented Aug 26, 2022

Uh oh!

HeydrichBeillschmidt commented Aug 27, 2022

Uh oh!

tdietert commented Sep 2, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

HeydrichBeillschmidt commented Jul 8, 2022

Uh oh!

tdietert commented Aug 25, 2022

Uh oh!

tdietert commented Aug 26, 2022

Uh oh!

HeydrichBeillschmidt commented Aug 27, 2022

Uh oh!

tdietert commented Sep 2, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants