Skip to content

Fix issue with moe trace generation#22

Open
daweilics wants to merge 1 commit intoastra-sim:mainfrom
daweilics:main-fix-issue-with-moe-run
Open

Fix issue with moe trace generation#22
daweilics wants to merge 1 commit intoastra-sim:mainfrom
daweilics:main-fix-issue-with-moe-run

Conversation

@daweilics
Copy link
Copy Markdown

Without the fix, we get the following failure when generating moe traces:

MoE model: Converting Chakra
processing zero rank before cross bucket ((pp, 0), (ep, 0), (cp, 0), (tp, 0), (dp, 0))
Traceback (most recent call last):
  File "/home/dli/workspace/symbolic_tensor_graph/main.py", line 568, in <module>
    main()
    ~~~~^^
  File "/home/dli/workspace/symbolic_tensor_graph/main.py", line 478, in main
    distributed_chakra_graph_moe = BundledConvertChakra.apply(
        distributed_tensor_graph_moe,
    ...<2 lines>...
        mixed_precision=args.mixed_precision,
    )
  File "/home/dli/workspace/symbolic_tensor_graph/symbolic_tensor_graph/graph/convert_chakra.py", line 634, in apply
    tensor_map_nodes = cls._ConvertChakra.apply_before_cross_bucket_comms(
        tensor_graph, symbol_map_value, bundled_graph.spatial_parallel_dims
    )
  File "/home/dli/workspace/symbolic_tensor_graph/symbolic_tensor_graph/graph/convert_chakra.py", line 500, in apply_before_cross_bucket_comms
    nodes_this_tensor = cls._tensor_to_nodes(
        tensor, symbol_map_value, spatial_parallel_syms
    )
  File "/home/dli/workspace/symbolic_tensor_graph/symbolic_tensor_graph/graph/convert_chakra.py", line 258, in _tensor_to_nodes
    cls._insert_comp(tensor, symbol_map_value, nodes_this_tensor)
    ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dli/workspace/symbolic_tensor_graph/symbolic_tensor_graph/graph/convert_chakra.py", line 68, in _insert_comp
    tensor_ops = Tensor.eval_expr(tensor.ops, symbol_map_value)
  File "/home/dli/workspace/symbolic_tensor_graph/symbolic_tensor_graph/tensor.py", line 114, in eval_expr
    target_eval_expr_cache[expr] = float(
                                   ~~~~~^
        expr.evalf(subs=target_symbol_value_dict)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/home/dli/miniconda3/lib/python3.13/site-packages/sympy/core/expr.py", line 375, in __float__
    raise TypeError("Cannot convert expression to float")
TypeError: Cannot convert expression to float

With the fix, we can generate moe traces fine.

I think this block of code is not supposed to be run. As suggested by the context, it should just belong to the else branch. Let me know if this is a valid fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant