Fix potential deadlocks from execution of mutex lock and release by mxgrey · Pull Request #490 · open-rmf/rmf_ros2

mxgrey · 2025-11-19T13:40:42Z

This PR fixes #448

I found two unrelated problems which had a risk of causing deadlocks in mutex usage:

1. Premature release of a mutex after replanning

The MoveRobot phase is set up to automatically release mutexes that are no longer needed by the remaining movement that the robot is performing. In general this works fine since a robot should only need to retain mutexes that it is currently moving through.

However, there is an edge case: If replanning gets forced by a negotiation then a new plan may involve one initial step where the robot is commanded to the midpoint of a lane. In the current implementation, this midpoint is not explicitly associated with any part of the graph, i.e. it has no "graph elements" to reference to determine what mutexes should be locked or not. This leaves the MoveRobot phase unaware that it needs to retain the mutex of the lane that the robot is currently on, nor the mutex of any vertex that the robot is moving towards. Without that awareness, the MoveRobot phase will naturally release those mutexes.

In this PR, we now detect this edge case directly, and we block the MoveRobot phase from doing any automatic release of mutexes when it's handling this type of midpoint. This is a very blunt force approach to solving the problem, but I don't think we'll find a better approach until we reimplement the traffic system to do geometric reasoning instead of pure graph vertex/edge reasoning.

2. Incorrect cleanup of `current_mutex_groups` in `ExecutePlan::make`

The logic when creating the plan execution is terribly convoluted because of the need to iterate through waypoints while only inserting events as needed. The variable current_mutex_groups is meant to track which mutex groups need to be locked before the next phase can be started, but its value was not getting reset to nullopt after being used in certain cases. This meant that the plan would involve repeatedly trying to relock mutexes that had just been correctly released.

The fix proposed in this comment does have the effect of forcing current_mutex_groups to be reset, which is why it appeared to resolve this part of the deadlocking problem. However it also has a side-effect of forcing a LockMutexGroup phase in between every waypoint in the plan, which is why I did not go with that proposed solution.

Other notes

Besides fixing the above problems, I've left some debug logging in this PR with the information that helped me resolve this issue. If we notice anything else suspicious about the mutex behavior in the future, those debug outputs should help resolve it a bit faster.

Signed-off-by: Michael X. Grey <greyxmike@gmail.com>

xiyuoh · 2025-11-21T07:49:59Z

I took a look at the code changes, it seems reasonable to add an additional filter to make sure that we account for robots currently on a lane during replanning.

I'm testing this PR out, and though it was working great without deadlock for some time, after tweaking the graph a little (not the mutexes, just spaced out the charging stations at the bottom further apart), I can consistently create a deadlock where TinyRobot4 starts waiting for mutex_3 after a replan.

Relevant logs are available here, and I'm copying the suspicious lines below, where it suggests that TinyRobot4 should be moving up towards loc_6 instead of move down towards the charger:

[fleet_adapter-19] [INFO] [1763709514.899506501] [TinyRobot_fleet_adapter]: Replanning for [TinyRobot/TinyRobot_4] after locking mutexes [mutex_4][mutex_5] because the recommended plan has changed from [loc_8][loc_10][#21][#24][#25][TinyRobot_4_charger] to [loc_8][loc_10][#21][#7][#23][#24][#25][TinyRobot_4_charger]
[fleet_adapter-19] [INFO] [1763709514.899614731] [TinyRobot_fleet_adapter]: Replanning requested for [TinyRobot/TinyRobot_4]
[fleet_adapter-19] [INFO] [1763709514.899701921] [TinyRobot_fleet_adapter]: Planning for [TinyRobot/TinyRobot_4] to [TinyRobot_4_charger] from one of these locations:
[fleet_adapter-19]  -- lane 8: { L1 <7.41765 2.27364> [loc_6] [mutex: mutex_3] } -> { L1 <  7.41765 -0.165181> [loc_8] [mutex: mutex_4] } [mutex: mutex_3] | location <  7.40485 -0.165181> | orientation -3.73284e-05
[fleet_adapter-19]  -- lane 9: { L1 <  7.41765 -0.165181> [loc_8] [mutex: mutex_4] } -> { L1 <7.41765 2.27364> [loc_6] [mutex: mutex_3] } [mutex: mutex_3] | location <  7.40485 -0.165181> | orientation -3.73284e-05
[fleet_adapter-19]  -- L1 <  7.41765 -0.165181> [loc_8] [mutex: mutex_4] | location <  7.40485 -0.165181> | orientation -3.73284e-05
[fleet_adapter-19]  -- lane 54: { L1 <  7.41765 -0.165181> [loc_8] [mutex: mutex_4] } -> { L1 <   0.8548 -0.165181> [loc_7] [mutex: mutex_4] } [mutex: mutex_4] | location <  7.40485 -0.165181> | orientation -3.73284e-05
[fleet_adapter-19]  -- lane 55: { L1 <   0.8548 -0.165181> [loc_7] [mutex: mutex_4] } -> { L1 <  7.41765 -0.165181> [loc_8] [mutex: mutex_4] } [mutex: mutex_4] | location <  7.40485 -0.165181> | orientation -3.73284e-05

For convenience I pushed my test case maps/configs to this xiyu/test_mutex branch, but it requires some setup to use them. The launch command is:

ros2 launch rmf_site_demos test_mutex.launch.xml

and I've also added charger.sh and send_robots_back.sh to send patrol tasks to all 5 robots to the chargers and start locations respectively.

I'm not entirely sure if this is a map setup problem, because it had worked successfully when I drew the charger waypoints differently (closer together, with a straight lane connecting all leaf nodes towards the chargers), but the replanning portion is a little weird to me.

mxgrey · 2025-11-24T07:34:55Z

@xiyuoh thanks for finding this additional case. I think it's worth opening a PR for this map you created in rmf_site_ros2 so we have this as a test case that we can refer back to.

I believe what's happening in the deadlock case that you've found is TinyRobot_4 is within the "merge lane" distance of the { loc_6 -> loc_8 } lane, making its current position on that lane an "acceptable" starting point for the planner. Geometrically the cost calculation is no worse whether the robot starts "on" that lane or "on" the loc_8 waypoint since it's still the same point in space. The planner will end up arbitrarily choosing between those two semantically different starting points because the apparent cost is the same between them. Unfortunately the { loc_6 -> loc_8 } lane has the constraint that it needs to have mutex_3 locked, which is something that doesn't get accounted for in the cost calculation or anywhere in the planner.

The only solution I can think of for this is that LockMutexGroup needs to forcibly override the start points used by the planner for as long as it's active. I'll try that out and see if it resolves this scenario.

Signed-off-by: Michael X. Grey <greyxmike@gmail.com>

mxgrey · 2026-01-24T05:52:08Z

@xiyuoh I finally had time to get back to this. I've updated the way RobotContext::location works so that events can set a _current_event_waypoint value which overrides the the default behavior of RobotContext::location. Instead of naively using whatever we get from the user, it will narrow down the plan start values to only start from the event waypoint if possible. If the robot's location no longer corresponds to that waypoint at all, then we fall back to whatever the user has passed in.

Your test scenario is now succeeding. You can A/B test between 2b4d740 and 5045c2f to see that the original version of the PR consistently fails on the test case while the updated PR consistently succeeds.

xiyuoh

Thanks for nailing down the issue and adding the fix! Tested this a bunch of times and it consistently works now, all 5 robots are able to move to their respective chargers successfully. It'd be very meaningful to get this fix in.

yutaroha · 2026-01-29T09:42:54Z

rmf_fleet_adapter/src/rmf_fleet_adapter/agv/RobotContext.cpp

+      nav_params()->max_merge_lane_distance);
+
+    std::optional<Eigen::Vector2d> p = std::nullopt;
+    double orientation = 0.0;


Thank you for your work on this. We’ve tested the changes on our side and confirmed that it works well.
However, we’d like to report the following issue:
In the rmf_traffic::agv::Plan::StartSet RobotContext::location() const function, the orientation is set to a fixed value of 0.0. This causes the AMR to appear facing right in RViz when stopping at each point (though it doesn’t actually rotate in reality). We believe a specific orientation value needs to be set here instead of the hardcoded 0.0.

Thanks for catching this!

You're right that I forget one line of code that was supposed to set this properly. Fix in this commit: 621b807

Signed-off-by: Michael X. Grey <mxgrey@intrinsic.ai>

mxgrey added 2 commits November 19, 2025 02:43

Fix mutex lock placement within execution

da456f1

Signed-off-by: Michael X. Grey <greyxmike@gmail.com>

Clean up debug code

7245791

Signed-off-by: Michael X. Grey <greyxmike@gmail.com>

mxgrey added this to PMC Board Nov 19, 2025

github-project-automation bot moved this to Inbox in PMC Board Nov 19, 2025

mxgrey mentioned this pull request Nov 19, 2025

[Bug]: Title: Unexpected deadlock caused by Mutex #448

Open

1 task

Merge branch 'main' into fix_mutex_lock_execution

846b5db

Introduce the concept of a current event waypoint

a2a847a

Signed-off-by: Michael X. Grey <greyxmike@gmail.com>

xiyuoh mentioned this pull request Nov 27, 2025

New test_mutex site map open-rmf/rmf_site_ros2#5

Open

aaronchongth moved this from Inbox to In Progress in PMC Board Dec 2, 2025

mxgrey added 6 commits January 23, 2026 20:13

Merge remote-tracking branch 'origin/main' into fix_mutex_lock_execution

2b4d740

Merge branch 'fix_mutex_lock_execution' into current_event_waypoint

81aeffe

Set current_event_waypoint during mutex lock event

4f0efeb

Signed-off-by: Michael X. Grey <greyxmike@gmail.com>

Debugging

aa5093d

Signed-off-by: Michael X. Grey <greyxmike@gmail.com>

Eliminate undefined behavior from RobotContext::location change

6add417

Signed-off-by: Michael X. Grey <greyxmike@gmail.com>

Finish debugging

5045c2f

Signed-off-by: Michael X. Grey <greyxmike@gmail.com>

mxgrey added 2 commits January 24, 2026 14:41

Merge branch 'main' into fix_mutex_lock_execution

028109d

Merge branch 'main' into fix_mutex_lock_execution

7d7de88

xiyuoh previously approved these changes Jan 29, 2026

View reviewed changes

yutaroha reviewed Jan 29, 2026

View reviewed changes

JeremiahLim-HT added a commit to T2HOPETECHNIK/rmf_ros2 that referenced this pull request Jan 30, 2026

Manual Chery Pick open-rmf#490 from open-rmf/rmf_ros2

67c105e

jeff994 pushed a commit to T2HOPETECHNIK/rmf_ros2 that referenced this pull request Jan 30, 2026

Manual Chery Pick open-rmf#490 from open-rmf/rmf_ros2

765595f

jeff994 pushed a commit to T2HOPETECHNIK/rmf_ros2 that referenced this pull request Feb 2, 2026

Manual Chery Pick open-rmf#490 from open-rmf/rmf_ros2

c7276d5

jeff994 pushed a commit to T2HOPETECHNIK/rmf_ros2 that referenced this pull request Feb 3, 2026

Manual Chery Pick open-rmf#490 from open-rmf/rmf_ros2

f1929f7

mxgrey moved this from In Progress to In Review in PMC Board Feb 10, 2026

Set orientation based on location update

621b807

Signed-off-by: Michael X. Grey <mxgrey@intrinsic.ai>

mxgrey dismissed xiyuoh’s stale review via 621b807 February 10, 2026 05:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Fix potential deadlocks from execution of mutex lock and release#490

Fix potential deadlocks from execution of mutex lock and release#490
mxgrey wants to merge 13 commits intomainfrom
fix_mutex_lock_execution

mxgrey commented Nov 19, 2025

Uh oh!

xiyuoh commented Nov 21, 2025

Uh oh!

mxgrey commented Nov 24, 2025 •

edited

Loading

Uh oh!

mxgrey commented Jan 24, 2026

Uh oh!

xiyuoh left a comment

Uh oh!

yutaroha Jan 29, 2026

Uh oh!

mxgrey Feb 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

mxgrey commented Nov 19, 2025

1. Premature release of a mutex after replanning

2. Incorrect cleanup of current_mutex_groups in ExecutePlan::make

Other notes

Uh oh!

xiyuoh commented Nov 21, 2025

Uh oh!

mxgrey commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mxgrey commented Jan 24, 2026

Uh oh!

xiyuoh left a comment

Choose a reason for hiding this comment

Uh oh!

yutaroha Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

mxgrey Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

2. Incorrect cleanup of `current_mutex_groups` in `ExecutePlan::make`

mxgrey commented Nov 24, 2025 •

edited

Loading