Fix potential deadlocks from execution of mutex lock and release#490
Fix potential deadlocks from execution of mutex lock and release#490
Conversation
Signed-off-by: Michael X. Grey <greyxmike@gmail.com>
Signed-off-by: Michael X. Grey <greyxmike@gmail.com>
|
I took a look at the code changes, it seems reasonable to add an additional filter to make sure that we account for robots currently on a lane during replanning. I'm testing this PR out, and though it was working great without deadlock for some time, after tweaking the graph a little (not the mutexes, just spaced out the charging stations at the bottom further apart), I can consistently create a deadlock where TinyRobot4 starts waiting for
Relevant logs are available here, and I'm copying the suspicious lines below, where it suggests that TinyRobot4 should be moving up towards For convenience I pushed my test case maps/configs to this xiyu/test_mutex branch, but it requires some setup to use them. The launch command is: and I've also added I'm not entirely sure if this is a map setup problem, because it had worked successfully when I drew the charger waypoints differently (closer together, with a straight lane connecting all leaf nodes towards the chargers), but the replanning portion is a little weird to me. |
|
@xiyuoh thanks for finding this additional case. I think it's worth opening a PR for this map you created in I believe what's happening in the deadlock case that you've found is The only solution I can think of for this is that |
Signed-off-by: Michael X. Grey <greyxmike@gmail.com>
Signed-off-by: Michael X. Grey <greyxmike@gmail.com>
Signed-off-by: Michael X. Grey <greyxmike@gmail.com>
Signed-off-by: Michael X. Grey <greyxmike@gmail.com>
|
@xiyuoh I finally had time to get back to this. I've updated the way Your test scenario is now succeeding. You can A/B test between 2b4d740 and 5045c2f to see that the original version of the PR consistently fails on the test case while the updated PR consistently succeeds. |
xiyuoh
left a comment
There was a problem hiding this comment.
Thanks for nailing down the issue and adding the fix! Tested this a bunch of times and it consistently works now, all 5 robots are able to move to their respective chargers successfully. It'd be very meaningful to get this fix in.
| nav_params()->max_merge_lane_distance); | ||
|
|
||
| std::optional<Eigen::Vector2d> p = std::nullopt; | ||
| double orientation = 0.0; |
There was a problem hiding this comment.
Thank you for your work on this. We’ve tested the changes on our side and confirmed that it works well.
However, we’d like to report the following issue:
In the rmf_traffic::agv::Plan::StartSet RobotContext::location() const function, the orientation is set to a fixed value of 0.0. This causes the AMR to appear facing right in RViz when stopping at each point (though it doesn’t actually rotate in reality). We believe a specific orientation value needs to be set here instead of the hardcoded 0.0.
There was a problem hiding this comment.
Thanks for catching this!
You're right that I forget one line of code that was supposed to set this properly. Fix in this commit: 621b807
Signed-off-by: Michael X. Grey <mxgrey@intrinsic.ai>

This PR fixes #448
I found two unrelated problems which had a risk of causing deadlocks in mutex usage:
1. Premature release of a mutex after replanning
The
MoveRobotphase is set up to automatically release mutexes that are no longer needed by the remaining movement that the robot is performing. In general this works fine since a robot should only need to retain mutexes that it is currently moving through.However, there is an edge case: If replanning gets forced by a negotiation then a new plan may involve one initial step where the robot is commanded to the midpoint of a lane. In the current implementation, this midpoint is not explicitly associated with any part of the graph, i.e. it has no "graph elements" to reference to determine what mutexes should be locked or not. This leaves the
MoveRobotphase unaware that it needs to retain the mutex of the lane that the robot is currently on, nor the mutex of any vertex that the robot is moving towards. Without that awareness, theMoveRobotphase will naturally release those mutexes.In this PR, we now detect this edge case directly, and we block the
MoveRobotphase from doing any automatic release of mutexes when it's handling this type of midpoint. This is a very blunt force approach to solving the problem, but I don't think we'll find a better approach until we reimplement the traffic system to do geometric reasoning instead of pure graph vertex/edge reasoning.2. Incorrect cleanup of
current_mutex_groupsinExecutePlan::makeThe logic when creating the plan execution is terribly convoluted because of the need to iterate through waypoints while only inserting events as needed. The variable
current_mutex_groupsis meant to track which mutex groups need to be locked before the next phase can be started, but its value was not getting reset tonulloptafter being used in certain cases. This meant that the plan would involve repeatedly trying to relock mutexes that had just been correctly released.The fix proposed in this comment does have the effect of forcing
current_mutex_groupsto be reset, which is why it appeared to resolve this part of the deadlocking problem. However it also has a side-effect of forcing aLockMutexGroupphase in between every waypoint in the plan, which is why I did not go with that proposed solution.Other notes
Besides fixing the above problems, I've left some debug logging in this PR with the information that helped me resolve this issue. If we notice anything else suspicious about the mutex behavior in the future, those debug outputs should help resolve it a bit faster.