-
-
Notifications
You must be signed in to change notification settings - Fork 35
Retry on libssh SSH_AGAIN return code
#756
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
1 similar comment
This comment was marked as outdated.
This comment was marked as outdated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this could result in an infinite loop..
Also, I don't think there's a proof that this works as intended without tests. Add them.
Yes, this would go into infinite loop if the server dies and does not properly disconnect. And timeouts are to handle this issue. Retrying unconditionally and infinitely is ok for tests, but for real-world application, the pylibssh should do at very least some check with Or setting some limit how many times you could retry. But in this case, why not raise the timeout itself? |
|
Thank you for your comments and review.
I tried to add a test for this but unfortunately I couldn't reproduce the scenario we see in the test environment. I'll try again.
I can change this PR to add a call to Whichever you prefer is acceptable for us, just let me know and i'll make those changes. |
|
I am actually wondering how you are getting the SSH_AGAIN in these two places with pylibssh. The sessions in libssh are blocking by default. The only way to change the session to non-blocking mode is to use But there might be the oddness that setting low timeout might actually return the SSH_AGAIN in places where it should not, according to the documentation, which would be a bug in libssh that needs to be fixed. What brought you initially to set smaller timeouts? Is a viable workaround to raise the timeouts? |
The error we currently see in our PRCI is specific to
I added the
In our code we set
If I understand correctly, setting this low -- for reference https://github.com/next-actions/pytest-mh/blob/master/pytest_mh/conn/ssh.py |
Workaround ansible pylibssh issue which causes test failures pylibsshext.errors.LibsshChannelException: Failed to open_session: [-2] PR ansible/pylibssh#756 is under review but workaround it in the meantime.
Workaround ansible pylibssh issue which causes test failures pylibsshext.errors.LibsshChannelException: Failed to open_session: [-2] PR ansible/pylibssh#756 is under review but workaround it in the meantime.
51b9a9b to
7afce6d
Compare
Workaround ansible pylibssh issue which causes test failures pylibsshext.errors.LibsshChannelException: Failed to open_session: [-2] PR ansible/pylibssh#756 is under review but workaround it in the meantime.
Workaround ansible pylibssh issue which causes test failures pylibsshext.errors.LibsshChannelException: Failed to open_session: [-2] PR ansible/pylibssh#756 is under review but workaround it in the meantime.
Workaround ansible pylibssh issue which causes test failures pylibsshext.errors.LibsshChannelException: Failed to open_session: [-2] PR ansible/pylibssh#756 is under review but workaround it in the meantime.
Workaround ansible pylibssh issue which causes test failures pylibsshext.errors.LibsshChannelException: Failed to open_session: [-2] PR ansible/pylibssh#756 is under review but workaround it in the meantime.
|
Ok, setting the libssh timeout is the timeout you are giving to the libssh to return to you. but if you are setting the low timeout to get the signals delivered, then either pylibssh or the caller needs to retry. The pylibssh code is really not written to support the retries around here so my proposal would be to create some pylibssh timeout/retry counter to avoid infinite cycle when stuff will go wrong. What do you think? It can be either separate pyblissh option, or it can be somehow intercepted when we set the libssh timeout to set it to some multiply of the user specified value to return the handling to the python code. Or the second option by default with possible override. And obviously we need some tests with this option, otherwise its untested broken code. I bet we can get some slow CI runners where this would demonstrate from time to time. |
7afce6d to
571272e
Compare
I went ahead and updated the PR with your suggestion, by adding a new option
I added some test scaffolding that is not done yet. In the test environment I always see |
The libssh has an option https://api.libssh.org/master/group__libssh__session.html#ga7a801b85800baa3f4e16f5b47db0a73d |
55f23f6 to
94a3c16
Compare
|
I clicked "rebase" so this PR pulls in the CI fixes. |
|
The change note was lost in your last force-push. |
94a3c16 to
d3528df
Compare
52ed2eb to
30192f1
Compare
|
@Jakuje any idea what might be causing these timeouts? https://github.com/ansible/pylibssh/actions/runs/18018035693/job/51268709820?pr=756#step:17:101 |
30192f1 to
7f4e098
Compare
The timeouts are in the |
Thanks, merged that one! |
7f4e098 to
f7e46bb
Compare
|
I clicked |
f7e46bb to
bea82cb
Compare
Improve pylibssh handling when libssh ssh_channel_open_session() returns SSH_AGAIN. Add a new 'open_session_retries' session connect() parameter to allow a configurable number of retries. SSH_AGAIN may be returned when setting a low SSH options timeout value. The default option value is 0, no retries will be attempted.
bea82cb to
717e4ca
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @justin-stephenson @Jakuje for polishing this to a great shape!
SSH_AGAIN return code
| ssh_session.close() | ||
|
|
||
|
|
||
| def test_open_session_small_timeout_with_retries(ssh_session_connect_retries): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@justin-stephenson this test failed at least on one occasion post-merge: https://github.com/ansible/pylibssh/actions/runs/18431666269#summary-52519922027
Follow-up from ansible#756 which introduced this test. Signed-off-by: Jakub Jelen <jjelen@redhat.com>
Follow-up from ansible#756 which introduced this test. Signed-off-by: Jakub Jelen <jjelen@redhat.com>
ansible-pylibssh 1.3.0 adds support for 'open_session_retries', parameter. Set 3 retry attempts to ensure ssh_channel_open_session() succeeds if libssh returns SSH_AGAIN due to low timeout(ansible/pylibssh#756).
ansible-pylibssh 1.3.0 adds support for 'open_session_retries', parameter. Set 3 retry attempts to ensure ssh_channel_open_session() succeeds if libssh returns SSH_AGAIN due to low timeout(ansible/pylibssh#756).
ansible-pylibssh 1.3.0 adds support for 'open_session_retries', parameter. Set 10 retry attempts to ensure ssh_channel_open_session() succeeds if libssh returns SSH_AGAIN due to low timeout(ansible/pylibssh#756).
ansible-pylibssh 1.3.0 adds support for 'open_session_retries', parameter. Set 10 retry attempts to ensure ssh_channel_open_session() succeeds if libssh returns SSH_AGAIN due to low timeout(ansible/pylibssh#756).
SUMMARY
When a low SSH options timeout value is set, we see sometimes that calls to
new_channel()andssh_channel_open_sessionfail when libssh returnsSSH_AGAIN. Currently, pylibssh returns an exception:SSH_AGAIN return code is documented https://api.libssh.org/master/group__libssh__channel.html#gaf051dd30d75bf6dc45d1a5088cf970bd
It is not clearly stated this but SSH_AGAIN also happens due to timeout.
ISSUE TYPE
ADDITIONAL INFORMATION
This issue happens in our https://github.com/next-actions/pytest-mh project.
CC @pbrezina