Skip to content

Conversation

@justin-stephenson
Copy link
Contributor

@justin-stephenson justin-stephenson commented Jul 30, 2025

SUMMARY

When a low SSH options timeout value is set, we see sometimes that calls to new_channel() and ssh_channel_open_session fail when libssh returns SSH_AGAIN. Currently, pylibssh returns an exception:


../../../../pytest-mh/pytest_mh/conn/ssh.py:285: in _run
    self.__channel = self.__conn.new_channel()
                     ^^^^^^^^^^^^^^^^^^^^^^^^^
src/pylibsshext/session.pyx:514: in pylibsshext.session.Session.new_channel
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
>   ???
E   pylibsshext.errors.LibsshChannelException: Failed to open_session: [-2]
src/pylibsshext/channel.pyx:71: LibsshChannelException

SSH_AGAIN return code is documented https://api.libssh.org/master/group__libssh__channel.html#gaf051dd30d75bf6dc45d1a5088cf970bd

It is not clearly stated this but SSH_AGAIN also happens due to timeout.

ssh_channel_open_session()

Returns
    SSH_OK on success, SSH_ERROR if an error occurred, SSH_AGAIN if in nonblocking mode and call has to be done again.
ISSUE TYPE
  • Bugfix Pull Request
ADDITIONAL INFORMATION

This issue happens in our https://github.com/next-actions/pytest-mh project.

CC @pbrezina

@packit-as-a-service

This comment was marked as outdated.

1 similar comment
@packit-as-a-service

This comment was marked as outdated.

@psf-chronographer psf-chronographer bot added the bot:chronographer:provided There is a change note present in this PR label Jul 30, 2025
Copy link
Member

@webknjaz webknjaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this could result in an infinite loop..

Also, I don't think there's a proof that this works as intended without tests. Add them.

@KB-perByte KB-perByte self-requested a review August 6, 2025 09:42
@Jakuje
Copy link
Contributor

Jakuje commented Aug 6, 2025

I wonder if this could result in an infinite loop..

Yes, this would go into infinite loop if the server dies and does not properly disconnect. And timeouts are to handle this issue.

Retrying unconditionally and infinitely is ok for tests, but for real-world application, the pylibssh should do at very least some check with ssh_is_connected() or something.

Or setting some limit how many times you could retry. But in this case, why not raise the timeout itself?

@justin-stephenson
Copy link
Contributor Author

Thank you for your comments and review.

I wonder if this could result in an infinite loop..

Also, I don't think a proof that this works as intended without tests. Add them.

I tried to add a test for this but unfortunately I couldn't reproduce the scenario we see in the test environment. I'll try again.

I wonder if this could result in an infinite loop..

Yes, this would go into infinite loop if the server dies and does not properly disconnect. And timeouts are to handle this issue.

Retrying unconditionally and infinitely is ok for tests, but for real-world application, the pylibssh should do at very least some check with ssh_is_connected() or something.

Or setting some limit how many times you could retry. But in this case, why not raise the timeout itself?

I can change this PR to add a call to ssh_is_connected() to avoid an infinite loop, or I can raise a different exception when SSH_AGAIN is returned (like LibsshChannelAgain) then we will handle this exception in our calls to pylibssh methods.

Whichever you prefer is acceptable for us, just let me know and i'll make those changes.

@Jakuje
Copy link
Contributor

Jakuje commented Aug 7, 2025

I am actually wondering how you are getting the SSH_AGAIN in these two places with pylibssh. The sessions in libssh are blocking by default. The only way to change the session to non-blocking mode is to use ssh_set_blocking() or doing some variation of ssh_channel_read_nonblocking(), but I see your changes completely elsewhere, this should not come into the effect and I do not see these functions exposed in the pylibssh either.

But there might be the oddness that setting low timeout might actually return the SSH_AGAIN in places where it should not, according to the documentation, which would be a bug in libssh that needs to be fixed.

What brought you initially to set smaller timeouts? Is a viable workaround to raise the timeouts?

@justin-stephenson
Copy link
Contributor Author

I am actually wondering how you are getting the SSH_AGAIN in these two places with pylibssh. The sessions in libssh are blocking by default. The only way to change the session to non-blocking mode is to use ssh_set_blocking() or doing some variation of ssh_channel_read_nonblocking(), but I see your changes completely elsewhere, this should not come into the effect and I do not see these functions exposed in the pylibssh either.

The error we currently see in our PRCI is specific to ssh_channel_open_session failure:

FAILED tests/test_authentication.py::test_authentication__user_login_with_overriding_home_directory[domain] (ldap) - pylibsshext.errors.LibsshChannelException: Failed to open_session: [-2]

I added the session: commit just as a nice to have because ssh_userauth_password() can return SSH_AGAIN per the libssh API docs, but the channel.pyx commit is the main issue we are hitting currently.

But there might be the oddness that setting low timeout might actually return the SSH_AGAIN in places where it should not, according to the documentation, which would be a bug in libssh that needs to be fixed.

What brought you initially to set smaller timeouts? Is a viable workaround to raise the timeouts?

In our code we set .set_ssh_options("timeout", 1) because in our pytest-mh code we allow users to to execute commands over SSH on hosts with an arbitrary timeout value set, such as:

client.host.conn.run(..., timeout=X)

If I understand correctly, setting this low set_ssh_options("timeout")" value is necessary for the above to work as expected because Python will not deliver signal if the code is blocked in C library The signal is delivered only after we get back to the Python code. @pbrezina can correct me here.

-- for reference https://github.com/next-actions/pytest-mh/blob/master/pytest_mh/conn/ssh.py

justin-stephenson added a commit to justin-stephenson/sssd that referenced this pull request Aug 11, 2025
Workaround ansible pylibssh issue which causes test failures

   pylibsshext.errors.LibsshChannelException: Failed to open_session: [-2]

PR ansible/pylibssh#756 is under review
but workaround it in the meantime.
justin-stephenson added a commit to justin-stephenson/sssd that referenced this pull request Aug 11, 2025
Workaround ansible pylibssh issue which causes test failures

   pylibsshext.errors.LibsshChannelException: Failed to open_session: [-2]

PR ansible/pylibssh#756 is under review
but workaround it in the meantime.
justin-stephenson added a commit to justin-stephenson/sssd that referenced this pull request Aug 11, 2025
Workaround ansible pylibssh issue which causes test failures

   pylibsshext.errors.LibsshChannelException: Failed to open_session: [-2]

PR ansible/pylibssh#756 is under review
but workaround it in the meantime.
justin-stephenson added a commit to justin-stephenson/sssd that referenced this pull request Aug 11, 2025
Workaround ansible pylibssh issue which causes test failures

   pylibsshext.errors.LibsshChannelException: Failed to open_session: [-2]

PR ansible/pylibssh#756 is under review
but workaround it in the meantime.
justin-stephenson added a commit to justin-stephenson/sssd that referenced this pull request Aug 12, 2025
Workaround ansible pylibssh issue which causes test failures

   pylibsshext.errors.LibsshChannelException: Failed to open_session: [-2]

PR ansible/pylibssh#756 is under review
but workaround it in the meantime.
justin-stephenson added a commit to justin-stephenson/sssd that referenced this pull request Aug 12, 2025
Workaround ansible pylibssh issue which causes test failures

   pylibsshext.errors.LibsshChannelException: Failed to open_session: [-2]

PR ansible/pylibssh#756 is under review
but workaround it in the meantime.
@Jakuje
Copy link
Contributor

Jakuje commented Aug 14, 2025

Ok, setting the libssh timeout is the timeout you are giving to the libssh to return to you. but if you are setting the low timeout to get the signals delivered, then either pylibssh or the caller needs to retry. The pylibssh code is really not written to support the retries around here so my proposal would be to create some pylibssh timeout/retry counter to avoid infinite cycle when stuff will go wrong. What do you think?

It can be either separate pyblissh option, or it can be somehow intercepted when we set the libssh timeout to set it to some multiply of the user specified value to return the handling to the python code. Or the second option by default with possible override.

And obviously we need some tests with this option, otherwise its untested broken code. I bet we can get some slow CI runners where this would demonstrate from time to time.

@justin-stephenson
Copy link
Contributor Author

Ok, setting the libssh timeout is the timeout you are giving to the libssh to return to you. but if you are setting the low timeout to get the signals delivered, then either pylibssh or the caller needs to retry. The pylibssh code is really not written to support the retries around here so my proposal would be to create some pylibssh timeout/retry counter to avoid infinite cycle when stuff will go wrong. What do you think?

It can be either separate pyblissh option, or it can be somehow intercepted when we set the libssh timeout to set it to some multiply of the user specified value to return the handling to the python code. Or the second option by default with possible override.

I went ahead and updated the PR with your suggestion, by adding a new option open_session_retries which can be provided to the connect() method in src/pylibsshext/session.pyx. To make it less invasive I I set the default value for this to 0, so default libssh behavior will not change. Please take a look.

And obviously we need some tests with this option, otherwise its untested broken code. I bet we can get some slow CI runners where this would demonstrate from time to time.

I added some test scaffolding that is not done yet. In the test environment I always see ssh_channel_open_session() returns SSH_OK instead of SSH_AGAIN even with setting a low timeout therefore I don't see how to test the retries properly in the test environment.

@Jakuje
Copy link
Contributor

Jakuje commented Aug 20, 2025

I added some test scaffolding that is not done yet. In the test environment I always see ssh_channel_open_session() returns SSH_OK instead of SSH_AGAIN even with setting a low timeout therefore I don't see how to test the retries properly in the test environment.

The libssh has an option SSH_OPTIONS_TIMEOUT_USEC to set subsecond timeouts so if this could help reproducing the issue (or spending less time in c code), we can either expose this option, or do again some mangling to allow float input on the python side and then convert it to SSH_OPTIONS_TIMEOUT or SSH_OPTIONS_TIMEOUT_USEC. Not sure what would be nicer/easier from the API point of view, but I think the second approach sounds mostly transparent for users.

https://api.libssh.org/master/group__libssh__session.html#ga7a801b85800baa3f4e16f5b47db0a73d

@webknjaz
Copy link
Member

I clicked "rebase" so this PR pulls in the CI fixes.

@webknjaz webknjaz removed the bot:chronographer:provided There is a change note present in this PR label Aug 21, 2025
@webknjaz
Copy link
Member

The change note was lost in your last force-push.

@psf-chronographer psf-chronographer bot added the bot:chronographer:provided There is a change note present in this PR label Aug 21, 2025
@webknjaz
Copy link
Member

@Jakuje
Copy link
Contributor

Jakuje commented Sep 29, 2025

@Jakuje any idea what might be causing these timeouts? https://github.com/ansible/pylibssh/actions/runs/18018035693/job/51268709820?pr=756#step:17:101

The timeouts are in the ssh_clientkey_path(), which generates 8k RSA test keys. Does it fail reproducibly or just randomly? Can you try with #768, which is using ecdsa keys for tests, which are more lightweight?

@webknjaz
Copy link
Member

webknjaz commented Oct 1, 2025

@Jakuje any idea what might be causing these timeouts? https://github.com/ansible/pylibssh/actions/runs/18018035693/job/51268709820?pr=756#step:17:101

The timeouts are in the ssh_clientkey_path(), which generates 8k RSA test keys. Does it fail reproducibly or just randomly? Can you try with #768, which is using ecdsa keys for tests, which are more lightweight?

Thanks, merged that one!

@webknjaz
Copy link
Member

webknjaz commented Oct 1, 2025

I clicked rebase to see if that other PR helped with this.

Improve pylibssh handling when libssh ssh_channel_open_session()
returns SSH_AGAIN. Add a new 'open_session_retries' session connect()
parameter to allow a configurable number of retries. SSH_AGAIN may be
returned when setting a low SSH options timeout value.

The default option value is 0, no retries will be attempted.
Copy link
Member

@webknjaz webknjaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @justin-stephenson @Jakuje for polishing this to a great shape!

@webknjaz webknjaz changed the title Retry on libssh SSH_AGAIN return code Retry on libssh SSH_AGAIN return code Oct 10, 2025
@webknjaz webknjaz merged commit 3654bb8 into ansible:devel Oct 10, 2025
62 checks passed
ssh_session.close()


def test_open_session_small_timeout_with_retries(ssh_session_connect_retries):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jakuje added a commit to Jakuje/pylibssh that referenced this pull request Oct 13, 2025
Follow-up from ansible#756 which introduced this test.

Signed-off-by: Jakub Jelen <jjelen@redhat.com>
Jakuje added a commit to Jakuje/pylibssh that referenced this pull request Oct 13, 2025
Follow-up from ansible#756 which introduced this test.

Signed-off-by: Jakub Jelen <jjelen@redhat.com>
justin-stephenson added a commit to justin-stephenson/pytest-mh that referenced this pull request Oct 13, 2025
ansible-pylibssh 1.3.0 adds support for 'open_session_retries',
parameter. Set 3 retry attempts to ensure ssh_channel_open_session()
succeeds if libssh returns SSH_AGAIN due to low
timeout(ansible/pylibssh#756).
justin-stephenson added a commit to justin-stephenson/pytest-mh that referenced this pull request Oct 13, 2025
ansible-pylibssh 1.3.0 adds support for 'open_session_retries',
parameter. Set 3 retry attempts to ensure ssh_channel_open_session()
succeeds if libssh returns SSH_AGAIN due to low
timeout(ansible/pylibssh#756).
justin-stephenson added a commit to justin-stephenson/pytest-mh that referenced this pull request Oct 14, 2025
ansible-pylibssh 1.3.0 adds support for 'open_session_retries',
parameter. Set 10 retry attempts to ensure ssh_channel_open_session()
succeeds if libssh returns SSH_AGAIN due to low
timeout(ansible/pylibssh#756).
pbrezina pushed a commit to next-actions/pytest-mh that referenced this pull request Oct 16, 2025
ansible-pylibssh 1.3.0 adds support for 'open_session_retries',
parameter. Set 10 retry attempts to ensure ssh_channel_open_session()
succeeds if libssh returns SSH_AGAIN due to low
timeout(ansible/pylibssh#756).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bot:chronographer:provided There is a change note present in this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants