Skip to content

Conversation

orlitzky
Copy link
Contributor

@orlitzky orlitzky commented Aug 17, 2025

Replace cysignals' sig_on() and sig_off() with custom gap_sig_on() and gap_sig_off() wrappers that use GAP's own SIGINT handling. This fixes an uncommon, but ultimately reproducible segfault when Ctrl-C is used to interrupt a GAP computation.

In concert with #40594 this has allowed sage/libs/gap/element.pyx to pass many thousands of test iterations.

Fixes

Dependencies

user202729 and others added 10 commits August 16, 2025 22:19
We plan to install InterruptExecStat (gap/stats.h) as the SIGINT and
SIGALRM handler while GAP code is running, since that seems to work
better than mixing cysignals with GAP_Enter/GAP_Leave has.
Add two new pre/post-GAP functions to enable/disable GAP's own SIGINT
(Ctrl-C) handler. These avoid the setjmp/longjmp issues we've
encountered with the sig_on() and sig_off() from cysignals.
Replace the sig_on() and sig_off() wrappers from cysignals with the
new custom gap_sig_on() and gap_sig_off(). We've lost the sig_on() and
sig_off() around GAP_initialize() because I don't think we can trust
GAP's own signal handler until after we've initialized GAP.
Replace the sig_on() and sig_off() wrappers from cysignals with the
new custom gap_sig_on() and gap_sig_off(). This fixes the segfault
that occurs after repeated testing of _pow_.
We are no longer using cysignals for this; instead we have our own
gap_sig_on() and gap_sig_off() functions.
The way we handle signals in libgap is now unusual, so a few
paragraphs have been added to the libgap module docstring explaining
how and why it is unusual.
We've found that GAP's own SIGINT handler is less crashy than if we
mix GAP_Enter/GAP_Leave with cysignals' sig_on and sig_off. We've
installed that same handler for SIGALRM, but the code that catches it
can't tell the difference and converts them both in to
KeyboardInterrupt. To retain the ability to doctest this stuff, we have
to catch those KeyboardInterrupts and poke at them to see if they arose
from GAP.
This function used cysignals for sig_error(), and we have opted not to
mix cysignals code with libgap code. It has in any case been replaced
by plain GAP_Enter calls.
# Ctrl-C
raise KeyboardInterrupt from e
else:
raise
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is some really bad duplication of code. I suggest

  • fold GAP_Enter and GAP_Leave into gap_sig_on and gap_sig_off respectively
  • instead of setting the interrupt handler directly to InterruptExecStat, make custom wrapper function (or hook into cysignals), the custom wrapper function sets a global variable that determine which kind of signal is being raised, then call InterruptExecStat
  • the global variable above need to be reset at appropriate places
  • modify error_handler to, instead of indiscriminately raise GAPError, raise the correct exception by reading the global variable above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it's pretty bad but I was trying to keep it as simple as possible for the first iteration. First get it to not crash, then make it pretty.

I don't think we can use a global for the signal type because signals are asynchronous. There's no reason why a SIGINT can't be triggered while handling a SIGALRM, or vice-versa. It's not very likely, but if there's a lesson here it's that we shouldn't count on unlikely things not happening.

We may be able to combine GAP_Enter and gap_sig_on(), though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can use a global for the signal type because signals are asynchronous. There's no reason why a SIGINT can't be triggered while handling a SIGALRM, or vice-versa. It's not very likely, but if there's a lesson here it's that we shouldn't count on unlikely things not happening.

you're right but… I think currently the signal handler is calling PyErr_Restore() or something anyway, and that one isn't reentrant either.

Searching up a bit https://stackoverflow.com/a/3127697/, that's indeed a possibility. But doesn't the same thing happen for cysignals as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may be able to combine GAP_Enter and gap_sig_on(), though.

Even this immediately leads to a segault.

e.__traceback__ = None
alarm_raised = True
else:
raise
Copy link
Contributor

@user202729 user202729 Aug 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the logic below (raise AlarmInterrupt instead of KeyboardInterrupt) are correctly implemented, this change wouldn't be necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've further simplified this down to a one-line change, using the fact that AlarmInterrupt is a subclass of KeyboardInterrupt. Instead of catching AlarmInterrupt, we can catch KeyboardInterrupt, and then check,

  • is it an AlarmInterrupt?
  • does it have "user interrupt" as the message?

I prefer this to hacking AlarmInterrupt into the GAP code at this point because sage.libs.gap is otherwise free of cysignals. To raise the AlarmInterrupt from sage.libs.gap we'd have to,

  1. Add cysignals back as a dep in meson.build
  2. Import AlarmInterrupt
  3. Add the global last_signal variable
  4. Set last_signal whenever a handler is called
  5. Check last_signal when raising a GAPError and convert it to the right thing

Since this is all for the benefit of one doctest method, it just seems easier to add the one line in that doctest method?

@user202729
Copy link
Contributor

user202729 commented Aug 17, 2025

I find the current implementation (before this pull request) questionable also—one shouldn't be calling any Python code between GAP_Enter…GAP_Leave, and if there's no Python code, try…finally isn't necessary.

If I understood correctly, the way the current exception handler works is that it just sets PyErr_Restore and hope that Python checks the exception quickly enough.

One implementation idea:

whenever you need to call GAP API, you only need to

gap_sig_on()
gap_check_exceptions()
<GAP code here, must not contain any Python code, therefore cannot raise any exception>
gap_sig_off()

where you do something like

int gap_enter_ok; // global variable internal to that file
#define gap_sig_on() gap_set_signal_handlers(); gap_enter_ok = GAP_Enter()
#define gap_check_exceptions() if (!gap_enter_ok) { GAP_Leave(); raise_appropriate_exception_type() }
#define gap_sig_off() GAP_Leave(); gap_reset_signal_handler()

Copy link

github-actions bot commented Aug 17, 2025

Documentation preview for this PR (built with commit c1adeb9; changes) is ready! 🎉
This preview will update shortly after each push to this PR.

@orlitzky
Copy link
Contributor Author

I find the current implementation (before this pull request) questionable also—one shouldn't be calling any Python code between GAP_Enter…GAP_Leave, and if there's no Python code, try…finally isn't necessary.

If I understood correctly, the way the current exception handler works is that it just sets PyErr_Restore and hope that Python checks the exception quickly enough.

I think the error handler returns control to GAP, though, which eventually jumps back to GAP_Enter. If I move the GAP_Enter outside of try/except, we won't catch the GAPError that is raised. I left comments on the GAP_Enter lines because this surprised me at first.

@user202729
Copy link
Contributor

I think the error handler returns control to GAP, though, which eventually jumps back to GAP_Enter. If I move the GAP_Enter outside of try/except, we won't catch the GAPError that is raised. I left comments on the GAP_Enter lines because this surprised me at first.

In my understanding, the way it works is the following. GAP_Enter is just a C macro, defined by

cdef extern from "gap/libgap-api.h" nogil:
    cdef int GAP_Enter() except 0

that by itself cannot raise an exception. But PyErr_Restore() can, and the exception is raised the next time Python checks for pending exceptions.

But the problem I see is that this looks dangerous, since to me there doesn't seem to be any guarantee it will be checked right after GAP_Enter call.

Anyway, need to double check.

@orlitzky
Copy link
Contributor Author

int gap_enter_ok; // global variable internal to that file
#define gap_sig_on() gap_set_signal_handlers(); gap_enter_ok = GAP_Enter()
#define gap_check_exceptions() if (!gap_enter_ok) { GAP_Leave(); raise_appropriate_exception_type() }
#define gap_sig_off() GAP_Leave(); gap_reset_signal_handler()

I'm trying a simpler version of this to start, with just gap_set_signal_handlers(), gap_sig_on(), and gap_sig_off() in an .h file, and it's already crashing as soon as I sent Ctrl-C or an alarm. I'm giving up for the night. I'm getting nowhere.

@user202729
Copy link
Contributor

user202729 commented Aug 18, 2025

I guess if you don't understand the segmentation fault very well it would be unfair. Not sure if the rare segmentation fault is worth the code duplication though.

Anyway what is your current implementation?

(Note that the GAP_Enter cannot be nested/hidden in anything, which is why I'm forced to separate the gap_check_exceptions from gap_sig_on and make them macros instead of inline functions.)

By the way, for quick hacking, you can use inline raw C embedded in pxd/pyx file with extern from *, instead of separate .h file.

If we're going to raise a GAPError that resulted from a SIGINT or
SIGALRM, we might as well turn it in to a KeyboardInterrupt right
away.
We are no longer retaining the __cause__ of KeyboardInterrupts
that arise from GAP, but the error message is still there, so
we can use that to filter these out.
We are now converting "GAPError: Error, user interrupt" into a
KeyboardInterrupt in our libgap error handler, so we don't have
to do it at every call site.
@user202729
Copy link
Contributor

user202729 commented Aug 22, 2025

I'm not sure why this function is using sig_block() and sig_unblock() in the first place -- you're probably not supposed to run them in a loop.

to avoid interrupting in the middle of the user callback function which is likely to corrupt the Python state.

I've seen it take 3s

okay, that's surprising. But then maybe that can happen if x is really close to a multiple of π, or something like the table maker's dilemma, or because trigonometric functions are just slow.

@user202729
Copy link
Contributor

user202729 commented Aug 23, 2025

minor documentation formatting issue in rendered HTML https://doc-pr-40613--sagemath.netlify.app/html/en/reference/libs/sage/libs/gap/libgap#using-the-gap-c-library-from-cython (the indentation is interpreted as quotation, and some code blocks aren't interpreted as such because of the lack of ::), but no big deal. You can re-set positive review afterwards.

I may put in some other improvements to remove the string check some day (just store the signum received by the signal handler to some global variable, then read it from the error handler, instead of detecting the user interrupt string).

(and maybe expose some way to customize the class used for alarm interrupt, in order to raise AlarmInterrupt, for consistency with the rest of the code base.)

@user202729
Copy link
Contributor

actually I think you want to rebase it on top of #40594 (instead of cherry-pick the first 2 commits as in the current situation) to avoid polluting the commit history.

These never should have been indented in the first place; the whole
block is being rendered as a quote.
@orlitzky
Copy link
Contributor Author

actually I think you want to rebase it on top of #40594 (instead of cherry-pick the first 2 commits as in the current situation) to avoid polluting the commit history.

git merge and git rebase will skip the two redundant commits so long as your branch is merged first. I added your PR as a dependency of this one in the description to ensure that the order is correct.

@orlitzky
Copy link
Contributor Author

Docs should be fixed now.

@user202729
Copy link
Contributor

I added your PR as a dependency of this one in the description to ensure that the order is correct.

huh, does @vbraun actually read these? I thought they're for the benefit of reviewers.

@orlitzky
Copy link
Contributor Author

huh, does @vbraun actually read these? I thought they're for the benefit of reviewers.

I hope so, we don't have any other way to express dependencies between PRs and this is how we imported them from trac.

@vbraun vbraun merged commit b2392aa into sagemath:develop Aug 27, 2025
26 of 29 checks passed
@user202729
Copy link
Contributor

bad news, segmentation fault again here?

https://github.com/sagemath/sage/actions/runs/17294515300

@orlitzky
Copy link
Contributor Author

bad news, segmentation fault again here?

https://github.com/sagemath/sage/actions/runs/17294515300

I'm not sure, I can't reproduce that one.

@user202729
Copy link
Contributor

I guess the cherry-picked commits get duplicated anyway (if you git log you'll see both e0b0c06 and 35e479a being in the commit history), but nothing too bad is caused by it.

@orlitzky
Copy link
Contributor Author

I guess the cherry-picked commits get duplicated anyway (if you git log you'll see both e0b0c06 and 35e479a being in the commit history), but nothing too bad is caused by it.

Ok, you were right. I tested and the history was clean even with a merge commit, so I don't know what's happening in the sage repo but it would be much better if the commit history were sane. As it stands git log -p will show you changes that could not possibly have happened, even with --no-merges or --first-parent.

@orlitzky
Copy link
Contributor Author

orlitzky commented Sep 3, 2025

bad news, segmentation fault again here?

https://github.com/sagemath/sage/actions/runs/17294515300

I guess these are not new: #37295

vbraun pushed a commit to vbraun/sage that referenced this pull request Sep 7, 2025
sagemathgh-40727: Explicitly check signum in GAP error handler
    
Follow-up to sagemath#40613.

Instead of checking for the string `user interrupt` (which might change
between GAP versions, or if there's some unforeseen way the string might
be sneaked in), we store the signum from the signal handler then check
it in the GAP error handler.

Also optionally use `AlarmInterrupt` instead of `KeyboardInterrupt` if
cysignals is available.

### 📝 Checklist

<!-- Put an `x` in all the boxes that apply. -->

- [ ] The title is concise and informative.
- [ ] The description explains in detail what this PR is about.
- [ ] I have linked a relevant issue or discussion.
- [ ] I have created tests covering the changes.
- [ ] I have updated the documentation and checked the documentation
preview.

### ⌛ Dependencies

<!-- List all open PRs that this PR logically depends on. For example,
-->
<!-- - sagemath#12345: short description why this is a dependency -->
<!-- - sagemath#34567: ... -->
    
URL: sagemath#40727
Reported by: user202729
Reviewer(s): Michael Orlitzky, user202729
vbraun pushed a commit to vbraun/sage that referenced this pull request Sep 11, 2025
sagemathgh-40727: Explicitly check signum in GAP error handler
    
Follow-up to sagemath#40613.

Instead of checking for the string `user interrupt` (which might change
between GAP versions, or if there's some unforeseen way the string might
be sneaked in), we store the signum from the signal handler then check
it in the GAP error handler.

Also optionally use `AlarmInterrupt` instead of `KeyboardInterrupt` if
cysignals is available.

### 📝 Checklist

<!-- Put an `x` in all the boxes that apply. -->

- [ ] The title is concise and informative.
- [ ] The description explains in detail what this PR is about.
- [ ] I have linked a relevant issue or discussion.
- [ ] I have created tests covering the changes.
- [ ] I have updated the documentation and checked the documentation
preview.

### ⌛ Dependencies

<!-- List all open PRs that this PR logically depends on. For example,
-->
<!-- - sagemath#12345: short description why this is a dependency -->
<!-- - sagemath#34567: ... -->
    
URL: sagemath#40727
Reported by: user202729
Reviewer(s): Michael Orlitzky, user202729
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants