Skip to content

Ambiguous failure on OOM when running locally #121

@sclaw

Description

@sclaw

When running container jobs, the process is being silently killed with exit code 137 when insufficient resources are allocated. This appears to be an out-of-memory condition where the application attempts to use resources that aren't properly allocated to the container.

This was observed with a LISFLOOD-FP run. The process starts successfully, but the process gets killed with exit code 137:

[INFO] Running LISFLOOD-FP with parfile: /data/30831/runs/Q100/par.par
***************************
 LISFLOOD-FP version 8.1.0 (double)
CUDA supported
***************************
./entrypoint.sh: line 27:    19 Killed                  /usr/local/bin/lisflood "$parfilepath"
[ERROR] LISFLOOD-FP failed with exit code 137

It would be good if SEPEX would report that it is killing the job so that the source of the job failure is clear.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions