Simulation output data not deleted from compute nodes on cluster #169

bwirthl · 2025-06-04T11:35:22Z

bwirthl
Jun 4, 2025

I am currently struggling with queens’ simulation scheduling: After a queens run, a lot of simulation data was left on the compute nodes. In this particular case, the problem was so extreme that the node was almost full. However, while deleting the data, I also found a lot of folders from other queens users. Any suggestions on what is going wrong and how I can avoid it, since it causes problems for all cluster users?

As @maxdinkel explained: If the workers are killed before a 4C simulation job is finished, the simulation outputs are not deleted in the scratch folder. Maybe some excessive killing of queens runs causes the issue. This explains some of the problems, but I found folders which were very likely not caused by killing of queens runs.

Possible solution

Deleting the data from the compute nodes might be failing for other reasons. We might try multiple times to call the deleting process, e.g.

 # delete what is no longer needed
 for f in $mynodes
 do
   echo "Delete data from node $f"
   ${SSH} $f "cd $SCRATCH; rm -rf $WORKDIR"

   for i in {1..10}
   do
     if [ $? -ne 0 ]
     then
       sleep $((RANDOM % 2))
       echo "Retry deleting $WORKDIR"
       ${SSH} $f "cd $SCRATCH; rm -rf $WORKDIR"
     fi
   done
 done

Thanks @maxdinkel!

sbrandstaeter · 2025-06-04T11:50:00Z

sbrandstaeter
Jun 4, 2025
Maintainer

Thanks for opening the discussion and bringing this issue to everyone’s attention. To avoid confusion, I just want to add that this is only relevant when your simulation data is written to the scratch (local file system) of a compute node. At IMCS, we don’t do this and instead always use the shared file system. While this might have other downsides (performance issues can occur), this setup prevents us from having to ensure cleanup.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simulation output data not deleted from compute nodes on cluster #169

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Simulation output data not deleted from compute nodes on cluster #169

Uh oh!

bwirthl Jun 4, 2025

Possible solution

Replies: 1 comment

Uh oh!

sbrandstaeter Jun 4, 2025 Maintainer

bwirthl
Jun 4, 2025

sbrandstaeter
Jun 4, 2025
Maintainer