jobs not resubmitted in `SLURMCluster` after graceful closure via `--lifetime`

Hi there,

I set up a `SLURMCluster` following the example below, including `--lifetime` and `--lifetime-stagger` to ensure jobs are closed gracefully by the dask scheduler.

```
import time

from dask.distributed import Client
from dask_jobqueue import SLURMCluster

GPU_CONFIG = {
    'queue': 'batch_gpu',
    'cores': 8,
    'memory': '8GB',
    'job_extra_directives': [
        '--gres=gpu:1',
    ],
    'walltime': '00:05:00',
    'worker_extra_args': ["--lifetime", "10s", "--lifetime-stagger", "10s"],
}

cluster = SLURMCluster(**GPU_CONFIG)
client = Client(cluster)
cluster.adapt(minimum_jobs=1, maximum_jobs=5)

while True:
    print(client)
    time.sleep(5)
```

My expectation was that this would submit new jobs as the original ones were closed I am not seeing this behavior. I am fairly sure that when I used this exact setup a few years ago it worked correctly...

The stdout below clearly shows that the initial two jobs submit and connect to the cluster before being killed and not respawning. 

```txt
<Client: 'tcp://10.164.24.30:39231' processes=0 threads=0, memory=0 B>
<Client: 'tcp://10.164.24.30:39231' processes=8 threads=16, memory=14.88 GiB>
<Client: 'tcp://10.164.24.30:39231' processes=7 threads=14, memory=13.02 GiB>
<Client: 'tcp://10.164.24.30:39231' processes=4 threads=8, memory=7.44 GiB>
<Client: 'tcp://10.164.24.30:39231' processes=4 threads=8, memory=7.44 GiB>
<Client: 'tcp://10.164.24.30:39231' processes=0 threads=0, memory=0 B>
<Client: 'tcp://10.164.24.30:39231' processes=0 threads=0, memory=0 B>
<Client: 'tcp://10.164.24.30:39231' processes=0 threads=0, memory=0 B>
<Client: 'tcp://10.164.24.30:39231' processes=0 threads=0, memory=0 B>
```

The following worker log shows the worker processes die as they reach their lifetime and close gracefully as expected.

```txt
2025-06-09 17:06:59,476 - distributed.nanny - INFO -         Start Nanny at: 'tcp://10.164.25.82:35379'
2025-06-09 17:06:59,480 - distributed.nanny - INFO -         Start Nanny at: 'tcp://10.164.25.82:33079'
2025-06-09 17:06:59,481 - distributed.nanny - INFO -         Start Nanny at: 'tcp://10.164.25.82:41377'
2025-06-09 17:06:59,483 - distributed.nanny - INFO -         Start Nanny at: 'tcp://10.164.25.82:35255'
2025-06-09 17:06:59,900 - distributed.diskutils - INFO - Found stale lock file and directory '/tmp/dask-scratch-space/worker-p715og9m', purging
2025-06-09 17:06:59,901 - distributed.diskutils - INFO - Found stale lock file and directory '/tmp/dask-scratch-space/worker-9_cdptn2', purging
2025-06-09 17:07:00,230 - distributed.worker - INFO -       Start worker at:   tcp://10.164.25.82:40503
2025-06-09 17:07:00,230 - distributed.worker - INFO -          Listening to:   tcp://10.164.25.82:40503
2025-06-09 17:07:00,230 - distributed.worker - INFO -           Worker name:           SLURMCluster-0-0
2025-06-09 17:07:00,230 - distributed.worker - INFO -          dashboard at:         10.164.25.82:44353
2025-06-09 17:07:00,230 - distributed.worker - INFO - Waiting to connect to:   tcp://10.164.24.30:39231
2025-06-09 17:07:00,231 - distributed.worker - INFO - -------------------------------------------------
2025-06-09 17:07:00,231 - distributed.worker - INFO -               Threads:                          2
2025-06-09 17:07:00,231 - distributed.worker - INFO -                Memory:                   1.86 GiB
2025-06-09 17:07:00,231 - distributed.worker - INFO -       Local Directory: /tmp/dask-scratch-space/worker-guwx53y4
2025-06-09 17:07:00,231 - distributed.worker - INFO - -------------------------------------------------
2025-06-09 17:07:00,240 - distributed.worker - INFO -       Start worker at:   tcp://10.164.25.82:45191
2025-06-09 17:07:00,240 - distributed.worker - INFO -          Listening to:   tcp://10.164.25.82:45191
2025-06-09 17:07:00,240 - distributed.worker - INFO -           Worker name:           SLURMCluster-0-3
2025-06-09 17:07:00,240 - distributed.worker - INFO -          dashboard at:         10.164.25.82:46333
2025-06-09 17:07:00,240 - distributed.worker - INFO - Waiting to connect to:   tcp://10.164.24.30:39231
2025-06-09 17:07:00,240 - distributed.worker - INFO - -------------------------------------------------
2025-06-09 17:07:00,240 - distributed.worker - INFO -               Threads:                          2
2025-06-09 17:07:00,240 - distributed.worker - INFO -                Memory:                   1.86 GiB
2025-06-09 17:07:00,240 - distributed.worker - INFO -       Local Directory: /tmp/dask-scratch-space/worker-0j9vxv6a
2025-06-09 17:07:00,240 - distributed.worker - INFO - -------------------------------------------------
2025-06-09 17:07:00,242 - distributed.worker - INFO -       Start worker at:   tcp://10.164.25.82:42045
2025-06-09 17:07:00,242 - distributed.worker - INFO -          Listening to:   tcp://10.164.25.82:42045
2025-06-09 17:07:00,242 - distributed.worker - INFO -       Start worker at:   tcp://10.164.25.82:36879
2025-06-09 17:07:00,242 - distributed.worker - INFO -           Worker name:           SLURMCluster-0-1
2025-06-09 17:07:00,242 - distributed.worker - INFO -          dashboard at:         10.164.25.82:43743
2025-06-09 17:07:00,242 - distributed.worker - INFO -          Listening to:   tcp://10.164.25.82:36879
2025-06-09 17:07:00,242 - distributed.worker - INFO - Waiting to connect to:   tcp://10.164.24.30:39231
2025-06-09 17:07:00,242 - distributed.worker - INFO -           Worker name:           SLURMCluster-0-2
2025-06-09 17:07:00,242 - distributed.worker - INFO - -------------------------------------------------
2025-06-09 17:07:00,242 - distributed.worker - INFO -          dashboard at:         10.164.25.82:39849
2025-06-09 17:07:00,242 - distributed.worker - INFO - Waiting to connect to:   tcp://10.164.24.30:39231
2025-06-09 17:07:00,242 - distributed.worker - INFO -               Threads:                          2
2025-06-09 17:07:00,242 - distributed.worker - INFO - -------------------------------------------------
2025-06-09 17:07:00,243 - distributed.worker - INFO -                Memory:                   1.86 GiB
2025-06-09 17:07:00,243 - distributed.worker - INFO -       Local Directory: /tmp/dask-scratch-space/worker-n29701zb
2025-06-09 17:07:00,243 - distributed.worker - INFO -               Threads:                          2
2025-06-09 17:07:00,243 - distributed.worker - INFO - -------------------------------------------------
2025-06-09 17:07:00,243 - distributed.worker - INFO -                Memory:                   1.86 GiB
2025-06-09 17:07:00,243 - distributed.worker - INFO -       Local Directory: /tmp/dask-scratch-space/worker-cxvq1qyq
2025-06-09 17:07:00,243 - distributed.worker - INFO - -------------------------------------------------
2025-06-09 17:07:00,245 - distributed.worker - INFO - Starting Worker plugin shuffle
2025-06-09 17:07:00,246 - distributed.worker - INFO -         Registered to:   tcp://10.164.24.30:39231
2025-06-09 17:07:00,246 - distributed.worker - INFO - -------------------------------------------------
2025-06-09 17:07:00,246 - distributed.core - INFO - Starting established connection to tcp://10.164.24.30:39231
2025-06-09 17:07:00,253 - distributed.worker - INFO - Starting Worker plugin shuffle
2025-06-09 17:07:00,253 - distributed.worker - INFO -         Registered to:   tcp://10.164.24.30:39231
2025-06-09 17:07:00,253 - distributed.worker - INFO - -------------------------------------------------
2025-06-09 17:07:00,254 - distributed.core - INFO - Starting established connection to tcp://10.164.24.30:39231
2025-06-09 17:07:00,255 - distributed.worker - INFO - Starting Worker plugin shuffle
2025-06-09 17:07:00,256 - distributed.worker - INFO -         Registered to:   tcp://10.164.24.30:39231
2025-06-09 17:07:00,256 - distributed.worker - INFO - Starting Worker plugin shuffle
2025-06-09 17:07:00,256 - distributed.worker - INFO - -------------------------------------------------
2025-06-09 17:07:00,256 - distributed.worker - INFO -         Registered to:   tcp://10.164.24.30:39231
2025-06-09 17:07:00,256 - distributed.worker - INFO - -------------------------------------------------
2025-06-09 17:07:00,256 - distributed.core - INFO - Starting established connection to tcp://10.164.24.30:39231
2025-06-09 17:07:00,256 - distributed.core - INFO - Starting established connection to tcp://10.164.24.30:39231
2025-06-09 17:07:02,099 - distributed.worker - INFO - Closing worker gracefully: tcp://10.164.25.82:45191. Reason: worker-lifetime-reached
2025-06-09 17:07:02,100 - distributed.worker - INFO - Stopping worker at tcp://10.164.25.82:45191. Reason: worker-lifetime-reached
2025-06-09 17:07:02,102 - distributed.nanny - INFO - Closing Nanny gracefully at 'tcp://10.164.25.82:33079'. Reason: worker-lifetime-reached
2025-06-09 17:07:02,102 - distributed.worker - INFO - Removing Worker plugin shuffle
2025-06-09 17:07:02,103 - distributed.core - INFO - Connection to tcp://10.164.24.30:39231 has been closed.
2025-06-09 17:07:02,103 - distributed.nanny - INFO - Worker closed
2025-06-09 17:07:04,104 - distributed.nanny - ERROR - Worker process died unexpectedly
2025-06-09 17:07:04,202 - distributed.nanny - INFO - Closing Nanny at 'tcp://10.164.25.82:33079'. Reason: nanny-close-gracefully
2025-06-09 17:07:04,203 - distributed.nanny - INFO - Nanny at 'tcp://10.164.25.82:33079' closed.
2025-06-09 17:07:04,590 - distributed.worker - INFO - Closing worker gracefully: tcp://10.164.25.82:36879. Reason: worker-lifetime-reached
2025-06-09 17:07:04,592 - distributed.worker - INFO - Stopping worker at tcp://10.164.25.82:36879. Reason: worker-lifetime-reached
2025-06-09 17:07:04,593 - distributed.nanny - INFO - Closing Nanny gracefully at 'tcp://10.164.25.82:35255'. Reason: worker-lifetime-reached
2025-06-09 17:07:04,593 - distributed.worker - INFO - Removing Worker plugin shuffle
2025-06-09 17:07:04,594 - distributed.core - INFO - Connection to tcp://10.164.24.30:39231 has been closed.
2025-06-09 17:07:04,594 - distributed.nanny - INFO - Worker closed
2025-06-09 17:07:06,693 - distributed.nanny - INFO - Closing Nanny at 'tcp://10.164.25.82:35255'. Reason: nanny-close-gracefully
2025-06-09 17:07:06,693 - distributed.nanny - INFO - Nanny at 'tcp://10.164.25.82:35255' closed.
2025-06-09 17:07:15,959 - distributed.worker - INFO - Closing worker gracefully: tcp://10.164.25.82:42045. Reason: worker-lifetime-reached
2025-06-09 17:07:15,961 - distributed.worker - INFO - Stopping worker at tcp://10.164.25.82:42045. Reason: worker-lifetime-reached
2025-06-09 17:07:15,962 - distributed.nanny - INFO - Closing Nanny gracefully at 'tcp://10.164.25.82:35379'. Reason: worker-lifetime-reached
2025-06-09 17:07:15,962 - distributed.worker - INFO - Removing Worker plugin shuffle
2025-06-09 17:07:15,963 - distributed.core - INFO - Connection to tcp://10.164.24.30:39231 has been closed.
2025-06-09 17:07:15,964 - distributed.nanny - INFO - Worker closed
2025-06-09 17:07:16,803 - distributed.worker - INFO - Closing worker gracefully: tcp://10.164.25.82:40503. Reason: worker-lifetime-reached
2025-06-09 17:07:16,805 - distributed.worker - INFO - Stopping worker at tcp://10.164.25.82:40503. Reason: worker-lifetime-reached
2025-06-09 17:07:16,806 - distributed.nanny - INFO - Closing Nanny gracefully at 'tcp://10.164.25.82:41377'. Reason: worker-lifetime-reached
2025-06-09 17:07:16,806 - distributed.worker - INFO - Removing Worker plugin shuffle
2025-06-09 17:07:16,807 - distributed.core - INFO - Connection to tcp://10.164.24.30:39231 has been closed.
2025-06-09 17:07:16,808 - distributed.nanny - INFO - Worker closed
2025-06-09 17:07:17,965 - distributed.nanny - ERROR - Worker process died unexpectedly
2025-06-09 17:07:18,063 - distributed.nanny - INFO - Closing Nanny at 'tcp://10.164.25.82:35379'. Reason: nanny-close-gracefully
2025-06-09 17:07:18,063 - distributed.nanny - INFO - Nanny at 'tcp://10.164.25.82:35379' closed.
2025-06-09 17:07:18,993 - distributed.nanny - INFO - Closing Nanny at 'tcp://10.164.25.82:41377'. Reason: nanny-close-gracefully
2025-06-09 17:07:18,993 - distributed.nanny - INFO - Nanny at 'tcp://10.164.25.82:41377' closed.
2025-06-09 17:07:18,993 - distributed.dask_worker - INFO - End worker
```

I also confirmed that I see no new workers in the SLURM job queue with `squeue`.

Is this a bug or am I doing something obviously wrong here? Many thanks in advance for any help with this



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

jobs not resubmitted in `SLURMCluster` after graceful closure via `--lifetime` #691

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

jobs not resubmitted in SLURMCluster after graceful closure via --lifetime #691

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

jobs not resubmitted in `SLURMCluster` after graceful closure via `--lifetime` #691