import time
from dask.distributed import Client
from dask_jobqueue import SLURMCluster
GPU_CONFIG = {
'queue': 'batch_gpu',
'cores': 8,
'memory': '8GB',
'job_extra_directives': [
'--gres=gpu:1',
],
'walltime': '00:05:00',
'worker_extra_args': ["--lifetime", "10s", "--lifetime-stagger", "10s"],
}
cluster = SLURMCluster(**GPU_CONFIG)
client = Client(cluster)
cluster.adapt(minimum_jobs=1, maximum_jobs=5)
while True:
print(client)
time.sleep(5)
My expectation was that this would submit new jobs as the original ones were closed I am not seeing this behavior. I am fairly sure that when I used this exact setup a few years ago it worked correctly...
The stdout below clearly shows that the initial two jobs submit and connect to the cluster before being killed and not respawning.
<Client: 'tcp://10.164.24.30:39231' processes=0 threads=0, memory=0 B>
<Client: 'tcp://10.164.24.30:39231' processes=8 threads=16, memory=14.88 GiB>
<Client: 'tcp://10.164.24.30:39231' processes=7 threads=14, memory=13.02 GiB>
<Client: 'tcp://10.164.24.30:39231' processes=4 threads=8, memory=7.44 GiB>
<Client: 'tcp://10.164.24.30:39231' processes=4 threads=8, memory=7.44 GiB>
<Client: 'tcp://10.164.24.30:39231' processes=0 threads=0, memory=0 B>
<Client: 'tcp://10.164.24.30:39231' processes=0 threads=0, memory=0 B>
<Client: 'tcp://10.164.24.30:39231' processes=0 threads=0, memory=0 B>
<Client: 'tcp://10.164.24.30:39231' processes=0 threads=0, memory=0 B>
The following worker log shows the worker processes die as they reach their lifetime and close gracefully as expected.
2025-06-09 17:06:59,476 - distributed.nanny - INFO - Start Nanny at: 'tcp://10.164.25.82:35379'
2025-06-09 17:06:59,480 - distributed.nanny - INFO - Start Nanny at: 'tcp://10.164.25.82:33079'
2025-06-09 17:06:59,481 - distributed.nanny - INFO - Start Nanny at: 'tcp://10.164.25.82:41377'
2025-06-09 17:06:59,483 - distributed.nanny - INFO - Start Nanny at: 'tcp://10.164.25.82:35255'
2025-06-09 17:06:59,900 - distributed.diskutils - INFO - Found stale lock file and directory '/tmp/dask-scratch-space/worker-p715og9m', purging
2025-06-09 17:06:59,901 - distributed.diskutils - INFO - Found stale lock file and directory '/tmp/dask-scratch-space/worker-9_cdptn2', purging
2025-06-09 17:07:00,230 - distributed.worker - INFO - Start worker at: tcp://10.164.25.82:40503
2025-06-09 17:07:00,230 - distributed.worker - INFO - Listening to: tcp://10.164.25.82:40503
2025-06-09 17:07:00,230 - distributed.worker - INFO - Worker name: SLURMCluster-0-0
2025-06-09 17:07:00,230 - distributed.worker - INFO - dashboard at: 10.164.25.82:44353
2025-06-09 17:07:00,230 - distributed.worker - INFO - Waiting to connect to: tcp://10.164.24.30:39231
2025-06-09 17:07:00,231 - distributed.worker - INFO - -------------------------------------------------
2025-06-09 17:07:00,231 - distributed.worker - INFO - Threads: 2
2025-06-09 17:07:00,231 - distributed.worker - INFO - Memory: 1.86 GiB
2025-06-09 17:07:00,231 - distributed.worker - INFO - Local Directory: /tmp/dask-scratch-space/worker-guwx53y4
2025-06-09 17:07:00,231 - distributed.worker - INFO - -------------------------------------------------
2025-06-09 17:07:00,240 - distributed.worker - INFO - Start worker at: tcp://10.164.25.82:45191
2025-06-09 17:07:00,240 - distributed.worker - INFO - Listening to: tcp://10.164.25.82:45191
2025-06-09 17:07:00,240 - distributed.worker - INFO - Worker name: SLURMCluster-0-3
2025-06-09 17:07:00,240 - distributed.worker - INFO - dashboard at: 10.164.25.82:46333
2025-06-09 17:07:00,240 - distributed.worker - INFO - Waiting to connect to: tcp://10.164.24.30:39231
2025-06-09 17:07:00,240 - distributed.worker - INFO - -------------------------------------------------
2025-06-09 17:07:00,240 - distributed.worker - INFO - Threads: 2
2025-06-09 17:07:00,240 - distributed.worker - INFO - Memory: 1.86 GiB
2025-06-09 17:07:00,240 - distributed.worker - INFO - Local Directory: /tmp/dask-scratch-space/worker-0j9vxv6a
2025-06-09 17:07:00,240 - distributed.worker - INFO - -------------------------------------------------
2025-06-09 17:07:00,242 - distributed.worker - INFO - Start worker at: tcp://10.164.25.82:42045
2025-06-09 17:07:00,242 - distributed.worker - INFO - Listening to: tcp://10.164.25.82:42045
2025-06-09 17:07:00,242 - distributed.worker - INFO - Start worker at: tcp://10.164.25.82:36879
2025-06-09 17:07:00,242 - distributed.worker - INFO - Worker name: SLURMCluster-0-1
2025-06-09 17:07:00,242 - distributed.worker - INFO - dashboard at: 10.164.25.82:43743
2025-06-09 17:07:00,242 - distributed.worker - INFO - Listening to: tcp://10.164.25.82:36879
2025-06-09 17:07:00,242 - distributed.worker - INFO - Waiting to connect to: tcp://10.164.24.30:39231
2025-06-09 17:07:00,242 - distributed.worker - INFO - Worker name: SLURMCluster-0-2
2025-06-09 17:07:00,242 - distributed.worker - INFO - -------------------------------------------------
2025-06-09 17:07:00,242 - distributed.worker - INFO - dashboard at: 10.164.25.82:39849
2025-06-09 17:07:00,242 - distributed.worker - INFO - Waiting to connect to: tcp://10.164.24.30:39231
2025-06-09 17:07:00,242 - distributed.worker - INFO - Threads: 2
2025-06-09 17:07:00,242 - distributed.worker - INFO - -------------------------------------------------
2025-06-09 17:07:00,243 - distributed.worker - INFO - Memory: 1.86 GiB
2025-06-09 17:07:00,243 - distributed.worker - INFO - Local Directory: /tmp/dask-scratch-space/worker-n29701zb
2025-06-09 17:07:00,243 - distributed.worker - INFO - Threads: 2
2025-06-09 17:07:00,243 - distributed.worker - INFO - -------------------------------------------------
2025-06-09 17:07:00,243 - distributed.worker - INFO - Memory: 1.86 GiB
2025-06-09 17:07:00,243 - distributed.worker - INFO - Local Directory: /tmp/dask-scratch-space/worker-cxvq1qyq
2025-06-09 17:07:00,243 - distributed.worker - INFO - -------------------------------------------------
2025-06-09 17:07:00,245 - distributed.worker - INFO - Starting Worker plugin shuffle
2025-06-09 17:07:00,246 - distributed.worker - INFO - Registered to: tcp://10.164.24.30:39231
2025-06-09 17:07:00,246 - distributed.worker - INFO - -------------------------------------------------
2025-06-09 17:07:00,246 - distributed.core - INFO - Starting established connection to tcp://10.164.24.30:39231
2025-06-09 17:07:00,253 - distributed.worker - INFO - Starting Worker plugin shuffle
2025-06-09 17:07:00,253 - distributed.worker - INFO - Registered to: tcp://10.164.24.30:39231
2025-06-09 17:07:00,253 - distributed.worker - INFO - -------------------------------------------------
2025-06-09 17:07:00,254 - distributed.core - INFO - Starting established connection to tcp://10.164.24.30:39231
2025-06-09 17:07:00,255 - distributed.worker - INFO - Starting Worker plugin shuffle
2025-06-09 17:07:00,256 - distributed.worker - INFO - Registered to: tcp://10.164.24.30:39231
2025-06-09 17:07:00,256 - distributed.worker - INFO - Starting Worker plugin shuffle
2025-06-09 17:07:00,256 - distributed.worker - INFO - -------------------------------------------------
2025-06-09 17:07:00,256 - distributed.worker - INFO - Registered to: tcp://10.164.24.30:39231
2025-06-09 17:07:00,256 - distributed.worker - INFO - -------------------------------------------------
2025-06-09 17:07:00,256 - distributed.core - INFO - Starting established connection to tcp://10.164.24.30:39231
2025-06-09 17:07:00,256 - distributed.core - INFO - Starting established connection to tcp://10.164.24.30:39231
2025-06-09 17:07:02,099 - distributed.worker - INFO - Closing worker gracefully: tcp://10.164.25.82:45191. Reason: worker-lifetime-reached
2025-06-09 17:07:02,100 - distributed.worker - INFO - Stopping worker at tcp://10.164.25.82:45191. Reason: worker-lifetime-reached
2025-06-09 17:07:02,102 - distributed.nanny - INFO - Closing Nanny gracefully at 'tcp://10.164.25.82:33079'. Reason: worker-lifetime-reached
2025-06-09 17:07:02,102 - distributed.worker - INFO - Removing Worker plugin shuffle
2025-06-09 17:07:02,103 - distributed.core - INFO - Connection to tcp://10.164.24.30:39231 has been closed.
2025-06-09 17:07:02,103 - distributed.nanny - INFO - Worker closed
2025-06-09 17:07:04,104 - distributed.nanny - ERROR - Worker process died unexpectedly
2025-06-09 17:07:04,202 - distributed.nanny - INFO - Closing Nanny at 'tcp://10.164.25.82:33079'. Reason: nanny-close-gracefully
2025-06-09 17:07:04,203 - distributed.nanny - INFO - Nanny at 'tcp://10.164.25.82:33079' closed.
2025-06-09 17:07:04,590 - distributed.worker - INFO - Closing worker gracefully: tcp://10.164.25.82:36879. Reason: worker-lifetime-reached
2025-06-09 17:07:04,592 - distributed.worker - INFO - Stopping worker at tcp://10.164.25.82:36879. Reason: worker-lifetime-reached
2025-06-09 17:07:04,593 - distributed.nanny - INFO - Closing Nanny gracefully at 'tcp://10.164.25.82:35255'. Reason: worker-lifetime-reached
2025-06-09 17:07:04,593 - distributed.worker - INFO - Removing Worker plugin shuffle
2025-06-09 17:07:04,594 - distributed.core - INFO - Connection to tcp://10.164.24.30:39231 has been closed.
2025-06-09 17:07:04,594 - distributed.nanny - INFO - Worker closed
2025-06-09 17:07:06,693 - distributed.nanny - INFO - Closing Nanny at 'tcp://10.164.25.82:35255'. Reason: nanny-close-gracefully
2025-06-09 17:07:06,693 - distributed.nanny - INFO - Nanny at 'tcp://10.164.25.82:35255' closed.
2025-06-09 17:07:15,959 - distributed.worker - INFO - Closing worker gracefully: tcp://10.164.25.82:42045. Reason: worker-lifetime-reached
2025-06-09 17:07:15,961 - distributed.worker - INFO - Stopping worker at tcp://10.164.25.82:42045. Reason: worker-lifetime-reached
2025-06-09 17:07:15,962 - distributed.nanny - INFO - Closing Nanny gracefully at 'tcp://10.164.25.82:35379'. Reason: worker-lifetime-reached
2025-06-09 17:07:15,962 - distributed.worker - INFO - Removing Worker plugin shuffle
2025-06-09 17:07:15,963 - distributed.core - INFO - Connection to tcp://10.164.24.30:39231 has been closed.
2025-06-09 17:07:15,964 - distributed.nanny - INFO - Worker closed
2025-06-09 17:07:16,803 - distributed.worker - INFO - Closing worker gracefully: tcp://10.164.25.82:40503. Reason: worker-lifetime-reached
2025-06-09 17:07:16,805 - distributed.worker - INFO - Stopping worker at tcp://10.164.25.82:40503. Reason: worker-lifetime-reached
2025-06-09 17:07:16,806 - distributed.nanny - INFO - Closing Nanny gracefully at 'tcp://10.164.25.82:41377'. Reason: worker-lifetime-reached
2025-06-09 17:07:16,806 - distributed.worker - INFO - Removing Worker plugin shuffle
2025-06-09 17:07:16,807 - distributed.core - INFO - Connection to tcp://10.164.24.30:39231 has been closed.
2025-06-09 17:07:16,808 - distributed.nanny - INFO - Worker closed
2025-06-09 17:07:17,965 - distributed.nanny - ERROR - Worker process died unexpectedly
2025-06-09 17:07:18,063 - distributed.nanny - INFO - Closing Nanny at 'tcp://10.164.25.82:35379'. Reason: nanny-close-gracefully
2025-06-09 17:07:18,063 - distributed.nanny - INFO - Nanny at 'tcp://10.164.25.82:35379' closed.
2025-06-09 17:07:18,993 - distributed.nanny - INFO - Closing Nanny at 'tcp://10.164.25.82:41377'. Reason: nanny-close-gracefully
2025-06-09 17:07:18,993 - distributed.nanny - INFO - Nanny at 'tcp://10.164.25.82:41377' closed.
2025-06-09 17:07:18,993 - distributed.dask_worker - INFO - End worker
Is this a bug or am I doing something obviously wrong here? Many thanks in advance for any help with this
Hi there,
I set up a
SLURMClusterfollowing the example below, including--lifetimeand--lifetime-staggerto ensure jobs are closed gracefully by the dask scheduler.My expectation was that this would submit new jobs as the original ones were closed I am not seeing this behavior. I am fairly sure that when I used this exact setup a few years ago it worked correctly...
The stdout below clearly shows that the initial two jobs submit and connect to the cluster before being killed and not respawning.
The following worker log shows the worker processes die as they reach their lifetime and close gracefully as expected.
I also confirmed that I see no new workers in the SLURM job queue with
squeue.Is this a bug or am I doing something obviously wrong here? Many thanks in advance for any help with this