====== Submitting Many Jobs ======

You may need to run many (possibly similar) tasks on the cluster. You can do this by submitting each job separately to the cluster, techniques for doing that are discussed below. You could also do it by submitting a small number of jobs (possibly even just one) and have those jobs execute the many tasks you need to run. You should try not to submit thousands of separate jobs into the queue.

To submit many jobs to the cluster you can:

  * Use a script (in your favorite scripting language).
    * In perl, PHP use something like the **system** call to run sbatch.
    * In python use **call** from the **subprocess** module.
    * In a bash script just run sbatch.
    * This method allows you to submit as many jobs as you like. If the cluster fills up the jobs will be queued and you will see them in "pending" state.
    * Be considerate: don't fill the cluster completely for a long time.
      * Restrict the number of nodes you use: **-x** option to exclude some nodes.
      * Limit the number of jobs you submit at one time.

File: submit_burden_pc.php

  * Use an array job 
    * sbatch --array="1-10%3" my_job.bash

  * Use the **-n** option on **sbatch** and **srun** within the script to start multiple copies of a program.
    * You can't exceed the number of cores available with this method (the job will be rejected).
    * You can get each of your tasks to do something a little different (e.g. processing a different file) by using the SLURM_PROCID environment variable.
    * The "-l" option on srun will label output lines with the task number.

Files: multi_job and run_multi_job

<code>
srun -n 2 -l multi_job
</code>

<code>
sbatch -n 2 run_multi_job
</code>

In the second case 2 copies of multi_job are run because we put the **-n** on the sbatch command.

The output of the srun tasks is written to the same file and "jumbled up". You can separate the output by using a %t in the output file name.

<code>
sbatch -n 2 run_multi_job_%t
</code>


===== Limiting the Number of Nodes/Cores Your Jobs Use =====

If you submit many jobs to the cluster (perhaps using individual **sbatch** commands) they will use any of the nodes/cores available on the cluster (in the partition to which you submitted your job). If you submit, say, 1000 jobs you may well fill all available cores on the cluster and leave no resources for other users. So it is a good idea to limit the number of jobs that are running at any one time. An **array job** is an ideal way of doing this: it makes only a single entry in the job queue (rather than 1000 in our example) and it lets you specify the maximum number that should be run at any one time. 

Array jobs are described in more detail on the [[sbatch]] page.

Some other options for limiting the number of cores your jobs use are described below.

==== Singleton Jobs ====

You can use the **--dependency** option of sbatch to make slurm run just one of your jobs at a time. Suppose you submit 10 jobs with the same job name, using the **--dependency=singleton** option will make slurm run these jobs one at a time.

<code>
for i in $(seq 1 10); do
sbatch --job-name oneatatime --dependency=singleton my_script.bash file_${i}.fasta
done
</code>

==== Sending a Job to a Specific Node ====

You can use the **-w** option to select a specific node. Actually it asks for "at least" the nodes in the node list you specify. So a command like:

<code>
sbatch -n 20 -w node2 my_script
</code>

Would get you some cores on node2 and some on another node (since there are only 16 cores total on node2). If there were no cores free on node2 the job would be queued until some became available.

Note that using the -w option with multiple nodes is not a way of queueing jobs on just those nodes: it will actually allocate cores across all nodes you specify and run the job on just the first on them. e.g.

<code>
sbatch -w node[2-4] my_script
</code>

would allocate one core on each of nodes 2,3, and 4 and run my_script on node2. You would then use srun from within your script to run job steps within this allocation of cores. Don't use this to limit the nodes you want your jobs to run on.

==== Excluding Some Nodes ====

You can use the **-x** option to avoid specific nodes. A list of node names looks like this:

<code>
node[1-4,7,11]
</code>

Read as "nodes 1 to 4, 7 and 11" i.e. 1,2,3,4,7,11.

You can use **-c 16** to request all cores on a (standard) node.

You can use the **--exclusive** option to ask for exclusive access to all the nodes your job is allocated. This is especially useful if you have a program which attempts to use all the cores it finds. Please only use it if you need it.

==== Using Job Steps ====

Files: test-multi.sh, pause.sh

If you use **srun** within an **sbatch** script, the cores to be used for the jobs being srun can be sub-allocated from the cores alloted to the sbatch script. For example, the following script, test-multi.sh, that is to be submitted using **sbatch** specifies 3 tasks (defaulted to one core each).

<code>
#!/bin/bash
#SBATCH --ntasks 3

for i in {1..7}; do
echo $i
srun --ntasks 1 --exclusive pause.sh &
done

wait
</code>

Note that the "--exclusive" flag to srun here has a different meaning than when submitting a job from a terminal using srun or sbatch. From "man srun":

<code>
This option can also be used when initiating more than one job step within an existing resource
allocation, where you want separate processors to be dedicated to each job step. If  sufficient
processors are not available to initiate the job step, it will be deferred. This can be thought 
of as providing a mechanism for resource management to the job within it's allocation.
</code>

The "--ntasks 1" on each srun command is important because without it srun will start the pause.sh script on each of the cores allocated to the sbatch command (3 in this case).

The "--exclusive" on each srun command is important as discussed above.

The "&" at the end of the srun line is important: else each srun will block, causing the srun steps to be executed consecutively.

Similarly the "wait" at the end of the script is important: it stops the script from exiting before all the (now background) srun jobs have finished. Without the wait some of the jobs would likely still be running at the end of the script and would be terminated when their parent script ends.

Some example code for pause.sh could be as follows.

<code>
#!/bin/bash
echo START $1 $(date +%s) $(hostname) SLURM_STEP_ID = $SLURM_STEP_ID SLURM_PROCID = $SLURM_PROCID
sleep 60
echo END $1 $(date +%s) $(hostname) SLURM_STEP_ID = $SLURM_STEP_ID SLURM_PROCID = $SLURM_PROCID
</code>

Now, if you run:

<code>
sbatch test-multi.sh
</code>

The output file (after a couple of minutes, when the job has completed) will look something like this:

<code>
1
2
3
4
5
6
7
START loop1 node1 1583960070 SLURM_STEP_ID = 0 SLURM_PROCID = 0
START loop1 node1 1583960070 SLURM_STEP_ID = 1 SLURM_PROCID = 0
START loop1 node1 1583960070 SLURM_STEP_ID = 2 SLURM_PROCID = 0
END loop1 node1 1583960130 SLURM_STEP_ID = 0 SLURM_PROCID = 0
END loop1 node1 1583960130 SLURM_STEP_ID = 1 SLURM_PROCID = 0
END loop1 node1 1583960130 SLURM_STEP_ID = 2 SLURM_PROCID = 0
srun: Job step creation temporarily disabled, retrying
START loop1 node1 1583960130 SLURM_STEP_ID = 3 SLURM_PROCID = 0
START loop1 node1 1583960130 SLURM_STEP_ID = 4 SLURM_PROCID = 0
START loop1 node1 1583960130 SLURM_STEP_ID = 5 SLURM_PROCID = 0
END loop1 node1 1583960190 SLURM_STEP_ID = 3 SLURM_PROCID = 0
END loop1 node1 1583960190 SLURM_STEP_ID = 4 SLURM_PROCID = 0
srun: Job step created
END loop1 node1 1583960190 SLURM_STEP_ID = 5 SLURM_PROCID = 0
START loop1 node1 1583960190 SLURM_STEP_ID = 6 SLURM_PROCID = 0
END loop1 node1 1583960250 SLURM_STEP_ID = 6 SLURM_PROCID = 0
</code>

Going through this output, we see the 7 iterations of the loop in test-multi.sh being executed (the "echo $i" in the loop prints the numbers 1-7). These happen in quick succession because of the "&" at the end of each srun line. Then we see three "START" messages from pause.sh followed 60 seconds later by three "END" messages (the "date +%s" command prints the time in seconds since 1970/1/1). Then we see three more "START" messages, and so on.

So the "--ntasks 3" sbatch option, and the "--ntasks 1 --exclusive" on the srun command limited the number of processes running at any one time to 3.

This technique also works "across nodes", i.e. if I specify "--ntasks 50" as an sbatch option I will get job steps run on multiple nodes (because the nodes have fewer than 50 cores each). In this case you will see messages from slurm saying:

<code>
srun: Warning: can't run 1 processes on 3 nodes, setting nnodes to 1
</code>