====== GPU Computing ======

===== SLURM partition =====

The GPUs are accessible through compute nodes in the ''gpu'' SLURM partition.
You can send a script to this queue using a command like ''sbatch -p gpu myscript.sh''.


===== Hardware =====

The gpu computing node has two K20 GPUs:
<code>
$ srun -p gpu lspci | grep -i nvidia
05:00.0 3D controller: NVIDIA Corporation Device 1028 (rev a1)
42:00.0 3D controller: NVIDIA Corporation Device 1028 (rev a1)
</code>


===== Directory structure =====

The development tools for GPU computing, including libraries and compilers,
have been installed in ''/usr/local/cuda-5.0/''.
Of particular interest could be ''nvcc'' in ''/usr/local/cuda-5.0/bin/''
and the shared object files such as ''libcudart.so'' and ''libcurand.so'' in ''/usr/local/cuda-5.0/lib64/''.


===== Hardware-specific nvcc build options =====

Because we are using K20 GPUs, you will want to tell the ''nvcc'' compiler that we can use their modern features such as double precision floating point arithmetic.  To do this, use %%-arch sm_35%% to specify the 'compute capability' of 3.5.  Other options that you will probably want to use for scientific computing are %%--prec-div=true%% and %%--prec-sqrt=true%%.  For more information see the [[http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc#nvcc-command-options|nvcc options page]].


===== Headless GPGPU profiling =====

The NVIDIA Visual Profiler [[https://developer.nvidia.com/nvidia-visual-profiler|nvvp]] (packaged as ''nvidia-visual-profiler'' for Ubuntu) is a cross-platform way to profile GPGPU applications, but this is less useful when we can run the applications only on the command line through the SLURM resource manager.  In this case we can use [[http://docs.nvidia.com/cuda/profiler-users-guide/index.html#nvprof-overview|nvprof]] (packaged as ''nvidia-profiler'') to get a log file which can be downloaded and viewed on a desktop using ''nvvp''.


===== Running Sample Script  =====

**1) Interactive Shell**

Below are the contents of sample CUDA script ''hello_world.cu'' which prints ''Hello CUDA!'' on execution. 
<code>
#include <stdio.h>

__global__ void helloCUDA()
{
    printf("Hello, CUDA!\n");
}

int main()
{
    helloCUDA<<<1, 1>>>();
    cudaDeviceSynchronize();
    return 0;
}
</code>

You can now compile this code with ''nvcc'' (Nvidia cuda compiler) and submit the job using ''srun''. Srun parameter ''-p'' specifies ''gpu'' partition. ''--nodes=1 --ntasks-per-node=1'' requests one node with one task per node. ''--time=01:00:00'' sets the time limit of 1 hr to the job. ''--pty bash -i'' Requests an interactive Bash shell on the allocated node, which allows for command-line interaction. You can now execute the code on allocated node with ''./hello_world''. 
<code>
> nvcc -o hello_world hello_world.cu
> srun -p gpu --nodes=1 --ntasks-per-node=1 --time=01:00:00 --pty bash -i
> ./hello_world
Hello, CUDA!
> exit
</code>


**2) Job Script**

You can submit same CUDA program with job script. Below are the contents of script.sh which specifies job parameters and captures program output.

<code>
#!/bin/bash

#SBATCH --job-name CudaJob
#SBATCH --output result.out
#SBATCH --partition=gpu
#SBATCH --ntasks=1
#SBATCH --time=0-00:10:00

## Run the script
srun $HOME/hello_world
</code>

You can now submit the job with this script using sbatch and expect the output in result.out.
<code>
> sbatch script.sh
Submitted batch job 2432982

> cat result.out
Hello, CUDA!
</code>