===== Environment Modules =====
Environment modules allow you to control which software (and which version of that software) is available in your environment. For instance, at the time of writing, the cluster has 4 different version of standard R installed: 3.5.3, 3.6.3, 4.0.5, 4.1.0. When you first log in and try to run R the OS will respond with "command not found". To activate R in your environment you would type:
module add R
That would then give you access to the most recent version of R available (4.1.0 in this case).
To use a specific version you would have typed something like:
module add R/3.6.3
To get a list of all available software you can type:
module avail
To get a full list of module commands:
module --help
There's a shorthand version of the **module** command: **ml**. To load a module you can use just:
ml fastp
to, for instance, load the fastp program.
You can also issue the other module commands using ml:
ml avail
ml list
ml purge
...
Just typing **ml** on its own is the same as **ml list**.
Whether using **module** or **ml** you can load multiple modules with a single command:
module add R/4.1.0 samtools bedtools
or
ml R/4.1.0 samtools bedtools
==== Where run Module Load Commands ====
You have several options as to where to use "module load" commands.
- Put module load commands into your .bashrc or .bash_profile file
* Modules/programs loaded here will be available immediately upon login.
* This is a good choice for programs you use a lot from the command line. R might be a good example.
* The specific choices you make in your .bashrc file can be overridden in shell scripts that you run on the head node or submit to compute nodes.
- Run module load commands as needed "by hand" (on the command line).
* This may be OK if you usually submit jobs by running scripts, and those scripts load modules themselves.
- Put module load commands into scripts that you run on the head node to submit jobs to the compute nodes e.g. scripts that run **sbatch** commands.
* Doing this does not affect the modules you have loaded in your login shell, and any modules loaded within the script affects only commands run within the script.
- Put the module load commands in the scripts that you submit to the compute nodes.
* This is subject to the same comment as technique 3.
- Put the required module load commands into a text file and then "source" (bash command) the text file as necessary.
* This might be a good way to specify what programs/modules are needed for specific pipelines that you commonly run.
* You could have your scripts source the list of programs needed for the pipeline rather than explicitly listing module load commands in each script.
==== Module Conflicts ====
When you issue a module load command the modules program checks whether you already have a different version of the same program loaded as a module. If you do, it reports an error and does not load the module a second time.
Suppose you have the latest version of R loaded in your .bashrc file, but also have a pipeline that has been installed and thoroughly tested with a previous version of R, and you have in the scripts for that pipeline something like:
module load R/3.6.3
This would generate an error because of the latest version of R already being loaded.
So, in your script you could unload R before loading the new version.
module unload R
module load R/3.6.3
You might also consider unloading all modules, and loading only those you need, before starting up your pipeline:
module purge
module load R/3.6.3
==== Not using Modules ====
In general, the environment modules just edit your PATH variable (they usually prepend a directory to your PATH). The convention is that all programs loaded by "module load" can be found in a sub-directory of **/opt**. The naming convention is:
/opt/SOFTWARE/VERSION
Where SOFTWARE would be replaced by the name of the software, e.g. R, and VERSION would be replaced by a
version number for that software, e.g. 3.6.3 (for R).
Usually (but not always) the executable programs will be in /opt/SOFTWARE/VERSION/bin.
You can find specifically what "module load" does for a piece of software by looking at the modulefile for that piece of software. It can be found at:
/opt/modules/modulefiles/SOFTWARE/VERSION
(This is a file - not a directory.)
If you prefer not to use modules, you can just update your PATH to include the directories of the pieces of software you want to use (possibly updating your PATH in .bashrc).
In most cases, modules is just a nice easy way of updating you PATH. So it seems preferable to use the module command rather than updating your PATH explicitly.
==== Special Purpose Modules ====
There are a couple of special purpose modules.
* dot
* The dot module just puts your current working directory into your path allowing you to run a program or script that is in your cwd with just "programname" rather than "./programname".
* This can lead to confusion if you have programs or scripts with the same name in different directories and forget where you are.
* use.own
* This module allows you to use modulefiles of your own.
* Try "module help use.own" for more information.
==== Oddities and Exceptions ====
=== Python and Perl ===
Since various OS level tools need python and perl there are versions of these languages installed system-wide. No module needed. These are python3 (version 3.8.5) and perl (version 5.30.0). You are welcome to use these, but there are also modules with slightly different versions:
* python 2.7.18 for older software that requires python2
* python 3.9.5
* perl 5.34.0
The system-wide python is accessible only as "python3". When a python module is loaded just "python" will start up the relevant version of python.
Python and perl packages that users request will be installed into the module versions of these programs. You can install python and perl packages locally as you wish (using any of these versions).
More information about environment modules can be found here: https://modules.readthedocs.io/en/latest/
=== Installing R Packages ===
If you try to install an R package (as an ordinary user) and get a "permission denied" message like this:
* installing to library '/opt/R/4.1.0/lib/R/library'
Error: ERROR: no permission to install to directory '/opt/R/4.1.0/lib/R/library'
Then you might need to create the correct directory for R to use for package installation within your home directory. For R 4.1.0 this would be:
~/R/x86_64-pc-linux-gnu-library/4.1
Where "~" means "your home directory". You should only specify the first two parts of the full version number (hence the 4.1 for version 4.1.0).
You can create the directory from the command line, like this:
cd
mkdir -p ~/R/x86_64-pc-linux-gnu-library/4.1
Or you can do it from within R, and then you won't need to know any details like the specific version number - the R program that you have started will fill them in for you:
dir.create(Sys.getenv("R_LIBS_USER"), recursive = TRUE)
=== Rscript and the "#!" Hack ===
If you have used
#!/usr/bin/Rscript
as the first line of your R scripts so that you can run them just like programs on the old cluster, they will no longer work on the new cluster. This is because there is no interpreter at /usr/bin/Rscript on the new cluster.
On the new cluster you should load an R module (possibly from within your .bashrc file so that R is always available when you log in), and then use:
#!/usr/bin/env Rscript
at the top of your R scripts.