RMS - Run My Samples¶
RMS is a “cluster scripting language” and execution engine, making the creation of computational pipelines, and running them across a compute cluster, easy to do. The software takes a templated RMS script plus spreadsheet data files, generates commands by using the spreadsheet data to fill in the templates (i.e., for each file, each sample, each trio), and runs them on the current computer or across a cluster.
And, calling it a language overstates the case just a bit. It really consists of extra lines to organize the Bash, Perl, Python or R scripts of a pipeline, along with fill-in-the-blank “template elements” that are replaced when creating the executed commands. You write the pipeline steps in any combination of the four languages, include the RMS lines and template elements, and then run RMS with the script and the spreadsheet data as you would any program. It handles all of the details of creating, queueing and running jobs across the cluster.
For example, if you have the file ‘myfiles.txt’, containing the names of three files plus a column header on the first line,
File
myfile1
myfile2
myfile3
and have the following RMS script ‘myscript.rms’, where the first line is an RMS line saying to execute the lines below as a command (and to execute that for each “File” in the spreadsheet data),
#### runTheFiles File
mycommand <File>
then you can run rms as follows:
rms myscript.rms myfiles.txt
and RMS will execute the three commands “mycommand myfile1”, “mycommand myfile2” and “mycommand myfile3” across your cluster, handling all of the details of cluster queues and job submissions.
The best place to get started is with Installation and Configuration, because you likely will need to add a configuration file describing your cluster, even if the RMS software has been installed already. Getting Started - Hello World!, part 1 will let you test the software on your cluster and see another simple RMS example. Then, the rest of the documents contain more examples and go into detail about RMS scripts and running RMS.
Installation and Configuration¶
Installation¶
The software is available at https://github.com/knightjimr/rms, and the latest stable release is available at https://github.com/knightjimr/rms/releases. No pip installation script is available, yet, so the easiest way to install is to download the release, untar it and then add the directory to your PATH environment variable. The software is a pure python implementation, with no dependencies. It was developed using python 2.7, but should be compatible with 2.6+ and 3.* versions.
Configuration¶
Configuring RMS to use your cluster takes a little more work, because every cluster’s job scheduler has a different setup, and many of the resource limit rules are not detectable automatically. So, the first thing you need to do is learn what job queues you can submit jobs to, and what, if any, resource limits there are. Hopefully, if you’ve been using the cluster for a while, you should know much of this information.
Another helpful thing to do with some clusters may be to test out the resource limit options using interactive jobs, to see what options ensure that you get compute node resources quickly. The documentation for one cluster I tried out said they had 20 core compute nodes, and said they kill jobs after 24 hours. When I started trying those options with interactive jobs, I found setting the time limit to 24 hours was fine, but trying ask for 20 cores on a single host caused the job to just sit in the queue. When I changed it to 10 cores, I was able to get my jobs running quickly. Setting your resource limits so that jobs start quickly will allow RMS to start on your computations quickly, and then be able to ramp up quickly when the cluster is not busy.
Defining the Cluster¶
RMS first reads the file “RMSRC” in the same directory as the RMS software, to get the system-level configuration, then tries to read “~/.rmsrc” for any user changes to the configuration. Similar to ulimit, user changes only reduce the resource limits defined in the system RMSRC, although there is an override option to allow a user to change the limits, regardless of what the system configuration is (at their own risk…since, if the resource limits don’t match the actual cluster configuration, their jobs will just sit in the queue forever…).
The RMSRC (or ~/.rmsrc) lines have the format “queueName option=value…” to define the job queues. For example, the first cluster I used was a PBS/Torque cluster with three job queues, “default”, “highcore” and “bigmem”, each with different configurations and where highcore and bigmem were to be used only when needed. So, the RMSRC file for that cluster was the following:
default type=torque;cpulimit=100;ppn=8;default=true
bigmem type=torque;joblimit=4;ppn=64
highcore type=torque;ppn=48
All were of type “torque”, the default queue had a limit of 100 cores and had compute nodes with 8 cores each. The bigmem queue only allowed 4 jobs at a time and each compute node had 64 cores. The highcore node had no resource limits, and had 48 cores per node.
The current cluster is a Slurm cluster, with “general” and “scavenge” queues, each with a 300 core limit, with compute nodes that have 20 cores and 124GB of memory (where Slurm memory management is turned on, so you must request and stay within the memory limit or the job is killed), and a one week job time limit. The RMSRC file for that is the following:
general type=slurm;ppn=20;mem=124;cpulimit=300;walltime=168;wallbuffer=12;default=true
scavenge type=slurm;ppn=20;mem=124;cpulimit=300;walltime=168;wallbuffer=12
The options string for each queue line contains a semi-colon separated list of “name=value” pairs, defining the properties of that queue. The core properties that must be defined for each queue are the following:
- type
- This defines the type of the job scheduler (see below for the list of supported job schedulers and their values).
- ppn
- This defines how many cores to request when submitting an RMS worker to run on a compute node. RMS creates one long-lived worker for each compute node it uses, and passes multiple commands to that worker, so the ppn value should be the number of cores that the compute nodes in this job queue have, not how many might be used in the RMS script steps. However, you can set this to request only partial access to the nodes (i.e., 10 out of 20 cores), and RMS will only use that proportion of the cores, memory and tmp space on the compute nodes.
- default
- This says whether to use this queue by default, if the list of queues is not explicitly named in the “-n” command-line option. [default: false]
There are a number of properties that can be used to define the resource limits for the queue. RMS interacts with the job queue by submitting one job for one compute node, in order to run the RMS worker process on that node (and will do that multiple times, as needed, so that RMS can expand and contract the number of compute nodes used, based on the commands ready to be run in the script).
- cpulimit
- This defines the limit of cores that this queue will allocate to a user. RMS will stop submitting jobs for RMS workers when it reaches this limit.
- joblimit
- This defines the limit on the number of jobs that a user can submit to the queue. RMS will only submit at most this number of RMS worker jobs.
- nodelimit
- This defines the limit on the number of nodes that a user can submit to the queue. RMS will only request at most this number of compute nodes.
- mem
- This defines the memory limit that a job can use, in gigabytes. This memory limit will be passed as an option to the job scheduler (and be used as the limit for what can be run by the job).
- walltime
- This defines the time limit that a job can run, in hours. This time limit will be passed as an option to the job scheduler.
- wallbuffer
- This defines the buffer time, in hours, that RMS should use for any job where walltime is defined, in order to stop sending commands to the worker that is near being killed. To avoid interrupted commands, this should be set longer than any command may run. For example, if a queue defines “walltime=24;wallbuffer=2”, RMS will submit a worker job with a time limit of 24 hours, send commands to that worker for the first 22 hours, then stop sending commands at the 22 hour mark, and wait for that worker to die as soon as all of the executing commands are completed, or the worker is killed. (If more commands need to be run, new worker jobs will be created.)
- account
- This defines the account to be passed as an option to the job scheduler.
- queue
- This defaults the actual cluster queue to be used to submit jobs, for this RMS “queue”. The use of this argument allows you to create multiple RMS “queues” for a single cluster queue, in order to either use different numbers of cores or memory (if your cluster has different kinds of compute nodes on the same queue) or use different accounts (if your cluster has account processing and you have multiple accounts that can be used). The value of this option can also be a previously defined RMS “queue”, and will copy the options set for that “queue” as the defaults for this “queue”.
Supported Job Schedulers¶
The software is setup to handle a number of different job schedulers, but not all are supported (because I don’t have access to clusters with other schedulers to test the functionality). The list of supported clusters is the following (listed by the value to use for the “type” property above):
- torque
- The PBS-Torque job scheduler.
- lsf
- The Platform LSF job scheduler.
- slurm
- The SLURM job scheduler.
The list of schedulers that the software is ready to support, but has not been tested, is the following:
- pbs
- The PBS job scheduler.
- sge
- The SunGrid Engine job scheduler.
If you are willing to help test one of these schedulers, or have a different job scheduler on your cluster, please contact knightjimr@gmail.com.
Aliases¶
Because of some of the complexities of interactive vs. non-interactive bash shells, any aliases that you’ve defined in your ~/.bashrc file cannot be used in RMS scripts (trying to work around that actually caused more initialization errors for more peoples’ scripts). RMS automatically sets the “expand_aliases” option at the beginning of every bash script it runs, so even if your version of bash disables aliases by default, they can be used anywhere in the bash sections of RMS scripts. However, trying to use an alias defined in the ~/.bashrc file will not.
To support those aliases, RMS has adopted the “standard” workaround that other software uses. RMS looks for and loads a file ~/.alias at the beginning of each shell script, if that file exists. So, if you have defined aliases in your ~/.bash_profile or ~/.bashrc file that you would like to use in RMS scripts, copy those aliases into ~/.alias, and then add the following lines to your ~/.bashrc script:
if [ -f ~/.alias ]; then
. ~/.alias
fi
(and possibly your ~/.bash_profile, if that script does not have the standard lines which load your ~/.bashrc file every time it runs.)
Note that some bash shells have alias expansion turned on by default, in which case this may not be necessary (I don’t currently have access to such a machine, so I have not tested it).
Getting Started - Hello World!, part 1¶
The RMS software has been packaged with three executable scripts, hello1, hello2 and hello3. Running these scripts will test your access to the software, and its ability to run across your cluster. The source code for the scripts are shown and explained in Getting Started - Hello World!, part 2.
The first script to run is hello1, which tests your access to RMS. The example here just uses “a b c d” as the arguments on the command-line, but the script can take any text as arguments. When you run hello1, the results should appear similar to this (“$” is the command-line prompt, and the text after it is the command that was run):
$ hello1 a b c d
Hello a!
Hello b!
Hello c!
Hello d!
Said hello to all of the arguments!
The hello1 script just runs on the current computer, not across the cluster. The second script, hello2, generates equivalent output, but it executes across the cluster. Running it should generate the following output:
$ hello2 a b c d
Hello a, from the cluster!
Hello b, from the cluster!
Hello c, from the cluster!
Hello d, from the cluster!
Said hello to all of the arguments from the cluster!
Be aware that this script will run slower (possibly much slower) than hello1. The reason is that RMS queues a job for a compute node on the cluster, then sends the commands to the RMS program running on that node for execution. So, the time it takes to generate the output will depend on how long it takes to allocate a compute node and then communicate the execution of those commands. To get a gauge of how long that might be, start up an interactive job on the cluster, and time how long it takes to get the command prompt.
If hello2 takes much longer than that to execute, use qstat (or your clusters’ equivalent) to determine if there is a compute job queued or running. RMS names its jobs “worker1”, “worker2”, … for each compute node it allocates. If the job is still queued, then RMS is waiting for the cluster to run the remote worker program. If there is no job queued or running, run “cat RMS_hello2*/worker1.pbs.err” to see if the RMS worker reported an error before it was able to contact the head RMS process.
If you get an error message from hello2, that is a signal that the cluster configuration is not right, and you’ll need to configure access to the cluster properly, so that RMS can run jobs on the cluster.
The third script, hello3, does the same computation across the cluster as hello2, but displays the output in the form that you will typically see when you run RMS with your own scripts. Running hello3 should display output similar to the following (where the lines between “Pipeline execution starting” and “Pipeline execution completed” actually overwrite each other on the screen as the scripts are run across the cluster, to show the progress messages about the computation):
$ hello3 a b c d
Commands: 5 commands to be executed.
[Wed Jan 20, 11:46am]:Pipeline execution starting.
[Wed Jan 20, 11:46am]: hello[4]: 4q,0r,0f,0c
[Wed Jan 20, 11:46am]: hello[4]: 3q,1r,0f,0c
[Wed Jan 20, 11:46am]: hello[4]: 2q,2r,0f,0c
[Wed Jan 20, 11:46am]: hello[4]: 1q,3r,0f,0c
[Wed Jan 20, 11:46am]: hello[4]: 0q,4r,0f,0c
[Wed Jan 20, 11:46am]: hello[4]: 0q,3r,0f,1c
[Wed Jan 20, 11:46am]: hello[4]: 0q,2r,0f,2c
[Wed Jan 20, 11:46am]: hello[4]: 0q,1r,0f,3c
[Wed Jan 20, 11:46am]: helloAll[1]: 1q,0r,0f,0c
[Wed Jan 20, 11:46am]: helloAll[1]: 0q,1r,0f,0c
[Wed Jan 20, 11:46am]:Pipeline execution completed.
In this output, “hello” and “helloAll” are the names of the two steps in the hello3 RMS script, the number in brackets is the count of how many commands of that step will run (4 hello commands and 1 helloAll command), and the abbreviation ‘q’ stands for queued, ‘r’ stands for running, ‘f’ stands for failed and ‘c’ stands for completed. These progress message lines display the currently queued and running commands, giving a real-time view of how the computation is executing across the cluster, and whether it has made any progress recently.
Whenever rms is executed in this mode (and this is the default execution mode), rms writes two files, in this case RMS_hello3.stdout and RMS_hello3.stderr. They contain the ordered standard output and standard error text from the scripts, output in the order the commands would be run if they were executed sequentially, regardless of the order they were actually executed across the cluster. So, if you run “cat RMS_hello3.stdout”, you will see the same output as was generated by running hello2.
Getting Started - Hello World!, part 2¶
Okay, so technically, none of the three hello scripts from Getting Started - Hello World!, part 1 implement a true “Hello World!” script. The main reason for that is that most RMS scripts will involve more than one computational step, and so the hello scripts give an example of how to do that. Second, the correct “Hello World!” involves setting two RMS options to change its default behavior, and so it requires a bit more explanation. The real “Hello World!” script is shown at the bottom of this page.
Hello3 Script¶
The simplest of the three hello scripts is actually hello3, and that script is the following:
#!/usr/bin/env rms
##argv=Arg
#### hello Arg
echo "Hello <Arg>, from the cluster!"
#### helloAll all
echo "Said hello to all of the arguments from the cluster!"
The first line of this script is the standard shell script shebang line used to make the file directly executable, just as can be done with Perl or Python scripts. In this case, the RMS command “rms” is executed, and given the script to run.
The two most important lines of the script are the two that begin with “#### “:
#### hello Arg
#### helloAll all
Any line beginning with exactly four ‘#’ followed by a space is an RMS “step divider” line, and it marks the beginning of each step of the pipeline script. The format of those lines is “#### name column [exists]”, where
- “name” is the name of the step, and can be any string. These names will be displayed in the progress messages that appear when a script is running.
- “column” is either a column header from a column of the spreadsheet data, or the word “all” which tells RMS to run exactly one command for this step. When RMS runs, it will create one executable command for each distinct value found on that column of the spreadsheet.
- “exists” is an optional filename or path to a file. If a name or path is given, that name/path is tested to see if it exists in the filesystem. If it exists, then that command is skipped during the execution.
So, the two lines in the above script specify the two steps of the pipeline, one named “hello” which is executed for each value in the “Arg” column of the spreadsheet data, and one named “helloAll” which is executed once (i.e., for all of the spreadsheet data).
The lines after each of the step divider lines are the lines that are executed in the commands for the step, and can be written in bash, perl, python or R. Here, they are written in bash (the default), and just use the echo command to write text to standard output.
The one unusual part of the first echo line is the “<Arg>” string appearing on the fourth line.
echo "Hello <Arg>, from the cluster!"
That is an RMS “template element”, which RMS replaces as it creates the executable command. The basic format for a template element is “<column>” where “column” is a column header name from the spreadsheet data.
If you have been wondering where the spreadsheet data is, that is the purpose of the second line, “##argv=Arg”. In an RMS script, header text before the first step divider line can be used to process the command-line arguments, and, in the header, a line of the form “##argv=column” tells RMS to make a single column spreadsheet out of the command-line arguments, and use “column” as the column header for that spreadsheet.
More details of the syntax and structure of RMS scripts can be found in Writing RMS Scripts - Basics and Writing RMS Scripts - Details.
Hello2 Script¶
The script hello2 is very similar to that of hello3, but it contains one extra line, used to adjust the behavior of RMS when it runs:
#!/usr/bin/env rms
##argv=Arg
##option=--log=-
#### hello Arg
echo "Hello <Arg>, from the cluster!"
#### helloAll all
echo "Said hello to all of the arguments from the cluster!"
In this script, the added line “##option=–log=-” is another RMS header section line, which sets RMS command-line options. In this case, the string “–log=-” is an RMS option that tells RMS where to write each command’s stdout and stderr text. Setting –log to “-” tells RMS to direct the stdout and stderr of the commands it runs to the stdout and stderr of the RMS program. As a result, hello2 executes the same commands as hello3, but the command output is written (in order) to the screen.
Hello1 Script¶
The final script, hello1, is essentially the same as hello2, but it sets one additional RMS option, “-s”, telling RMS to perform a sequential execution of the commands on the current computer, instead of running across the cluster:
#!/usr/bin/env rms
##argv=Arg
##option="-s --log=-"
#### hello Arg
echo "Hello <Arg>!"
#### helloAll all
echo "Said hello to all of the arguments!"
(plus the text of the echo commands is slightly different). For more information on the RMS command-line options, see Command-Line Help Text.
Hello World! Script¶
Finally, the script that implements “Hello World!” is the following:
#!/usr/bin/env rms
##option="-s --log=-"
#### HelloWorld all
echo "Hello World!"
or, you can run it from the command-line directly using
rms -s --log=- -e 'echo "Hello World!"'
Writing RMS Scripts - Basics¶
An RMS script is a text file, just like any Perl or Python script. And the bulk of the script itself will actually be written in bash, Perl, Python or R (whichever language you choose to write each step in). The RMS parts are three elements that define the RMS-specific parts of the script:
- “Step divider” lines that begin with “#### ” (four pound signs followed by a space). These lines separate the steps of the pipeline.
- Option lines that begin with “##word” or “##word=”, which are used to define optional values for the header or steps of the pipeline. The specific allowed values of “word” are different for each section of the RMS script.
- Template elements of the form “<column>” or “<column,option…>” in the text of the steps, which define the fill-in-the-blank elements that RMS replaces when creating each executable command that is run.
An RMS script can be as simple as a single step divider line followed by the script for that step, but typically there are up to three common sections in most scripts:
- An optional “header” section which processes the command-line arguments and/or provides hard-coded input to RMS.
- An optional “environment” section which is typically used to setup the environment variables ($PATH, …) for each of the commands run by RMS.
- One or more “step” sections, which are the steps of the script.
Writing RMS Scripts - Details describes the other optional sections that can be included in RMS scripts, and goes into complete detail about the features of each section.
Header Section¶
The header section occurs before the first step divider line (and before the environment section), and is used either to hard-code some input for RMS or to provide a script that should be run to process the command-line arguments. The examples in Getting Started - Hello World!, part 2 contained header sections with hard-coded RMS directives, and the four most common directives that can be included in the header are the following:
The “argv” directive to tell RMS to convert the command-line arguments into a single column spreadsheet with the given column header name:
##argv=name
The “sheet” directive to give RMS a hard-coded spreadsheet of values. The format of the sheet directive is similar to how bash script lines can be passed as standard input to a program, using a marker like “EOF” at the beginning and end of the text. An example here is the following:
##sheet=EOF Sample File Type S1 S1_reads_R1.fastq R1 S1 S1_reads_R2.fastq R2 S2 S2_reads_R1.fastq R1 S2 S2_reads_R2.fastq R2 EOF
The lines between “##sheet=EOF” and “EOF” contain the spreadsheet data that RMS will parse and include in its input (in this case, a three column spreadsheet with “Sample”, “File” and “Type” columns).
The “option” directive to pass RMS command-line options. These options will be parsed as would options actually occurring on the command-line, however options found on the command-line will override options given in the RMS script header.
##option="-s --log=-"
User-defined, name-value lines of the form “##name=( values )”, where the “name” can then be used in template elements in the script, and are replaced with the value or values found within the parentheses. This is commonly used to define a option value to the rest of the script (like the genome to be used), or create a set of values to generate commands for (like a range of k-mer values to test when aligning reads):
##genome=( hg19 ) ##kmer=( 10 12 14 16 18 20 22 24 26 28 30 )
The full set of header directives is described in Writing RMS Scripts - Details. Each of these directive lines must occur at the beginning of the line, and must be of the form “##name”, “##name=value” or “##name=(…)”, and may not contain spaces, except between the parentheses or the quotes. The reason for this is to limit the interference the RMS line format may have with real Python or Perl program comment lines, so that they are not mistaken for RMS lines.
Header Script¶
The header section may also contain a bash, Python, Perl or R script, which is used to allow the script to process the command-line arguments, instead of RMS. If there is at least one line in the header that is not a RMS directive line, and is not a comment line beginning with “#”, then RMS assumes that the header contains a script, and will pass it the command-line arguments for processing.
What this header section script needs to do is to output RMS directive lines on its standard output, which is piped into RMS to define the input for the program. For example, a Python script which implements the “##argv=Arg” functionality is the following:
##python
import sys
print "##sheet=EOF"
print "Arg"
for arg in sys.argv[1:]:
print arg
print "EOF"
The first line of this script is “##python”, telling RMS that the language for the script is Python (there is also “##perl”, “##R” and “##bash” for those languages). The rest of the script is Python code which outputs the line for a “##sheet” directive, defining a one-column spreadsheet (with column header “Arg”) containing the command-line arguments.
Any functionality is permitted in this script. You can also read files, use subprocess to call commands, whatever is necessary to parse the command-line arguments and output the spreadsheet data and options to be used in the RMS execution (on its standard output). Once this script terminates, RMS will process the directives and begin the execution.
Environment Section¶
Many clusters don’t support the inheritance of environment variables (PATH, PWD, …) for the jobs that are submitted, so the commands that RMS executes across the cluster may not begin with the environment values that exist when you execute the RMS command.
RMS takes care of loading your ~/.bash_profile and ~/.bashrc files (so, no need for “source ~/.bashrc” in your scripts), and also sets the current working directory for the command to be the same as when you started the RMS command (so, no need for “cd /my/hardcoded/starting/directory” in your scripts either). But, it may not have the other environment variables, and, in particular for writing scripts to be run by other users, there may not be an assurance that the software you want to run in the RMS script is already setup in the users’ environment.
The environment section is used to setup the environment variables for each commands’ script execution. It begins with a “##env” line before the first step divider line, and all of the lines between “##env” and the first step divider line are assumed to be the environment section.
For example, if you want to write an RMS script to use samtools to index one or more bam files, but are not sure that the samtools executable is on each user’s PATH (but you know the executable is in /opt/bioinfo/software/samtools-1.2), then the following script will ensure that the samtools executable is found for each execution of the command:
##argv=file
##env
export PATH=/opt/bioinfo/software/samtools-1.2:$PATH
#### index file -
samtools index <file>
Whatever lines you would normally put at the beginning of a bash script to setup the environment can be put here, and it will get loaded at the beginning of every command execution.
Environment sections are also used for Python, Perl or R scripts. When RMS creates an executable command, it creates a bash script that contains (1) RMS initialization lines, (2) the environment section lines and (3) a language-specific body. For RMS steps whose language is bash, RMS just adds the lines from the RMS step directly into the bash script. For the other languages, the bash script contains a launcher which runs python, perl or Rscript on a file containing the lines from the RMS step.
Step Section¶
The rest of the RMS script are the “step sections” that make up the pipeline steps to be executed. Each section begins with a step divider line and RMS option lines. The rest of the section is a bash, python, perl or R script, written (with two exceptions) just as they would be written as a separate script. The first exception is that the script can contain “template elements”, fill-in-the-blank elements like “<file>” that RMS will replace when it generates the command script. The second exception is that any lines occurring in the environment section (described above) do not need to be included in each step.
The step divider line that begins a step section serves three purposes, (1) mark the beginning of a new step, (2) define what commands to generate for the step and (3) support incremental execution with a “file test” to determine when to skip the execution of the step. The format of the step divider line is the following:
#### name column(s) [filetest]
The “name” value is the name of the step, is shown in progress output and error messages, and can be any non-whitespace string. The “column(s)” value is a comma-separated list of the spreadsheet column headers which will be used to determine what commands are generated for this step during execution. The optional “filetest” value is a filename or path to be checked for existence during execution. If that file/path does exist when that command is ready to be run, the command is skipped (not run) as part of the execution.
The “column(s)” value is what RMS uses to determine what commands will be generated for the pipeline step. RMS takes the spreadsheet data given as input, finds the distinct sets of values in those columns of the spreadsheet, and will create one command for each distinct set. So, if the column(s) value is “Sample”, one command will be created for each sample in the column. If it is “Sample,File”, one command is created for each distinct pair of Sample and File values. If the column value is the special keyword “all”, then RMS will create one command for the step (covering “all” of the spreadsheet data). If a name-value line was defined in the header of the script, that can also be used as a column(s) value, so if this was in the script header:
#kmer=( 15 20 25 30 35 40 45 50 )
then a column(s) value of “Sample,kmer” will create one command for each pair of samples and kmers.
As part of this computation, the rows of the input spreadsheet data are partitioned into the sets of rows for each distinct set of values found (i.e., each command generated for the step will use the spreadsheet rows that contain those specific column(s) values). These rows are what will be used when performing the replacement of the template elements in the body of the step (see below).
Step Options¶
The step option lines are used to tell RMS what resources (cores, memory, tmp space) will be used during the computation of the step commands, as well as telling RMS about the step language and parallelism/recovery.
There are four resource option lines, each of takes a number as its value (except for ##tmp, which may also take a path, see Writing RMS Scripts - Details). They are the following:
##ppn=5 [number of cores, default 1]
##io=3 [i/o concurrency limit, default no limit]
##mem=40 [main memory limit in GB, default 10]
##tmp=60 [tmp space limit in GB, default 10]
These are not hard limits, but help RMS run the commands on the compute nodes quickly and safely (too high a cpu load or too much I/O will slow your computation down significantly, using too much memory on a compute node has been the cause of nearly every “mysterious” crash of a cluster job, in my experience, and running out of disk space is usually an unrecoverable error…so making an effort to avoid these issues will make your script faster and more robust).
The ##io limit restricts how many commands from the step can run at the same time on the same compute node, in order to avoid thrashing on the I/O channel (as the compute node tries to satisfy all of the reads and write the commands are making). To get a sense of the problem, try running the following the script with and without the ##io line on a compute node (where you can substitute the samtools sort command for a command you know is heavily I/O bound), making sure you are running at least as many commands as cores on the compute node, i.e., like “rms -s sort.rms *.bam”:
##argv=bam
#### sort bam
##io=3
samtools sort <bam> > <bam>.sorted.bam
Particularly with nodes that have 10-20 cores, the running time without the ##io line should be much longer than with (as the overall execution keeps within the capability of the compute node’s and the NFS I/O hardware, so that the file contents can be served up efficiently).
The other option lines commonly used are the lines defining the language for the step:
##python, ##perl, ##R or ##bash
as well as two option lines, ##after and ##redo, that allow for step parallelism and error recovery. The “##after=…” takes a comma-separated list of step names (these steps must be defined earlier in the RMS script), and tells RMS that this step can execute immediately after those steps complete, instead of when the previous step in the script completes. For example, the “scatter-gather” pattern can use step parallelism with a script as follows:
...
#### step1 sample
...
#### step2 sample,file
...
#### step3 sample,file
##after=step1
...
#### step4 sample
##after=step2,step3
...
(Note that this script performs both step parallelism, using the ##after option, and data parallelism, using “sample,file” column-based parallelism for step2 and step3.)
The ##redo option line takes a value which is the number of times the command should be restarted afte an error, along with an optional command to run when restarting the command (i.e., to clean up temp files or reset necessary files/values):
##redo=1
##redo=2;rm -f <sample>/tmp_*.bam
RMS will restart any step command that failed to complete (i.e., returned a non-zero exit status). If the command fails the given number of times, the command will be considered as failed.
Step Lines¶
The remaining lines of the step Step lines, with template elements and rms commands (rmssync, rmscp, rmslock).
<column> <column,glob=True>
Writing RMS Scripts - Details¶
Sections are:
- Setup/Command-Line Section
- Environment Section
- Language Initialization Section
- Step Section
- Step Divider Line
- Step Options
- Step Body
- Template Element Format
Setup/Command-Line Section¶
- ##python, ##perl, ##R or ##bash
- ##option=”…” or #option=…
- ##lang=…
- ##sheet=EOF … EOF
- ##argv=…
- ##redo=…
- ##totalio=…
- ##outline=…
- ##errline=…
- ##…=(…)
Environment Section¶
Bash-only. Goes at beginning of every command script. Template element substitution occurs in this section.
##local=…
Lanuage Initialization Section¶
Starts with “##init …”, where “…” is “bash, “perl”, “python, “R”.
Each script in that language will begin with lines in this section. Multiple sections for different languages allowed, one section per language.
Step Section¶
Structure of step section. Template element substitution occurs in each section.
- Step Divider Line
- Step Option Lines
- Step Body
Step Divider Line¶
Format structure
- “#### “
- “stepName”
- column or columns (comma-separated, no spaces allowed)
- Path or “-” for existence test (optional, single file only)
Columns must be spreadsheet columns or variables. Multiple columns allowed, separated by commas, no whitespace allowed, or the word “all” for all of the data. RMS creates one command for each distinct value in that column of the input data (or each distinct set of values in the columns of data).
Step Option Lines¶
Describe the resources for the step and/or set the options for the step’s execution.
- ##python, ##perl, ##R or ##bash - step language, default is ##lang= default or ##bash
- ##after=… - step dependency, this step runs after list of steps comma-separated list of previous steps)
- ##ppn=… - number of cores needed by this step
- ##io=… - maximum number of this step’s commands to be run on a single node (for I/O limitations)
- ##mem=… - GB of memory needed by this step
- ##tmp=… - GB of local disk space needed by this step (number or 3x<path> string)
- ##redo=… - number of times to retry the command on failure, with optional reset command
- ##name=column if column=value - set a new RMS variable called “name” to be equal to the set of distinct values where “column=value” is True in the spreadsheet data
- ##outline=… - limit number of lines reported from each command’s stdout
- ##errline=… - limit number of lines reported from each command’s stderr
Step Body¶
Written in the language defined by the step option lines or the default language option. Describe command generation and command script organization (bash script with launcher for python/perl/R script). Line numbers and error messages. Bash scripts die on first error (careful with status logic). RMS commands rmscp, rmssync, rmslock.
Template Element Format¶
Basic template element “<column>”, where column is a column or RMS variable in the input data. No whitespace allowed, except as defined in the options below. There are also defined template elements from the environment setup for a command:
- <ppn> - Current step’s ppn option value
- <tmp> - Temp directory created for the command’s execution (separate for each command, with auto cleanup)
- <local> - Temp directory created for each compute node (shared across all commands executed on the node),
- loaded by ##local RMS lines
- <script>, <bin>, <bin..>
RMS performs a string replacement of <column> with the value or distinct set of values corresponding to column (as defined by the step set of distinct values). If multiple values found, default replacement separates values by a single space.
Options can appear as comma-separated strings after the column name.
- glob=true|false - use globbing of all non-whitespace text around element for multiple value replacement
- sep=’,’ - use the given separator for multiple values, instead of a space
- quote=’”’ - surround each value by the given quote character, after globbing and prefix/suffix addition
- prefix=”…” - add the given string before each value, whitespace permitted within quotes
- suffix=”…” - add the given suffix after each value, whitespace permitted within quotes
Examples of element replacement. If the distinct values of column “sample” are the values “me”, “my” and “mine”, then the following template replacements occur
"ls <sample>" -> "ls me my mine"
"rm <sample>.bam" -> "rm me my mine.bam" (likely not what you want)
"rm <sample,glob=True>.bam" -> "rm me.bam my.bam mine.bam"
"myscript <sample,prefix="-V ",suffix=".bam">" -> "myscript -V me.bam -V my.bam -V mine.bam"
"samples = [ <sample,quote='"',sep=", "> ]" -> "samples = [ "me", "my", "mine" ]" (useful for python)
Recursive replacement is supported, but each replacement operation occurs separately. If column “project” is defined as the single value “prj”, then the following replacements occur:
"ls <project>/<sample>.bam" -> "ls prj/me my mine.bam" (likely not what you want...)
"ls <project>/<sample,glob=True>.bam" -> "ls prj/me.bam prj/my.bam prj/mine.bam"
"ls <sample>/<sample>.bam" -> "ls me my mine/me my mine.bam" (likely not what you want...)
For this last example, there is not currently an RMS way to get what you want, namely “ls me/me.bam my/my.bam mine/mine.bam”, because each replacement occurs separately.
Language-Specific Template Element Tips¶
For each of the four languages (bash, python, perl and R), here are examples of how you can (1) assign a template elements values to a variable, (2) perform an if test on a single value element and (3) loop over the values of a template element. These should be helpful building blocks to communicating between RMS and the step script.
Bash script lines:
PROJECT="<project>"
echo $PROJECT
SAMPLE=( <sample,quote='"'> )
echo ${SAMPLE[1]}
if [ "<project>" == "prj" ] ; then
echo This is the prj project.
else
echo This is not the prj project.
fi
for sample in <sample,quote='"'> ; do
echo $sample
done
Python script lines:
project = "<project>"
print project
sample = [ <sample,quote='"',sep=','> ]
print sample[0]
if "<project>" == "prj":
print "This is the prj project."
else:
print "This is not the prj project."
for sample in [ <sample,quote='"',sep=','> ]:
print sample
RMS Execution Process¶
When RMS runs, it performs the following operations (understanding this can help with writing RMS scripts).
- Reading and parsing the RMS script
- Processing the RMS command options
- Constructing the spreadsheet data
- Executing the step commands
- Determining commands and dependencies
- Writing a command’s script
- Starting and Ending Worker Jobs
- <local> and <tmp> directories
- Running a command
- Collecting stdout and stderr
- Reporting progress
Troubleshooting tips.
Command-Line Help Text¶
RMS executes pipeline scripts on the current machine or across a cluster. Similar to perl and python, the script can be run either as “rms myscript …” or just “myscript” (if the first line of the script file is “#!/usr/bin/env rms”).
rmsScript [conditional processing]
rms [options] rmsScript [conditional processing]
The command line arguments after the script file are “conditionally processed”, meaning that if the script defines command line processing, the arguments are processed by the script. If the script does not define command line processing, the following default command line processing occurs:
myscript sheet...
rms [options] myscript sheet...
where “sheet” is one or more tab-, comma- or space-delimited spreadsheet files (see scriptbasic for more details).
The following options tell RMS where to execute the script commands [default mode: cluster]:
-t, --test | Test the script for syntax errors (by compiling only) |
-s, --single | Run the script sequentially on the current computer |
-p, --parallel | Execute the script just on the current computer (like GNU parallel) |
-c, --cluster | Execute the script across the cluster |
In parallel and cluster mode, the number of cluster compute nodes or number of cores can be set using these options, to limit how many commands are executed at the same time [parallel default: 0, cluster default: defined in ~/.rmsrc file, or “default:0” if no ~/.rmsrc file]:
-n N, --num=N | Limit for the number of nodes to use (cluster mode) or the number of cores to use (parallel mode), where 0 specifies no limit. |
-n queuestr, --num=queuestr | |
The queues to use and node limits for each queue, as a comma-separated list of “queue[:N]” strings (queue name, plus optional number limit). |
For example “-n default:6,highcore:4” tells RMS it can use up to 6 nodes of the default queue and 4 nodes of the highcore queue, if needed.
The following options limit the steps that are executed (so that a part of the script can be run, instead of the whole script):
-S step, --start step | |
Start the pipeline with step “step” (skipping initial steps) | |
-E step, --end step | |
End the pipeline with step “step” (skipping later steps) | |
-O step, --only step | |
Only run step “step” (equivalent to “-S step -E step”) |
And these options provide additional miscellaneous functions:
-f, --force | Ignore existing files and force the pipeline commands to run |
-o dir, --output=dir | |
Set the output directory (and current working directory for the script) to “dir”. [default: .] | |
-l prefix, --log=prefix | |
Log the script execution stdout and stderr to “prefix.stdout” and “prefix.stderr”. Passing “-” outputs the execution stdout/stderr to the command’s stdout/stderr. [default: RMS_myscript_YMD_HMS] |
When the command-line explicitly begins with “rms”, as in “rms [options] myscript …”, all of the above options may be specified between “rms” and “myscript” (so that you can specify the runtime execution of the program, even if the script redefines its command-line processing).