uap – Robust, Consistent, and Reproducible Data Analysis¶
Description:
uap executes, controls and keeps track of the analysis of large data sets. It enables users to perform robust, consistent, and reproducible data analysis. uap encapsulates the usage of (bioinformatic) tools and handles data flow and processing during an analysis. Users can use predefined or self-made analysis steps to create custom analysis. Analysis steps encapsulate best practice usages for bioinformatic software tools. uap focuses on the analysis of high-throughput sequencing (HTS) data. But its plugin architecture allows users to add functionality, such that it can be used for any kind of large data analysis.
Usage:
uap is a command-line tool, implemented in Python. It requires a user-defined configuration file, which describes the analysis, as input.
Supported Platforms:
Important Information
uap does NOT include all tools necessary for the data analysis. It expects that the required tools are already installed.
Table of contents¶
Introducing uap¶
Core aspects¶
Robustness¶
- Data is processed in temporary location. If and only if ALL involved processes exited graceful, the output files are copied to the final output directory.
- The final output directory names are suffixed with a hashtag which is based on the commands executed to generate the output data. Data is not easily overwritten and this helps to check for necessary recomputations.
- Processing can be aborted and continued from the command line at any time. Failures during data processing do not lead to unstable state of analysis.
- Errors are reported as early as possible, fail-fast. Tools are checked for availability, and the entire processing pipeline is calculated in advance before jobs are being started or submitted to a cluster.
Consistency¶
- Steps and files are defined in a directed acyclic graph (DAG). The DAG defines dependencies between in- and output files.
- Prior to any execution the dependencies between files are calculated. If a file is newer or an option for a calculation has changed all dependent files are marked for recalculation.
Reproducibility¶
- Comprehensive annotations are written to the output directories. They allow for later investigation of errors or review of executed commands. They contain also versions of used tool, required runtime, memory and CPU usage, etc.
Usability¶
- Single configuration file describes entire processing pipeline.
- Single command-line tool interacts with the pipeline. It can be used to execute, monitor, and analyse the pipeline.
Software Design¶
uap is designed as a plugin architecture. The plugins are called steps because they resemble steps of the analysis.
Source and Processing Steps: Building Blocks of the Analysis¶
There are two different types of steps: source and processing steps. Source steps are used to include data from outside the destination path (see destination_path Section) into the analysis. Processing steps are blueprints that describe how to process input to output data. Processing steps describe what needs to be done on an abstract level. uap controls the ordered execution of the steps as defined in the analysis configuration file.
Runs: Atomic Units of the Analysis¶
Steps define runs which represent the concrete commands for a part of the analysis. You can think of steps as objects and runs as instances like in object-oriented programming. A run is an atomic unit of the analysis. It can only succeed or fail entirely. Typically a single run computes data of a single sample. Runs compute output files from input files and provide these output files to subsequent steps via so called connections.
Connections: Propagation of Data¶
Connections are like tubes that connect steps.
A step can have any number of connections.
Run have to assign output file(s) to each connection of the step.
If a run can not assign files to a connection it has to define it as empty.
Downstream steps can access the connections to get the information which run
created which file.
The names of the connections can be arbitrarily chosen.
The name should not be just the file format of the contained files but
a description of their content.
For example an out/alignment
can contain gzipped SAM and/or BAM files.
That’s why the file type is often checked in steps and influences the issued
commands or set parameters.
Analysis as Directed Acyclic Graph¶
The steps and connections are the building blocks of the analysis graph. Steps are the nodes and connections are the edges of the analysis graph. That graph has to be a directed acyclic graph (DAG). This implies that every step has one or more parent steps, which may in turn have parents themself. The analysis graph is not allowed to contain cycles. Steps without parents have to be source steps. They provide the initial input data, like for example FASTQ files with raw sequencing reads, genome sequences, genome annotations, etc..
Recommended uap Workflow¶
The recommended workflow to analyse data with uap is:
- Install uap (see Installation of uap)
- Optionally: Extend uap by adding new steps (see Add New Functionality)
- Write a configuration file to setup your analysis (see Analysis Configuration File)
- Start the analysis locally (see run-locally) or submit it to a cluster (see submit-to-cluster)
- Follow the progress of the analysis (see status)
- Share your extensions with the public (send us a pull request via github)
A finished analysis leaves the user with:
- The original input files (which are, of course, left untouched).
- The experiment-specific configuration file (see Analysis Configuration File). You should keep this configuration file for later reference and you could even make it publicly available along with your input files for anybody to re-run the entire data analysis or parts thereof.
- The output files and comprehensive annotations of the analysis
- (see Annotation Files).
- These files are stored in the destination path defined in the configuration file.
Quick Start uap¶
At first, you need to install uap (see Installation of uap).
After successfully finishing the installation of uap example
analysis can be found in the folder example-configurations
.
Let’s jump head first into uap and have a look at some examples:
$ cd <uap-path>/example-configurations/
$ ls *.yaml
2007-CD4+_T_Cell_ChIPseq-Barski_et_al_download.yaml
2007-CD4+_T_Cell_ChIPseq-Barski_et_al.yaml
2014-RNA_CaptureSeq-Mercer_et_al_download.yaml
2014-RNA_CaptureSeq-Mercer_et_al.yaml
download_human_gencode_release.yaml
index_homo_sapiens_hg19_genome.yaml
index_mycoplasma_genitalium_ASM2732v1_genome.yaml
These example configurations differ in their usage of computational resources. Some example configurations download or work on small datasets and are thus feasible for machines with limited resources. Others require a very powerful stand-alone machine or a cluster system. The examples are marked accordingly in the examples below.
Handle Genomic Data¶
A usual analysis of High-Throughput Sequencing (HTS) data relies on different publicly available data. Most important is probably the genomic sequence of the species under investigation. That sequence is required to construct the indices (data structures used by read aligners). Other publicly available data sets (such as reference annotations or the chromosome sizes) might also be required for an analysis. The following configurations showcase how to get or generate that data:
index_mycoplasma_genitalium_ASM2732v1_genome.yaml
- Downloads the Mycoplasma genitalium genome, generates the indices for bowtie2, bwa, segemehl, and samtools. This workflow is quite fast because it uses the very small genome of Mycoplasma genitalium.
index_homo_sapiens_hg19_genome.yaml
- Downloads the Homo sapiens genome, generates the indices for bowtie2, bwa, and samtools. This workflow requires substantial computational resources due to the size of the human genome. The segemehl index creation is commented out due to its high memory consumption (~50-60 GB). Please make sure to only run it on well equipped machines.
download_human_gencode_release.yaml
- Downloads the human Gencode main annotation v24 and a subset for long non-coding RNA genes. This workflow only downloads files from the internet and and thus should work on any machine.
Let’s have a look at the Mycoplasma genitalium example workflow by checking its uap_status:
$ cd <uap-path>/example-configurations/
$ uap index_mycoplasma_genitalium_ASM2732v1_genome.yaml status
[uap] Set log level to ERROR
[uap][ERROR]: index_mycoplasma_genitalium_ASM2732v1_genome.yaml: Destination path does not exist: genomes/bacteria/Mycoplasma_genitalium/
Oops, the destination_path
does not exist (see destination_path Section).
Create it and start again:
$ mkdir -p genomes/bacteria/Mycoplasma_genitalium/
$ uap index_mycoplasma_genitalium_ASM2732v1_genome.yaml status
Waiting tasks
-------------
[w] bowtie2_index/Mycoplasma_genitalium_index-download
[w] bwa_index/Mycoplasma_genitalium_index-download
[w] fasta_index/download
[w] segemehl_index/Mycoplasma_genitalium_genome-download
Ready tasks
-----------
[r] M_genitalium_genome/download
tasks: 5 total, 4 waiting, 1 ready
A list with all runs and their respective state should be displayed. A run is always in one of these states:
[r]eady
[w]aiting
[q]ueued
[e]xecuting
[f]inished
If the command still fails, please check that the tools defined in
index_mycoplasma_genitalium_ASM2732v1_genome.yaml
are available in your
environment (see uap_config_tools_section).
If you really want to download and index the genome tell uap to start
the workflow:
$ uap index_mycoplasma_genitalium_ASM2732v1_genome.yaml run-locally
uap should have created a symbolic link named
index_mycoplasma_genitalium_ASM2732v1_genome.yaml-out
pointing to the
destination_path
.
The content should look something like that:
$ tree --charset=ascii
.
|-- bowtie2_index
| |-- Mycoplasma_genitalium_index-download-cMQPtBxs
| | |-- Mycoplasma_genitalium_index-download.1.bt2
| | |-- Mycoplasma_genitalium_index-download.2.bt2
| | |-- Mycoplasma_genitalium_index-download.3.bt2
| | |-- Mycoplasma_genitalium_index-download.4.bt2
| | |-- Mycoplasma_genitalium_index-download.rev.1.bt2
| | `-- Mycoplasma_genitalium_index-download.rev.2.bt2
| `-- Mycoplasma_genitalium_index-download-ZsvbSjtK
| |-- Mycoplasma_genitalium_index-download.1.bt2
| |-- Mycoplasma_genitalium_index-download.2.bt2
| |-- Mycoplasma_genitalium_index-download.3.bt2
| |-- Mycoplasma_genitalium_index-download.4.bt2
| |-- Mycoplasma_genitalium_index-download.rev.1.bt2
| `-- Mycoplasma_genitalium_index-download.rev.2.bt2
|-- bwa_index
| `-- Mycoplasma_genitalium_index-download-XRyj5AnJ
| |-- Mycoplasma_genitalium_index-download.amb
| |-- Mycoplasma_genitalium_index-download.ann
| |-- Mycoplasma_genitalium_index-download.bwt
| |-- Mycoplasma_genitalium_index-download.pac
| `-- Mycoplasma_genitalium_index-download.sa
|-- fasta_index
| `-- download-HA439DGO
| `-- Mycoplasma_genitalium.ASM2732v1.fa.fai
|-- M_genitalium_genome
| `-- download-5dych7Xj
|-- Mycoplasma_genitalium.ASM2732v1.fa
|-- segemehl_index
| |-- Mycoplasma_genitalium_genome-download-2UKxxupJ
| | |-- download-segemehl-generate-index-log.txt
| | `-- Mycoplasma_genitalium_genome-download.idx
| `-- Mycoplasma_genitalium_genome-download-zgtEpQmV
| |-- download-segemehl-generate-index-log.txt
| `-- Mycoplasma_genitalium_genome-download.idx
`-- temp
Congratulation you’ve finished your first uap workflow!
Go on and try to run some more workflows.
Most examples require the human genome so you might turn your head towards the
index_homo_sapiens_hg19_genome.yaml
workflow from her:
$ uap index_homo_sapiens_hg19_genome.yaml status
[uap] Set log level to ERROR
[uap][ERROR]: Output directory (genomes/animalia/chordata/mammalia/primates/homo_sapiens/hg19/chromosome_sizes) does not exist. Please create it.
$ mkdir -p genomes/animalia/chordata/mammalia/primates/homo_sapiens/hg19/chromosome_sizes
$ uap index_homo_sapiens_hg19_genome.yaml run-locally
<Analysis starts>
Again you need to create the output folder (you get the idea). Be aware that by default only the smallest chromosome, chromsome 21, is downloaded and indexed. This reduces required memory and computation time. You can uncomment the download steps for the other chromosomes and the index for the complete genome will be created.
Sequencing Data Analysis¶
Now that you possess the genome sequences, indices, and annotations let’s have a look at some example analysis.
General Steps¶
The analysis of high-throughput sequencing (HTS) data usually start with some basic steps.
- Conversion of the raw sequencing data to, most likely, fastq(.gz) files
- Removal of adapter sequences from the sequencing reads
- Alignment of the sequencing reads onto the reference genome
These basic steps can be followed up with a lot of different analysis steps. The following analysis examples illustrate how to perform the basic as well as some more specific steps.
RNAseq Example – Reanalysing Data from Mercer et al., Nature Protoc. (2014)¶
RNAseq analysis often aims at the discovery of differentially expressed (known) transcripts. Therefore mappped reads for at least two different samples have to be available.
- Differential Expression Analysis
- Get annotation set (for e.g. genes, transcripts, ...)
- Count the number of reads overlapping the annotation
- Perform statistical analysis, based on counts
Another common analysis performed with RNAseq data is the identification of novel tarnscripts. This approach is useful to identify tissue-specific transcipts.
- De novo Transcript Assembly
- Apply transcript assembly tool on mapped reads
2014-RNA_CaptureSeq-Mercer_et_al_download.yaml
- Downloads the data published in the paper Mercer et al., Nature Protoc. (2014).
2014-RNA_CaptureSeq-Mercer_et_al.yaml
The downloaded FASTQ files get analysed by FastQC and FASTX-Toolkit. The reads are afterwards mapped to the human genome with tophat2. The mapped reads are afterwards sorted by position using samtools. htseq-count is used to count the mapped reads for every exon of the annotation. cufflinks is used to perform de novo transcript assembly. The usage of segemehl is disabled by default. But it can be enabled and combined with cufflinks de novo transcript assembly employing our s2c python script.
This workflow is not going to work, because the initial data set is to small.
ChIPseq Example – Reanalysing Data from Barski et al., Cell (2007)¶
ChIPseq analysis aims at the discovery of genomic loci at which protein(s) of interest were bound. The experiment is an enrichment procedure using specific antibodies. The enrichment detection is normally performed by so called peak calling programs. The data is prone to duplicate reads from PCR due to relatively low amounts of input DNA. So these steps follow the basic ones:
- Duplicate removal
- Peak calling
The analysis of data published in the paper Barski et al., Cell (2007) is contained in these files:
2007-CD4+_T_Cell_ChIPseq-Barski_et_al_download.yaml
- Downloads the data published in the paper Barski et al., Cell (2007).
2007-CD4+_T_Cell_ChIPseq-Barski_et_al.yaml
At first the downloaded FASTQ files are grouped by sample. All files per sample are merged. Sequencing quality is controlled by FastQC and FASTX-Toolkit. Adapter sequences are removed from the reads before they are mapped to the human genome. Reads are mapped with bowtie2, bwa, and tophat2. Again mapping with segemehl is disabled by default due to its high resource requirements. Library complexity is estimated using preseq. After the mapping duplicate reads are removed using Picard. Finally enriched regions are detected with MACS2.
This workflow will take some time due to the number of steps and multiple mapping tools used.
Create Your Own Workflow¶
You finished to check out the examples? Go and try to create your own workflow If you are fine with what you saw Although writing the configuration may seem a bit complicated, the trouble pays off later because further interaction with the pipeline is quite simple. The structure and content of the configuration files is very detailed described on another page (see Analysis Configuration File).
Installation of uap¶
Prerequisites¶
The installation requires virtualenv, git and graphviz. So, please install it if its not already installed.:
$ sudo apt-get install python-virtualenv git graphviz
uap does NOT include any tools necessary for the data analysis. It is expected that the required tools are already installed.
Downloading the Software¶
Download the software from uap's github repository like this:
$ git clone https://github.com/kmpf/uap.git
Setting Up Python Environment¶
After cloning the repository, change into the created directory and run the
bootstrapping script bootstrap.sh
:
$ cd uap
$ ./bootstrap.sh
The script creates the required Python environment (which will be located in
./python_env/
).
Afterwards it installs PyYAML, NumPy, biopython and
psutil into the freshly created environment.
There is no harm in accidentally running this script multiple times.
Making uap Globally Available¶
uap can be used globally.
On Unix-type operating systems it is advised to add the installation path to
your $PATH
variable.
Therefore change into the uap directory and execute:
$ echo ""PATH=$PATH:$(pwd)" >> ~/.bashrc
$ source ~/.bashrc
OR
$ echo ""PATH=$PATH:$(pwd)" >> ~/.bash_profile
$ source ~/.bash_profile
Analysis Configuration File¶
uap requires a YAML file which contains all information about the data analysis. These files are called configuration files.
A configuration file describes a complete analysis. Configurations consist of four sections (let’s just call them sections, although technically, they are keys):
Mandatory Sections
destination_path
– points to the directory where the result files, annotations and temporary files are written toconstants
– defines constants for later use (define repeatedly used values as constants to increase readability of the following sections)steps
– defines the source and processing steps and their ordertools
– defines all tools used in the analysis and how to determine their versions (for later reference)
Optional Sections
cluster
– if uap is required to run on a HPC cluster some default parameters can be set her
Please refer to the YAML definition for the correct notation used in that file.
Sections of a Configuration File¶
destination_path
Section¶
The value of destination_path
is the directory where uap is going
to store the created files.
destination_path: "/path/to/workflow/output"
constants
Section¶
This section is the place where repeatedly used constants should be defined. For instance absolute paths to the genome index files can be defined as constant.
- &genome_faidx
genomes/animalia/chordata/mammalia/primates/homo_sapiens/hg19/samtools_faidx/hg19_all_chr_UCSC-download-B7ceRp9K/hg19_all_chr_UCSC-download.fasta.fai
Later on the value can be reused by typing *genome_faidx
.
There are no restrictions about what can be defined here.
steps
Section¶
The steps
section is the core of the analysis file, because it defines when
steps are executed and how they depend on each other.
All available steps are described in detail in the steps documentation:
Available steps.
The steps
section contains an entry (technically a key) for every step.
Every step name must be unique.
Note
Please be aware that the PyYAML, the YAML parser used by uap, does not complain about keys with the same name. But drops one of the duplicates without giving an error.
There are two ways to name a step to allow multiple steps of the same type and still ensure unique naming:
steps:
# here, the step name is unchanged, it's a cutadapt step which is also
# called 'cutadapt'
cutadapt:
... # options following
# here, we also insert a cutadapt step, but we give it a different name:
# 'clip_adapters'
clip_adapters (cutadapt):
... # options following
Now let’s have a look at the two different types of steps which constitute an uap analaysis.
Source Steps¶
Source steps are the only steps which are allowed to use or create data
outside the destination_path
.
Feature of source steps:
- they provide the input files for the following steps
- they can start processes e.g. to download files or demultiplex reads
- they do not depend on previous steps
- they are the root nodes of the analysis graph
If you want to work with fastq files, you should use the fastq_source
step to import the required files.
Such a step definition would look like this:
steps:
input_step (fastq_source):
pattern: /Path/to/fastq-files/*.gz
group: ([SL]\w+)_R[12]-00[12].fastq.gz
sample_id_prefix: MyPrefix
first_read: '_R1'
second_read: '_R2'
paired_end: True
The options of the fastq_source
step are described at Available steps.
The group
option takes a regular expression (regexp).
You can test your regular expression at pythex.org.
Processing Steps¶
Processing steps depend upon one or more preceding steps. They use their output files and process them. Output files of processing steps are automatically named and saved by uap. A complete list of available options per step can be found at Available steps or by using the subcommand steps Subcommand.
Reserved Keywords for Steps¶
_depends:
Dependencies are defined via the_depends
key which may either benull
, a step name, or a list of step names.
steps:
# the source step which depends on nothing
fastq_source:
# ...
run_folder_source:
# ...
# the first processing step, which depends on the source step
cutadapt:
_depends: [fastq_source, run_folder_source]
# the second processing step, which depends on the cutadapt step
fix_cutadapt:
_depends: cutadapt
_connect:
Normally steps connected with_depends
do pass data along by defining so called connections. If the name of an output connection matches the name of an input connection of a succeeding step the data gets passed on automatically. But, sometimes the user wants to force the connection of differently named connections. This can be done with the_connect
keyword. A common usage is to connect downloaded data with a Processing Steps.
steps:
# Source step to download i.e. sequence of chr1 of some species
chr1 (raw_url_source):
...
# Download chr2 sequence
chr2 (raw_url_source):
...
merge_fasta_files:
_depends:
- chr1
- chr2
# Equivalent to:
# _depends: [chr1, chr2]
_connect:
in/sequence:
- chr1/raw
- chr2/raw
# Equivalent to:
# _connect:
# in/sequence: [chr1/raw, chr2/raw]
The examples shows how the ``raw_url_source`` output connection ``raw`` is
connected to the input connection ``sequence`` of the ``merge_fasta_files``
step.
_BREAK:
If you want to cut off entire branches of the step graph, set the_BREAK
flag in a step definition, which will force the step to produce no runs (which will in turn give all following steps nothing to do, thereby effectively disabling these steps):
steps:
fastq_source:
# ...
cutadapt:
_depends: fastq_source
# this step and all following steps will not be executed
fix_cutadapt:
_depends: cutadapt
_BREAK: true
_volatile:
Steps can be marked with_volatile: yes
. This flag tells uap that the output files of the marked step are only intermediate results.
steps:
# the source step which depends on nothing
fastq_source:
# ...
# this steps output can be deleted if all depending steps are finished
cutadapt:
_depends: fastq_source
_volatile: yes
# same as:
# _volatile: True
# if fix_cutadapt is finished the output files of cutadapt can be
# volatilized
fix_cutadapt:
_depends: cutadapt
If all steps depending on the intermediate step are finished uap tells the user that he can free disk space. The message is output if the status is checked and looks like this:
Hint: You could save 156.9 GB of disk space by volatilizing 104 output files.
Call 'uap <project-config>.yaml volatilize --srsly' to purge the files.
uap is going to replace the output files by placeholder files if the user executes the volatilize command.
_cluster_submit_options
This string contains the entire submit options which will be set in the submit script. This option allows to overwrite the values set in default_submit_options.
_cluster_pre_job_command
This string contains command(s) that are executed BEFORE uap is started on the cluster. This option allows to overwrite the values set in default_pre_job_command.
_cluster_post_job_command
This string contains command(s) that are executed AFTER uap did finish on the cluster. This option allows to overwrite the values set in default_post_job_command.
_cluster_job_quota
This option defines the number of jobs of the same type that can run simultaneously on a cluster. This option allows to overwrite the values set in default_job_quota.
tools
Section¶
The tools
section lists all programs required for the execution of a
particular analysis.
An example tool configuration looks like this:
tools:
# you don't have to specify a path if the tool can be found in $PATH
cat:
path: cat
get_version: --version
module_load:
# you have to specify a path if the tool can not be found in $PATH
some-tool:
path: /path/to/some-tool
get_version: --version
pigz:
path: pigz
get_version: --version
exit_code: 0
uap uses the path
, get_version
, and exit_code
information to
control the availability of a tool.
This is particularly useful on cluster systems were software can be dynamically
loaded and unloaded.
uap logs the version of every used tool.
If get_version
and exit_code
is not set, uap tries to determine the
version by calling the program without command-line arguments.
get_version
is the command line argument (e.g. --version
) required to
get the version information.
exit_code
is the value returned by echo $?
after trying to determine
the version e.g. by running pigz --version
.
If not set exit_code
defaults to 0.
uap can use the module system if you are working on a cluster system (e.g.
UGE or SLURM).
The configuration for pigz
would change a bit:
tools:
pigz:
path: pigz
get_version: --version
exit_code: 0
module_load: /path/to/modulecmd python load pigz
module_unload: /path/to/modulecmd python unload pigz
As you can see you need to get the /path/to/modulecmd
.
So let’s investigate what happens when a module is loaded or unloaded:
$ module load <module-name>
$ module unload <module-name>
As far as I know is module
neither a command nor an alias.
It is a BASH function. So use declare -f
to find out what it is actually
doing:
$ declare -f module
The output should look like this:
module ()
{
eval `/usr/local/modules/3.2.10-1/Modules/$MODULE_VERSION/bin/modulecmd bash $*`
}
An other possible output is:
module ()
{
eval $($LMOD_CMD bash "$@");
[ $? = 0 ] && eval $(${LMOD_SETTARG_CMD:-:} -s sh)
}
In this case you have to look in $LMOD_CMD
for the required path:
$ echo $LMOD_CMD
/usr/local/modules/3.2.10-1/Modules/$MODULE_VERSION/bin/modulecmd
You can use this path to assemble the module_load
and module_unload
options for pigz
.
Just replace the $MODULE_VERSION
with the current version of the module
system.
tools:
pigz:
path: pigz
get_version: --version
exit_code: 0
module_load: /usr/local/modules/3.2.10-1/Modules/$MODULE_VERSION/bin/modulecmd python load pigz
module_unload: /usr/local/modules/3.2.10-1/Modules/$MODULE_VERSION/bin/modulecmd python unload pigz
Note
Use python
instead of bash
for loading modules via uap.
Because the module is loaded from within a python environment and
not within a BASH shell.
cluster
Section¶
The cluster
section is required only if the analysis is executed on a
system using a cluster engine like UGE or SLURM.
This section interacts tightly with the
An example cluster
section looks like this:
cluster:
default_submit_options: "-pe smp #{CORES} -cwd -S /bin/bash -m as -M me@example.com -l h_rt=1:00:00 -l h_vmem=2G"
default_pre_job_command: "echo 'Started the run!'"
default_post_job_command: "echo 'Finished the run!'"
default_job_quota: 5
default_submit_options
This is the default submit options string which replaces the #{SUBMIT_OPTIONS} placeholder in the submit script template. It is mandatory to set this value.
default_pre_job_command
This string contains the default commands which will be executed BEFORE uap is started on the cluster. It will replace the #{PRE_JOB_COMMAND} placeholder in the submit script template. If mutliple commands shall be executed separate those with\n
. It is optional to set this value.
default_post_job_command
This string contains the default commands which will be executed AFTER uap is started on the cluster. It will replace the #{POST_JOB_COMMAND} placeholder in the submit script template. If mutliple commands shall be executed separate those with\n
. It is optional to set this value.
default_job_quota:
This option defines the number of jobs of the same type that can run simultaneously on a cluster. The number influences the way uap sets the job dependencies of submitted jobs. It is optional to set this value, if the value is not provided it is set to 5.
Example Configurations¶
Example configurations can be found in uap‘s example-configurations
folder.
More information about these examples can be found in Quick Start uap.
Cluster Configuration File¶
The cluster configuration file resides at:
$ ls -la $(dirname $(which uap))/cluster/cluster-specific-commands.yaml
This YAML file contains a dictionary for every cluster type. An example file is shown here:
# Configuration for a UGE cluster engine
uge:
# Command to get version information
identity_test: ['qstat', '-help']
# The expected output of identity_test for this cluster engine
identity_answer: 'UGE'
# Command to submit job
submit: 'qsub'
# Command to check job status
stat: 'qstat'
# Relative path to submit script template
# The path has to be relative to:
# $ dirname $(which uap)
template: 'cluster/submit-scripts/qsub-template.sh'
# way to define job dependencies
hold_jid: '-hold_jid'
# Separator for job dependencies
hold_jid_separator: ';'
# Option to set job names
set_job_name: '-N'
# Option to set path of stderr file
set_stderr: '-e'
# Option to set path of stdout file
set_stdout: '-o'
# Regex to extract Job ID after submission
parse_job_id: 'Your job (\d+)'
# Configuration for a SLURM cluster engine
slurm:
identity_test: ['sbatch', '--version']
identity_answer: 'slurm'
submit: 'sbatch'
stat: 'squeue'
template: 'cluster/submit-scripts/sbatch-template.sh'
hold_jid: '--dependency=afterany:%s'
hold_jid_separator: ':'
set_job_name: '--job-name=%s'
set_stderr: '-e'
set_stdout: '-o'
parse_job_id: 'Submitted batch job (\d+)'
Let’s browse over the options which need to be set per cluster engine:
identity_test:
- Command used to determine if uap has been started on a system running
a cluster engine e.g.
sbatch --version
. identity_answer:
- uap checks if the output of the
identity_test
command starts with this value e.g.slurm
. If that is true the cluster type has been detected. submit:
- Command to submit a job onto the cluster e.g.
sbatch
. stat:
- Command to check the status of jobs on the cluster e.g.
squeue
. template:
- Path to the submit script template which has to be used for this cluster
type e.g.
cluster/submit-scripts/sbatch-template.sh
. hold_jid:
- Option given to the
submit
command to define dependencies between jobs e.g.--dependency=afterany:%s
. Placeholder%s
gets replaced with the jobs this job depends on if present. hold_jid_separator:
- Separator used to concatenate multiple jobs for
hold_jid
e.g.:
. set_job_name:
- Option given to the
submit
command to set the job name e.g.--job-name=%s
.%s
is replaced by the job name if present. set_stderr:
- Option given to the
submit
command to set the name of the stderr file e.g.-e
. set_stdout:
- Option given to the
submit
command to set the name of the stdout file e.g.-o
. parse_job_id:
- Python regular expression whose first parenthesized subgroup represents
the cluster job ID e.g.
Submitted batch job (\d+)
.
Submit Script Template¶
The submit script template contains a lot of placeholders which are replaced if a job is submitted to the cluster with the actual commands.
The submit script templates reside at:
$ ls $(dirname $(which uap))/cluster/submit-scripts/*
qsub-template.sh
sbatch-template.sh
Feel free to add your own templates. The templates need to contain the following placeholders:
#{SUBMIT_OPTIONS}
- Will be replaced with the steps
_cluster_submit_options
value (see _cluster_submit_options), if present, or thedefault_submit_options
value.
#{PRE_JOB_COMMAND}
- Will be replaced with the steps
_cluster_pre_job_command
value (see _cluster_pre_job_command), if present, or thedefault_pre_job_command
value.
#{COMMAND}
- Will be replaced with
uap <project-config>.yaml run-locally <run ID>
.
#{POST_JOB_COMMAND}
- Will be replaced with the steps
_cluster_post_job_command
value (see _cluster_post_job_command), if present, or thedefault_post_job_command
value.
The submit script template is required by submit-to-cluster for job submission to the cluster.
Command-Line Usage of uap¶
uap uses Python’s argparse. Therefore, uap provides help information on the command-line:
$ uap -h
usage: uap [-h] [-v] [--version]
[<project-config>.yaml]
{fix-problems,render,run-locally,status,steps,submit-to-cluster,run-info,volatilize}
...
This script starts and controls 'uap' analysis.
positional arguments:
<project-config>.yaml
Path to YAML file that contains the pipeline configuration.
The content of that file needs to follow the documentation.
optional arguments:
-h, --help show this help message and exit
-v, --verbose Increase output verbosity
--version Display version information.
subcommands:
Available subcommands.
{fix-problems,render,run-locally,status,steps,submit-to-cluster,run-info,volatilize}
fix-problems Fixes problematic states by removing stall files.
render Renders DOT-graphs displaying information of the analysis.
run-locally Executes the analysis on the local machine.
status Displays information about the status of the analysis.
steps Displays information about the steps available in uap.
submit-to-cluster Submits the jobs created by uap to a cluster
run-info Displays information about certain source or processing runs.
volatilize Saves disk space by volatilizing intermediate results
For further information please visit http://uap.readthedocs.org/en/latest/
For citation use ...
Almost all subcommands require a YAML configuration file (see
Analysis Configuration File) except for uap steps
, which works
independent of an analysis configuration file.
Everytime uap is started with a analysis configuration file the following actions happen:
- Configuration file is read
- Tools given in the tools section are checked
- Input files are checked
- State of all runs are calculated
If any of these steps fail, uap will exit and print an error message.
uap will create a symbolic link, if it does not exist already, pointing to
the destination path called
<project-config>.yaml-out
.
The symbolic link is created in the directory containing the
<project-config>.yaml
.
There are a couple of global command line parameters which are valid for all scripts (well, actually, it’s only one):
--even-if-dirty
or short--even
:- If this parameter appears uap will work even if uncommited changes to its source code are detected. uap would otherwise immediately stop working. If you specify this flag, the repositories state is recorded in all annotation files created by this process. A full Git diff is included as well.
Subcommands¶
Here an overview of all the available subcommands are given.
steps
Subcommand¶
The steps
subcommand lists all available source and processing
steps:
$ uap steps -h
usage: uap [<project-config>.yaml] steps [-h] [--even-if-dirty] [--show STEP]
This script displays by default a list of all steps the pipeline can use.
optional arguments:
-h, --help show this help message and exit
--even-if-dirty This option must be set if the local git repository
contains uncommited changes.
Otherwise uap will not run.
--show STEP Show the details of a specific step.
status
Subcommand¶
The status
subcommand lists all runs of an analysis.
A run is describes the concrete processing of a sample by a step.
Samples are usually defined at the source steps and are then propagated through
the analysis.
Here is the help message:
$ uap <project-config>.yaml status -h
usage: uap [<project-config>.yaml] status [-h] [--even-if-dirty]
[--cluster CLUSTER] [--summarize]
[--graph] [--sources]
[-r [RUN [RUN ...]]]
This script displays by default information about all runs of the pipeline as
configured in '<project-config>.yaml'. But the displayed information can be
narrowed down via command line options.
IMPORTANT: Hints given by this script are just valid if the jobs were
submitted to the cluster.
optional arguments:
-h, --help show this help message and exit
--even-if-dirty This option must be set if the local git repository
contains uncommited changes.
Otherwise uap will not run.
--cluster CLUSTER Specify the cluster type. Default: [auto].
--summarize Displays summarized information of the analysis.
--graph Displays the dependency graph of the analysis.
--sources Displays only information about the source runs.
-r [RUN [RUN ...]], --run [RUN [RUN ...]]
The status of these runs are displayed.
At any time, each run is in one of the following states:
[w]aiting
– the run is waiting for input files to appear, or its input files are not up-to-date regarding their respective dependencies[r]eady
– all input files are present and up-to-date regarding their upstream input files (and so on, recursively), the run is ready and can be started[q]ueued
– the run is currently queued and will be started “soon” (only available if you use a compute cluster)[e]xecuting
– the run is currently running on this or another machine[f]inished
– all output files are in place and up-to-date
Here is an example output:
$ uap <project-config>.yaml status
Waiting tasks
-------------
[w] fasta_index/download
[w] segemehl_index/Mycoplasma_genitalium_genome-download
Ready tasks
-----------
[r] bowtie2_index/Mycoplasma_genitalium_index-download
[r] bwa_index/Mycoplasma_genitalium_index-download
Finished tasks
--------------
[f] M_genitalium_genome/download
tasks: 5 total, 2 waiting, 2 ready, 1 finished
To get a more concise summary, specify --summarize
:
$ uap <project-config>.yaml status --summarize
Waiting tasks
-------------
[w] 1 fasta_index
[w] 1 segemehl_index
Ready tasks
-----------
[r] 1 bowtie2_index
[r] 1 bwa_index
Finished tasks
--------------
[f] 1 M_genitalium_genome
tasks: 5 total, 2 waiting, 2 ready, 1 finished
... or print a fancy ASCII art graph with --graph
:
$ uap <project-config>.yaml status --graph
M_genitalium_genome (raw_url_source) [1 finished]
└─│─│─│─bowtie2_index (bowtie2_generate_index) [1 ready]
└─│─│─bwa_index (bwa_generate_index) [1 ready]
└─│─fasta_index (samtools_faidx) [1 waiting]
└─segemehl_index (segemehl_generate_index) [1 waiting]
Detailed information about a specific task can be obtained by specifying the run ID on the command line:
$ uap index_mycoplasma_genitalium_ASM2732v1_genome.yaml status -r \
bowtie2_index/Mycoplasma_genitalium_index-download
[uap] Set log level to ERROR
output_directory: genomes/bacteria/Mycoplasma_genitalium/bowtie2_index/Mycoplasma_genitalium_index-download-ZsvbSjtK
output_files:
out/bowtie_index:
Mycoplasma_genitalium_index-download.1.bt2: &id001
- genomes/bacteria/Mycoplasma_genitalium/Mycoplasma_genitalium.ASM2732v1.fa
Mycoplasma_genitalium_index-download.2.bt2: *id001
Mycoplasma_genitalium_index-download.3.bt2: *id001
Mycoplasma_genitalium_index-download.4.bt2: *id001
Mycoplasma_genitalium_index-download.rev.1.bt2: *id001
Mycoplasma_genitalium_index-download.rev.2.bt2: *id001
private_info: {}
public_info: {}
run_id: Mycoplasma_genitalium_index-download
state: FINISHED
This is the known data for run
bowtie2_index/Mycoplasma_genitalium_index-download
.
It contains information about the output folder, the output files and the
input files they depend on as well as the run ID and the run state.
Source steps can be viewed separately by specifying --sources
:
$ uap <project-config>.yaml status --sources
[uap] Set log level to ERROR
M_genitalium_genome/download
run-info
Subcommand¶
The run-info
subcommand displays the commands issued for a given run.
The output looks like a BASH script, but might not be functional.
This is due to the fact that output redirections for some commands
are missing in the BASH script.
The output includes also the information as shown by the status -r <run-ID>
subcommand.
An example output showing the download of the Mycoplasma genitalium genome:
$ uap index_mycoplasma_genitalium_ASM2732v1_genome.yaml run-info --even -r M_genitalium_genome/download
[uap] Set log level to ERROR
#!/usr/bin/env bash
# M_genitalium_genome/download -- Report
# ======================================
#
# output_directory: genomes/bacteria/Mycoplasma_genitalium/M_genitalium_genome/download-7RncJ4tr
# output_files:
# out/raw:
# genomes/bacteria/Mycoplasma_genitalium/Mycoplasma_genitalium.ASM2732v1.fa: []
# private_info: {}
# public_info: {}
# run_id: download
# state: FINISHED
#
# M_genitalium_genome/download -- Commands
# ========================================
# 1. Group of Commands -- 1. Command
# ----------------------------------
curl ftp://ftp.ncbi.nih.gov/genomes/genbank/bacteria/Mycoplasma_genitalium/latest_assembly_versions/GCA_000027325.1_ASM2732v1/GCA_000027325.1_ASM2732v1_genomic.fna.gz
# 2. Group of Commands -- 1. Command
# ----------------------------------
../tools/compare_secure_hashes.py --algorithm md5 --secure-hash f02c78b5f9e756031eeaa51531517f24 genomes/bacteria/Mycoplasma_genitalium/M_genitalium_genome/download-7RncJ4tr/L9PXBmbPKlemghJGNM97JwVuzMdGCA_000027325.1_ASM2732v1_genomic.fna.gz
# 3. Group of Commands -- 1. Pipeline
# -----------------------------------
pigz --decompress --stdout --processes 1 genomes/bacteria/Mycoplasma_genitalium/M_genitalium_genome/download-7RncJ4tr/L9PXBmbPKlemghJGNM97JwVuzMdGCA_000027325.1_ASM2732v1_genomic.fna.gz | dd bs=4M of=/home/hubert/develop/uap/example-configurations/genomes/bacteria/Mycoplasma_genitalium/Mycoplasma_genitalium.ASM2732v1.fa
This subcommand enables the user to manually run parts of the analysis without uap. That can be helpful for debugging steps during development.
run-locally
Subcommand¶
The run-locally
subcommand runs all non-finished runs (or a specified
subset) sequentially on the local machine.
The execution can be cancelled at any time, it won’t put your project in a
unstable state.
However, if the run-locally
subcommand receives a SIGKILL signal, the
currently executing job will continue to run and the corresponding run
will be reported as executing
by calling status
subcommand for five more
minutes (SIGTERM should be fine and exit gracefully but
doesn’t just yet).
After that time, you will be warned that a job is marked as being currently
run but no activity has been seen for a while, along with further
instructions about what to do in such a case (don’t worry, it shouldn’t
happen by accident).
Specify a set of run IDs to execute only those runs. Specify the name of a step to execute all ready runs of that step.
This subcommands usage information:
$ uap index_mycoplasma_genitalium_ASM2732v1_genome.yaml run-locally -h
usage: uap [<project-config>.yaml] run-locally [-h] [--even-if-dirty]
[run [run ...]]
This command starts 'uap' on the local machine. It can be used to start:
* all runs of the pipeline as configured in <project-config>.yaml
* all runs defined by a specific step in <project-config>.yaml
* one or more steps
To start the complete pipeline as configured in <project-config>.yaml execute:
$ uap <project-config>.yaml run-locally
To start a specific step execute:
$ uap <project-config>.yaml run-locally <step_name>
To start a specific run execute:
$ uap <project-config>.yaml run-locally <step/run>
The step_name is the name of an entry in the 'steps:' section as defined in '<project-config>.yaml'. A specific run is defined via its run ID 'step/run'. To get a list of all run IDs please run:
$ uap <project-config>.yaml status
positional arguments:
run These runs are processed on the local machine.
optional arguments:
-h, --help show this help message and exit
--even-if-dirty This option must be set if the local git repository
contains uncommited changes.
Otherwise uap will not run.
Note
Why is it safe to cancel the pipeline? The pipeline is written in a way which expects processes to fail or cluster jobs to disappear without notice. This problem is mitigated by a design which relies on file presence and file timestamps to determine whether a run is finished or not. Output files are automatically written to temporary locations and later moved to their real target directory, and it is not until the last file rename operation has finished that a run is regarded as finished.
submit-to-cluster
Subcommand¶
The submit-to-cluster
subcommand determines which runs still need to be
executed and which supported cluster engine is available.
It submits a job for every run to the cluster if a cluster engine could be
detected.
Dependencies are passed to cluster engine in a way that jobs that depend on
other jobs won’t get scheduled until their dependencies have been satisfied.
For more information read about the
cluster configuration and the
submit script template.
Each submitted job calls uap with the run-locally
subcommand on the
executing cluster node.
Here is the usage information:
$ uap index_mycoplasma_genitalium_ASM2732v1_genome.yaml submit-to-cluster -h
usage: uap [<project-config>.yaml] submit-to-cluster [-h] [--even-if-dirty]
[--cluster CLUSTER]
[run [run ...]]
This script submits all runs configured in <project-config>.yaml to a cluster.
The configuration for the available cluster types is stored at
/<path-to-uap>/cluster/cluster-specific-commands.yaml.
The list of runs can be narrowed down to specific steps. All runs of the
specified step will be submitted to the cluster. Also, individual runs IDs
(step/run) can be used for submission.
positional arguments:
run Submit only these runs to the cluster.
optional arguments:
-h, --help show this help message and exit
--even-if-dirty This option must be set if the local git repository
contains uncommited changes.
Otherwise uap will not run.
--cluster CLUSTER Specify the cluster type. Default: [auto].
fix-problems
Subcommand¶
The fix-problems
subcommand removes temporary files written by uap if
they are not required anymore.
Here is the usage information:
$ uap <project-config>.yaml fix-problems -h
usage: uap [<project-config>.yaml] fix-problems [-h] [--even-if-dirty]
[--cluster CLUSTER]
[--details] [--srsly]
optional arguments:
-h, --help show this help message and exit
--even-if-dirty This option must be set if the local git repository
contains uncommited changes.
Otherwise uap will not run.
--cluster CLUSTER Specify the cluster type. Default: [auto].
--details Displays information about the files causing problems.
--srsly Delete problematic files.
usage: uap [<project-config>.yaml] fix-problems [-h] [--even-if-dirty]
[--cluster CLUSTER]
[--details] [--srsly]
optional arguments:
-h, --help show this help message and exit
--even-if-dirty Must be set if the local git repository contains
uncommited changes. Otherwise the pipeline will not start.
--cluster CLUSTER Specify the cluster type (sge, slurm), defaults to auto.
--details Displays information about problematic files which need
to be deleted to fix problem.
--srsly Deletes problematic files.
uap writes temporary files to indicate if a job is queued or executed.
Sometimes (especially on the compute cluster) jobs fail, without even starting
uap.
This leaves the temporary file, written on job submission, indicating that a run
was queued on the cluster without process (because it already failed).
The status
subcommand will inform the user if fix-problems
needs to be
executed to clean up the mess.
The hint given by status
would look like:
Warning: There are 10 tasks marked as queued, but they do not seem to be queued
Hint: Run 'uap <project-config>.yaml fix-problems --details' to see the details.
Hint: Run 'uap <project-config>.yaml fix-problems --srsly' to fix these problems
(that is, delete all problematic ping files).
Be nice and do as you’ve told. Now you are able to resubmit your runs to the cluster. You’ve fixed the problem, haven’t you?
volatilize
Subcommand¶
The volatilize
subcommand is useful to reduce the required disk space of
your analysis.
It works only if the _volatile keyword is set in
the analysis configuration file for.
As already mentioned there, steps marked as _volatile
compute their output
files as normal but can be replaced by placeholder files if their dependent
steps are finished.
This subcommand provides usage information:
$ uap <project-config>.yaml volatilize -h
usage: uap [<project-config>.yaml] volatilize [-h] [--even-if-dirty]
[--details] [--srsly]
Save disk space by volatilizing intermediate results. Only steps marked with '_volatile: True' are considered.
optional arguments:
-h, --help show this help message and exit
--even-if-dirty This option must be set if the local git repository
contains uncommited changes.
Otherwise uap will not run.
--details Shows which files can be volatilized.
--srsly Replaces files marked for volatilization with a placeholder.
After running volatilize --srsly
the output files of the volatilized step
are replaced by placeholder files.
The placeholder files have the same name as the original files suffixed with
.volatile.placeholder.yaml
.
render
Subcommand¶
The render
subcommand generates graphs using graphviz.
The graphs either show the complete analysis or the execution of a single run.
At the moment --simple
only has an effect in combination with --steps
.
This subcommand provides usage information:
$ uap <project-config>.yaml render -h
usage: uap [<project-config>.yaml] render [-h] [--even-if-dirty] [--files]
[--steps] [--simple]
[--orientation {left-to-right,right-to-left,top-to-bottom}]
[run [run ...]]
'render' generates DOT-graphs. Without arguments
it takes the annotation file of each run and generates a graph,
showing details of the computation.
positional arguments:
run Render only graphs for these runs.
optional arguments:
-h, --help show this help message and exit
--even-if-dirty This option must be set if the local git repository
contains uncommited changes.
Otherwise uap will not run.
--files Renders a graph showing all files of the analysis.
[Not implemented yet!]
--steps Renders a graph showing all steps of the analysis and
their connections.
--simple Simplify rendered graphs.
--orientation {left-to-right,right-to-left,top-to-bottom}
Defines orientation of the graph.
Default: 'top-to-bottom'
Add New Functionality¶
Implement New Steps¶
uap can be easily extended by implementing new source or processing steps. This requires basic python programming skills. New steps are added to uap by placing a single Python file into one of these folders in the uap installation directory:
include/sources
- Place source step files here
include/steps
- Place processing step files here
Let’s talk about how to implement such uap steps.
Step 1: Import Statements and Logger¶
At the beginning of every step please import the required modules and create a logger object.
# First import standard libraries
import os
from logging import getLogger
# Secondly import third party libraries
import yaml
# Thirdly import local application files
from abstract_step import AbstractStep # or AbstractSourceStep
# Get application wide logger
logger = getLogger("uap_logger")
Essential imports are the from logging import getLogger
and
from abstract_step import ...
.
The former is necessary to get access to the application wide logger and
the latter to be able to inherit either from AbstractStep
or
AbstractSourceStep
.
Step 2: Class Definition¶
Now you need to define a class (which inherits either from AbstractStep
or
AbstractSourceStep
) and its __init__
method.
class ConcatenateFiles(AbstractStep):
# Overwrite initialisation
def __init__(self, pipeline):
# Call super classes initialisation
super(ConcatenateFiles, self).__init__(pipeline)
..
The new class needs to be derived from either AbstractStep
, for processing
steps, or AbstractSourceStep
, for source steps.
Step 3: __init__
Method¶
The __init__
method is the place where you should declare:
- Tools via
self.require_tool('tool_name')
: - Steps usually require tools to perform their task.
Each tool that is going to be used by a step needs to be requested via the
method
require_tool('tool_name')
. uap tests the existence of the required tools whenever it constructs the directed acyclic graph (DAG) of the analysis. The test is based on the information provided in the tools section of the analysis configuration. An entry fortool_name
has to exist and to provide information to verify the tools accessibility. - Connections via
add_connection(...)
: Connections are defined by the method
add_connection(...)
. They are used to transfer data from one step to another. If a step defines an output connectionout/something
and a subsequent step defines an input connection namedin/something
, then the files beloging toout/something
will be available via the connectionin/something
.Please name connection in a way that they describe the data itself and NOT the data type. For instance, use
in/genome
overin/fasta
. The data type of the received input data should be checked by the steps to make sure to execute the correct commands.TODO: Reanimate the constraints feature. It would often save some lines of code to be able to define constraints on the connections.
- Options via
self.add_option()
: Options allow to influence the commands executed by a step. It is advisable to provide as many meaningful options as possible to keep steps flexible. Steps can have any number of options. Options are defined via the method
add_option()
.The
add_option()
method allows to specify various information about the option. The method parameters are these:key
name of the option (if possible include the name of the tool this option influences e.g.
dd-blocksize
to setdd
blocksize)
option_type
The option type has to be at least one of
int
,float
,str
,bool
,list
, ordict
.
optional
(Boolean)Defines if the option is mandatory (
False
) or optional (True
).
choices
List of valid values for the option.
default
Defines the default value for the option.
description
The description of the functionality of the option.
..
# Define connections
self.add_connection('in/text')
self.add_connection('out/text')
# Request tools
self.require_tool('cat')
# Options for workflow
self.add_option('concatenate_all_files', bool, optional=False,
default=False, description="Concatenate all files from "
"all runs, if 'True'.")
# Options for 'cat' (see manpage)
self.add_option('show-all', bool, optional=True,
description="Show all characters")
self.add_option('number-nonblank', int, optional=True,
description="number nonempty output lines, "
"overrides --number")
self.add_option('show-ends', bool, optional=True,
description="display $ at end of each line")
self.add_option("number", int, optional=True,
description="number all output lines")
self.add_option("squeeze-blank", bool, optional=True,
description="suppress repeated empty output lines")
self.add_option("show-tabs", bool, optional=True,
description="display TAB characters as ^I")
self.add_option("show-nonprinting", bool, optional=True,
description="use ^ and M- notation, except for "
"LFD and TAB")
..
Step 4: runs
Method¶
The runs
method is where all the work is done.
This method gets handed over a dictionary of dictionaries.
The keys of the first dictionary are the run IDs (often resembling the samples).
The values of the first dictionary is another dictionary.
The keys of that second dictionary are the connections e.g. “in/text” and the
values are the corresponding files belonging to that connection.
Let’s inspect all the run IDs, connections, and input files we got from our upstream steps. And let’s tore all files we received in a list for later use.
..
def runs(self, run_ids_connections_files):
all_files = list()
# Let's inspect the run_ids_connections_files data structure
for run_id in run_ids_connections_files.keys():
logger.info("Run ID: %s" % run_id)
for connection in run_ids_connections_files[run_id].keys():
logger.info("Connection: %s" % connection)
for in_file in run_ids_connections_files[run_id][connection]:
logger.info("Input file: %s" % in_file)
# Collect all files
all_files.append(in_file)
..
It comes in handy to assemble a list with all options for cat
here.
..
# List with options for 'cat'
cat_options = ['show-all', 'number-nonblank', 'show-ends', 'number',
'squeeze-blank', 'show-tabs', 'show-nonprinting']
# Get all options which were set
set_options = [option for option in cat_options if \
self.is_option_set_in_config(option)]
# Compile the list of options
cat_option_list = list()
for option in set_options:
# bool options look different than ...
if isinstance(self.get_option(option), bool):
if self.get_option(option):
cat_option_list.append('--%s' % option)
# ... the rest ...
else:
cat_option_list.append('--%s' % option)
# ... make sure to cast the values to string
cat_option_list.append(str(self.get_option(option)))
..
What should happen if we are told to concatenate all files from all input runs?
We have to create a single run with a new run ID ‘all_files’.
The run consists of a exec_group
that runs the cat
command.
Note
An exec_group
is a list of commands which are executed in one go.
You might create multiple exec_group
‘s if you need to make sure a set of
commands finished before another set is started.
An exec_group
can contain commands and pipelines.
They can be added like this:
# Add a single command
exec_group.add_command(...)
# Add a pipeline to an exec_group
with exec_group.add_pipeline as pipe:
...
# Add a command to a pipeline
pipe.add_command(...)
The result of the concatenation is written to an output file. The run object needs to know about each output file that is going to be created.
Note
An output file is announced via the run objects
add_output_file(tag, out_path, in_paths)
method.
The method parameters are:
tag
: The name of the out connection e.g. ‘text’ for ‘out/text’out_path
: The name of the output file (best practice is to add the run ID to the file name)in_paths
: The input files this output file is based on
..
# Okay let's concatenate all files we get
if self.get_option('concatenate_all_files'):
run_id = 'all_files'
# New run named 'all_files' is created here
with self.declare_run(run_id) as run:
# Create an exec
with run.new_exec_group() as exec_group:
# Assemble the cat command
cat = [ self.get_tool('cat') ]
# Add the options to the command
cat.extend( cat_option_list )
cat.extend( all_files )
# Now add the command to the execution group
exec_group.add_command(
cat,
stdout_path = run.add_output_file(
'text',
"%s_concatenated.txt" % run_id,
all_files)
)
..
What should happen if all files of an input run have to be concatenated? We create a new run for each input run and concatenate all files that belong to the input run.
# Concatenate all files from a runs 'in/text' connection
else:
# iterate over all run IDs ...
for run_id in run_ids_connections_files.keys():
input_paths = run_ids_connections_files[run_id]['in/text']
# ... and declare a new run for each of them.
with self.declare_run(run_id) as run:
with run.new_exec_group() as exec_group:
# Assemble the cat command
cat = [ self.get_tool('cat') ]
# Add the options to the command
cat.extend( cat_option_list )
cat.extend( input_paths )
# Now add the command to the execution group
exec_group.add_command(
cat,
stdout_path = run.add_output_file(
'text',
"%s_concatenated.txt" % run_id,
input_paths)
)
That’s it. You created your first uap processing step.
Step 5: Add the new step to uap¶
You have to make the new step known to uap.
Save the complete file into uap‘s include/steps
folder.
Processing step files are located at uap‘s include/steps/
folder
and source step files at uap‘s include/sources/
folder.
You can control that your step is correctly “installed” if its included in the list of all source and processing steps:
$ ls -la $(dirname $(which uap))/include/sources
... Lists all available source step files
$ ls -la $(dirname $(which uap))/include/steps
... Lists all available processing step files
You can also use uap‘s steps subcommand to get information about installed steps.
If the step file exists at the correct location that step can be used in an analysis configuration file.
A potential example YAML file named test.yaml
could look like this:
destination_path: example-out/test/
steps:
##################
## Source steps ##
##################
raw_file_source:
pattern: example-data/text-files/*.txt
group: (.*).txt
######################
## Processing steps ##
######################
cat:
_depends: raw_file_source
_connect:
in/text:
- raw_file_source/raw
concatenate_all_files: False
tools:
cat:
path: cat
get_version: '--version'
exit_code: 0
You need to create the destination path and some text files matching the
pattern example-data/text-files/*.txt
.
Also you see the work of the _connect
keyword in play.
Check the status of the configured analysis:
$ uap test.yaml status
Ready runs
----------
[r] cat/Hello_america
[r] cat/Hello_asia
[r] cat/Hello_europe
[r] cat/Hello_world
runs: 4 total, 4 ready
Best Practices¶
There are a couple of things you should keep in mind while implementing new steps or modifying existing ones:
- NEVER remove files! If files need to be removed report the issue and exit uap or force the user to call a specific subcommand. Never delete files without permission by the user.
- Make sure errors already show up in when the steps
runs()
method is called the first time. So, look out for things that may fail inruns
. Stick to fail early, fail often. That way errors show up before submitting jobs to the cluster and wasting precious cluster waiting time is avoided. - Make sure that all tools which you request inside the
runs()
method are also required by the step viaself.require_tool()
. Use the__init__()
method to request tools. - Make sure your disk access is as cluster-friendly as possible (which
primarily means using large block sizes and preferably no seek operations).
If possible, use pipelines to wrap your commands in
pigz
ordd
commands. Make the used block size configurable. Although this is not possible in every case (for example when seeking in files is involved), it is straightforward with tools that read a continuous stream fromstdin
and write a continuous stream tostdout
. - Always use
os.path.join(...)
to handle paths. - Use bash commands like
mkfifo
over python library equivalents likeos.mkfifo()
. Themkfifo
command is hashed while anos.mkfifo()
is not. - Keep your steps as flexible as possible. You don’t know what other user might need, so let them decide.
Usage of dd
and mkfifo
¶
uap relies often on dd
and FIFOs to process data with fewer
disk read-write operations.
Please provide a step option to adjust the dd
blocksize (this option
is usually called dd-blocksize
).
Create your steps in a way that they perform the least filesystem operations.
Some systems might be very sensitive to huge numbers of read-write operations.
Annotation Files¶
The annotation files contain detailed information about every output file. Also, the Git SHA1 hash of the uap repository at the time of data processing is included. The executed commands are listed. Annotation contains information about inter-process streams and output files, including SHA1 checksums, file sizes, and line counts as well.
Upon successful completion of a task, an extensive YAML-formatted annotation
is placed next to the output files in a file called
.[task_id]-annotation.yaml
.
Also, for every output file, a symbolic link to this file is created:
.[output_filename].annotation.yaml
.
Finally, the annotation is rendered via GraphViz, if available. Rendering can also be done at a later time using annotations as input (see uap‘s render subcommand). The annotation can be used to determine at a later time what exactly happened. Also, annotations may help to identify bottlenecks.
known_paths¶
Contains information about all directories/files used during processing a run. uap calculates the SHA1 hexdigest for each known file with the designation ‘output’ aka. output/result files.
Available steps¶
Source steps¶
bcl2fastq_source¶
- Connections:
- Output Connection:
- ‘out/configureBcl2Fastq_log_stderr’
- ‘out/make_log_stderr’
- ‘out/sample_sheet’
- Output Connection:
- Options:
- adapter-sequence (str, optional) - adapter-stringency (str, optional) - fastq-cluster-count (int, optional) - filter-dir (str, optional) - flowcell-id (str, optional) - ignore-missing-bcl (bool, optional) - ignore-missing-control (bool, optional) - ignore-missing-stats (bool, optional) - input-dir (str, required) – file URL
- intensities-dir (str, optional) - mismatches (int, optional) - no-eamss (str, optional) - output-dir (str, optional) - positions-dir (str, optional) - positions-format (str, optional) - sample-sheet (str, required) - tiles (str, optional) - use-bases-mask (str, optional) – Conversion mask characters:- Y or y: use- N or n: discard- I or i: use for indexingIf not given, the mask will be guessed from theRunInfo.xml file in the run folder.For instance, in a 2x76 indexed paired end run, themask Y76,I6n,y75n means: “use all 76 bases from thefirst end, discard the last base of the indexing read,and use only the first 75 bases of the second end”.
- with-failed-reads (str, optional)**Required tools:** configureBclToFastq.pl, make, mkdir, mv
This step provides input files which already exists and therefore creates no tasks in the pipeline.
fastq_source¶
The FastqSource class acts as a source for FASTQ files. This source creates a run for every sample.
Specify a file name pattern in pattern and define how sample names should be determined from file names by specifyign a regular expression in group.
Sample index barcodes may specified by providing a filename to a CSV file containing the columns Sample_ID and Index or directly by defining a dictionary which maps indices to sample names.
- Connections:
- Output Connection:
- ‘out/first_read’
- ‘out/second_read’
- Output Connection:
- Options:
- first_read (str, required) – Part of the file name that marks all files containing sequencing data of the first read. Example: ‘R1.fastq’ or ‘_1.fastq’
- group (str, optional) – A regular expression which is applied to found files, and which is used to determine the sample name from the file name. For example,
(Sample_\d+)_R[12].fastq.gz
, when applied to a file calledSample_1_R1.fastq.gz
, would result in a sample name ofSample_1
. You can specify multiple capture groups in the regular expression. - indices (str/dict, optional) – path to a CSV file or a dictionary of sample_id: barcode entries.
- paired_end (bool, required) – Specify whether the samples are paired end or not.
- pattern (str, optional) – A file name pattern, for example
/home/test/fastq/Sample_*.fastq.gz
. - sample_id_prefix (str, optional) – This optional prefix is prepended to every sample name.
- sample_to_files_map (dict/str, optional) – A listing of sample names and their associated files. This must be provided as a YAML dictionary.
- second_read (str, required) – Part of the file name that marks all files containing sequencing data of the second read. Example: ‘R2.fastq’ or ‘_2.fastq’
This step provides input files which already exists and therefore creates no tasks in the pipeline.
fetch_chrom_sizes_source¶
- Connections:
- Output Connection:
- ‘out/chromosome_sizes’
- Output Connection:
- Options:
- path (str, required) – directory to move file to
- ucsc-database (str, required) – Name of UCSC database e.g. hg38, mm9
Required tools: cp, fetchChromSizes
This step provides input files which already exists and therefore creates no tasks in the pipeline.
raw_file_source¶
- Connections:
- Output Connection:
- ‘out/raw’
- Output Connection:
- Options:
- group (str, optional) – A regular expression which is applied to found files, and which is used to determine the sample name from the file name. For example, (Sample_d+)_R[12].fastq.gz`, when applied to a file called
Sample_1_R1.fastq.gz
, would result in a sample name ofSample_1
. You can specify multiple capture groups in the regular expression. - pattern (str, optional) – A file name pattern, for example
/home/test/fastq/Sample_*.fastq.gz
. - sample_id_prefix (str, optional) – This optional prefix is prepended to every sample name.
- sample_to_files_map (dict/str, optional) – A listing of sample names and their associated files. This must be provided as a YAML dictionary.
- group (str, optional) – A regular expression which is applied to found files, and which is used to determine the sample name from the file name. For example, (Sample_d+)_R[12].fastq.gz`, when applied to a file called
This step provides input files which already exists and therefore creates no tasks in the pipeline.
raw_file_sources¶
The RawFileSources class acts as a tyemporary fix to get files into the pipeline. This source creates a run for every sample.
Specify a file name pattern in pattern and define how sample names should be determined from file names by specifyign a regular expression in group.
- Connections:
- Output Connection:
- ‘out/raws’
- Output Connection:
- Options:
- group (str, required) – This is a LEGACY step. Do NOT use it, better use the
raw_file_source
step. A regular expression which is applied to found files, and which is used to determine the sample name from the file name. For example,(Sample_\d+)_R[12].fastq.gz
, when applied to a file calledSample_1_R1.fastq.gz
, would result in a sample name ofSample_1
. You can specify multiple capture groups in the regular expression. - paired_end (bool, required) – Specify whether the samples are paired end or not.
- pattern (str, required) – A file name pattern, for example
/home/test/fastq/Sample_*.fastq.gz
. - sample_id_prefix (str, optional) – This optional prefix is prepended to every sample name.
- group (str, required) – This is a LEGACY step. Do NOT use it, better use the
This step provides input files which already exists and therefore creates no tasks in the pipeline.
raw_url_source¶
- Connections:
- Output Connection:
- ‘out/raw’
- Output Connection:
- Options:
- filename (str, optional) – local file name of downloaded file
- hashing-algorithm (str, optional) – hashing algorithm to use
- possible values: ‘md5’, ‘sha1’, ‘sha224’, ‘sha256’, ‘sha384’, ‘sha512’
- path (str, required) – directory to move downloaded file to
- secure-hash (str, optional) – expected secure hash of downloaded file
- uncompress (bool, optional) – File is uncompressed after download
- url (str, required) – Download URL
Required tools: compare_secure_hashes, cp, curl, dd, mkdir, pigz
This step provides input files which already exists and therefore creates no tasks in the pipeline.
raw_url_sources¶
- Connections:
- Output Connection:
- ‘out/raw’
- Output Connection:
- Options:
- run-download-info (dict, required) – Dictionary of dictionaries. The keys are the names of the runs. The values are dictionaries whose keys are identical with the options of an ‘raw_url_source’ source step. An example: <name>: filename: <filename> hashing-algorithm: <hashing-algorithm> path: <path> secure-hash: <secure-hash> uncompress: <uncompress> url: <url>
Required tools: compare_secure_hashes, cp, curl, dd, mkdir, pigz
This step provides input files which already exists and therefore creates no tasks in the pipeline.
run_folder_source¶
This source looks for fastq.gz files in[path]/Unaligned/Project_*/Sample_*
and pulls additional information from CSV sample sheets it finds. It also makes sure that index information for all samples is coherent and unambiguous.
- Connections:
- Output Connection:
- ‘out/first_read’
- ‘out/second_read’
- Output Connection:
- Options:
- first_read (str, required) – Part of the file name that marks all files containing sequencing data of the first read. Example: ‘_R1.fastq’ or ‘_1.fastq’
- default value: _R1
- paired_end (bool, required) - path (str, required) - project (str, required) - default value: *
- second_read (str, required) – Part of the file name that marks all files containing sequencing data of the second read. Example: ‘R2.fastq’ or ‘_2.fastq’
- default value: _R2
- first_read (str, required) – Part of the file name that marks all files containing sequencing data of the first read. Example: ‘_R1.fastq’ or ‘_1.fastq’
This step provides input files which already exists and therefore creates no tasks in the pipeline.
Processing steps¶
bam_to_bedgraph_and_bigwig¶
- Connections:
- Input Connection:
- ‘in/alignments’
- Output Connection:
- ‘out/bedgraph’
- ‘out/bigwig’
- Input Connection:
- Options:
- chromosome-sizes (str, required) - temp-sort-dir (str, optional)**Required tools:** bedGraphToBigWig, bedtools, sort
CPU Cores: 8
bam_to_genome_browser¶
- Connections:
- Input Connection:
- ‘in/alignments’
- Output Connection:
- ‘out/alignments’
- Input Connection:
- Options:
- bedtools-bamtobed-color (str, optional) - bedtools-bamtobed-tag (str, optional) - bedtools-genomecov-3 (bool, optional) - bedtools-genomecov-5 (bool, optional) - bedtools-genomecov-max (int, optional) - bedtools-genomecov-report-zero-coverage (bool, required) - bedtools-genomecov-scale (float, optional) - bedtools-genomecov-split (bool, required) - default value: True
- bedtools-genomecov-strand (str, optional) - possible values: ‘+’, ‘-‘
- chromosome-sizes (str, required) - dd-blocksize (str, optional) - default value: 256k
- output-format (str, required) - default value: bigWig - possible values: ‘bed’, ‘bigBed’, ‘bedGraph’, ‘bigWig’
- trackline (dict, optional) - trackopts (dict, optional)**Required tools:** bedGraphToBigWig, bedToBigBed, bedtools, dd, mkfifo, pigz
CPU Cores: 8
bowtie2¶
Bowtie2 is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. It is particularly good at aligning reads of about 50 up to 100s or 1,000s of characters, and particularly good at aligning to relatively long (e.g. mammalian) genomes. Bowtie 2 indexes the genome with an FM Index to keep its memory footprint small: for the human genome, its memory footprint is typically around 3.2 GB. Bowtie 2 supports gapped, local, and paired-end alignment modes.
http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
typical command line:
bowtie2 [options]* -x <bt2-idx> {-1 <m1> -2 <m2> | -U <r>} -S [<hit>]
- Connections:
- Input Connection:
- ‘in/first_read’
- ‘in/second_read’
- Output Connection:
- ‘out/alignments’
- Input Connection:
- Options:
- dd-blocksize (str, optional) - default value: 256k
- index (str, required) – Path to bowtie2 index (not containing file suffixes).
Required tools: bowtie2, dd, mkfifo, pigz
CPU Cores: 6
bowtie2_generate_index¶
bowtie2-build builds a Bowtie index from a set of DNA sequences. bowtie2-build outputs a set of 6 files with suffixes .1.bt2, .2.bt2, .3.bt2, .4.bt2, .rev.1.bt2, and .rev.2.bt2. In the case of a large index these suffixes will have a bt2l termination. These files together constitute the index: they are all that is needed to align reads to that reference. The original sequence FASTA files are no longer used by Bowtie 2 once the index is built.
http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#the-bowtie2-build-indexer
typical command line:
bowtie2-build [options]* <reference_in> <bt2_index_base>
- Connections:
- Input Connection:
- ‘in/reference_sequence’
- Output Connection:
- ‘out/bowtie_index’
- Input Connection:
- Options:
- bmax (int, optional) – The maximum number of suffixes allowed in a block. Allowing more suffixes per block makes indexing faster, but increases peak memory usage. Setting this option overrides any previous setting for –bmax, or –bmaxdivn. Default (in terms of the –bmaxdivn parameter) is –bmaxdivn 4. This is configured automatically by default; use -a/–noauto to configure manually.
- bmaxdivn (int, optional) – The maximum number of suffixes allowed in a block, expressed as a fraction of the length of the reference. Setting this option overrides any previous setting for –bmax, or –bmaxdivn. Default: –bmaxdivn 4. This is configured automatically by default; use -a/–noauto to configure manually.
- cutoff (int, optional) – Index only the first <int> bases of the reference sequences (cumulative across sequences) and ignore the rest.
- dcv (int, optional) – Use <int> as the period for the difference-cover sample. A larger period yields less memory overhead, but may make suffix sorting slower, especially if repeats are present. Must be a power of 2 no greater than 4096. Default: 1024. This is configured automatically by default; use -a/–noauto to configure manually.
- dd-blocksize (str, optional) - default value: 256k
- ftabchars (int, optional) – The ftab is the lookup table used to calculate an initial Burrows-Wheeler range with respect to the first <int> characters of the query. A larger <int> yields a larger lookup table but faster query times. The ftab has size 4^(<int>+1) bytes. The default setting is 10 (ftab is 4MB).
- index-basename (str, required) – Base name used for the bowtie2 index.
- large-index (bool, optional) – Force bowtie2-build to build a large index, even if the reference is less than ~ 4 billion nucleotides long.
- noauto (bool, optional) – Disable the default behavior whereby bowtie2-build automatically selects values for the –bmax, –dcv and –packed parameters according to available memory. Instead, user may specify values for those parameters. If memory is exhausted during indexing, an error message will be printed; it is up to the user to try new parameters.
- nodc (bool, optional) – Disable use of the difference-cover sample. Suffix sorting becomes quadratic-time in the worst case (where the worst case is an extremely repetitive reference). Default: off.
- offrate (int, optional) – To map alignments back to positions on the reference sequences, it’s necessary to annotate (‘mark’) some or all of the Burrows-Wheeler rows with their corresponding location on the genome. -o/–offrate governs how many rows get marked: the indexer will mark every 2^<int> rows. Marking more rows makes reference-position lookups faster, but requires more memory to hold the annotations at runtime. The default is 5 (every 32nd row is marked; for human genome, annotations occupy about 340 megabytes).
- packed (bool, optional) – Use a packed (2-bits-per-nucleotide) representation for DNA strings. This saves memory but makes indexing 2-3 times slower. Default: off. This is configured automatically by default; use -a/–noauto to configure manually.
- seed (int, optional) – Use <int> as the seed for pseudo-random number generator.
Required tools: bowtie2-build, dd, pigz
CPU Cores: 6
bwa_backtrack¶
bwa-backtrack is the bwa algorithm designed for Illumina sequence reads up to 100bp. The computation of the alignments is done by running ‘bwa aln’ first, to align the reads, followed by running ‘bwa samse’ or ‘bwa sampe’ afterwards to generate the final SAM output.
http://bio-bwa.sourceforge.net/
typical command line for single-end data:
bwa aln <bwa-index> <first-read.fastq> > <first-read.sai> bwa samse <bwa-index> <first-read.sai> <first-read.fastq> > <sam-output>typical command line for paired-end data:
bwa aln <bwa-index> <first-read.fastq> > <first-read.sai> bwa aln <bwa-index> <second-read.fastq> > <second-read.sai> bwa sampe <bwa-index> <first-read.sai> <second-read.sai> <first-read.fastq> <second-read.fastq> > <sam-output>
- Connections:
- Input Connection:
- ‘in/first_read’
- ‘in/second_read’
- Output Connection:
- ‘out/alignments’
- Input Connection:
- Options:
- aln-0 (bool, optional) – When aln-b is specified, only use single-end reads in mapping.
- aln-1 (bool, optional) – When aln-b is specified, only use the first read in a read pair in mapping (skip single-end reads and the second reads).
- aln-2 (bool, optional) – When aln-b is specified, only use the second read in a read pair in mapping.
- aln-B (int, optional) – Length of barcode starting from the 5’-end. When INT is positive, the barcode of each read will be trimmed before mapping and will be written at the BC SAM tag. For paired-end reads, the barcode from both ends are concatenated. [0]
- aln-E (int, optional) – Gap extension penalty [4]
- aln-I (bool, optional) – The input is in the Illumina 1.3+ read format (quality equals ASCII-64).
- aln-M (int, optional) – Mismatch penalty. BWA will not search for suboptimal hits with a score lower than (bestScore-misMsc). [3]
- aln-N (bool, optional) – Disable iterative search. All hits with no more than maxDiff differences will be found. This mode is much slower than the default.
- aln-O (int, optional) – Gap open penalty [11]
- aln-R (int, optional) – Proceed with suboptimal alignments if there are no more than INT equally best hits. This option only affects paired-end mapping. Increasing this threshold helps to improve the pairing accuracy at the cost of speed, especially for short reads (~32bp).
- aln-b (bool, optional) – Specify the input read sequence file is the BAM format. For paired-end data, two ends in a pair must be grouped together and options aln-1 or aln-2 are usually applied to specify which end should be mapped. Typical command lines for mapping pair-end data in the BAM format are:
bwa aln ref.fa -b1 reads.bam > 1.sai bwa aln ref.fa -b2 reads.bam > 2.sai bwa sampe ref.fa 1.sai 2.sai reads.bam reads.bam > aln.sam
- aln-c (bool, optional) – Reverse query but not complement it, which is required for alignment in the color space. (Disabled since 0.6.x)
- aln-d (int, optional) – Disallow a long deletion within INT bp towards the 3’-end [16]
- aln-e (int, optional) – Maximum number of gap extensions, -1 for k-difference mode (disallowing long gaps) [-1]
- aln-i (int, optional) – Disallow an indel within INT bp towards the ends [5]
- aln-k (int, optional) – Maximum edit distance in the seed [2]
- aln-l (int, optional) – Take the first INT subsequence as seed. If INT is larger than the query sequence, seeding will be disabled. For long reads, this option is typically ranged from 25 to 35 for ‘-k 2’. [inf]
- aln-n (float, optional) – Maximum edit distance if the value is INT, or the fraction of missing alignments given 2% uniform base error rate if FLOAT. In the latter case, the maximum edit distance is automatically chosen for different read lengths. [0.04]
- aln-o (int, optional) – Maximum number of gap opens [1]
- aln-q (int, optional) – Parameter for read trimming. BWA trims a read down to argmax_x{sum_{i=x+1}^l(INT-q_i)} if q_l<INT where l is the original read length. [0]
- aln-t (int, optional) – Number of threads (multi-threading mode) [1]
- default value: 6
- dd-blocksize (str, optional) - default value: 256k
- index (str, required) – Path to BWA index
- sampe-N (int, optional) – Maximum number of alignments to output in the XA tag for reads paired properly. If a read has more than INT hits, the XA tag will not be written. [3]
- sampe-P (bool, optional) – Load the entire FM-index into memory to reduce disk operations (base-space reads only). With this option, at least 1.25N bytes of memory are required, where N is the length of the genome.
- sampe-a (int, optional) – Maximum insert size for a read pair to be considered being mapped properly. Since 0.4.5, this option is only used when there are not enough good alignment to infer the distribution of insert sizes. [500]
- sampe-n (int, optional) – Maximum number of alignments to output in the XA tag for reads paired properly. If a read has more than INT hits, the XA tag will not be written. [3]
- sampe-o (int, optional) – Maximum occurrences of a read for pairing. A read with more occurrneces will be treated as a single-end read. Reducing this parameter helps faster pairing. [100000]
- sampe-r (str, optional) – Specify the read group in a format like '@RG ID:foo SM:bar’. [null]
- samse-n (int, optional) – Maximum number of alignments to output in the XA tag for reads paired properly. If a read has more than INT hits, the XA tag will not be written. [3]
- samse-r (str, optional) – Specify the read group in a format like '@RG ID:foo SM:bar’. [null]
Required tools: bwa, dd, mkfifo, pigz
CPU Cores: 8
bwa_generate_index¶
This step generates the index database from sequences in the FASTA format.
Typical command line:
bwa index -p <index-basename> <seqeunce.fasta>
- Connections:
- Input Connection:
- ‘in/reference_sequence’
- Output Connection:
- ‘out/bwa_index’
- Input Connection:
- Options:
- index-basename (str, required) – Prefix of the created index database
Required tools: bwa
CPU Cores: 6
bwa_mem¶
Align 70bp-1Mbp query sequences with the BWA-MEM algorithm. Briefly, the algorithm works by seeding alignments with maximal exact matches (MEMs) and then extending seeds with the affine-gap Smith-Waterman algorithm (SW).
http://bio-bwa.sourceforge.net/bwa.shtml
Typical command line:
bwa mem [options] <bwa-index> <first-read.fastq> [<second-read.fastq>] > <sam-output>
- Connections:
- Input Connection:
- ‘in/first_read’
- ‘in/second_read’
- Output Connection:
- ‘out/alignments’
- Input Connection:
- Options:
- A (int, optional) – score for a sequence match, which scales options -TdBOELU unless overridden [1]
- B (int, optional) – penalty for a mismatch [4]
- C (bool, optional) – append FASTA/FASTQ comment to SAM output
- D (float, optional) – drop chains shorter than FLOAT fraction of the longest overlapping chain [0.50]
- E (str, optional) – gap extension penalty; a gap of size k cost ‘{-O} + {-E}*k’ [1,1]
- H (str, optional) – insert STR to header if it starts with @; or insert lines in FILE [null]
- L (str, optional) – penalty for 5’- and 3’-end clipping [5,5]
- M (str, optional) – mark shorter split hits as secondary
- O (str, optional) – gap open penalties for deletions and insertions [6,6]
- P (bool, optional) – skip pairing; mate rescue performed unless -S also in use
- R (str, optional) – read group header line such as '@RG ID:foo SM:bar’ [null]
- S (bool, optional) – skip mate rescue
- T (int, optional) – minimum score to output [30]
- U (int, optional) – penalty for an unpaired read pair [17]
- V (bool, optional) – output the reference FASTA header in the XR tag
- W (int, optional) – discard a chain if seeded bases shorter than INT [0]
- Y (str, optional) – use soft clipping for supplementary alignments
- a (bool, optional) – output all alignments for SE or unpaired PE
- c (int, optional) – skip seeds with more than INT occurrences [500]
- d (int, optional) – off-diagonal X-dropoff [100]
- dd-blocksize (str, optional) - default value: 256k
- e (bool, optional) – discard full-length exact matches
- h (str, optional) – if there are <INT hits with score >80% of the max score, output all in XA [5,200]
- index (str, required) – Path to BWA index
- j (bool, optional) – treat ALT contigs as part of the primary assembly (i.e. ignore <idxbase>.alt file)
- k (int, optional) – minimum seed length [19]
- m (int, optional) – perform at most INT rounds of mate rescues for each read [50]
- p (bool, optional) – smart pairing (ignoring in2.fq)
- r (float, optional) – look for internal seeds inside a seed longer than {-k} * FLOAT [1.5]
- t (int, optional) – number of threads [6]
- default value: 6
- v (int, optional) – verbose level: 1=error, 2=warning, 3=message, 4+=debugging [3]
- w (int, optional) – band width for banded alignment [100]
- x (str, optional) – read type. Setting -x changes multiple parameters unless overriden [null]
pacbio: -k17 -W40 -r10 -A1 -B1 -O1 -E1 -L0 (PacBio reads to ref) ont2d: -k14 -W20 -r10 -A1 -B1 -O1 -E1 -L0 (Oxford Nanopore 2D-reads to ref) intractg: -B9 -O16 -L5 (intra-species contigs to ref)
- y (int, optional) – seed occurrence for the 3rd round seeding [20]
Required tools: bwa, dd, mkfifo, pigz
CPU Cores: 6
chromhmm_binarizebam¶
This command converts coordinates of aligned reads into binarized data form from which a chromatin state model can be learned. The binarization is based on a poisson background model. If no control data is specified the parameter to the poisson distribution is the global average number of reads per bin. If control data is specified the global average number of reads is multiplied by the local enrichment for control reads as determined by the specified parameters. Optionally intermediate signal files can also be outputted and these signal files can later be directly converted into binary form using the BinarizeSignal command.
- Connections:
- Input Connection:
- ‘in/alignments’
- Output Connection:
- ‘out/cellmarkfiletable’
- ‘out/chromhmm_binarization’
- Input Connection:
- Options:
- b (int, optional) – The number of base pairs in a bin determining the resolution of the model learning and segmentation. By default this parameter value is set to 200 base pairs.
- cell_mark_files (dict, required) – A dictionary where the keys are the names of the run and the values are lists of lists. The lists of lists describe the content of a ‘cellmarkfiletable’ files as used by ‘BinarizeBam’. But instead of file names use the run ID for the mark and control per line. That is a tab delimited file where each row contains the cell type or other identifier for a groups of marks, then the associated mark, then the name of a BAM file, and optionally a corresponding control BAM file. If a mark is missing in one cell type, but not others it will receive a 2 for all entries in the binarization file and -1 in the signal file. If the same cell and mark combination appears on multiple lines, then the union of all the reads across entries is taken except for control data where each unique file is only counted once.
- center (bool, optional) – If this flag is present then the center of the interval is used to determine the bin to assign a read. This can make sense to use if the coordinates are based on already extended reads. If this option is selected, then the strand information of a read and the shift parameter are ignored. By default reads are assigned to a bin based on the position of its 5’ end as determined from the strand of the read after shifting an amount determined by the -n shift option.
- chrom_sizes_file (str, required) - e (int, optional) – Specifies the amount that should be subtracted from the end coordinate of a read so that both coordinates are inclusive and 0 based. The default value is 1 corresponding to standard bed convention of the end interval being 0-based but not inclusive.
- f (int, optional) – This indicates a threshold for the fold enrichment over expected that must be met or exceeded by the observed count in a bin for a present call. The expectation is determined in the same way as the mean parameter for the poission distribution in terms of being based on a uniform background unless control data is specified. This parameter can be useful when dealing with very deeply and/or unevenly sequenced data. By default this parameter value is 0 meaning effectively it is not used.
- g (int, optional) – This indicates a threshold for the signal that must be met or exceeded by the observed count in a bin for a present call. This parameter can be useful when desiring to directly place a threshold on the signal. By default this parameter value is 0 meaning effectively it is not used.
- n (int, optional) – The number of bases a read should be shifted to determine a bin assignment. Bin assignment is based on the 5’ end of a read shifted this amount with respect to the strand orientation. By default this value is 100.
- p (float, optional) – This option specifies the tail probability of the poisson distribution that the binarization threshold should correspond to. The default value of this parameter is 0.0001.
- s (int, optional) – The amount that should be subtracted from the interval start coordinate so the interval is inclusive and 0 based. Default is 0 corresponding to the standard bed convention.
- strictthresh (bool, optional) – If this flag is present then the poisson threshold must be strictly greater than the tail probability, otherwise by default the largest integer count for which the tail includes the poisson threshold probability is used.
- u (int, optional) – An integer pseudocount that is uniformly added to every bin in the control data in order to smooth the control data from 0. The default value is 1.
- w (int, optional) – This determines the extent of the spatial smoothing in computing the local enrichment for control reads. The local enrichment for control signal in the x-th bin on the chromosome after adding pseudocountcontrol is computed based on the average control counts for all bins within x-w and x+w. If no controldir is specified, then this option is ignored. The default value is 5.
Required tools: ChromHMM, ln, ls, mkdir, printf, tar, xargs
CPU Cores: 4
chromhmm_learnmodel¶
This command takes a directory with a set of binarized data files and learns a chromatin state model. Binarized data files have “_binary” in the file name. The format for the binarized data files are that the first line contains the name of the cell separated by a tab with the name of the chromosome. The second line contains in tab delimited form the name of each mark. The remaining lines correspond to consecutive bins on the chromosome. The remaining lines in tab delimited form corresponding to each mark, with a “1” for a present call or “0” for an absent call and a “2” if the data is considered missing at that interval for the mark.
- Connections:
- Input Connection:
- ‘in/cellmarkfiletable’
- ‘in/chromhmm_binarization’
- Output Connection:
- ‘out/chromhmm_model’
- Input Connection:
- Options:
- assembly (str, required) – specifies the genome assembly. overlap and neighborhood enrichments will be called with default parameters using this genome assembly.Assembly names are e.g. hg18, hg19, GRCh38
- b (int, optional) – The number of base pairs in a bin determining the resolution of the model learning and segmentation. By default this parameter value is set to 200 base pairs.
- color (str, optional) – This specifies the color of the heat map. “r,g,b” are integer values between 0 and 255 separated by commas. By default this parameter value is 0,0,255 corresponding to blue.
- d (float, optional) – The threshold on the change on the estimated log likelihood that if it falls below this value, then parameter training will terminate. If this value is less than 0 then it is not used as part of the stopping criteria. The default value for this parameter is 0.001.
- e (float, optional) – This parameter is only applicable if the load option is selected for the init parameter. This parameter controls the smoothing away from 0 when loading a model. The emission value used in the model initialization is a weighted average of the value in the file and a uniform probability over the two possible emissions. The value in the file gets weight (1-loadsmoothemission) while uniform gets weight loadsmoothemission. The default value of this parameter is 0.02.
- h (float, optional) – A smoothing constant away from 0 for all parameters in the information based initialization. This option is ignored if random or load are selected for the initialization method. The default value of this parameter is 0.02.
- holdcolumnorder (bool, optional) – Including this flag suppresses the reordering of the mark columns in the emission parameter table display.
- init (str, optional) – This specifies the method for parameter initialization method. ‘information’ is the default method described in (Ernst and Kellis, Nature Methods 2012). ‘random’ - randomly initializes the parameters from a uniform distribution. ‘load’ loads the parameters specified in ‘-m modelinitialfile’ and smooths them based on the value of the ‘loadsmoothemission’ and ‘loadsmoothtransition’ parameters. The default is information.
- possible values: ‘information’, ‘random’, ‘load’
- l (str, optional) – This file specifies the length of the chromosomes. It is a two column tab delimited file with the first column specifying the chromosome name and the second column the length. If this file is provided then no end coordinate will exceed what is specified in this file. By default BinarizeBed excludes the last partial bin along the chromosome, but if that is included in the binarized data input files then this file should be included to give a valid end coordinate for the last interval.
- m (str, optional) – This specifies the model file containing the initial parameters which can then be used with the load option
- nobed (bool, optional) – If this flag is present, then this suppresses the printing of segmentation information in the four column format. The default is to generate a four column segmentation file
- nobrowser (bool, optional) – If this flag is present, then browser files are not printed. If -nobed is requested then browserfile writing is also suppressed.
- noenrich (bool, optional) – If this flag is present, then enrichment files are not printed. If -nobed is requested then enrichment file writing is also suppressed.
- numstates (int, required) - r (int, optional) – This option specifies the maximum number of iterations over all the input data in the training. By default this is set to 200.
- s (int, optional) – This allows the specification of the random seed. Randomization is used to determine the visit order of chromosomes in the incremental expectation-maximization algorithm used to train the parameters and also used to generate the initial values of the parameters if random is specified for the init method.
- stateordering (str, optional) – This determines whether the states are ordered based on the emission or transition parameters. See (Ernst and Kellis, Nature Methods) for details. Default is ‘emission’.
- possible values: ‘emission’, ‘transition’
- t (float, optional) – This parameter is only applicable if the load option is selected for the init parameter. This parameter controls the smoothing away from 0 when loading a model. The transition value used in the model initialization is a weighted average of the value in the file and a uniform probability over the transitions. The value in the file gets weight (1-loadsmoothtransition) while uniform gets weight loadsmoothtransition. The default value is 0.5.
- x (int, optional) – This parameter specifies the maximum number of seconds that can be spent optimizing the model parameters. If it is less than 0, then there is no limit and termination is based on maximum number of iterations or a log likelihood change criteria. The default value of this parameter is -1.
- z (int, optional) – This parameter determines the threshold at which to set extremely low transition probabilities to 0 durining training. Setting extremely low transition probabilities makes model learning more efficient with essentially no impact on the final results. If a transition probability falls below 10^-zerotransitionpower during training it is set to 0. Making this parameter to low and thus the cutoff too high can potentially cause some numerical instability. By default this parameter is set to 8.
Required tools: ChromHMM, ls, mkdir, rm, tar, xargs
CPU Cores: 8
cuffcompare¶
CuffCompare is part of the ‘Cufflinks suite of tools’ for differential expr. analysis of RNA-Seq data and their visualisation. This step compares a cufflinks assembly to known annotation. For details about cuffcompare we refer to the author’s webpage:
- Connections:
- Input Connection:
- ‘in/features’
- Output Connection:
- ‘out/features’
- ‘out/loci’
- ‘out/log_stderr’
- ‘out/stats’
- ‘out/tracking’
- Input Connection:
- Options:
- ref-gtf (str, optional) – A “reference” annotation GTF. The input assemblies are merged together with the reference GTF and included in the final output.
Required tools: cuffcompare
CPU Cores: 1
cufflinks¶
CuffLinks is part of the ‘Cufflinks suite of tools’ for differential expr. analysis of RNA-Seq data and their visualisation. This step applies the cufflinks tool which assembles transcriptomes from RNA-Seq data and quantifies their expression and produces .gtf files with these annotations. For details on cufflinks we refer to the author’s webpage:
- Connections:
- Input Connection:
- ‘in/alignments’
- Output Connection:
- ‘out/features’
- ‘out/genes-fpkm’
- ‘out/isoforms_fpkm’
- ‘out/log_stderr’
- ‘out/skipped’
- Input Connection:
- Options:
- 3-overhang-tolerance (int, optional) – overhang allowed on 3’ end when merging with reference
- GTF (bool, optional) – quantitate against reference transcript annotations
- GTF-guide (bool, optional) – use reference transcript annotation to guide assembly
- compatible-hits-norm (bool, optional) – count hits compatible with reference RNAs only
- frag-bias-correct (str, optional) – use bias correction - reference fasta required
- frag-len-mean (int, optional) – average fragment length (unpaired reads only)
- frag-len-std-dev (int, optional) – fragment length std deviation (unpaired reads only)
- intron-overhang-tolerance (int, optional) – overhang allowed inside reference intron when merging
- junc-alpha (float, optional) – alpha for junction binomial test filter
- label (str, optional) – assembled transcripts have this ID prefix
- library-norm-method (str, optional) – Method used to normalize library sizes
- possible values: ‘classic-fpkm’
- library-type (str, required) – library prep used for input reads
- possible values: ‘ff-firststrand’, ‘ff-secondstrand’, ‘ff-unstranded’, ‘fr-firststrand’, ‘fr-secondstrand’, ‘fr-unstranded’, ‘transfrags’
- mask-file (str, optional) – ignore all alignment within transcripts in this file
- max-bundle-frags (int, optional) – maximum fragments allowed in a bundle before skipping
- max-bundle-length (int, optional) – maximum genomic length allowed for a given bundle
- max-frag-multihits (str, optional) – Maximum number of alignments allowed per fragment
- max-intron-length (int, optional) – ignore alignments with gaps longer than this
- max-mle-iterations (int, optional) – maximum iterations allowed for MLE calculation
- max-multiread-fraction (float, optional) – maximum fraction of allowed multireads per transcript
- min-frags-per-transfrag (int, optional) – minimum number of fragments needed for new transfrags
- min-intron-length (int, optional) – minimum intron size allowed in genome
- min-isoform-fraction (float, optional) – suppress transcripts below this abundance level
- multi-read-correct (bool, optional) – use ‘rescue method’ for multi-reads (more accurate)
- no-effective-length-correction (bool, optional) – No effective length correction
- no-faux-reads (bool, optional) – disable tiling by faux reads
- no-length-correction (bool, optional) – No length correction
- no-update-check (bool, optional) – do not contact server to check for update availability
- num-frag-assign-draws (int, optional) – Number of fragment assignment samples per generation
- num-frag-count-draws (int, optional) – Number of fragment generation samples
- num-threads (int, optional) – number of threads used during analysis
- overhang-tolerance (int, optional) – number of terminal exon bp to tolerate in introns
- overlap-radius (int, optional) – maximum gap size to fill between transfrags (in bp)
- pre-mrna-fraction (float, optional) – suppress intra-intronic transcripts below this level
- seed (int, optional) – value of random number generator seed
- small-anchor-fraction (float, optional) – percent read overhang taken as ‘suspiciously small’
- total-hits-norm (bool, optional) – count all hits for normalization
- trim-3-avgcov-thresh (int, optional) – minimum avg coverage required to attempt 3’ trimming
- trim-3-dropoff-frac (float, optional) – fraction of avg coverage below which to trim 3’ end
- verbose (bool, optional) – log-friendly verbose processing (no progress bar)
Required tools: cufflinks, mkdir, mv
CPU Cores: 6
cuffmerge¶
CuffMerge is part of the ‘Cufflinks suite of tools’ for differential expr. analysis of RNA-Seq data and their visualisation. This step applies the cuffmerge tool which merges several Cufflinks assemblies. For details on cuffmerge we refer to the author’s webpage:
- Connections:
- Input Connection:
- ‘in/features’
- Output Connection:
- ‘out/assemblies’
- ‘out/features’
- ‘out/log_stderr’
- ‘out/run_log’
- Input Connection:
- Options:
- num-threads (int, optional) – Use this many threads to merge assemblies.
- default value: 6
- ref-gtf (str, optional) – A “reference” annotation GTF. The input assemblies are merged together with the reference GTF and included in the final output.
- ref-sequence (str, optional) – This argument should point to the genomic DNA sequences for the reference. If a directory, it should contain one fasta file per contig. If a multifasta file, all contigs should be present.
- run_id (str, optional) – An arbitrary name of the new run (which is a merge of all samples).
- default value: magic
- num-threads (int, optional) – Use this many threads to merge assemblies.
Required tools: cuffmerge, mkdir, mv, printf
CPU Cores: 6
cutadapt¶
Cutadapt finds and removes adapter sequences, primers, poly-A tails and other types of unwanted sequence from your high-throughput sequencing reads.
- Connections:
- Input Connection:
- ‘in/first_read’
- ‘in/second_read’
- Output Connection:
- ‘out/first_read’
- ‘out/log_first_read’
- ‘out/log_second_read’
- ‘out/second_read’
- Input Connection:
- Options:
- adapter-R1 (str, optional) – Adapter sequence to be clipped off of thefirst read.
- adapter-R2 (str, optional) – Adapter sequence to be clipped off of thesecond read
- adapter-file (str, optional) – File containing adapter sequences to be clipped off of the reads.
- adapter-type (str, optional) – a: 3’ adapter, b: 3’ or 5’ adapter, g: 5’ adapter
- default value: -a
- possible values: ‘-a’, ‘-g’, ‘-b’
- dd-blocksize (str, optional) - default value: 256k
- fix_qnames (bool, required) – If set to true, only the leftmost string without spaces of the QNAME field of the FASTQ data is kept. This might be necessary for downstream analysis.
- use_reverse_complement (bool, required) – The reverse complement of adapter sequences ‘adapter-R1’ and ‘adapter-R2’ are used for adapter clipping.
Required tools: cat, cutadapt, dd, fix_qnames, mkfifo, pigz
CPU Cores: 4
discardLargeSplitsAndPairs¶
discardLargeSplitsAndPairs reads SAM formatted alignments of the mapped reads. It discards all split reads that skip more than splits_N nucleotides in their alignment to the ref genome. In addition, all read pairs that are mapped to distant region such that the final template will exceed N_mates nucleotides will also be discarded. All remaining reads are returned in SAM format. The discarded reads are also collected in a SAM formatted file and a statistic is returned.
- Connections:
- Input Connection:
- ‘in/alignments’
- Output Connection:
- ‘out/alignments’
- ‘out/log’
- ‘out/stats’
- Input Connection:
- Options:
- M_mates (str, required) – Size of template (in nucleotides) that would arise from a read pair. Read pairs that exceed this value are discarded.
- N_splits (str, required) – Size of the skipped region within a split read (in nucleotides). Split Reads that skip more nt than this value are discarded.
Required tools: dd, discardLargeSplitsAndPairs, pigz, samtools
CPU Cores: 4
fastqc¶
The fastqc step is a wrapper for the fastqc tool. It generates some quality metrics for fastq files. For this specific instance only the zip archive is preserved.
- Connections:
- Input Connection:
- ‘in/first_read’
- ‘in/second_read’
- Output Connection:
- ‘out/first_read_fastqc_report’
- ‘out/first_read_fastqc_report_webpage’
- ‘out/first_read_log_stderr’
- ‘out/second_read_fastqc_report’
- ‘out/second_read_fastqc_report_webpage’
- ‘out/second_read_log_stderr’
- Input Connection:
Required tools: fastqc, mkdir, mv
CPU Cores: 1
fastx_quality_stats¶
fastx_quality_stats generates a text file containing quality information of the input FASTQ data.
Documentation:
http://hannonlab.cshl.edu/fastx_toolkit/
- Connections:
- Input Connection:
- ‘in/first_read’
- ‘in/second_read’
- Output Connection:
- ‘out/first_read_quality_stats’
- ‘out/second_read_quality_stats’
- Input Connection:
- Options:
- dd-blocksize (str, optional) - default value: 256k
- new_output_format (bool, optional) - default value: True
- quality (int, optional) - default value: 33
Required tools: cat, dd, fastx_quality_stats, mkfifo, pigz
CPU Cores: 4
fix_cutadapt¶
This step takes FASTQ data and removes both reads of a paired-end read, if one of them has been completely removed by cutadapt (or any other software).
- Connections:
- Input Connection:
- ‘in/first_read’
- ‘in/second_read’
- Output Connection:
- ‘out/first_read’
- ‘out/second_read’
- Input Connection:
- Options:
- dd-blocksize (str, optional) - default value: 256k
Required tools: cat, dd, fix_cutadapt, mkfifo, pigz
CPU Cores: 4
htseq_count¶
The htseq-count script counts the number of reads overlapping a feature. Input needs to be a file with aligned sequencing reads and a list of genomic features. For more information see:
- Connections:
- Input Connection:
- ‘in/alignments’
- ‘in/features’
- Output Connection:
- ‘out/counts’
- Input Connection:
- Options:
- a (int, optional) - dd-blocksize (str, optional) - default value: 256k
- feature-file (str, optional) - idattr (str, optional) - default value: gene_id
- mode (str, optional) - default value: union - possible values: ‘union’, ‘intersection-strict’, ‘intersection-nonempty’
- order (str, required) - possible values: ‘name’, ‘pos’
- stranded (str, required) - possible values: ‘yes’, ‘no’, ‘reverse’
- type (str, optional) - default value: exon
Required tools: dd, htseq-count, pigz, samtools
CPU Cores: 2
macs2¶
Model-based Analysis of ChIP-Seq (MACS) is a algorithm, for the identifcation of transcript factor binding sites. MACS captures the influence of genome complexity to evaluate the significance of enriched ChIP regions, and MACS improves the spatial resolution of binding sites through combining the information of both sequencing tag position and orientation. MACS can be easily used for ChIP-Seq data alone, or with control sample data to increase the specificity.
https://github.com/taoliu/MACS
typical command line for single-end data:
macs2 callpeak --treatment <aligned-reads> [--control <aligned-reads>] --name <run-id> --gsize 2.7e9
- Connections:
- Input Connection:
- ‘in/alignments’
- Output Connection:
- ‘out/broadpeaks’
- ‘out/broadpeaks-xls’
- ‘out/diagnosis’
- ‘out/gappedpeaks’
- ‘out/log’
- ‘out/model’
- ‘out/narrowpeaks’
- ‘out/narrowpeaks-xls’
- ‘out/summits’
- Input Connection:
- Options:
- bdg (bool, optional) - broad (bool, optional) - broad-cutoff (float, optional) - buffer-size (int, optional) - bw (int, optional) - call-summits (bool, optional) - control (dict, required) - down-sample (bool, optional) - extsize (int, optional) - format (str, required) - default value: AUTO - possible values: ‘AUTO’, ‘ELAND’, ‘ELANDMULTI’, ‘ELANDMULTIPET’, ‘ELANDEXPORT’, ‘BED’, ‘SAM’, ‘BAM’, ‘BAMPE’, ‘BOWTIE’
- gsize (str, required) - default value: 2.7e9
- keep-dup (int, optional) - llocal (str, optional) - mfold (str, optional) - nolambda (bool, optional) - nomodel (bool, optional) - pvalue (float, optional) - qvalue (float, optional) - read-length (int, optional) - shift (int, optional) - slocal (str, optional) - to-large (bool, optional) - verbose (int, optional) - possible values: ‘0’, ‘1’, ‘2’, ‘3’
Required tools: macs2, mkdir, mv, pigz
CPU Cores: 4
merge_fasta_files¶
This step merges all .fasta(.gz) files belonging to a certain sample. The output files are gzipped.
- Connections:
- Input Connection:
- ‘in/sequence’
- Output Connection:
- ‘out/sequence’
- Input Connection:
- Options:
- compress-output (bool, optional) – If set to true output is gzipped.
- default value: True
- dd-blocksize (str, optional) - default value: 256k
- merge-all-runs (bool, optional) – If set to true sequences from all runs are merged
- output-fasta-basename (str, optional) – Name used as prefix for FASTA output.
- compress-output (bool, optional) – If set to true output is gzipped.
Required tools: cat, dd, mkfifo, pigz
CPU Cores: 4
merge_fastq_files¶
This step merges all .fastq(.gz) files belonging to a certain sample. First and second read files are merged separately. The output files are gzipped.
- Connections:
- Input Connection:
- ‘in/first_read’
- ‘in/second_read’
- Output Connection:
- ‘out/first_read’
- ‘out/second_read’
- Input Connection:
- Options:
- dd-blocksize (str, optional) - default value: 256k
Required tools: cat, dd, mkfifo, pigz
CPU Cores: 4
picard_add_replace_read_groups¶
Replace read groups in a BAM file. This tool enables the user to replace all read groups in the INPUT file with a single new read group and assign all reads to this read group in the OUTPUT BAM file.
Documentation:
https://broadinstitute.github.io/picard/command-line-overview.html#AddOrReplaceReadGroups
- Connections:
- Input Connection:
- ‘in/alignments’
- Output Connection:
- ‘out/alignments’
- Input Connection:
- Options:
- COMPRESSION_LEVEL (int, optional) – Compression level for all compressed files created (e.g. BAM and GELI). Default value: 5. This option can be set to “null” to clear the default value.
- CREATE_INDEX (bool, optional) – Whether to create a BAM index when writing a coordinate-sorted BAM file. Default value: false. This option can be set to “null” to clear the default value.
- CREATE_MD5_FILE (bool, optional) – Whether to create an MD5 digest for any BAM or FASTQ files created. Default value: false. This option can be set to “null” to clear the default value.
- GA4GH_CLIENT_SECRETS (str, optional) – Google Genomics API client_secrets.json file path. Default value: client_secrets.json. This option can be set to “null” to clear the default value.
- MAX_RECORDS_IN_RAM (int, optional) – When writing SAM files that need to be sorted, this will specify the number of records stored in RAM before spilling to disk. Increasing this number reduces the number of file handles needed to sort a SAM file, and increases the amount of RAM needed. Default value: 500000. This option can be set to “null” to clear the default value.
- QUIET (bool, optional) – Whether to suppress job-summary info on System.err. Default value: false. This option can be set to “null” to clear the default value.
- REFERENCE_SEQUENCE (str, optional) – Reference sequence file. Default value: null.
- RGCN (str, optional) – Read Group sequencing center name. Default value: null.
- RGDS (str, optional) – Read Group description. Default value: null.
- RGDT (str, optional) – Read Group run date. Default value: null.
- RGID (str, optional) – Read Group ID Default value: 1. This option can be set to ‘null’ to clear the default value.
- RGLB (str, required) – Read Group library
- RGPG (str, optional) – Read Group program group. Default value: null.
- RGPI (int, optional) – Read Group predicted insert size. Default value: null.
- RGPL (str, required) – Read Group platform (e.g. illumina, solid)
- RGPM (str, optional) – Read Group platform model. Default value: null.
- RGPU (str, required) – Read Group platform unit (eg. run barcode)
- SORT_ORDER (str, optional) – Optional sort order to output in. If not supplied OUTPUT is in the same order as INPUT. Default value: null. Possible values: {unsorted, queryname, coordinate, duplicate}
- possible values: ‘unsorted’, ‘queryname’, ‘coordinate’, ‘duplicate’
- TMP_DIR (str, optional) – A file. Default value: null. This option may be specified 0 or more times.
- VALIDATION_STRINGENCY (str, optional) – Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded. Default value: STRICT. This option can be set to “null” to clear the default value.
- possible values: ‘STRICT’, ‘LENIENT’, ‘SILENT’
- VERBOSITY (str, optional) – Control verbosity of logging. Default value: INFO. This option can be set to “null” to clear the default value.
- possible values: ‘ERROR’, ‘WARNING’, ‘INFO’, ‘DEBUG’
Required tools: picard-tools
CPU Cores: 6
picard_markduplicates¶
Identifies duplicate reads. This tool locates and tags duplicate reads (both PCR and optical/ sequencing-driven) in a BAM or SAM file, where duplicate reads are defined as originating from the same original fragment of DNA. Duplicates are identified as read pairs having identical 5’ positions (coordinate and strand) for both reads in a mate pair (and optinally, matching unique molecular identifier reads; see BARCODE_TAG option). Optical, or more broadly Sequencing, duplicates are duplicates that appear clustered together spatially during sequencing and can arise from optical/ imagine-processing artifacts or from bio-chemical processes during clonal amplification and sequencing; they are identified using the READ_NAME_REGEX and the OPTICAL_DUPLICATE_PIXEL_DISTANCE options. The tool’s main output is a new SAM or BAM file in which duplicates have been identified in the SAM flags field, or optionally removed (see REMOVE_DUPLICATE and REMOVE_SEQUENCING_DUPLICATES), and optionally marked with a duplicate type in the ‘DT’ optional attribute. In addition, it also outputs a metrics file containing the numbers of READ_PAIRS_EXAMINED, UNMAPPED_READS, UNPAIRED_READS, UNPAIRED_READ_DUPLICATES, READ_PAIR_DUPLICATES, and READ_PAIR_OPTICAL_DUPLICATES.
Usage example:
java -jar picard.jar MarkDuplicates I=input.bam O=marked_duplicates.bam M=marked_dup_metrics.txtDocumentation:
https://broadinstitute.github.io/picard/command-line-overview.html#MarkDuplicates
- Connections:
- Input Connection:
- ‘in/alignments’
- Output Connection:
- ‘out/alignments’
- ‘out/metrics’
- Input Connection:
- Options:
- ASSUME_SORTED (bool, optional) - COMMENT (str, optional) - COMPRESSION_LEVEL (int, optional) – Compression level for all compressed files created (e.g. BAM and GELI). Default value: 5. This option can be set to “null” to clear the default value.
- CREATE_INDEX (bool, optional) – Whether to create a BAM index when writing a coordinate-sorted BAM file. Default value: false. This option can be set to “null” to clear the default value.
- CREATE_MD5_FILE (bool, optional) – Whether to create an MD5 digest for any BAM or FASTQ files created. Default value: false. This option can be set to “null” to clear the default value.
- GA4GH_CLIENT_SECRETS (str, optional) – Google Genomics API client_secrets.json file path. Default value: client_secrets.json. This option can be set to “null” to clear the default value.
- MAX_FILE_HANDLES (int, optional) - MAX_RECORDS_IN_RAM (int, optional) – When writing SAM files that need to be sorted, this will specify the number of records stored in RAM before spilling to disk. Increasing this number reduces the number of file handles needed to sort a SAM file, and increases the amount of RAM needed. Default value: 500000. This option can be set to “null” to clear the default value.
- OPTICAL_DUPLICATE_PIXEL_DISTANCE (int, optional) - PROGRAM_GROUP_COMMAND_LINE (str, optional) - PROGRAM_GROUP_NAME (str, optional) - PROGRAM_GROUP_VERSION (str, optional) - PROGRAM_RECORD_ID (str, optional) - QUIET (bool, optional) – Whether to suppress job-summary info on System.err. Default value: false. This option can be set to “null” to clear the default value.
- READ_NAME_REGEX (str, optional) - REFERENCE_SEQUENCE (str, optional) – Reference sequence file. Default value: null.
- SORTING_COLLECTION_SIZE_RATIO (float, optional) - TMP_DIR (str, optional) – A file. Default value: null. This option may be specified 0 or more times.
- VALIDATION_STRINGENCY (str, optional) – Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded. Default value: STRICT. This option can be set to “null” to clear the default value.
- possible values: ‘STRICT’, ‘LENIENT’, ‘SILENT’
- VERBOSITY (str, optional) – Control verbosity of logging. Default value: INFO. This option can be set to “null” to clear the default value.
- possible values: ‘ERROR’, ‘WARNING’, ‘INFO’, ‘DEBUG’
Required tools: picard-tools
CPU Cores: 12
picard_merge_sam_bam_files¶
Documentation:
https://broadinstitute.github.io/picard/command-line-overview.html#MergeSamFiles
- Connections:
- Input Connection:
- ‘in/alignments’
- Output Connection:
- ‘out/alignments’
- Input Connection:
- Options:
- ASSUME_SORTED (bool, optional) – If true, assume that the input files are in the same sort order as the requested output sort order, even if their headers say otherwise. Default value: false. This option can be set to ‘null’ to clear the default value. Possible values: {true, false}
- COMMENT (str, optional) – Comment(s) to include in the merged output file’s header. Default value: null.
- COMPRESSION_LEVEL (int, optional) – Compression level for all compressed files created (e.g. BAM and GELI). Default value: 5. This option can be set to “null” to clear the default value.
- CREATE_INDEX (bool, optional) – Whether to create a BAM index when writing a coordinate-sorted BAM file. Default value: false. This option can be set to “null” to clear the default value.
- CREATE_MD5_FILE (bool, optional) – Whether to create an MD5 digest for any BAM or FASTQ files created. Default value: false. This option can be set to “null” to clear the default value.
- GA4GH_CLIENT_SECRETS (str, optional) – Google Genomics API client_secrets.json file path. Default value: client_secrets.json. This option can be set to “null” to clear the default value.
- INTERVALS (str, optional) – An interval list file that contains the locations of the positions to merge. Assume bam are sorted and indexed. The resulting file will contain alignments that may overlap with genomic regions outside the requested region. Unmapped reads are discarded. Default value: null.
- MAX_RECORDS_IN_RAM (int, optional) – When writing SAM files that need to be sorted, this will specify the number of records stored in RAM before spilling to disk. Increasing this number reduces the number of file handles needed to sort a SAM file, and increases the amount of RAM needed. Default value: 500000. This option can be set to “null” to clear the default value.
- MERGE_SEQUENCE_DICTIONARIES (bool, optional) – Merge the sequence dictionaries. Default value: false. This option can be set to ‘null’ to clear the default value. Possible values: {true, false}
- QUIET (bool, optional) – Whether to suppress job-summary info on System.err. Default value: false. This option can be set to “null” to clear the default value.
- REFERENCE_SEQUENCE (str, optional) – Reference sequence file. Default value: null.
- SORT_ORDER (str, optional) – Sort order of output file. Default value: coordinate. This option can be set to ‘null’ to clear the default value. Possible values: {unsorted, queryname, coordinate, duplicate}
- possible values: ‘unsorted’, ‘queryname’, ‘coordinate’, ‘duplicate’
- TMP_DIR (str, optional) – A file. Default value: null. This option may be specified 0 or more times.
- USE_THREADING (bool, optional) – Option to create a background thread to encode, compress and write to disk the output file. The threaded version uses about 20% more CPU and decreases runtime by ~20% when writing out a compressed BAM file. Default value: false. This option can be set to ‘null’ to clear the default value. Possible values: {true, false}
- VALIDATION_STRINGENCY (str, optional) – Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded. Default value: STRICT. This option can be set to “null” to clear the default value.
- possible values: ‘STRICT’, ‘LENIENT’, ‘SILENT’
- VERBOSITY (str, optional) – Control verbosity of logging. Default value: INFO. This option can be set to “null” to clear the default value.
- possible values: ‘ERROR’, ‘WARNING’, ‘INFO’, ‘DEBUG’
Required tools: ln, picard-tools
CPU Cores: 12
post_cufflinksSuite¶
- The cufflinks suite can be used to assembly new transcripts and
- merge those with known annotations. However, the output .gtf files need to be reformatted in several aspects afterwards. This step can be used to reformat and filter the cufflinksSuite .gtf file.
- Connections:
- Input Connection:
- ‘in/features’
- Output Connection:
- ‘out/features’
- ‘out/log_stderr’
- Input Connection:
- Options:
- class_list (str, optional) – Class codes to be removed; possible ‘=,c,j,e,i,o,p,r,u,x,s,.’
- filter_by_class (bool, required) – Remove gtf if any class is found in class_code field, requieres class_list
- filter_by_class_and_gene_name (bool, required) – Combines remove-by-class and remove-by-gene-name
- gene_name (str, optional) – String to match in gtf field gene_name for discarding
- default value: ENS
- remove_by_gene_name (bool, required) – Remove gtf if matches ‘string’ in gene_name field
- remove_gencode (bool, required) – Hard removal of gtf line which match ‘ENS’ in gene_name field
- remove_unstranded (bool, required) – Removes transcripts without strand specifity
- run_id (str, optional) – An arbitrary name of the new run (which is a merge of all samples).
- default value: magic
Required tools: cat, post_cufflinks_merge
CPU Cores: 6
preseq_complexity_curve¶
The preseq package is aimed at predicting the yield of distinct reads from a genomic library from an initial sequencing experiment. The estimates can then be used to examine the utility of further sequencing, optimize the sequencing depth, or to screen multiple libraries to avoid low complexity samples.
c_curve computes the expected yield of distinct reads for experiments smaller than the input experiment in a .bed or .bam file through resampling. The full set of parameters can be outputed by simply typing the program name. If output.txt is the desired output file name and input.bed is the input .bed file, then simply type:
preseq c_curve -o output.txt input.sort.bed
- Connections:
- Input Connection:
- ‘in/alignments’
- Output Connection:
- ‘out/complexity_curve’
- Input Connection:
- Options:
- hist (bool, optional) – input is a text file containing the observed histogram
- pe (bool, required) – input is paired end read file
- seg_len (int, optional) – maximum segment length when merging paired end bam reads (default: 5000)
- step (int, optional) – step size gin extrapolations (default: 1e+06)
- vals (bool, optional) – input is a text file containing only the observed counts
Required tools: preseq
CPU Cores: 4
preseq_future_genome_coverage¶
The preseq package is aimed at predicting the yield of distinct reads from a genomic library from an initial sequencing experiment. The estimates can then be used to examine the utility of further sequencing, optimize the sequencing depth, or to screen multiple libraries to avoid low complexity samples.
gc_extrap computes the expected genomic coverage for deeper sequencing for single cell sequencing experiments. The input should be a mr or bed file. The tool bam2mr is provided to convert sorted bam or sam files to mapped read format.
- Connections:
- Input Connection:
- ‘in/alignments’
- Output Connection:
- ‘out/future_genome_coverage’
- Input Connection:
- Options:
- bin_size (int, optional) – bin size (default: 10)
- bootstraps (int, optional) – number of bootstraps (default: 100)
- cval (float, optional) – level for confidence intervals (default: 0.95)
- extrap (int, optional) – maximum extrapolation in base pairs (default: 1e+12)
- max_width (int, optional) – max fragment length, set equal to read length for single end reads
- quick (bool, optional) – quick mode: run gc_extrap without bootstrapping for confidence intervals
- step (int, optional) – step size in bases between extrapolations (default: 1e+08)
- terms (int, optional) – maximum number of terms
Required tools: preseq
CPU Cores: 4
preseq_future_yield¶
The preseq package is aimed at predicting the yield of distinct reads from a genomic library from an initial sequencing experiment. The estimates can then be used to examine the utility of further sequencing, optimize the sequencing depth, or to screen multiple libraries to avoid low complexity samples.
lc_extrap computes the expected future yield of distinct reads and bounds on the number of total distinct reads in the library and the associated confidence intervals.
- Connections:
- Input Connection:
- ‘in/alignments’
- Output Connection:
- ‘out/future_yield’
- Input Connection:
- Options:
- bootstraps (int, optional) – number of bootstraps (default: 100)
- cval (float, optional) – level for confidence intervals (default: 0.95)
- dupl_level (float, optional) – fraction of duplicate to predict (default: 0.5)
- extrap (int, optional) – maximum extrapolation (default: 1e+10)
- hist (bool, optional) – input is a text file containing the observed histogram
- pe (bool, required) – input is paired end read file
- quick (bool, optional) – quick mode, estimate yield without bootstrapping for confidence intervals
- seg_len (int, optional) – maximum segment length when merging paired end bam reads (default: 5000)
- step (int, optional) – step size in extrapolations (default: 1e+06)
- terms (int, optional) – maximum number of terms
- vals (bool, optional) – input is a text file containing only the observed counts
Required tools: preseq
CPU Cores: 4
remove_duplicate_reads_runs¶
Duplicates are removed by Picard tools ‘MarkDuplicates’.
typical command line:
MarkDuplicates INPUT=<SAM/BAM> OUTPUT=<SAM/BAM> METRICS_FILE=<metrics-out> REMOVE_DUPLICATES=true
- Connections:
- Input Connection:
- ‘in/alignments’
- Output Connection:
- ‘out/alignments’
- ‘out/metrics’
- Input Connection:
Required tools: MarkDuplicates
CPU Cores: 12
rgt_thor¶
THOR is an HMM-based approach to detect and analyze differential peaks in two sets of ChIP-seq data from distinct biological conditions with replicates. THOR performs genomic signal processing, peak calling and p-value calculation in an integrated framework. For differential peak calling without replicates use ODIN.
More information please refer to:
Allhoff, M., Sere K., Freitas, J., Zenke, M., Costa, I.G. (2016), Differential Peak Calling of ChIP-seq Signals with Replicates with THOR, Nucleic Acids Research, epub gkw680 [paper][supp].
Feel free to post your question in our googleGroup or write an e-mail: rgtusers@googlegroups.com
- Connections:
- Input Connection:
- ‘in/alignments’
- Output Connection:
- ‘out/chip_seq_bigwig’
- ‘out/diff_narrow_peaks’
- ‘out/diff_peaks_thor_bed’
- ‘out/thor_config’
- ‘out/thor_setup_info’
- Input Connection:
- Options:
- binsize (int, optional) – Size of bins for creating the signal.
- chrom_sizes_file (str, required) - config_file (dict, required) – A dictionary with
- deadzones (str, optional) – Define blacklisted genomic regions to be ignored by the peak caller.
- exts (str, optional) – Read’s extension size for BAM files (comma separated list for each BAM file in config file). If option is not chosen, estimate extension sizes from reads.
- factors-inputs (str, optional) – Normalization factors for input-DNA (comma separated list for each BAM file in config file). If option is not chosen, estimate factors.
- genome (str, required) – FASTA file containing the complete genome sequence
- housekeeping-genes (str, optional) – Define housekeeping genes (BED format) used for normalizing.
- merge (bool, optional) – Merge peaks which have a distance less than the estimated mean fragment size (recommended for histone data).
- name (str, optional) – Experiment’s name and prefix for all files that are created.
- no-correction (bool, optional) – Do not use multiple test correction for p-values (Benjamini/Hochberg).
- no-gc-content (bool, optional) – Do not normalize towards GC content.
- pvalue (float, optional) – P-value cutoff for peak detection. Call only peaks with p-value lower than cutoff.
- report (bool, optional) – Generate HTML report about experiment.
- save-input (bool, optional) – Save input DNA bigwig (if input was provided).
- scaling-factors (str, optional) – Scaling factor for each BAM file (not control input-DNA) as comma separated list for each BAM file in config file. If option is not chosen, follow normalization strategy (TMM or HK approach)
- step (int, optional) – Stepsize with which the window consecutively slides across the genome to create the signal.
Required tools: printf, rgt-THOR
CPU Cores: 4
rseqc¶
The RSeQC step can be used to evaluate aligned reads in a BAM file. RSeQC does not only report raw sequence-based metrics, but also quality control metrics like read distribution, gene coverage, and sequencing depth.
- Connections:
- Input Connection:
- ‘in/alignments’
- Output Connection:
- ‘out/bam_stat’
- ‘out/infer_experiment’
- ‘out/read_distribution’
- Input Connection:
- Options:
- reference (str, required) – Reference gene model in bed fomat. [required]
Required tools: bam_stat.py, cat, infer_experiment.py, read_distribution.py
CPU Cores: 1
s2c¶
s2c formats the output of segemehl mapping to be compatible with the cufflinks suite of tools for differential expr. analysis of RNA-Seq data and their visualisation. For details on cufflinks we refer to the author’s webpage:
- Connections:
- Input Connection:
- ‘in/alignments’
- Output Connection:
- ‘out/alignments’
- ‘out/log’
- Input Connection:
- Options:
- tmp_dir (str, required) – Temp directory for ‘s2c.py’. This can be in the /work/username/ path, since it is only temporary.
Required tools: cat, dd, fix_s2c, pigz, s2c, samtools
CPU Cores: 6
sam_to_sorted_bam¶
The step sam_to_sorted_bam builds on ‘samtools sort’ to sort SAM files and output BAM files.
Sort alignments by leftmost coordinates, or by read name when -n is used. An appropriate @HD-SO sort order header tag will be added or an existing one updated if necessary.
Documentation:
http://www.htslib.org/doc/samtools.html
- Connections:
- Input Connection:
- ‘in/alignments’
- Output Connection:
- ‘out/alignments’
- Input Connection:
- Options:
- dd-blocksize (str, optional) - default value: 256k
- genome-faidx (str, required) - sort-by-name (bool, required) - temp-sort-dir (str, required) – Intermediate sort files are stored intothis directory.
Required tools: dd, pigz, samtools
CPU Cores: 8
samtools_faidx¶
Index reference sequence in the FASTA format or extract subsequence from indexed reference sequence. If no region is specified, faidx will index the file and create <ref.fasta>.fai on the disk. If regions are specified, the subsequences will be retrieved and printed to stdout in the FASTA format.
- Connections:
- Input Connection:
- ‘in/sequence’
- Output Connection:
- ‘out/indices’
- Input Connection:
Required tools: mv, samtools
CPU Cores: 4
samtools_index¶
Index a coordinate-sorted BAM or CRAM file for fast random access. (Note that this does not work with SAM files even if they are bgzip compressed to index such files, use tabix(1) instead.)
Documentation:
http://www.htslib.org/doc/samtools.html
- Connections:
- Input Connection:
- ‘in/alignments’
- Output Connection:
- ‘out/alignments’
- ‘out/index_stats’
- ‘out/indices’
- Input Connection:
- Options:
- index_type (str, required) - possible values: ‘bai’, ‘csi’
Required tools: ln, samtools
CPU Cores: 4
samtools_stats¶
samtools stats collects statistics from BAM files and outputs in a text format. The output can be visualized graphically using plot-bamstats.
Documentation:
http://www.htslib.org/doc/samtools.html
- Connections:
- Input Connection:
- ‘in/alignments’
- Output Connection:
- ‘out/stats’
- Input Connection:
- Options:
- dd-blocksize (str, optional) - default value: 256k
Required tools: dd, pigz, samtools
CPU Cores: 1
segemehl¶
segemehl is a software to map short sequencer reads to reference genomes. Unlike other methods, segemehl is able to detect not only mismatches but also insertions and deletions. Furthermore, segemehl is not limited to a specific read length and is able to mapprimer- or polyadenylation contaminated reads correctly.
This step creates at first two FIFOs. The first is used to provide the genome data for segemehl and the second is used for the output of the unmapped reads:
mkfifo genome_fifo unmapped_fifo cat <genome-fasta> -o genome_fifoThe executed segemehl command is this:
segemehl -d genome_fifo -i <genome-index-file> -q <read1-fastq> [-p <read2-fastq>] -u unmapped_fifo -H 1 -t 11 -s -S -D 0 -o /dev/stdout | pigz --blocksize 4096 --processes 2 -cThe unmapped reads are saved via these commands:
cat unmapped_fifo | pigz --blocksize 4096 --processes 2 -c > <unmapped-fastq>
- Connections:
- Input Connection:
- ‘in/first_read’
- ‘in/second_read’
- Output Connection:
- ‘out/alignments’
- ‘out/log’
- ‘out/unmapped’
- Input Connection:
- Options:
- MEOP (bool, optional) – output MEOP field for easier variance calling in SAM (XE:Z:)
- SEGEMEHL (bool, optional) – output SEGEMEHL format (needs to be selected for brief)
- accuracy (int, optional) – min percentage of matches per read in semi-global alignment (default:90)
- autoclip (bool, optional) – autoclip unknown 3prime adapter
- bisulfite (int, optional) – bisulfite mapping with methylC-seq/Lister et al. (=1) or bs-seq/Cokus et al. protocol (=2) (default:0)
- possible values: ‘0’, ‘1’, ‘2’
- brief (bool, optional) – brief output
- clipacc (int, optional) – clipping accuracy (default:70)
- dd-blocksize (str, optional) - default value: 256k
- differences (int, optional) – search seeds initially with <n> differences (default:1)
- default value: 1
- dropoff (int, optional) – dropoff parameter for extension (default:8)
- evalue (float, optional) – max evalue (default:5.000000)
- extensionpenalty (int, optional) – penalty for a mismatch during extension (default:4)
- extensionscore (int, optional) – score of a match during extension (default:2)
- fix-qnames (bool, optional) – The QNAMES field of the input will be purged from spaces and everything thereafter.
- genome (str, required) – Path to genome file
- hardclip (bool, optional) – enable hard clipping
- hitstrategy (int, optional) – report only best scoring hits (=1) or all (=0) (default:1)
- default value: 1
- possible values: ‘0’, ‘1’
- index (str, required) – Path to genome index for segemehl
- jump (int, optional) – search seeds with jump size <n> (0=automatic) (default:0)
- maxinsertsize (int, optional) – maximum size of the inserts (paired end) (default:5000)
- maxinterval (int, optional) – maximum width of a suffix array interval, i.e. a query seed will be omitted if it matches more than <n> times (default:100)
- maxsplitevalue (float, optional) – max evalue for splits (default:50.000000)
- minfraglen (int, optional) – min length of a spliced fragment (default:20)
- minfragscore (int, optional) – min score of a spliced fragment (default:18)
- minsize (int, optional) – minimum size of queries (default:12)
- minsplicecover (int, optional) – min coverage for spliced transcripts (default:80)
- nohead (bool, optional) – do not output header
- order (bool, optional) – sorts the output by chromsome and position (might take a while!)
- polyA (bool, optional) – clip polyA tail
- prime3 (str, optional) – add 3’ adapter (default:none)
- prime5 (str, optional) – add 5’ adapter (default:none)
- showalign (bool, optional) – show alignments
- silent (bool, optional) – shut up!
- default value: True
- splicescorescale (float, optional) – report spliced alignment with score s only if <f>*s is larger than next best spliced alignment (default:1.000000)
- splits (bool, optional) – detect split/spliced reads (default:none)
- default value: True
- threads (int, optional) – start <n> threads (default:10)
- default value: 10
Required tools: cat, dd, fix_qnames, mkfifo, pigz, segemehl
CPU Cores: 10
segemehl_generate_index¶
The step segemehl_generate_index generates a index for given reference sequences.
Documentation:
http://www.bioinf.uni-leipzig.de/Software/segemehl/
- Connections:
- Input Connection:
- ‘in/reference_sequence’
- Output Connection:
- ‘out/log’
- ‘out/segemehl_index’
- Input Connection:
- Options:
- dd-blocksize (str, optional) - default value: 256k
- index-basename (str, required) – Basename for created segemehl index.
Required tools: dd, mkfifo, pigz, segemehl
CPU Cores: 4
sra_fastq_dump¶
sra tools is a suite from NCBI to handle sra (short read archive) files. fastq-dump is an sra tool that dumps the content of an sra file in fastq format
The following options cannot be set, as they would interefere with the pipeline implemented in this step
- -O|–outdir <path> Output directory, default is working
- directory ‘.’ )
- -Z|–stdout Output to stdout, all split data become
- joined into single stream
--gzip Compress output using gzip --bzip2 Compress output using bzip2
- Multiple File Options Setting these options will produce more
than 1 file, each of which will be suffixed according to splitting criteria.
--split-files Dump each read into separate file.Files will receive suffix corresponding to read number --split-3 Legacy 3-file splitting for mate-pairs: First biological reads satisfying dumping conditions are placed in files *_1.fastq and *_2.fastq If only one biological read is present it is placed in *.fastq Biological reads and above are ignored. -G|–spot-group Split into files by SPOT_GROUP (member name) -R|–read-filter <[filter]> Split into files by READ_FILTER value
optionally filter by value: pass|reject|criteria|redacted-T|–group-in-dirs Split into subdirectories instead of files -K|–keep-empty-files Do not delete empty files
Details on fastq-dump can be found at https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc&f=fastq-dump
To make IO cluster friendly, fastq-dump is not reading th sra file directly. Rather, dd with configurable blocksize is used to provide the sra file via a fifo to fastq-dump.
The executed calls lools like this
mkfifo sra_fifo dd bs=4M if=<sra-file> of=sra_fifo fastq-dump -Z sra_fifo | pigz –blocksize 4096 –processes 2 > file.fastq
- Connections:
- Input Connection:
- ‘in/sequence’
- Output Connection:
- ‘out/first_read’
- ‘out/log’
- ‘out/second_read’
- Input Connection:
- Options:
- accession (str, optional) – Replaces accession derived from <path> in filename(s) and deflines (only for single table dump)
- aligned (bool, optional) – Dump only aligned sequences
- aligned-region (str, optional) – Filter by position on genome. Name can either be accession.version (ex:NC_000001.10) or file specific name (ex:”chr1” or “1”). “from” and “to” are 1-based coordinates. <name[:from-to]>
- clip (bool, optional) – Apply left and right clips
- dd-blocksize (str, optional) - default value: 256k
- defline-qual (str, optional) – Defline format specification for quality.
- defline-seq (str, optional) – Defline format specification for sequence.
- disable-multithreading (bool, optional) – disable multithreading
- dumpbase (bool, optional) – Formats sequence using base space (default for other than SOLiD).
- dumpcs (bool, optional) – Formats sequence using color space (default for SOLiD),”cskey” may be specified for translation.
- fasta (int, optional) – FASTA only, no qualities, optional line wrap width (set to zero for no wrapping). <[line width]>
- helicos (bool, optional) – Helicos style defline
- legacy-report (bool, optional) – use legacy style “Written spots” for tool
- log-level (str, optional) – Logging level as number or enum string One of (fatal|sys|int|err|warn|info) or (0-5). Current/default is warn. <level>
- matepair-distance (str, optional) – Filter by distance beiween matepairs. Use “unknown” to find matepairs split between the references. Use from-to to limit matepair distance on the same reference. <from-to|unknown>
- maxSpotId (int, optional) – Maximum spot id to be dumped. Use with “minSpotId” to dump a range.
- max_cores (int, optional) – Maximum number of cores available on the cluster
- default value: 10
- minReadLen (int, optional) – Filter by sequence length >= <len>
- minSpotId (int, optional) – Minimum spot id to be dumped. Use with “maxSpotId” to dump a range.
- ncbi_error_report (str, optional) – Control program execution environment report generation (if implemented). One of (never|error|always). Default is error. <error>
- offset (int, optional) – Offset to use for quality conversion, default is 33
- origfmt (bool, optional) – Defline contains only original sequence name
- qual-filter (bool, optional) – Filter used in early 1000 Genomes data: no sequences starting or ending with >= 10N
- qual-filter-1 (bool, optional) – Filter used in current 1000 Genomes data
- read-filter (str, optional) – Split into files by READ_FILTER value optionally filter by value: pass|reject|criteria|redacted
- readids (bool, optional) – Append read id after spot id as “accession.spot.readid” on defline.
- skip-technical (bool, optional) – Dump only biological reads
- split-spot (str, optional) – Split spots into individual reads
- spot-groups (str, optional) – Filter by SPOT_GROUP (member): name[,...]
- suppress-qual-for-cskey (bool, optional) – supress quality-value for cskey
- table (str, optional) – Table name within cSRA object, default is “SEQUENCE”
- unaligned (bool, optional) – Dump only unaligned sequences
- verbose (bool, optional) – Increase the verbosity level of the program. Use multiple times for more verbosity.
Required tools: dd, fastq-dump, mkfifo, pigz
CPU Cores: 10
subsetMappedReads¶
subsetMappedReads selects a provided number of mapped reads from a file in .sam or .bam format. Depending on the set options the first N mapped reads and their mates (for paired end sequencing) are returned in .sam format. If the number of requested reads exceeds the number of available mapped reads, all mapped reads are returned.
- Connections:
- Input Connection:
- ‘in/alignments’
- Output Connection:
- ‘out/alignments’
- ‘out/log’
- Input Connection:
- Options:
- Nreads (str, required) – Number of reads to extract from input file.
- genome-faidx (str, required) - paired_end (bool, required) – The reads are expected to have a mate, due to paired end sequencing.
Required tools: cat, dd, head, pigz, samtools
CPU Cores: 1
tophat2¶
TopHat is a fast splice junction mapper for RNA-Seq reads. It aligns RNA-Seq reads to mammalian-sized genomes using the ultra high-throughput short read aligner Bowtie, and then analyzes the mapping results to identify splice junctions between exons.
typical command line:
tophat [options]* <index_base> <reads1_1[,...,readsN_1]> [reads1_2,...readsN_2]
- Connections:
- Input Connection:
- ‘in/first_read’
- ‘in/second_read’
- Output Connection:
- ‘out/align_summary’
- ‘out/alignments’
- ‘out/deletions’
- ‘out/insertions’
- ‘out/junctions’
- ‘out/log_stderr’
- ‘out/misc_logs’
- ‘out/prep_reads’
- ‘out/unmapped’
- Input Connection:
- Options:
- index (str, required) – Path to genome index for tophat2
- library_type (str, required) – The default is unstranded (fr-unstranded). If either fr-firststrand or fr-secondstrand is specified, every read alignment will have an XS attribute tag as explained below. Consider supplying library type options below to select the correct RNA-seq protocol.(https://ccb.jhu.edu/software/tophat/manual.shtml)
- possible values: ‘fr-unstranded’, ‘fr-firststrand’, ‘fr-secondstrand’
Required tools: mkdir, mv, tar, tophat2
CPU Cores: 6
API documentation¶
Pipeline-specific modules¶
abstract_step¶
Classes AbstractStep and AbstractSourceStep are defined here.
The class AbstractStep has to be inherited by all processing step classes. The class AbstractSourceStep has to be inherited by all source step classes.
Processing steps generate output files from input files whereas source steps only provide output files. Both step types may generates tasks, but only source steps can introduce files from outside the destination path into the pipeline.
-
class
abstract_step.
AbstractStep
(pipeline)[source]¶ -
add_connection
(connection, constraints=None)[source]¶ Add a connection, which must start with ‘in/’ or ‘out/’.
-
add_dependency
(parent)[source]¶ Add a parent step to this steps dependencies.
parent – parent step this step depends on
-
declare_run
(run_id)[source]¶ Declare a run. Use it like this:
with self.declare_run(run_id) as run: # add output files and information to the run here
-
dependencies
= None¶ All steps this step depends on.
-
finalize
()[source]¶ Finalizes the step.
The intention is to make further changes to the step impossible, but apparently, it’s checked nowhere at the moment.
-
find_upstream_info_for_input_paths
(input_paths, key)[source]¶ Find a piece of public information in all upstream steps. If the information is not found or defined in more than one upstream step, this will crash.
-
generate_one_report
()[source]¶ Gathers the output files for each outgoing connection and calls self.reports() to do the job of creating a report.
-
generate_report
(run_id)[source]¶ Gathers the output files for each outgoing connection and calls self.reports() to do the job of creating a report.
-
get_module_loads
()[source]¶ Return dictionary with module load commands to execute before starting any other command of this step
-
get_module_unloads
()[source]¶ Return dictionary with module unload commands to execute before starting any other command of this step
-
get_post_commands
()[source]¶ Return dictionary with commands to execute after finishing any other command of this step
-
get_pre_commands
()[source]¶ Return dictionary with commands to execute before starting any other command of this step
-
get_run_ids_in_connections_input_files
()[source]¶ Return a dictionary with all run IDs from parent steps, the in connections they provide data for, and the names of the files:
run_id_1: in_connection_1: [input_path_1, input_path_2, ...] in_connection_2: ... run_id_2: ...
Format of
in_connection
:in/<connection>
. Input paths are absolute.
-
get_run_ids_out_connections_output_files
()[source]¶ Return a dictionary with all run IDs of the current step, their out connections, and the files that belong to them:
run_id_1: in_connection_1: [input_path_1, input_path_2, ...] in_connection_2: ... run_id_2: ...
Format of
in_connection
:in/<connection>
. Input paths are absolute.
-
get_run_state
(run_id)[source]¶ Returns run state of a run.
Determine the run state (that is, not basic but extended run state) of a run, building on the value returned by get_run_state_basic().
- If a run is ready, this will:
- return executing if an up-to-date executing ping file is found
- otherwise return queued if a queued ping file is found
- If a run is waiting, this will:
- return queued if a queued ping file is found
Otherwise, it will just return the value obtained from get_run_state_basic().
Attention: The status indicators executing and queued may be temporarily wrong due to the possiblity of having out-of-date ping files lying around.
-
get_run_state_basic
(run_id)[source]¶ Determines basic run state of a run.
Determine the basic run state of a run, which is, at any time, one of waiting, ready, or finished.
These states are determined from the current configuration and the timestamps of result files present in the file system. In addition to these three basic states, there are two additional states which are less reliable (see get_run_state()).
-
get_runs
()[source]¶ Getter method for runs of this step.
If there are no runs as this method is called, they are created here.
-
classmethod
get_step_class_for_key
(key)[source]¶ Returns a step (or source step) class for a given key which corresponds to the name of the module the class is defined in. Pass ‘cutadapt’ and you will get the cutadapt.Cutadapt class which you may then instantiate.
-
get_step_name
()[source]¶ Returns this steps name.
Returns the step name which is initially equal to the step type (== module name) but can be changed via set_step_name() or via the YAML configuration.
-
is_option_set_in_config
(key)[source]¶ Determine whether an optional option (that is, a non-required option) has been set in the configuration.
-
reports
(run_id, out_connection_output_files)[source]¶ Abstract method this must be implemented by actual step.
Raise NotImplementedError if subclass does not override this method.
-
require_tool
(tool)[source]¶ Declare that this step requires an external tool. Query it later with get_tool().
-
run
(run_id)[source]¶ Create a temporary output directory and execute a run. After the run has finished, it is checked that all output files are in place and the output files are moved to the final output location. Finally, YAML annotations are written.
-
runs
(run_ids_connections_files)[source]¶ Abstract method this must be implemented by actual step.
Raise NotImplementedError if subclass does not override this method.
-
-
class
abstract_step.
AbstractSourceStep
(pipeline)[source]¶ A subclass all source steps inherit from and which distinguishes source steps from all real processing steps because they do not yield any tasks, because their “output files” are in fact files which are already there.
Note that the name might be a bit misleading because this class only applies to source steps which ‘serve’ existing files. A step which has no input but produces input data for other steps and actually has to do something for it, on the other hand, would be a normal AbstractStep subclass because it produces tasks.
pipeline¶
-
class
pipeline.
Pipeline
(**kwargs)[source]¶ The Pipeline class represents the entire processing pipeline which is defined and configured via the configuration file config.yaml.
Individual steps may be defined in a tree, and their combination with samples as generated by one or more source leads to an array of tasks.
-
all_tasks_topologically_sorted
= None¶ List of all tasks in topological order.
-
check_tools
()[source]¶ checks whether all tools references by the configuration are available and records their versions as determined by
[tool] --version
etc.
-
cluster_type
= None¶ The cluster type to be used (must be one of the keys specified in cluster_config).
-
config
= None¶ Dictionary representation of configuration YAML file.
-
file_dependencies
= None¶ This dict stores file dependencies within this pipeline, but regardless of step, output file tag or run ID. This dict has, for all output files generated by the pipeline, a set of input files that output file depends on.
-
file_dependencies_reverse
= None¶ This dict stores file dependencies within this pipeline, but regardless of step, output file tag or run ID. This dict has, for all input files required by the pipeline, a set of output files which are generated using this input file.
-
input_files_for_task_id
= None¶ This dict stores a set of input files for every task id in the pipeline.
-
output_files_for_task_id
= None¶ This dict stores a set of output files for every task id in the pipeline.
-
states
= Enum(['READY', 'EXECUTING', 'WAITING', 'QUEUED', 'FINISHED'])¶ Possible states a task can be in.
-
steps
= None¶ This dict stores step objects by their name. Each step knows his dependencies.
-
task_for_task_id
= None¶ This dict stores task objects by task IDs.
-
task_id_for_output_file
= None¶ This dict stores a task ID for every output file created by the pipeline.
-
task_ids_for_input_file
= None¶ This dict stores a set of task IDs for every input file used in the pipeline.
-
topological_step_order
= None¶ List with topologically ordered steps.
-
run¶
-
class
run.
Run
(step, run_id)[source]¶ The Run class is a helper class which represents a run in a step. Declare runs inside AbstractStep.runs() via:
with self.new_run(run_id) as run: # declare output files, private and public info here
After that, use the available methods to configure the run. The run has typically no information about input connections only about input files.
-
add_empty_output_connection
(tag)[source]¶ An empty output connection has ‘None’ as output file and ‘None’ as input file.
-
add_output_file
(tag, out_path, in_paths)[source]¶ Add an output file to this run. Output file names must be unique across all runs defined by a step, so it may be a good idea to include the run_id into the output filename.
- tag: You must specify the connection annotation which must have been
previously declared via AbstractStep.add_connection(“out/...”), but this doesn’t have to be done in the step constructor, it’s also possible in declare_runs() right before this method is called.
- out_path: The output file path, without a directory. The pipeline
assigns directories for you (this parameter must not contain a slash).
- in_paths: A list of input files this output file depends on. It is
crucial to get this right, so that the pipeline can determine which steps are up-to-date at any given time. You have to specify absolute paths here, including a directory, and you can obtain them via AbstractStep.run_ids_and_input_files_for_connection and related functions.
-
add_private_info
(key, value)[source]¶ Add private information to a run. Use this to store data which you will need when the run is executed. As opposed to public information, private information is not visible to subsequent steps.
You can store paths to input files here, but not paths to output files as their expected location is not defined until we’re in AbstractStep.execute (hint: they get written to a temporary directory inside execute()).
-
add_public_info
(key, value)[source]¶ Add public information to a run. For example, a FASTQ reader may store the index barcode here for subsequent steps to query via
AbstractStep.find_upstream_info()
.
-
add_temporary_directory
(prefix='', suffix='', designation=None)[source]¶ Convenience method for creation of temporary directories. Basically, just calls self.add_temporary_file(). The magic happens in ProcessPool.__exit__()
-
get_basic_state
()[source]¶ Determines basic run state of a run.
Determine the basic run state of a run, which is, at any time, one of waiting, ready, or finished.
These states are determined from the current configuration and the timestamps of result files present in the file system. In addition to these three basic states, there are two additional states which are less reliable (see get_run_state()).
-
get_execution_hashtag
()[source]¶ Creates a hash tag based on the commands to be executed.
This causes runs to be marked for rerunning if the commands to be executed change.
-
get_input_files_for_output_file
(out_path)[source]¶ Return all input files a given output file depends on.
-
get_output_directory_du_jour
()[source]¶ Returns the state-dependent output directory of the step.
Returns this steps output directory according to its current state: - if we are currently calling a step’s declare_runs() method, this will return None - if we are currently calling a step’s execute() method, this will return the temporary directory - otherwise, it will return the real output directory
-
get_output_directory_du_jour_placeholder
()[source]¶ Returns a placeholder for the temporary output directory, which needs to be replaced by the actual temp directory inside the abstract_step.execute() method
-
get_output_files_abspath
()[source]¶ Return a dictionary of all defined output files, grouped by connection annotation:
annotation_1: out_path_1: [in_path_1, in_path_2, ...] out_path_2: ... annotation_2: ...
The
out_path
consists of the output directory du jour and the output file name.
Retrieve a set of output files of the given annotation, assigned to the same number of specified tags. If you have two ‘alignment’ output files and they are called out-a.txt and out-b.txt, you can use this function like this:
- tags: [‘a’, ‘b’]
- result: {‘a’: ‘out-a.txt’, ‘b’: ‘out-b.txt’}
-
get_private_info
(key)[source]¶ Query private information which must have been previously stored via ” “add_private_info().
-
get_public_info
(key)[source]¶ Query public information which must have been previously stored via ” “add_public_info().
-
get_single_output_file_for_annotation
(annotation)[source]¶ Retrieve exactly one output file of the given annotation, and crash if there isn’t exactly one.
-
remove_temporary_paths
()[source]¶ Everything stored in self._temp_paths is examined and deleted if possible. The list elements are removed in LIFO order. Also, self._known_paths ‘type’ info is updated here. NOTE: Included additional stat checks to detect FIFOs as well as other special files.
-
update_public_info
(key, value)[source]¶ Update public information already existing in a run. For example, all steps which handle FASTQ files want to know how to distinguish between files of read 1 and files of read 2. So each step that provides FASTQ should update this information if the file names are altered. The stored information can be acquired via:
AbstractStep.find_upstream_info()
.
-
Miscellaneous modules¶
process_pool¶
This module can be used to launch child processes and wait for them. Processes may either run on their own or pipelines can be built with them.
-
class
process_pool.
ProcessPool
(run)[source]¶ The process pool provides an environment for launching and monitoring processes. You can launch any number of unrelated processes plus any number of pipelines in which several processes are chained together.
Use it like this:
with process_pool.ProcessPool(self) as pool: # launch processes or create pipelines here
When the scope opened by the with statement is left, all processes are launched and being watched. The process pool then waits until all processes have finished. You cannot launch a process pool within another process pool, but you can launch multiple pipeline and independent processes within a single process pool. Also, you can launch several process pools sequentially.
-
COPY_BLOCK_SIZE
= 4194304¶ When stdout or stderr streams should be written to output files, this is the buffer size which is used for writing.
-
class
Pipeline
(pool)[source]¶ This class can be used to chain multiple processes together.
Use it like this:
with pool.Pipeline(pool) as pipeline: # append processes to the pipeline here
-
ProcessPool.
SIGTERM_TIMEOUT
= 10¶ After a SIGTERM signal is issued, wait this many seconds before going postal.
-
ProcessPool.
TAIL_LENGTH
= 1024¶ Size of the tail which gets recorded from both stdout and stderr streams of every process launched with this class, in bytes.
-
classmethod
ProcessPool.
kill
()[source]¶ Kills all user-launched processes. After that, the remaining process will end and a report will be written.
-
classmethod
ProcessPool.
kill_all_child_processes
()[source]¶ Kill all child processes of this process by sending a SIGTERM to each of them. This includes all children which were not launched by this module, and their children etc.
-
ProcessPool.
launch
(args, stdout_path=None, stderr_path=None, hints={})[source]¶ Launch a process. Arguments, including the program itself, are passed in args. If the program is not a binary but a script which cannot be invoked directly from the command line, the first element of args must be a list like this: [‘python’, ‘script.py’].
Use stdout_path and stderr_path to redirect stdout and stderr streams to files. In any case, the output of both streams gets watched, the process pool calculates SHA1 checksums automatically and also keeps the last 1024 bytes of every stream. This may be useful if a process crashes and writes error messages to stderr in which case you can see them even if you didn’t redirect stderr to a log file.
Hints can be specified but are not essential. They help to determine the direction of arrows for the run annotation graphs rendered by GraphViz (sometimes, it’s not clear from the command line whether a certain file is an input or output file to a given process).
-
fscache¶
-
class
fscache.
FSCache
[source]¶ Use this class if you expect to make the same os.path.* calls many times during a short time. The first time you call a method with certain arguments, the call is made, but all subsequent calls are served from a cache.
Usage example:
# Instantiate a new file system cache. fsc = FSCache() # This call will stat the file system. print(fsc.exists('/home')) # This call will leave the file system alone, the cached result will be returned. print(fsc.exists('/home'))
You may call any method which is available in os.path.
misc¶
-
misc.
append_suffix_to_path
(path, suffix)[source]¶ Append a suffix to a path, for example:
- path: /home/michael/chocolate-cookies.txt.gz
- suffix: done right
- result: /home/michael/chocolate-cookies-done-right.txt.gz
-
misc.
assign_strings
(paths, tags)[source]¶ Assign N strings (path names, for example) to N tags. Example:
- paths = [‘RIB0000794-cutadapt-R1.fastq.gz’, ‘RIB0000794-cutadapt-R2.fastq.gz’]
- tags = [‘R1’, ‘R2’]
- result = { ‘R1’: ‘RIB0000794-cutadapt-R1.fastq.gz’, ‘R2’: ‘RIB0000794-cutadapt-R2.fastq.gz’ }
If this is not possible without ambiguities, a StandardError is thrown. Attention: The number of paths must be equal to the number of tags, a 1:1 relation is returned, if possible.
-
misc.
bytes_to_str
(num)[source]¶ Convert a number representing a number of bytes into a human-readable string such as “4.7 GB”
-
misc.
duration_to_str
(duration, long=False)[source]¶ Minor adjustment for Python’s duration to string conversion, removed microsecond accuracy and replaces ‘days’ with ‘d’
-
misc.
natsorted
(l)[source]¶ Return a ‘naturally sorted’ permutation of l.
Credits: http://www.codinghorror.com/blog/2007/12/sorting-for-humans-natural-sort-order.html
Remarks¶
This documentation has been created using sphinx and reStructuredText.