Pipeliner Documenation¶
Requirements¶
The Pipeliner framework requires Nextflow and Anaconda. Nextflow can be used on any POSIX compatible system (Linux, OS X, etc). It requires BASH and Java 8 (or higher) to be installed. Third-party software tools used by individual pipelines will be installed and managed through a Conda virtual environment.
Testing Nextflow¶
Before continuuing, test to make sure your environment is compatible with a Nextflow executable.
Note
You will download another one later when you clone the repository
Make sure your Java installation is version 8 or higher:
java -version
Create a new directory and install/test Nextflow:
mkdir nf-test
cd nf-test
curl -s https://get.nextflow.io | bash
./nextflow run hello
Output:
N E X T F L O W ~ version 0.31.0
Launching `nextflow-io/hello` [sad_curran] - revision: d4c9ea84de [master]
[warm up] executor > local
[4d/479eec] Submitted process > sayHello (4)
[a8/4bc038] Submitted process > sayHello (2)
[17/5be64e] Submitted process > sayHello (3)
[ee/0d879f] Submitted process > sayHello (1)
Hola world!
Ciao world!
Hello world!
Bonjour world!
Installing Anaconda¶
Pipeliner uses virtual environments managed by Conda, which is available through Anaconda. Download the distribution pre-packaged with Python 2.7.
Make sure conda is installed and updated:
conda --version
conda update conda
Tip
If this is your first time working with Conda, you may need to edit your configuration paths to ensure Anaconda is invoked when calling conda
Pre-Packaged Conda Environment¶
Yaml File¶
Clone Pipeliner:
git clone https://github.com/montilab/pipeliner
Environment for Linux:
conda env create -f pipeliner/envs/linux_env.yml
Environment for OS X:
conda env create -f pipeliner/envs/osx_env.yml
Note
Copies of pre-compiled binaries are hosted/maintained at https://anaconda.org/Pipeliner/repo
Warning
For those installing on the Shared Computing Cluster (SCC) at Boston University, instructions on how to setup a private conda environment can be here.
Setting up Pipeliner¶
Tip
It is recommended to clone Pipeliner to a directory path that does not contain spaces
With all prerequisites, one can quickly setup Pipeliner by cloning the repository, configuring local paths to toy datasets, activating the conda environment, and downloading the Nextflow executable:
# Clone Pipeliner
git clone https://github.com/montilab/pipeliner
# Activate conda environment
source activate pipeliner
# Configure local paths to toy datasets
python pipeliner/scripts/paths.py
# Move to pipelines directory
cd pipeliner/pipelines
# Download nextflow executable
curl -s https://get.nextflow.io | bash
# Run RNA-seq pipeline with toy data
./nextflow rnaseq.nf -c rnaseq.config
The output should look like this:
N E X T F L O W ~ version 0.31.1
Launching `rnaseq.nf` [nasty_pauling] - revision: cd3f572ab2
[warm up] executor > local
[31/1b2066] Submitted process > pre_fastqc (ggal_alpha)
[23/de6d60] Submitted process > pre_fastqc (ggal_theta)
[7c/28ee53] Submitted process > pre_fastqc (ggal_gamma)
[97/9ad6c1] Submitted process > check_reads (ggal_alpha)
[ab/c3eedf] Submitted process > check_reads (ggal_theta)
[2d/050633] Submitted process > check_reads (ggal_gamma)
[1d/f3af6d] Submitted process > pre_multiqc
[32/b1db1d] Submitted process > hisat_indexing (genome_reference.fa)
[3b/d93c6d] Submitted process > trim_galore (ggal_alpha)
[9c/3fa50b] Submitted process > trim_galore (ggal_theta)
[62/25fce0] Submitted process > trim_galore (ggal_gamma)
[66/ccc9db] Submitted process > hisat_mapping (ggal_alpha)
[28/69fff5] Submitted process > hisat_mapping (ggal_theta)
[5c/5ed2b6] Submitted process > hisat_mapping (ggal_gamma)
[b4/e559ab] Submitted process > gtftobed (genome_annotation.gtf)
[bc/6f490c] Submitted process > rseqc (ggal_alpha)
[71/80aa9e] Submitted process > rseqc (ggal_theta)
[17/ca0d9f] Submitted process > rseqc (ggal_gamma)
[d7/7d391b] Submitted process > counting (ggal_alpha)
[df/936854] Submitted process > counting (ggal_theta)
[11/143c2c] Submitted process > counting (ggal_gamma)
[31/4c11f9] Submitted process > expression_matrix
[1f/3af548] Submitted process > multiqc
Success: Pipeline Completed!
Basic Usage¶
Framework Stucture¶
Pipeline is a framework with various moving parts to support the development of multiple sequencing pipelines. The following is a simplified example of its directory structure:
/pipeliner
├── /docs
├── /envs
├── /scripts
├── /tests
└── /pipelines
├── /configs
├── /scripts
├── /templates
├── /toy_data
├── /rnaseq.nf
└── /rnaseq.config
- docs
- Markdown and Restructured Text documentaion files associated with Pipeliner and existing pipelines
- envs
- Yaml files and scripts required to reproduce Conda environments
- scripts
- Various helper scripts for framework setup and maintenance
- tests
- Python testing module for multi-pipeline automatic test execution and reporting
- pipelines/configs
- Base config files inherited by pipeline configurations
- pipelines/scripts
- Various helper scripts for pipeline processes
- pipelines/templates
- Template processes inherited by pipeline workflows
- pipelines/toy_data
- Small datasets for rapid development and testing of pipelines. These datasets are modifications from original RNA-seq and scRNA-seq datasets.
- pipelines/rnaseq.nf
- Nextflow script for the RNA-seq pipeline
- pipelines/rnaseq.config
- Configuration file for the RNA-seq pipeline
Pipeline Configuration¶
Note
These examples are applicable to all pipelines
In the previous section, we gave instructions for processing the RNA-seq toy dataset. In that example, the configuration options were all preset, however with real data, these settings must be reconfigured. Therefore the configuration file is typically the first thing a user will have to modify to suit their needs. The following is a screenshot of the first half of the RNA-seq configuration file.

Config Inheritance¶
Line 1: Configuration files can inherit basic properties that are reused across many pipelines. We have defined several inheritable configuration files that are reused repeatedly. These include configs for running pipelines on local machines, Sun Grid Engine clusters, in Docker environments, and on AWS cloud computing.
Data Input and Output¶
Lines 9-15 All data paths are defined in the configuration file. This includes specifying where incoming data resides as well as defining where to output all data produced by the pipeline.
Basic Options¶
Lines 17-19 These are pipeline specific parameters that make large changes to how the data is processed.
Providing an Index¶
Lines 25-27 A useful feature of a pipeline is the ability to use an existing alignment index.
Starting from Bams¶
Lines 29-30 Another useful feature of a pipeline is the ability to skip pre-processing steps and start directly from the bam files. This allows users to start their pipeline from the counting step.
Temporary Files¶
Lines 33-34 By default, bam files are saved after alignment for future use. This can be useful, however these files are quite large and serve only as an intermediate step. Therefore, users can opt-out of storing them.
Skipping Steps¶
Lines 36-41 Users can skip entire pipeline steps and mix and match options that suit their need. Note that not all combination of steps are compatible.
Process Configuration¶
While the first half of the configuration is dedicated to controlling the pipeline, the second half is dedicated to modifying specific steps. We call these process-specific settings or parameters.

Descriptive Arguments¶
Variables for common parameters used in each process are explicitly typed out. For example, trim_galore.quality
refers to the quality threshold used by Trim Galore and feature_counts.id
refers to the gene id that Feature Counts refers to in the gtf file header. These variable names match the same variable names given in the original documentation of each tool. Therefore, one can refer to their individual documentation for more information.
Xargs¶
Because some software tools have hundreds of arguments, they cannot all be listed in the configuration file. Therefore, another variable called xargs
can be used to extend the flexibility of each tool. Users can add additional arguments as a string that will be injected into the shell command.
Ainj¶
Sometimes, users may want to add additional processing steps to a process without modifying the pipeline script or template directly. This can be done with the variable called ainj
that injects a secondary shell command after the original template process.
Pipeline Execution¶
When the configuration file is set, run the pipeline with:
./nextflow rnaseq.nf -c rnaseq.config
If the pipeline encounters an error, start from where it left off with:
./nextflow rnaseq.nf -resume -c rnaseq.config
Warning
If running Pipeliner on a high performance cluster environment such as Sun Grid Engine, ensure that Nextflow is initially executed on a node that allows for long-running processes.
Output and Results¶
One the pipeline has finished, all results will be directed to a single output folder specified in the configuration file.

Sample Folders¶
Each sample contains its own individual folder that holds temporary and processed data that was created by each process. In the screenshot, one can see the gene counts file specific to sample ggal_gamma
that was generated by HTSeq.
Expression Matrix¶
The expression matrix folder contains the final count matrix as well as other normalized gene by sample matrices.
Bam Files¶
If the configuration file is set to store bam files, they will show up in the results directory.
Alignment Index¶
If an alignment index is built from scratch, it will be saved to the results directory so that it can be reused during future pipeline runs.
Reports¶
After a successful run, two reports are generated. A report conducted using the original data before any pre-processing steps as well as a final report run after the entire pipeline has finished. This allows one to see any potential issues that existed in the data before the pipeline as well as if those issues were resolved after the pipeline.
Pipeline Structure¶
The file paths for all data fed to a pipeline are specified in the configuration file. To ease the development process, Pipeline includeds toy datasets for each of the pipelines. This example will cover the RNA-seq pipeline.
Note
Data for this pipeline is located in pipelines/toy_data/rna-seq
- Users must provide the following files:
- Sequencing files or alignment files
- Comma-delimited file containing file paths to reads/bams
- Genome reference file
- Genome annotation file
Configuration File¶
The configuration file is where all file paths are specified and pipeline processes are paramaterized. The configuration can be broken into three sections, including file paths, executor and compute resources, and pipeline options and parameters.
File Paths¶
The configuration file specifies where to find all of the input data. Additionally, it provides a path to an output directory where the pipeline will output results. The following is a typical example for the RNA-seq configuration file:
indir = "/Users/anthonyfederico/pipeliner/pipelines/toy_data/rna-seq"
outdir = "/Users/anthonyfederico/pipeliner/pipelines/rna-seq-results"
fasta = "${params.indir}/genome_reference.fa"
gtf = "${params.indir}/genome_annotation.gtf"
reads = "${params.indir}/ggal_reads.csv"
Executor and Compute Resources¶
An abstraction layer between Nextflow and Pipeliner logic enables platform independence and seamless compatibility with high performance computing executors. This allows users to execute pipelines on their local machine or through a computing cluster by simply specifying in the configuration file.
Pipeliner provides two base configuration files that can be inherited depending if a pipeline is being executing using local resources or a Sun Grid Engine (SGE) queuing system.
If the latter is chosen, pipeline processes will be automatically parallelized. Additionally, each individual process can be allocated specific computing resource instructions when nodes are requested.
Local config example:
process {
executor = 'local'
}
Cloud computing config example:
process {
executor = 'sge'
scratch = true
$trim_galore.clusterOptions = "-P montilab -l h_rt=24:00:00 -pe omp 8"
$star_mapping.clusterOptions = "-P montilab -l h_rt=24:00:00 -l mem_total=94G -pe omp 16"
$counting.clusterOptions = "-P montilab -l h_rt=24:00:00 -pe omp 8"
$expression_matrix.clusterOptions = "-P montilab -l h_rt=24:00:00 -pe omp 8"
$multiqc.clusterOptions = "-P montilab -l h_rt=24:00:00 -pe omp 8"
}
Pipeline Options and Parameters¶
The rest of the configuration file is dedicated to the different pipeline options and process parameters that can be specified. Some important examples include the following:
# General pipeline parameters
aligner = "hisat"
quantifier = "htseq"
# Process-specific parameters
htseq.type = "exon"
htseq.mode = "union"
htseq.idattr = "gene_id"
htseq.order = "pos"
Pipeline Script¶
Template Processes¶
Pipelines written in Nextflow consist of a series of processes. Processes specify data I/O and typically wrap around third-party software tools to process this data. Processes are connected through channels – asynchronous FIFO queues – which manage the flow of data throughout the pipeline.
Processes have the following basic structure:
process <name> {
input:
<process inputs>
output:
<process outputs>
script:
<user script to be executed>
}
Often, the script portion of the processes are reused by various sequencing pipelines. To help standardize pipeline development and ensure good practices are propogated to all pipelines, template processes are defined and inherited by pipeline processes.
Note
Templates are located in pipelines/templates
For example, these two processes execute the same code:
# Without inheritance
process htseq {
input:
<process inputs>
output:
<process outputs>
script:
'''
samtools view ${bamfiles} | htseq-count - ${gtf} \\
--type ${params.htseq.type} \\
--mode ${params.htseq.mode} \\
--idattr ${params.htseq.idattr} \\
--order ${params.htseq.order} \\
> counts.txt
'''
}
# With inheritance
process htseq {
input:
<process inputs>
output:
<process outputs>
script:
template 'htseq.sh'
}
Output¶
The RNA-seq pipeline output has the following basic structure:
/pipeliner/RNA-seq
└── /results
│
├── /sample_1
│ ├── /trimgalore | Trimmed Reads (.fq.gz) for sample_1
│ ├── /fastqc
│ ├── /rseqc
│ └── /htseq
│
├── /alignments | Where (.bams) are saved
├── /aligner
│ └── /index | Index created and used during mapping
│
├── /expression_matrix | Aggregated count matrix
├── /expression_set | An expression set (.rds) object
├── /reports | Aggregated report across all samples pre/post pipeliner
└── /logs | Process-related logs
Each sample will have its own directory with sample-specific data and results for each process. Additionally, sequencing alignment files and the indexed reference genome will be saved for future use if specified. Summary reports pre/post-workflow can be found inside the reports directory.
Existing Pipelines¶
RNA-seq¶
Check Reads check_reads
¶
input: | List of read files (.fastq) |
---|---|
output: | None |
script: | Ensures correct format of sequencing read files |
Genome Indexing hisat_indexing/star_indexing
¶
input: | Genome reference file (.fa) | Genome annotation file (.gtf) |
---|---|
output: | Directory containing indexed genome files |
script: | Uses either STAR or HISAT2 to build an indexed genome |
Pre-Quality Check pre_fastqc
¶
input: | List of read files (.fastq) |
---|---|
output: | Report files (.html) |
script: | Uses FastQC to check quality of read files |
Pre-MultiQC pre_multiqc
¶
input: | Log files (.log) |
---|---|
output: | Summary report file (.html) |
script: | Uses MultiQC to generate a summary report |
Read Trimming trim_galore
¶
input: | List of read files (.fastq) |
---|---|
output: | Trimmed read files (.fastq) | Report files (.html) |
script: | Trims low quality reads with TrimGalore and checks quality with FastQC |
Read Mapping hisat_mapping/star_mapping
¶
input: | List of read files (.fastq) | Genome annotation file (.gtf) | Directory containing indexed reference genome files |
---|---|
output: | A list of alignment files (.bam) | Log files (.log) |
script: | Uses either STAR or HISAT2 to align reads to a reference genome |
Reformat Reference gtftobed
¶
input: | Genome annotation file (.gtf) |
---|---|
output: | Genome annotation file (.bed) |
script: | Converts genome annotation file from GTF to BED format |
Mapping Quality rseqc
¶
input: | A list of alignment files (.bam) |
---|---|
output: | Report files (.txt) |
script: | Uses RSeQC to check quality of alignment files |
Quantification counting
¶
input: | A list of alignment files (.bam) | Genome annotation file (.gtf) |
---|---|
output: | Read counts (.txt) | Log files (.txt) |
script: | Uses either StringTie, HTSeQ, or featureCounts to quantify reads |
Expression Matrix expression_matrix
¶
input: | A list of count files (.txt) |
---|---|
output: | An expression matrix (.txt) |
script: | Reformats a list of count files into a genes x samples matrix |
Expression Features expression_features
¶
input: | Genome annotation file (.gtf) | An expression matrix (.txt) |
---|---|
output: | Gene feature data (.txt) |
script: | Parses the genome annotation file for gene feature data |
Expression Set expression_set
¶
input: | An expression matrix (.txt) | Gene feature data (.txt) | Sample phenotypic data (.txt) |
---|---|
output: | An expression set object (.rds) |
script: | Creates an expression set object with eData, fData, and pData attributes |
Summary Report multiqc
¶
input: | Log files and summary reports from all processes |
---|---|
output: | A summary report (.html) |
script: | Uses MultiQC to generate a summary report |
scRNA-seq¶
Check Reads check_reads
¶
input: | List of read files (.fastq) |
---|---|
output: | None |
script: | Ensures correct format of sequencing read files |
Genome Indexing hisat_indexing/star_indexing
¶
input: | Genome reference file (.fa) | Genome annotation file (.gtf) |
---|---|
output: | Directory containing indexed genome files |
script: | Uses either STAR or HISAT2 to build an indexed genome |
Quality Check fastqc
¶
input: | List of read files (.fastq) |
---|---|
output: | Report files (.html) |
script: | Uses FastQC to check quality of read files |
Whitelist whitelist
¶
input: | List of read files (.fastq) |
---|---|
output: | A table of white listed barcodes (.txt) |
script: | Uses UMI-tools to extract and identify true cell barcodes |
Extract extract
¶
input: | List of read files (.fastq) | A table of white listed barcodes (.txt) |
---|---|
output: | Extracted read files (.fastq) |
script: | Uses UMI-tools to extract barcode from reads and append to read name |
Read Mapping hisat_mapping/star_mapping
¶
input: | List of read files (.fastq) | Genome annotation file (.gtf) | Directory containing indexed reference genome files |
---|---|
output: | A list of alignment files (.bam) | Log files (.log) |
script: | Uses either STAR or HISAT2 to align reads to a reference genome |
Reformat Reference gtftobed
¶
input: | Genome annotation file (.gtf) |
---|---|
output: | Genome annotation file (.bed) |
script: | Converts genome annotation file from GTF to BED format |
Mapping Quality rseqc
¶
input: | A list of alignment files (.bam) |
---|---|
output: | Report files (.txt) |
script: | Uses RSeQC to check quality of alignment files |
Quantification counting
¶
input: | A list of alignment files (.bam) | Genome annotation file (.gtf) |
---|---|
output: | Read counts (.txt) | Log files (.txt) |
script: | Uses featureCounts to quantify reads |
Summary Report multiqc
¶
input: | Log files and summary reports from all processes |
---|---|
output: | A summary report (.html) |
script: | Uses MultiQC to generate a summary report |
DGE¶
Quantification counting
¶
input: | A list of alignment files (.bam) | Genome annotation file (.gtf) |
---|---|
output: | Read counts (.txt) | Log files (.txt) |
script: | Uses featureCounts to quantify reads |
Expression Matrix expression_matrix
¶
input: | A list of count files (.txt) |
---|---|
output: | An expression matrix (.txt) |
script: | Reformats a list of count files into a genes x samples matrix |
Sample Renaming rename_samples
¶
input: | An expression matrix (.txt) |
---|---|
output: | An expression matrix (.txt) |
script: | Renames samples in expression matrix based on a user-supplied table |
Summary Report multiqc
¶
input: | Log files and summary reports from all processes |
---|---|
output: | A summary report (.html) |
script: | Uses MultiQC to generate a summary report |
Extending Pipelines¶
General Workflow¶
The framework provides multiple resources for the user to extend and create sequencing pipelines. The first is toy datasets for all available pipelines including sequencing files, alignment files, genome reference and annotation files, as well as phenotypic data. Additionally, there are pre-defined scripts, processes, and configuration files that can be inherited and easily modified for various pipelines. Together, users can rapidly develop flexible and scalable pilelines. Lastly, there is a testing module enabling users to frequently test a series of different configurations with each change to the codebase.
Configuration Inheritance¶
An important property of configuration files is that they are inheritable. This allows developers to focus soley on the configuration components that are changing with each pipeline execution. Typically there are four components of a configuration file including the following.
Executor parameters:
process {
executor = "local"
}
Input data file paths:
indir = "/Users/anthonyfederico/pipeliner/pipelines/toy_data/rna-seq"
outdir = "/Users/anthonyfederico/pipeliner/pipelines/rna-seq-results"
Pipeline parameters:
aligner = "hisat"
quantifier = "htseq"
Process-specific parameters:
htseq.type = "exon"
htseq.mode = "union"
htseq.idattr = "gene_id"
htseq.order = "pos"
When developing, typically the only parameters that will be changing are pipeline parameters when testing the full scope of flexibility. Therefore, the development configuration file will look something like the following:
// paired / hisat / featurecounts
includeConfig "local.config"
includeConfig "dataio.config"
paired = true
aligner = "hisat"
quantifier = "featurecounts"
skip.counting = false
skip.rseqc = false
skip.multiqc = false
skip.eset = false
includeConfig "parameters.config"
Template Process Injections¶
Note
Sometimes it’s better to create a new template rather than heavily modify an existing one
Each pipeline is essentially a series of modules - connected through minimal Nextflow scripting - that execute pre-defined template processes. While templates are generally defined to be applicable to multiple pipelines and are parameterized in a configuration file, they have two additional components contributing to their flexibility.
The following is an example of a template process for the third-party software tool featureCounts:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | featureCounts \\
# Common flags directly defined by the user
-T ${params.feature_counts.cpus} \\
-t ${params.feature_counts.type} \\
-g ${params.feature_counts.id} \\
# Flags handled by the pipeline
-a ${gtf} \\
-o "counts.raw.txt" \\
# Arguments indirectly defined by the user
${feature_counts_sargs} \\
# Extra arguments
${params.feature_counts.xargs} \\
# Input data
${bamfiles};
# After injection
${params.feature_counts.ainj}
|
- Lines 4-6
- These are common keyword arguments that can be set to string/int/float types by the user and passed directly from the configuration file to the template. The params prefix in the variable means it is initialized in the configuration file.
- Lines 9-10
- These are flags that are typically non-dynamic and handled interally by the pipeline.
- Line 13
- These are common flags that must be indirectly defined by the user. For example, featurCounts requires a
-p
flag for paired reads. Becauseparams.paired
is a boolean, it makes more sense for the pipeline to create a string of supplemental arguments indirectly defined by the configuration file.
feature_counts_sargs = ""
if (params.paired) {
feature_counts_sargs = feature_counts_sargs.concat("-p ")
}
- Line 16
These are uncommmon keyword arguments or flags that can be pass directly from the configuration file to the template. Because some software tools can include hundreds of arguments, we explicitly state common arguments, but allow the user to additionally insert any unlimited number of additional arguments to maximize flexibility.
For example, the user might want to perform a one-off test of the pipeline where they remove duplicate reads and only count fragments that have a length between 50-600 base pairs. These options can be injected into the template by simply defining
params.feature_counts.xargs = "--ignoreDup -d 50 -D 600"
in the configuration file.- Line 19
- These are required arguments such as input data handled interally by the pipeline.
- Line 22
- These are code injections - typically one-liner cleanup commands - that can be injected after the main script of a template. For example, the output of featureCounts is a genes x samples matrix and the user may want to try sorting rows by gene names. Setting
params.feature_counts.ainj
to"sort -n -k1,1 counts.raw.txt > counts.raw.txt;"
would accomplish such a task.
After parameterization, the final result would look something like this:
1 2 3 4 5 6 | featureCounts -T 1 -t "exon" -g "gene_id" \
-a "path/to/reference_annotation.gtf" \
-o "counts.raw.txt" \
-p --ignoreDup -d 50 -D 600 \
s1.bam s2.bam s3 bam;
sort -n -k1,1 counts.raw.txt > counts.raw.txt;
|
Testing Module¶
Each major change to a pipeline should be followed with a series of tests. Because pipelines are so flexible, it’s infeasible to manually test even a limited set of typical configurations. To solve this problem we include an automated testing module.
Users can automatically test a series of configuration files by specifying a directory of user-defined tests:
/pipeliner
└── /tests
└── /configs
└── /rnaseq
├── /t1.config
├── /t2.config
└── /t3.config
To run these series of tests, users can execute python pipeliner/tests/test.py rnaseq
which will search for the directory pipeliner/tests/configs/rnaseq
and automatically pair and run each configuration file with a pipeline script named rnaseq.nf
.
Note
The directory name of tests must be the same as the pipeline script they are paired with
Warning
You must execute test.py
from the /pipelines directory because Nextflow requires its executable to be in the working directory. Therefore the testing command will look like python ../tests/test.py rnaseq