AssemblerFlow¶
A NextFlow pipeline assembler for genomics.
Overview¶
Assemblerflow is an assembler of pipelines written in nextflow for analyses of genomic data. The premisse is simple:
Software are container blocks → Build your lego-like pipeline → Execute it (almost) anywhere.
What is Nextflow¶
If you do not know nextflow, be sure to check it out. It’s an awesome framework based on the dataflow programming model used for building parallelized, scalable and reproducible workflows using software containers. It provides an abstraction layer between the execution and the logic of the pipeline, which means that the same pipeline code can be executed on multiple platforms, from a local laptop to clusters managed with SLURM, SGE, etc. These are quite attractive features since genomic pipelines are increasingly executed on large computer clusters to handle large volumes of data and/or tasks. Moreover, portability and reproducibility are becoming central pillars in modern data science.
What Assemblerflow does¶
Assemblerflow is a python engine that automatically builds nextflow pipelines
by assembling pre-made ready-to-use components. These components are modular
pieces of software or scripts, such as fastqc
, trimmomatic
, spades
,
etc, that are written for nextflow and have a set of attributes, such as
input and output types, parameters, directives, etc. This modular nature
allows them to be freely connected as long as they respect some basic rules,
such as the input type of a component must match with the output type of
the preceding component. In this way, nextflow processes can be
written only once, and assemblerflow is the magic glue that connects them,
handling the linking and forking of channels automatically. Moreover, each
component is associated with a docker image, which means that there is no
need to install any dependencies at all and all software runs on a
transparent and reliable box. To illustrate:
A linear genome assembly pipeline can be easily built using assemblerflow with the following pipeline string:
trimmomatic fastqc spades
Which will generate all the necessary files to run the nextflow pipeline on any linux system that has nextflow and a container engine.
You can easily add more components to perform assembly polishing, in this case,
pilon
:trimmomatic fastqc spades pilon
If a new assembler comes along and you want to switch that component in the pipeline, its as easy as replacing
spades
(or any other component):trimmomatic fastqc skesa pilon
And you can also fork the output of a component into multiple ones. For instance, we could annotate the resulting assemblies with multiple software:
trimmomatic fastqc spades pilon (abricate | prokka)
Or fork the execution of a pipeline early on to compare different software:
trimmomatic fastqc (spades pilon | skesa pilon)
This will fork the output of fastqc
into spades
and skesa
, and
the pipeline will proceed independently in these two new ‘lanes’.
Directives for each process can be dynamically set when building the pipeline, such as the cpu/RAM usage or the software version:
trimmomatic={'cpus':'4'} fastqc={'version':'0.11.5'} skesa={'memory':'10GB'} pilon (abricate | prokka)
And extra input can be directly inserted in any part of the pipeline. For example, it is possible to assemble genomes from both fastq files and SRR accessions (downloaded from public databases) in a single workflow:
download_reads trimmomatic={'extra_input':'reads'} fastqc skesa pilon
This pipeline can be executed by providing a file with accession numbers
(--accessions
parameter by default) and fastq reads, using the
--reads
parameter defined with the extra_input
directive.
Who is Assemblerflow for¶
Assemblerflow can be useful for bioinformaticians with varied levels of expertise that need to executed genomic pipelines often and potentially in different platforms. Building and executing pipelines requires no programming knowledge, but familiarization with nextflow is highly recommended to take full advantage of the generated pipelines.
At the moment, the available pre-made processes are mainly focused on bacterial genome assembly simply because that was how we started. However, our goal is to expand the library of existing components to other commonly used tools in the field of genomics and to widen the applicability and usefulness of assemblerflow pipelines.
Why not just write a Nextflow pipeline?¶
In many cases, building a static nextflow pipeline is sufficient for our goals.
However, when building our own pipelines, we often felt the need to add
dynamism to this process, particularly if we take into account how fast new
tools arise and existing ones change. Our biological goals also change over
time and we might need different pipelines to answer different questions.
Assemblerflow makes this very easy by having a set of pre-made and ready-to-use
components that can be freely assembled. By using components (fastqc
,
trimmomatic
) as its atomic elements, very complex pielines that take
full advantage of nextflow can be built with little effort. Moreover,
these components have explicit and standardized
input and output types, which means that the addition of new modules does not
require any changes in the existing code base. They just need to take into
account how data will be received by the process and how data may be emitted
from the process, to ensure that it can link with other components.
However, why not both?
Assemblerflow generates a complete Nextflow pipeline file, which ca be used as a starting point for your customized processes!
Installation¶
User installation¶
Assemblerflow is available as a bioconda package, which already comes with nextflow:
conda install assemblerflow
Alternatively, you can install only Assemblerflow, via pip:
pip install assemblerflow
You will also need a container engine (see Container engine below)
Container engine¶
All components of assemblerflow are executed in docker containers, which means that you’ll need to have a container engine installed. The container engines available are the ones supported by Nextflow:
- Docker,
- Singularity
- Shifter (undocumented)
If you already have any one of these installed, you are good to go. If not, you’ll need to install one. We recommend singularity because it does not require the processes to run on a separate root daemon.
Singularity¶
Singularity is available as a bioconda package. Simply install it, and it’s ready to use:
conda install singularity
Docker¶
Docker can be installed following the instructions on the website: https://www.docker.com/community-edition#/download. To run docker as anon-root user, you’ll need to following the instructions on the website: https://docs.docker.com/install/linux/linux-postinstall/#manage-docker-as-a-non-root-user
Developer installation¶
If you are looking to contribute to assemblerflow or simply interested in tweaking it, clone the github repository and its submodule and then run setup.py:
git clone https://github.com/ODiogoSilva/assemblerflow.git
cd assemblerflow
git submodule update --init --recursive
python3 setup.py install
Basic Usage¶
Assemblerflow has currently one execution mode, build
, that is used to
build the nextflow pipeline. However, more execution modes are slated for
release.
Assembling a pipeline¶
Pipelines can be generated using the build
execution mode of assemblerflow
and the -t
parameter to specify the components inside quotes:
assemblerflow build -t "trimmomatic fastqc spades" -o my_pipe.nf
All components should be written inside quotes and be space separated. This command will generate a linear pipeline with three components on the current working directory (for more features and tips on how pipelines can be built, see the pipeline building section). A linear pipeline means that there are no bifurcations between components, and the input data will flow linearly. In this particular case, the input data of the pipeline will be paired-end fastq files, since that is the input data type of the first component, trimmomatic.
The rationale of how the data flows across the pipeline is simple and intuitive.
Data enters a component and is processed in some way, which may result on the
creation of results (stored in the results
directory) and reports (stored
in the reports
directory) (see Results and reports below). If that
component has an output_type
, it will feed the processed data into the
next component (or components) and this will repeated until the end of the
pipeline.
If you are interesting in checking the pipeline DAG tree, open the
my_pipe.html
file (same name as the pipeline with the html extension)
in any browser.

The integrity_coverage
component is a dependency of trimmomatic
, so
it was automatically added to the pipeline.
Note
Not all pipeline variations will work. You always need to ensure that the output type of a component matches the input type of the next component, otherwise assemblerflow will exit with an error.
Pipeline directory¶
In addition to the main nextflow pipeline file (my_pipe.nf
),
assemblerflow will write several auxiliary files that are necessary for
the pipeline to run. The contents of the directory should look something like
this:
$ ls
bin lib my_pipe.nf params.config templates
containers.config my_pipe.html nextflow.config profiles.config resources.config user.config
You do not have to worry about most of these files. However, the
*.config
files can be modified to change several aspects of the pipeline run
(see Pipeline configuration for more details). Briefly:
params.config
: Contains all the available parameters of the pipeline (see Parameters below). These can be changed here, or provided directly on run-time (e.g.:nextflow run --fastq value
).resources.config
: Contains the resource directives of the pipeline processes, such as cpus, allocated RAM and other nextflow process directives.containers.config
: Specifies the container and version tag of each process in the pipeline.profiles.config
: Contains a number of predefined profiles of executor and container engine.user.config
: Empty configuration file that is not over-written if you build another pipeline in the same directory. Used to set persistent configurations across different pipelines.
Parameters¶
The parameters of the pipeline can be viewed by running the pipeline file
with nextflow
and using the --help
option:
$ nextflow my_pipe.nf --help
N E X T F L O W ~ version 0.28.0
Launching `my_pipe.nf` [stupefied_booth] - revision: 504208431f
============================================================
A S S E M B L E R F L O W
============================================================
Built using assemblerflow v1.0.2
Usage:
nextflow run my_pipe.nf
--fastq Path expression to paired-end fastq files. (default: fastq/*_{1,2}.*) (integrity_coverage)
--genomeSize Genome size estimate for the samples. It is used to estimate the coverage and other assembly parameters andchecks (default: 2.1) (integrity_coverage)
--minCoverage Minimum coverage for a sample to proceed. Can be set to0 to allow any coverage (default: 15) (integrity_coverage)
--adapters Path to adapters files, if any (default: None) (trimmomatic;fastqc)
--trimSlidingWindow Perform sliding window trimming, cutting once the average quality within the window falls below a threshold (default: 5:20) (trimmomatic)
--trimLeading Cut bases off the start of a read, if below a threshold quality (default: 3 (trimmomatic)
--trimTrailing Cut bases of the end of a read, if below a threshold quality (default: 3) (trimmomatic)
--trimMinLength Drop the read if it is below a specified length (default: 55) (trimmomatic)
--spadesMinCoverage The minimum number of reads to consider an edge in the de Bruijn graph during the assembly (default: 2) (spades)
--spadesMinKmerCoverage Minimum contigs K-mer coverage. After assembly only keep contigs with reported k-mer coverage equal or above this value (default: 2) (spades)
--spadesKmers If 'auto' the SPAdes k-mer lengths will be determined from the maximum read length of each assembly. If 'default', SPAdes will use the default k-mer lengths. (default: auto) (spades)
All these parameters are related to the components of the pipeline. However,
the main input parameter (or parameters) of the pipeline is always available.
Since this pipeline started with fastq paired-end files as the main input,
the --fastq
parameter is available. If the pipeline started with any other
input type or with more than one input type, the appropriate parameters would
appear. These parameters can be provided on run-time or edited in the
params.config
file.
Executing the pipeline¶
Most parameters in assemblerflow’s components already come with sensible
defaults, which means that usually you’ll only need to provide a small number
of arguments. In the example above, the --fastq
is the only parameter
required. I have placed fastq files on the data
directory:
$ ls data
sample_1.fastq.gz sample_2.fastq.gz
We’ll need to provide the pattern to the fastq files. This pattern is perhaps a bit confusing at first, but it’s necessary for the correct inference of the paired:
nextflow run my_pipe.nf --fastq "data/*_{1,2}.*"
In this case, the pairs are separated by the “_1.” or “_2.” substring, which leads
to the pattern *_{1,2}.*
. Another common nomenclature for paired fastq
files is something like sample_R1_L001.fastq.gz
. In this case, an
acceptable pattern would be *_R{1,2}_*
.
Important
Note the quotes around the fastq path pattern. These quotes are necessary to allow nextflow to resolve the pattern, otherwise your shell might try to resolve it and provide the wrong input to nextflow.
Changing executor and container engine¶
The default run mode of an assemblerflow pipeline is to be executed locally
and using the singularity container engine. In nextflow terms, this is
equivalent to have executor = "local"
and singularity.enabled = true
.
If you want to change these settings, you can modify the
nextflow.config
file, or use one of the available profiles in the
profiles.config
file. These profiles provide a combination of common
<executor>_<container_engine>
that are supported by nextflow. Therefore,
if you want to run the pipeline on a cluster with SLURM and shifter, you’ll
just need to specify the `` slurm_shifter`` profile:
nextflow run my_pipe.nf --fastq "data/*_{1,2}.*" -profile slurm_shifter
Common executors include:
slurm
sge
lsf
pbs
Other container engines are:
docker
singularity
shifter
Docker images¶
All components of assemblerflow are executed in containers, which means that
the first time they are executed in a machine, the corresponding image will have
to be downloaded. In the case of docker, images are pulled and stored in
var/lib/docker
by default. In the case of singularity, the
nextflow.config
generated by assemblerflow sets the cache dir for the
images at $HOME/.singularity_cache
. Note that when an image is downloading,
nextflow does not display any informative message, except for singularity where you’ll
get something like:
Pulling Singularity image docker://ummidock/trimmomatic:0.36-2 [cache /home/diogosilva/.singularity_cache/ummidock-trimmomatic-0.36-2.img]
So, if a process seems to take too long to run the first time, it’s probably because the image is being downloaded.
Results and reports¶
As the pipeline runs, processes may write result and report files to the
results
and reports
directories, respectively. For example, the
reports of the pipeline above, would look something like this:
reports
├── coverage_1_1
│ └── estimated_coverage_initial.csv
├── fastqc_1_3
│ ├── FastQC_2run_report.csv
│ ├── run_2
│ │ ├── sample_1_0_summary.txt
│ │ └── sample_1_1_summary.txt
│ ├── sample_1_1_trim_fastqc.html
│ └── sample_1_2_trim_fastqc.html
└── status
├── master_fail.csv
├── master_status.csv
└── master_warning.csv
The estimated_coverage_initial.csv
file contains a very rough coverage
estimation for each sample, the fastqc*
directory contains the html
reports and summary files of FastQC for each sample, and the status
directory contains a log of the status, warnings and fails of each process for
each sample.
The actual results for each process that produces them, are stored in the
results
directory:
results
├── assembly
│ └── spades_1_4
│ └── sample_1_trim_spades3111.fasta
└── trimmomatic_1_2
├── sample_1_1_trim.fastq.gz
└── sample_1_2_trim.fastq.gz
If you are interested in checking the actual environment where the execution
of a particular process occurred for any given sample, you can inspected the
pipeline_stats.txt
file in the root of the pipeline directory. This file
contains rich information about the execution of each process, including
the working directory:
task_id hash process tag status exit start container cpus duration realtime queue %cpu %mem rss vmem
5 7c/cae270 trimmomatic_1_2 sample_1 COMPLETED 0 2018-04-12 11:42:29.599 docker:ummidock/trimmomatic:0.36-2 2 1m 25s 1m 17s - 329.3% 1.1% 1.5 GB 33.3 GB
The hash
column contains the start of the current working directory of that
process. In the example below, the directory would be:
work/7c/cae270*
Pipeline building¶
Assemblerflow offers a few extra features when building pipelines using the
build
execution mode.
Raw input types¶
Forks¶
The output of any component in an assemblerflow pipeline can be forked into two or more components, using the following fork syntax:
trimmomatic fastqc (spades | skesa)

In this example, the output of fastqc
will be fork into two new lanes,
which will proceed independently from each other. In this syntax, a fork is
triggered by the (
symbol (and the corresponding closing )
) and each
lane will be separated by a |
symbol. There is no limitation to the number
of forks or lanes that a pipeline has. For instance, we could add more
components after the skesa
module, including another fork:
trimmomatic fastqc (spades | skesa pilon (abricate | prokka | chewbbaca))

In this example, data will be forked after fastqc
into two new lanes,
processed by spades
and skesa
. In the skesa lane, data will continue
to flow into the pilon
component and its output will fork into three new
lanes.
It is also possible to start a fork at the beggining of the pipeline, which basically means that the pipeline will have multiple starting points. If we want to provide the raw input two multiple process, the fork syntax can start at the beginning of the pipeline:
(seq_typing | trimmomatic fastqc (spades | skesa))

In this case, since both initial components (seq_typing
and
integrity_coverage
) received fastq files as input, the data provided
via the --fastq
parameter will be forked and provided to both processes.
Note
Some components have dependencies which need to be included previously
in the pipeline. For instance, trimmomatic
requires
integrity_coverage
and pilon
requires assembly_mapping
. By
default, assemblerflow will insert any missing dependencies right before
the process, which is why these components appear in the figures above.
Warning
Pay special attention to the syntax of the pipeline string when using forks. However, when unable to parse it, assemblerflow will do its best to inform you where the parsing error occurred.
Directives¶
Several directives with information on cpu usage, RAM, version, etc. can be
specified for each individual component when building the pipeline using the
={}
notation. These
directives are written to the resources.config
and
containers.config
files that are generated in the pipeline directory. You
can pass any of the directives already supported by nextflow (https://www.nextflow.io/docs/latest/process.html#directives),
but the most commonly used include:
cpus
memory
queue
In addition, you can also pass the container
and version
directives
which are parsed by assemblerflow to dynamically change the container and/or
version tag of any process.
Here is an example where we specify cpu usage, allocated memory and container version in the pipeline string:
assemblerflow build -t "fastqc={'version':'0.11.5'} \
trimmomatic={'cpus':'2'} \
spades={'memory':'\'10GB\''}" -o my_pipeline.nf
When a directive is not specified, it will assume the default value of the nextflow directive.
Warning
Take special care not to include any white space characters inside the
directives field. Common mistakes occur when specifying directives like
fastqc={'version': '0.11.5'}
.
Note
The values specified in these directives are placed in the
respective config files exactly as they are. For instance,
spades={'memory':'10GB'}"
will appear in the config as
spades.memory = 10Gb
, which will raise an error in nextflow because
10Gb
should be a string. Therefore, if you want a string you’ll need to add
the '
as in this example: spades={'memory':'\'10GB\''}"
. The
reason why these directives are not automatically converted is to allow
the specification of dynamic computing resources, such as
spades={'memory':'{10.Gb*task.attempt}'}"
Extra inputs¶
By default, only the first process (or processes) in a pipeline will receive
the raw input data provided by the user. However, the extra_input
special
directive allows one or more processes to receive input from an additional parameter
that is provided by the user:
reads_download integrity_coverage={'extra_input':'local'} trimmomatic spades
The default main input of this pipeline is a text file with accession numbers
for the reads_download
component. The extra_input
creates
a new parameter, named local
in this example, that allows us to provide
additional input data to the integrity_coverage
component directly:
nextflow run pipe.nf --accessions accession_list.txt --local "fastq/*_{1,2}.*"
What will happen in this pipeline, is that the fastq files provided to the
integrity_coverage
component will be mixed with the ones provided by the
reads_download
component. Therefore, if we provide 10 accessions and 10
fastq samples, we’ll end up with 20 samples being processed by the end of the
pieline.
It is important to note that the extra input parameter expected data compliant with the input type of the process. If files other than fastq files would be provided in the pipeline above, this would result in a pipeline error.
If the extra_input
directive is used on a component that has a different
input type from the first component in the pipeline, it is possible to use
the default
value:
trimmomatic spades abricate={'extra_input':'default'}
In this case, the input type of the first component if fastq and the input
type of abricate
is fasta. The default
value will make available the
default parameter for fasta raw input, which is fasta
:
nextflow run pipe.nf --fastq "fastq/*_{1,2}.*" --fasta "fasta/*.fasta"
Pipeline file¶
Instead of providing the pipeline components via the command line, you can specify them in a text file:
# my_pipe.txt
trimmomatic fastqc spades
And then provide the pipeline file to the -t
parameter:
assemblerflow build -t my_pipe.txt -o my_pipe.nf
Pipeline files are usually more readable, particularly when they become more complex. Consider the following example:
integrity_coverage (
spades={'memory':'\'50GB\''} |
skesa={'memory':'\'40GB\'','cpus':'4'} |
trimmomatic fastqc (
spades pilon (abricate={'extra_input':'default'} | prokka) |
skesa pilon (abricate | prokka)
)
)
In addition to be more readable, it is also easier to edit, re-use and share.
Pipeline configuration¶
When a nextflow pipeline is built with assemblerflow, a number of configuration
files are automatically generated in the same directory. They are all imported
at the end of the nextflow.config
file and are sorted by their configuration
role. All configuration files are overwritten if you build another pipeline
in the same directory, with the exception of the user.config
file, which
is meant to be a persistent configuration file.
Parameters¶
The params.config
file includes all available paramenters for the pipeline
and their respective default values. Most of these parameters already contain
sensible defaults.
Resources¶
The resources.config
file includes the majority of the directives provided
for each process, including cpus
and memory
. You’ll note that each
process name has a suffix like _1_1
, which is a unique process identifier
composed of <lane>_<process_number>
. This ensures that even when the same
component is specified multiple times in a pipeline, you’ll still be able to
set directives for each one individually.
Containers¶
The containers.config
file includes the container directive for each
process in the pipeline. These containers are retrieved from dockerhub, if they
do not exist locally yet. You can change the container string to any other
value, but it should point to an image that exist on dockerhub or locally.
Profiles¶
The profiles.config
file includes a set of pre-made profiles with all
possible combinations of executors and container engines. You can add new ones
or modify existing one.
User configutations¶
The user.config
file is configuration file that is not overwritten when a
new pipeline is build in the same directory. It can contain any configuration
that is supported by nextflow and will overwrite all other configuration files.
Components¶
These are the currently available assemblerflow components with a short description of their tasks. For a more detailed information, follow the links of each component.
Read Quality Control¶
- Integrity_coverage: Tests the integrity of the provided FastQ files, provides the option to filter FastQ files based on the expected assembly coverage and provides information about the maximum read length and sequence encoding.
- FastQC: Runs FastQC on paired-end FastQ files.
- Trimmomatic: Runs Trimmomatic on paired-end FastQ files.
- Fastqc_trimmomatic: Runs Trimmomatic on paired-end FastQ files informed by the FastQC report.
- Check_coverage: Estimates the coverage for each sample and filters FastQ files according to a specified minimum coverage threshold.
Assembly¶
Post-assembly¶
- Process_spades: Processes the assembly output from Spades and performs filtering base on quality criteria of GC content k-mer coverage and read length.
- Process_skesa: Processes the assembly output from Skesa and performs filtering base on quality criteria of GC content k-mer coverage and read length.
- Assembly_mapping: Performs a mapping procedure of FastQ files into a their assembly and performs filtering based on quality criteria of read coverage and genome size.
- Pilon: Corrects and filters assemblies using Pilon.
Annotation¶
MLST¶
Reads typing¶
- Seq_typing: Determines the type of a given sample frm a set of reference sequences.
- Patho_typing: In silico pathogenic typing from raw illumina reads.
Plasmids¶
- mapping_patlas: Performs read mapping and generates a JSON input file for pATLAS.
- mash_screen: Performs mash screen against a reference index plasmid database and generates a JSON input file for pATLAS. This component searches for containment of a given sequence in read sequencing data. However if a different database is provided it can use mash screen for other purporses.
- mash_dist: Executes mash distance against a reference index plasmid database and generates a JSON for pATLAS. This component calculates pairwise distances between sequences (one from the database and the query sequence). However if a different database is provided it can use mash dist for other purposes.
General orientation¶
Codebase structure¶
The most important elements of assemblerflow’s directory structure are:
generator
:components
: Contains theProcess
classes for each componenttemplates
: Contains the nextflow jinja template files for each componentengine.py
: The engine of assemblerflow that builds the pipelineprocess.py
: Contains the abstractProcess
class that is inherited- by all component classes
pipeline_parser.py
: Functions that parse and check the pipeline stringrecipe.py
: Class responsible for creating recipes
templates
: A git submodule of the templates repository that contain the template scripts for the components.
Code style¶
- Style: the code base of assemblerflow should adhere (the best it can) to the PEP8 style guidelines.
- Docstrings: code should be generally well documented following the numpy docstring style.
- Quality: there is also an integration with the codacy service to evaluate code quality, which is useful for detecting several coding issues that may appear.
Testing¶
Tests are performed using pytest and the source files are stored in the
assemblerflow/tests
directory. Tests must be executed on the root directory
of the repository
Documentation¶
Documentation source files are stored in the docs
directory. The general
configuration file is found in docs/conf.py
and the entry
point to the documentation is docs/index.html
.
Process creation guidelines¶
Basic process creation¶
The addition of a new process to assemblerflow requires three main steps:
- Create process template: Create a jinja2 template in
assemblerflow.generator.templates
with the nextflow code. - Create Process class: Create a
Process
subclass inassemblerflow.generator.process
with information about the process (e.g., expected input/output, secondary inputs, etc.). - Add to available processes: Add the
process
class to the dictionary of available process inassemblerflow.generator.engine.process_map
.
Create process template¶
First, create the nextflow template that will be integrated into the pipeline
as a process. This file must be placed in assemblerflow.generator.templates
and have the .nf
extension. In order to allow the template to be
dynamically added to a pipeline file, we use the jinja2 template language to
substitute key variables in the process, such as input/output channels.
A minimal example created as a my_process.nf
file is as follows:
process myProcess_{{ pid }} {
{% include "post.txt" ignore missing %}
input:
set sample_id, <data> from {{ input_channel }}
// The output is optional
output:
set sample_id, <data> into {{ output_channel }}
{% with task_name="abricate" %}
{%- include "compiler_channels.txt" ignore missing -%}
{% endwith %}
"""
<process code/commands>
"""
}
{{ forks }}
The fields surrounded by curly brackets are jinja placeholders that will be dynamically interpolated when building the pipeline, ensuring that the processes and potential forks correctly link with each other. This example contains all placeholder variables that are currently supported by assemblerflow:
pid
(Mandatory): This placeholder is used as a unique process identifier that prevent issues from process duplication in the pipeline. It is also important for for unique secondary output channels, such as those that send run status information (see Status channels).include "post.txt"
(Mandatory): InsertsbeforeScript
andafterScript
statements to the process that setup environmental variables and a series of dotfiles for the process to log their status, warnings, fails and reports (see Dotfiles for more information). It also includes scripts for sending requests to REST APIs (only when certain pipeline parameters are used).input_channel
(Mandatory): All processes must include one and only one input channel. In most cases, this channel should be defined with a two element tuple that contains the sample ID and then the actual data file/stream. We suggest the sample ID variable to be namedsample_id
as a standard. If other name variable name is specified and you include thecompiler_channels.txt
in the process, you’ll need to change the sample ID variable (see Sample ID variable).output_channel
(Optional): Terminal processes may skip the output channel entirely. However, if you want to link the main output of this process with subsequent ones, this placeholder must be used only once. Like in the input channel, this channel should be defined with a two element tuple with the sample ID and the data. The sample ID must match the one specified in theinput_channel
.include "compiler_channels.txt"
(Mandatory): This will include the special channels that will compile the status/logging of the processes throughout the pipeline. You must include the whole block (see Status channels):{% with task_name="abricate" %} {%- include "compiler_channels.txt" ignore missing -%} {% endwith %}
forks
(Conditional): Inserts potential forks of the main output channel. It is mandatory if theoutput_channel
is set.
As an example of a complete process, this is the template of spades.nf
:
process spades_{{ pid }} {
// Send POST request to platform
{% include "post.txt" ignore missing %}
tag { fastq_id + " getStats" }
publishDir 'results/assembly/spades/', pattern: '*_spades.assembly.fasta', mode: 'copy'
input:
set fastq_id, file(fastq_pair), max_len from {{ input_channel }}.join(SIDE_max_len_{{ pid }})
val opts from IN_spades_opts
val kmers from IN_spades_kmers
output:
set fastq_id, file('*_spades.assembly.fasta') optional true into {{ output_channel }}
set fastq_id, val("spades"), file(".status"), file(".warning"), file(".fail") into STATUS_{{ pid }}
file ".report.json"
when:
params.stopAt != "spades"
script:
template "spades.py"
}
{{ forks }}
Create Process class¶
The process class will contain the information that assemblerflow
will use to build the pipeline and assess potential conflicts/dependencies
between process. This class should be created in one the category files in the
assemblerflow.generator.components
module (e.g.: assembly.py
). If
the new component does not fit in any of the existing categories, create a
new one that imports assemblerflow.generator.process.Process
and add
your new class. This class should inherit from the
Process
base
class:
class MyProcess(Process):
def __init__(self, **kwargs):
super().__init__(**kwargs)
self.input_type = "fastq"
self.output_type = "fasta"
This is the simplest working example of a process class, which basically needs
to inherit the parent class attributes (the super
part).
Then we only need to define the expected input
and output types of the process. There are no limitations to the
input/output types.
However, a pipeline will only build successfully when all processes correctly
link the output with the input type.
Depending on the process, other attributes may be required:
- Parameters: Parameters provided by the user to be used in the process.
- Secondary inputs: Channels created from parameters provided by the user.
- Secondary Link start and Link end: Secondary links that connect secondary information between two processes.
- Dependencies: List of other processes that may be required for the current process.
- Directives: Default information for RAM/CPU/Container directives and more.
Add to available processes¶
The final step is to add your new process to the list of available processes.
This list is defined in assemblerflow.generator.engine.process_map
module, which is a dictionary
mapping the process template name to the corresponding template class:
process_map = {
<other_process>
"my_process_template": process.MyProcess
}
Note that the template string does not include the .nf
extension.
Process attributes¶
This section describes the main attributes of the
Process
class: what they
do and how do they impact the pipeline generation.
Input/Output types¶
The input_type
and
output_type
attributes
set the expected type of input and output of the process. There are no
limitations to the type of input/output that are provided. However, processes
will only link when the output of one process matches the input of the
subsequent process (unless the
ignore_type
attribute is set
to True
). Otherwise, assemblerflow will raise an exception stating that
two processes could not be linked.
Note
The input/ouput types that are currently used are fastq
, fasta
.
Parameters¶
The params
attribute sets
the parameters that can be used by the process. For each parameter, a default
value and a description should be provided. The default value will be set
in the params.config
file in the pipeline directory and the description
will be used to generated the custom help message of the pipeline:
self.params = {
"genomeSize": {
"default": 2.1,
"description": "Expected genome size (default: params.genomeSiz)
},
"minCoverage": {
"default": 15,
"description": "Minimum coverage to proceed (default: params.minCoverage)"
}
}
These parameters can be simple values that are not feed into any channel, or can be automatically set to a secondary input channel via Secondary inputs (see below).
They can be specified when running the pipeline like any nextflow parameter
(e.g.: --genomeSize 5
) and used in the nextflow process as usual
(e.g.: params.genomeSize
).
Note
These pairs are then used to populate the params.config
file that is
generated in the pipeline directory. Note that the values are replaced
literally in the config file. For instance, "genomeSize": 2.1,
will appear
as genomeSize = 2.1
, whereas "adapters": "'None'"
will appear as
adapters = 'None'
. If you want a value to appear as a string, the double
and single quotes are necessary.
Secondary inputs¶
Any process can receive one or more input channels in addition to the main
channel. These are particularly useful when the process needs to receive
additional options from the parameters
scope of nextflow.
These additional inputs can be specified via the
secondary_inputs
attribute,
which should store a list of dictionaries (a dictionary for each input). Each dictionary should
contains a key:value pair with the name of the parameter (params
) and the
definition of the nextflow channel (channel
). Consider the example below:
self.secondary_inputs = [
{
"params": "genomeSize",
"channel": "IN_genome_size = Channel.value(params.genomeSize)"
},
{
"params": "minCoverage",
"channel": "IN_min_coverage = Channel.value(params.minCoverage)"
}
]
This process will receive two secondary inputs that are given by the
genomeSize
and minCoverage
parameters. These should be also specified
in the params
attribute
(See Parameters above).
For each of these parameters, the dictionary
also stores how the channel should be defined at the beginning of the pipeline
file. Note that this channel definition mentions the parameters (e.g.
params.genomeSize
). An additional best practice for channel definition
is to include one or more sanity checks to ensure that the provided arguments
are correct. These checks can be added in the nextflow template file, or
literally in the channel
string:
self.secondary_inputs = [
{
"params": "genomeSize",
"channel":
"IN_genome_size = Channel.value(params.genomeSize)"
"map{it -> it.toString().isNumber() ? it : exit(1, \"The genomeSize parameter must be a number or a float. Provided value: '${params.genomeSize}'\")}"
}
Extra input¶
The extra_input
attribute
is mostly a user specified directive that allows the injection of additional
input data from a parameter into the main input channel of the process.
When a pipeline is defined as:
process1 process2={'extra_input':'var'}
assemblerflow will expose a new var
parameter, setup an extra input
channel and mix it with process2
main input channel. A more detailed
explanation follows below.
First, assemblerflow will create a nextflow channel from the parameter name
provided via the extra_input
directive. The channel string will depend
on the input type of the process (this string is fetched from the
RAW_MAPPING
attribute).
For instance, if the input type of
process2
is fastq
, the new extra channel will be:
IN_var_extraInput = Channel.fromFilePairs(params.var)
Since the same extra input parameter may be used by more than one process,
the IN_var_extraInput
channel will be automatically forked into the
final destination channels:
// When there is a single destination channel
IN_var_extraInput.set{ EXTRA_process2_1_2 }
// When there are multiple destination channels for the same parameter
IN_var_extraInput.into{ EXTRA_process2_1_2; EXTRA_process3_1_3 }
The destination channels are the ones that will be actually mixed with the main input channels:
process process2 {
input:
(...) main_channel.mix(EXTRA_process2_1_2)
}
In these cases, the processes that receive the extra input will process the
data provided by the preceding channel AND by the parameter. The data
provided via the extra input parameter does not have to wait for the
main_channel
, which means that they can run in parallel, if there are
enough resources.
Compiler¶
The compiler
attribute
allows one or more channels of the process to be fed into a compiler process
(See Compiler processes). These are special processes that collect
information from one or more processes to execute a given task. Therefore,
this parameter can only be used when there is an appropriate compiler process
available (the available compiler processes are set in the
compilers
dictionary). In order to
provide one or more channels to a compiler process, simply add a key:value to the
attribute, where the key is the id of the compiler process present in the
compilers
dictionary and the value
is the list of channels:
self.compiler["patlas_consensus"] = ["mappingOutputChannel"]
Link start¶
The link_start
attribute
stores a list of strings of channel names that can be used as secondary
channels in the pipeline (See the Secondary links between process section).
By default, this attribute contains the main output channel, which means
that every process can fork the main channel to one or more receiving
processes.
Link end¶
The link_end
attribute
stores a list of dictionaries with channel names that are meant to be
received by the process as secondary channel if the corresponding
Link start exists in the pipeline. Each dictionary in this list will define
one secondary channel and requires two key:value pairs:
self.link_end({
"link": "SomeChannel",
"alias": "OtherChannel")
})
If another process exists in the pipeline with
self.link_start.extend(["SomeChannel"])
, assemblerflow will automatically
establish a secondary channel between the two processes. If there are multiple
processes receiving from a single one, the channel from the later will
for into any number of receiving processes.
Dependencies¶
If a process depends on the presence of one or more processes upstream in the
pipeline, these can be specific via the
dependencies
attribute.
When building the pipeline if at least one of the dependencies is absent,
assemblerflow will raise an exception informing of a missing dependency.
Directives¶
The directives
attribute
allows for information about cpu/RAM usage and container to be specified
for each nextflow process in the template file. For instance, considering
the case where a Process
has a template with two nextflow processes:
process proc_A_{{ pid }} {
// stuff
}
process proc_B_{{ pid }} {
// stuff
}
Then, information about each process can be specified individually in the
directives
attribute:
class myProcess(Process):
(...)
self.directives = {
"proc_A": {
"cpus": 1
"memory": "4GB"
},
"proc_B": {
"cpus": 4
"container": "my/container"
"version": "1.0.0"
}
}
The information in this attribute will then be used to build the
resources.config
(containing the information about cpu/RAM) and
containers.config
(containing the container images) files. Whenever a
directive is missing, such as the container
and version
from proc_A
and memory
from proc_B
, nothing about them will be written into the
config files and they will use the default pipeline values:
cpus
:1
memory
:1GB
container
: assemblerflow_base image
Ignore type¶
The ignore_type
attribute,
controls whether a match between the input of the current process and the
output of the previous one is enforced or not. When there are multiple
terminal processes that fork from the main channel, there is no need to
enforce the type match and in that case this attribute can be set to False
.
Process ID¶
The process ID, set via the
pid
attribute, is an
arbitrarily and incremental number that is awarded to each process depending
on its position in the pipeline. It is mainly used to ensure that there are
no duplicated channels even when the same process is used multiple times
in the same pipeline.
Template¶
The template
attribute
is used to fetch the jinja2 template file that corresponds to the current
process. The path to the template file is determined as follows:
join(<template directory>, template + ".nf")
Status channels¶
The status channels are special channels dedicated to passing information regarding the status, warnings, fails and logging from each process (see Dotfiles for more information). They are used only when the nextflow template file contains the appropriate jinja2 placeholder:
output:
{% with task_name="<nextflow_template_name>" %}
{%- include "compiler_channels.txt" ignore missing -%}
{% endwith %}
By default,
every Process
class contains a
status_channels
list
attribute that contains the
template
string:
self.status_channels = ["STATUS_{}".format(template)]
If there is only one nextflow process in the template and the task_name
variable in the template matches the
template
attribute, then
it’s all automatically set up.
If the template file contains more than one nextflow process definition, multiple placeholders can be provided in the template:
process A {
(...)
output:
{% with task_name="A" %}
{%- include "compiler_channels.txt" ignore missing -%}
{% endwith %}
}
process B {
(...)
output:
{% with task_name="B" %}
{%- include "compiler_channels.txt" ignore missing -%}
{% endwith %}
}
In this case, the
status_channels
attribute
would need to be changed to:
self.status_channels = ["A", "B"]
Sample ID variable¶
In case you change the standard nextflow variable that stores the sample ID
in the input of the process (sample_id
), you also need to change it for
the compiler_channels
placeholder:
process A {
input:
set other_id, data from {{ input_channel }}
output:
{% with task_name="B", sample_id="other_id" %}
{%- include "compiler_channels.txt" ignore missing -%}
{% endwith %}
}
Advanced use cases¶
Compiler processes¶
Compilers are special processes that collect data from one or more processes and perform a given task with that compiled data. They are automatically included in the pipeline when at least one of the source channels is present. In the case there are multiple source channels, they are merged according to a specified operator.
Creating a compiler process¶
The creation of the compiler process is simpler than that of a regular process but follows the same three steps.
Create a nextflow template file in
assemblerflow.generator.templates
:process fullConsensus { input: set id, file(infile_list) from {{ compile_channels }} output: <output channels> script: """ <commands/code/template> """ }
The only requirement is the inclusion of a compiler_channels
jinja
placeholder in the main input channel.
Create a Compiler class in the
assemblerflow.generator.process
module:class PatlasConsensus(Compiler): def __init__(self, **kwargs): super().__init__(**kwargs)
This class must inherit from
Compiler
and does not require any
more changes.
3. Map the compiler template file to the class in
compilers
attribute:
self.compilers = {
"patlas_consensus": {
"cls": pc.PatlasConsensus,
"template": "patlas_consensus",
"operator": "join"
}
}
Each compiler should contain a key:value entry. The key is the compiler
id that is then specified in the compiler
attribute of the component classes. The value is a json/dict object that
species the compiler class in the cls
key, the template string in the
template
string and the operator used to join the channels into the
compiler via the operator
key.
How a compiler process works¶
Consider the case where you have a compiler process named compiler_1
and
two processes, process_1
and process_2
, both of which feed a single
channel to compiler_1
. This means that the class definition of these
processes include:
class Process_1(Process):
(...)
self.compiler["compiler_1"] = ["channel1"]
class Process_2(Process):
(...)
self.compiler["compiler_1"] = ["channel2"]
If a pipeline is built with at least one of these process, the compiler_1
process will be automatically included in the pipeline. If more than one
channel is provided to the compiler, they will be merged with the specified
operator:
process compiler_1 {
input:
set sample_id, file(infile_list) from channel2.join(channel1)
}
This will allow the output of multiple separate process to be processed by a single process in the pipeline, and it automatically adjusts according to the channels provided to the compiler.
Secondary links between process¶
In some cases, it might be necessary to perform additional links between two or more processes. For example, the maximum read length might be gathered in one process, and that information may be required by a subsequent process. These secondary channels allow this information to be passed between theses channels.
These additional links are called secondary channels and they may be explicitly or implicitly declared.
Explicit secondary channels¶
To create an explicit secondary channel, the origin or source of this channel must be declared in the nextflow process that sends it:
// secondary channels can be created inside the process
output:
<main output> into {{ output_channel }}
<secondary output> into SIDE_max_read_len_{{ pid }}
// or outside
SIDE_phred_{{ pid }} = Channel.create()
Then, we add the information that this process has a secondary channel start
via the link_start
list attribute in the corresponding
assemblerflow.generator.process.Process
class:
class MyProcess(Process):
(...)
self.link_start.extend(["SIDE_max_read_len", "SIDE_phred"])
Notice that we extend the link_start
list, instead of simply assigning.
This is because all processes already have the main channel as an implicit
link start (See Implicit secondary channels).
Now, any process that is executed after this one can receive this secondary channel.
For another process to receive this channel, it will be necessary to add this
information to the process class(es) via the link_end
list attribute:
class OtherProcess(Process):
(...)
self.link_end.append({
"link": "SIDE_phred",
"alias": "OtherName"
})
Notice that now we append a dictionary with two key:values. The first, link must match a string from the link_start list (in this case, SIDE_phred). The second, alias, will be the channel name in the receiving process nextflow template (which can be the same as the link value).
Now, we only need to add the secondary channel to the nextflow template, as in the example below:
input:
<main_input> from {{ input_channel }}.mix(OtherName_{{ pid}})
Implicit secondary channels¶
By default, the main output of the channels is declared as a secondary channel
start. This means that any process can receive the main output channel as a
a secondary channel of a subsequent process. This can be useful in situations
were a post-assembly process (has assembly
as expected input and output)
needs to receive the last channel with fastq files:
class AssemblyMapping(Process):
(...)
self.link_end.append({
"link": "MAIN_fq",
"alias": "_MAIN_assembly"
})
In this example, the AssemblyMapping
process will receive a secondary
channel with from the last process that output fastq files into a channel
called _MAIN_assembly
. Then, this channel is received in the nextflow
template like this:
input:
<main input> from {{ input_channel }}.join(_{{ input_channel }})
Implicit secondary channels can also be used to fork the last output channel into multiple terminal processes:
class Abricate(Process):
(...)
self.link_end.append({
"link": "MAIN_assembly",
"alias": "MAIN_assembly"
})
In this case, since MAIN_assembly
is already the prefix of the main
output channel of this process, there is no need for changes in the process
template:
input:
<main input> from {{ input_channel }}
Template creation guidelines¶
Though none of these guidelines are mandatory nor required, their usage is highly recommended for several reasons:
- Consistency in the outputs of the templates throughout the pipeline, particularly the status and report dotfiles (see Dotfiles section);
- Debugging purposes;
- Versioning;
- Proper documentation of the template scripts.
Preface header¶
After the script shebang, a header with a brief description of the purpose and
expected inputs and outputs should be provided. A complete example of such
description can be viewed in assemblerflow.templates.integrity_coverage
.
Purpose¶
Purpose section contains a brief description of the script’s objective. E.g.:
Purpose
-------
This module is intended parse the results of FastQC for paired end FastQ \
samples.
Expected input¶
Expected input section contains a description of the variables that are provided to the main function of the template script. These variables are defined in the input channels of the process in which the template is supposed to be executed. E.g.:
Expected input
--------------
The following variables are expected whether using NextFlow or the
:py:func:`main` executor.
- ``mash_output`` : String with the name of the mash screen output file.
- e.g.: ``'sortedMashScreenResults_SampleA.txt'``
This means that the process that will execute this channel will have the input defined as:
input:
file(mash_output) from <channel>
Generated output¶
Generated output section contains a description of the output files that the template script is intended to generated. E.g.:
Generated output
----------------
The generated output are output files that contain an object, usually a string.
- ``fastqc_health`` : Stores the health check for the current sample. If it
passes all checks, it contains only the string 'pass'. Otherwise, contains
the summary categories and their respective results
These can then be passed to the output channel(s) in the nextflow process:
output:
file(fastqc_health) into <channel>
Note
Since templates can be re-used by multiple processes, not all generated outputs need to be passed to output channels. Depending on the job of the nextflow process, it may catch none or all of the output files generated by the template.
Versioning and logging¶
Assemblerflow has a specific logger
(get_logger()
) and
versioning system that can be imported from
assemblerflow.templates.utils
:
# the module that imports the logger and the decorator class for versioning
# of the script itself and other software used in the script
from utils.assemblerflow_base import get_logger, MainWrapper
Logger¶
A logger function is also required to add logs to the script. The logs
are written to the .command.log
file in the work directory of each process.
First, the logger must be called, for example, after the imports as follows:
logger = get_logger(__file__)
Then, it may be used at will, using the default logging levels . E.g.:
logger.debug("Information tha may be important for debugging")
logger.info("Information related to the normal execution steps")
logger.warning("Events that may require the attention of the developer")
logger.error("Module exited unexpectedly with error:\\n{}".format(
traceback.format_exc()))
MainWrapper decorator¶
This MainWrapper
class decorator allows the program to fetch information on the script version,
build and template name. For example:
# This can also be declared after the imports
__version__ = "1.0.0"
__build__ = "15012018"
__template__ = "process_abricate-nf"
The MainWrapper
should decorate the main function of the script.
E.g.:
@MainWrapper
def main():
#some awesome code
...
Besides searching for the script’s version, build and template name this decorator
will also search for a specific set of functions that start with the
substring __get_version
. For example:
def __get_version_fastqc():
try:
cli = ["fastqc", "--version"]
p = subprocess.Popen(cli, stdout=PIPE, stderr=PIPE)
stdout, _ = p.communicate()
version = stdout.strip().split()[1][1:].decode("utf8")
except Exception as e:
logger.debug(e)
version = "undefined"
# Note that it returns a dictionary that will then be written to the .versions
# dotfile
return {
"program": "FastQC",
"version": version,
# some programs may also contain build.
}
These functions are used to fetch the version, name and other relevant information from third-party software and the only requirement is that they return a dictionary with at least two key:value pairs:
program
: String with the name of the program.version
: String with the version of the program.
For more information, refer to the
build_versions()
method.
Nextflow .command.sh¶
When these templates are used as a Nextflow template
they are executed as a .command.sh
file in the work directory of each
process. In this case, we recommended the inclusion of
an if statement to parse the arguments sent from nextflow to the python
template. For example, imagine we have a path to a file name to pass as
argument between nextflow and the required template:
# code check for nextflow execution
if __file__.endswith(".command.sh"):
FILE_NAME = '$Nextflow_file_name'
# logger output can also be included here, for example:
logger.debug("Running {} with parameters:".format(
os.path.basename(__file__)))
logger.debug("FILE_NAME: {}".format(FILE_NAME))
Then, we could use this variable as the argument of a function, such as:
def main(FILE_NAME):
#some awesome code
...
This way, we can use this function with nextflow arguments or without them, as is the case when the templates are used as standalone modules.
Dotfiles¶
Several dotfiles (files prefixed by a single .
, as in .status
) are
created at the beginning of every nextflow process that has the following
placeholder (see Create process template):
process myProcess {
{% include "post.txt" ignore missing %}
(...)
}
The actual script that creates the dotfiles is found in
assemblerflow/bin
, is called set_dotfiles.sh
and executes the
following command:
touch .status .warning .fail .report.json .versions
Status¶
The .status
file simply stores a string with the run status of the process.
The supported status are:
pass
: The process finished successfullyfail
: The process ran without unexpected issues but failed due to some quality control checkerror
: The process exited with an unexpected error.
Warning¶
The .warning
file stores any warnings that may occur during the execution
of the process. There is no particular format for the warning messages other
than that each individual warning should be in a separate line.
Fail¶
The .fail
file stores any fail messages that may occur during the
execution of the process. When this occurs, the .status
channel must have
the fail
string as well. As in the warning dotfile, there is no
particular format for the fail message.
Report JSON¶
The .report.json
file stores any information from a given process that is
deemed worthy of being reported and displayed at the end of the pipeline.
Any information can be stored in this file, as long as it is in JSON format,
but there are a couple of recommendations that are necessary to follow
for them to be processed by a reporting web app (Currently hosted at
report-nf). However, if
data processing will be performed with custom scripts, feel free to specify
your own format.
Information for tables¶
Information meant to be displayed in tables should be in the following format:
json_dic = {
"tableRow": [
{"header": "Raw BP",
"value": chars,
"table": "assembly",
"columnBar": True},
}
This means that the chars
variable that is created during the execution
of the process should appear as a table entry with the specified header
and value
. The table
key specifies in which table of the reports
it will appear and the columnBar
key informs the report generator to
create a bar column in that particular cell.
Information for plots¶
Information meant to be displayed in plots should be in the following format:
json_dic = {
"plotData": {
"size_dist": size_dist
}
}
This is a simple key:value pair, where the key is the ID of the plot in the
reports and the size_dist
contains the plot data that was gathered
for a particular process.
Other information¶
Other than tables and plots, which have a somewhat predefined format, there is not particular format for other information. They will simply store the data of interest to report and it will be the job of a downstream report app to process that data into an actual visual report.
Versions¶
The .version
dotfile should contain a list of JSON objects with the
version information of the programs used in any given process. There are
only two required key:value pairs:
program
: String with the name of the software/script/templateversion
: String with the version of said software.
As an example:
version = {
"program": "abricate"
"version": "0.3.7"
}
Key:value pairs with other metadata can be included at will for downstream processing.
assemblerflow package¶
Subpackages¶
assemblerflow.generator package¶
assemblerflow.generator.components package¶
Submodules¶
-
class
assemblerflow.generator.components.annotation.
Abricate
(**kwargs)[source]¶ Bases:
assemblerflow.generator.process.Process
Abricate mapping process template interface
This process is set with:
input_type
: assemblyoutput_type
: Noneptype
: post_assembly
It contains one secondary channel link end:
MAIN_assembly
(alias:MAIN_assembly
): Receives the last
assembly.
Attributes: template_str
Class property that returns a populated template string
Methods
get_user_channel
(input_channel[, input_type])Returns the main raw channel for the process render
(template, context)Wrapper to the jinja2 render method from a template file set_channels
(**kwargs)General purpose method that sets the main channels set_main_channel_names
(input_suffix, …)Sets the main channel names based on the provide input and output channel suffixes. set_secondary_channel
(source, channel_list)General purpose method for setting a secondary channel update_attributes
(attr_dict)Updates the directives attribute from a dictionary object. update_main_forks
(sink)Updates the forks attribute with the sink channel destination update_main_input
-
class
assemblerflow.generator.components.annotation.
Prokka
(**kwargs)[source]¶ Bases:
assemblerflow.generator.process.Process
Prokka mapping process template interface
This process is set with:
input_type
: assemblyoutput_type
: Noneptype
: post_assembly
It contains one secondary channel link end:
MAIN_assembly
(alias:MAIN_assembly
): Receives the last
assembly.
Attributes: template_str
Class property that returns a populated template string
Methods
get_user_channel
(input_channel[, input_type])Returns the main raw channel for the process render
(template, context)Wrapper to the jinja2 render method from a template file set_channels
(**kwargs)General purpose method that sets the main channels set_main_channel_names
(input_suffix, …)Sets the main channel names based on the provide input and output channel suffixes. set_secondary_channel
(source, channel_list)General purpose method for setting a secondary channel update_attributes
(attr_dict)Updates the directives attribute from a dictionary object. update_main_forks
(sink)Updates the forks attribute with the sink channel destination update_main_input
-
class
assemblerflow.generator.components.assembly.
Spades
(**kwargs)[source]¶ Bases:
assemblerflow.generator.process.Process
Spades process template interface
This process is set with:
input_type
: fastqoutput_type
: assemblyptype
: assembly
It contains one secondary channel link end:
SIDE_max_len
(alias:SIDE_max_len
): Receives max read length
Attributes: template_str
Class property that returns a populated template string
Methods
get_user_channel
(input_channel[, input_type])Returns the main raw channel for the process render
(template, context)Wrapper to the jinja2 render method from a template file set_channels
(**kwargs)General purpose method that sets the main channels set_main_channel_names
(input_suffix, …)Sets the main channel names based on the provide input and output channel suffixes. set_secondary_channel
(source, channel_list)General purpose method for setting a secondary channel update_attributes
(attr_dict)Updates the directives attribute from a dictionary object. update_main_forks
(sink)Updates the forks attribute with the sink channel destination update_main_input
-
class
assemblerflow.generator.components.assembly.
Skesa
(**kwargs)[source]¶ Bases:
assemblerflow.generator.process.Process
Skesa process template interface
Attributes: template_str
Class property that returns a populated template string
Methods
get_user_channel
(input_channel[, input_type])Returns the main raw channel for the process render
(template, context)Wrapper to the jinja2 render method from a template file set_channels
(**kwargs)General purpose method that sets the main channels set_main_channel_names
(input_suffix, …)Sets the main channel names based on the provide input and output channel suffixes. set_secondary_channel
(source, channel_list)General purpose method for setting a secondary channel update_attributes
(attr_dict)Updates the directives attribute from a dictionary object. update_main_forks
(sink)Updates the forks attribute with the sink channel destination update_main_input
-
class
assemblerflow.generator.components.assembly_processing.
ProcessSkesa
(**kwargs)[source]¶ Bases:
assemblerflow.generator.process.Process
Attributes: template_str
Class property that returns a populated template string
Methods
get_user_channel
(input_channel[, input_type])Returns the main raw channel for the process render
(template, context)Wrapper to the jinja2 render method from a template file set_channels
(**kwargs)General purpose method that sets the main channels set_main_channel_names
(input_suffix, …)Sets the main channel names based on the provide input and output channel suffixes. set_secondary_channel
(source, channel_list)General purpose method for setting a secondary channel update_attributes
(attr_dict)Updates the directives attribute from a dictionary object. update_main_forks
(sink)Updates the forks attribute with the sink channel destination update_main_input
-
class
assemblerflow.generator.components.assembly_processing.
ProcessSpades
(**kwargs)[source]¶ Bases:
assemblerflow.generator.process.Process
Process spades process template interface
This process is set with:
input_type
: assemblyoutput_type
: assemblyptype
: post_assembly
Attributes: template_str
Class property that returns a populated template string
Methods
get_user_channel
(input_channel[, input_type])Returns the main raw channel for the process render
(template, context)Wrapper to the jinja2 render method from a template file set_channels
(**kwargs)General purpose method that sets the main channels set_main_channel_names
(input_suffix, …)Sets the main channel names based on the provide input and output channel suffixes. set_secondary_channel
(source, channel_list)General purpose method for setting a secondary channel update_attributes
(attr_dict)Updates the directives attribute from a dictionary object. update_main_forks
(sink)Updates the forks attribute with the sink channel destination update_main_input
-
class
assemblerflow.generator.components.assembly_processing.
AssemblyMapping
(**kwargs)[source]¶ Bases:
assemblerflow.generator.process.Process
Assembly mapping process template interface
This process is set with:
input_type
: assemblyoutput_type
: assemblyptype
: post_assembly
It contains one secondary channel link end:
MAIN_fq
(alias:_MAIN_assembly
): Receives the FastQ files
from the last process with
fastq
output type.It contains two status channels:
STATUS_am
: Status for the assembly_mapping processSTATUS_amp
: Status for the process_assembly_mapping process
Attributes: template_str
Class property that returns a populated template string
Methods
get_user_channel
(input_channel[, input_type])Returns the main raw channel for the process render
(template, context)Wrapper to the jinja2 render method from a template file set_channels
(**kwargs)General purpose method that sets the main channels set_main_channel_names
(input_suffix, …)Sets the main channel names based on the provide input and output channel suffixes. set_secondary_channel
(source, channel_list)General purpose method for setting a secondary channel update_attributes
(attr_dict)Updates the directives attribute from a dictionary object. update_main_forks
(sink)Updates the forks attribute with the sink channel destination update_main_input
-
class
assemblerflow.generator.components.assembly_processing.
Pilon
(**kwargs)[source]¶ Bases:
assemblerflow.generator.process.Process
Pilon mapping process template interface
This process is set with:
input_type
: assemblyoutput_type
: assemblyptype
: post_assembly
It contains one dependency process:
assembly_mapping
: Requires the BAM file generated by the
assembly mapping process
Attributes: template_str
Class property that returns a populated template string
Methods
get_user_channel
(input_channel[, input_type])Returns the main raw channel for the process render
(template, context)Wrapper to the jinja2 render method from a template file set_channels
(**kwargs)General purpose method that sets the main channels set_main_channel_names
(input_suffix, …)Sets the main channel names based on the provide input and output channel suffixes. set_secondary_channel
(source, channel_list)General purpose method for setting a secondary channel update_attributes
(attr_dict)Updates the directives attribute from a dictionary object. update_main_forks
(sink)Updates the forks attribute with the sink channel destination update_main_input
-
class
assemblerflow.generator.components.distance_estimation.
PatlasMashDist
(**kwargs)[source]¶ Bases:
assemblerflow.generator.process.Process
Attributes: template_str
Class property that returns a populated template string
Methods
get_user_channel
(input_channel[, input_type])Returns the main raw channel for the process render
(template, context)Wrapper to the jinja2 render method from a template file set_channels
(**kwargs)General purpose method that sets the main channels set_main_channel_names
(input_suffix, …)Sets the main channel names based on the provide input and output channel suffixes. set_secondary_channel
(source, channel_list)General purpose method for setting a secondary channel update_attributes
(attr_dict)Updates the directives attribute from a dictionary object. update_main_forks
(sink)Updates the forks attribute with the sink channel destination update_main_input
-
class
assemblerflow.generator.components.distance_estimation.
PatlasMashScreen
(**kwargs)[source]¶ Bases:
assemblerflow.generator.process.Process
Attributes: template_str
Class property that returns a populated template string
Methods
get_user_channel
(input_channel[, input_type])Returns the main raw channel for the process render
(template, context)Wrapper to the jinja2 render method from a template file set_channels
(**kwargs)General purpose method that sets the main channels set_main_channel_names
(input_suffix, …)Sets the main channel names based on the provide input and output channel suffixes. set_secondary_channel
(source, channel_list)General purpose method for setting a secondary channel update_attributes
(attr_dict)Updates the directives attribute from a dictionary object. update_main_forks
(sink)Updates the forks attribute with the sink channel destination update_main_input
-
class
assemblerflow.generator.components.downloads.
DownloadReads
(**kwargs)[source]¶ Bases:
assemblerflow.generator.process.Process
Process template interface for reads downloading from SRA and NCBI
This process is set with:
input_type
: accessionsoutput_type
fastq
Attributes: template_str
Class property that returns a populated template string
Methods
get_user_channel
(input_channel[, input_type])Returns the main raw channel for the process render
(template, context)Wrapper to the jinja2 render method from a template file set_channels
(**kwargs)General purpose method that sets the main channels set_main_channel_names
(input_suffix, …)Sets the main channel names based on the provide input and output channel suffixes. set_secondary_channel
(source, channel_list)General purpose method for setting a secondary channel update_attributes
(attr_dict)Updates the directives attribute from a dictionary object. update_main_forks
(sink)Updates the forks attribute with the sink channel destination update_main_input
-
class
assemblerflow.generator.components.reads_quality_control.
IntegrityCoverage
(**kwargs)[source]¶ Bases:
assemblerflow.generator.process.Process
Process template interface for first integrity_coverage process
This process is set with:
input_type
: fastqoutput_type
: fastqptype
: pre_assembly
It contains two secondary channel link starts:
SIDE_phred
: Phred score of the FastQ filesSIDE_max_len
: Maximum read length
Attributes: template_str
Class property that returns a populated template string
Methods
get_user_channel
(input_channel[, input_type])Returns the main raw channel for the process render
(template, context)Wrapper to the jinja2 render method from a template file set_channels
(**kwargs)General purpose method that sets the main channels set_main_channel_names
(input_suffix, …)Sets the main channel names based on the provide input and output channel suffixes. set_secondary_channel
(source, channel_list)General purpose method for setting a secondary channel update_attributes
(attr_dict)Updates the directives attribute from a dictionary object. update_main_forks
(sink)Updates the forks attribute with the sink channel destination update_main_input
-
class
assemblerflow.generator.components.reads_quality_control.
CheckCoverage
(**kwargs)[source]¶ Bases:
assemblerflow.generator.process.Process
Process template interface for additional integrity_coverage process
This process is set with:
input_type
: fastqoutput_type
: fastqptype
: pre_assembly
It contains one secondary channel link start:
SIDE_max_len
: Maximum read length
Attributes: template_str
Class property that returns a populated template string
Methods
get_user_channel
(input_channel[, input_type])Returns the main raw channel for the process render
(template, context)Wrapper to the jinja2 render method from a template file set_channels
(**kwargs)General purpose method that sets the main channels set_main_channel_names
(input_suffix, …)Sets the main channel names based on the provide input and output channel suffixes. set_secondary_channel
(source, channel_list)General purpose method for setting a secondary channel update_attributes
(attr_dict)Updates the directives attribute from a dictionary object. update_main_forks
(sink)Updates the forks attribute with the sink channel destination update_main_input
-
class
assemblerflow.generator.components.reads_quality_control.
TrueCoverage
(**kwargs)[source]¶ Bases:
assemblerflow.generator.process.Process
TrueCoverage process template interface
Attributes: template_str
Class property that returns a populated template string
Methods
get_user_channel
(input_channel[, input_type])Returns the main raw channel for the process render
(template, context)Wrapper to the jinja2 render method from a template file set_channels
(**kwargs)General purpose method that sets the main channels set_main_channel_names
(input_suffix, …)Sets the main channel names based on the provide input and output channel suffixes. set_secondary_channel
(source, channel_list)General purpose method for setting a secondary channel update_attributes
(attr_dict)Updates the directives attribute from a dictionary object. update_main_forks
(sink)Updates the forks attribute with the sink channel destination update_main_input
-
class
assemblerflow.generator.components.reads_quality_control.
FastQC
(**kwargs)[source]¶ Bases:
assemblerflow.generator.process.Process
FastQC process template interface
This process is set with:
input_type
: fastqoutput_type
: fastqptype
: pre_assembly
It contains two status channels:
STATUS_fastqc
: Status for the fastqc processSTATUS_report
: Status for the fastqc_report process
Attributes: template_str
Class property that returns a populated template string
Methods
get_user_channel
(input_channel[, input_type])Returns the main raw channel for the process render
(template, context)Wrapper to the jinja2 render method from a template file set_channels
(**kwargs)General purpose method that sets the main channels set_main_channel_names
(input_suffix, …)Sets the main channel names based on the provide input and output channel suffixes. set_secondary_channel
(source, channel_list)General purpose method for setting a secondary channel update_attributes
(attr_dict)Updates the directives attribute from a dictionary object. update_main_forks
(sink)Updates the forks attribute with the sink channel destination update_main_input -
status_channels
= None¶ list: Setting status channels for FastQC execution and FastQC report
-
class
assemblerflow.generator.components.reads_quality_control.
Trimmomatic
(**kwargs)[source]¶ Bases:
assemblerflow.generator.process.Process
Trimmomatic process template interface
This process is set with:
input_type
: fastqoutput_type
: fastqptype
: pre_assembly
It contains one secondary channel link end:
SIDE_phred
(alias:SIDE_phred
): Receives FastQ phred score
Attributes: template_str
Class property that returns a populated template string
Methods
get_user_channel
(input_channel[, input_type])Returns the main raw channel for the process render
(template, context)Wrapper to the jinja2 render method from a template file set_channels
(**kwargs)General purpose method that sets the main channels set_main_channel_names
(input_suffix, …)Sets the main channel names based on the provide input and output channel suffixes. set_secondary_channel
(source, channel_list)General purpose method for setting a secondary channel update_attributes
(attr_dict)Updates the directives attribute from a dictionary object. update_main_forks
(sink)Updates the forks attribute with the sink channel destination update_main_input
-
class
assemblerflow.generator.components.reads_quality_control.
FastqcTrimmomatic
(**kwargs)[source]¶ Bases:
assemblerflow.generator.process.Process
Fastqc + Trimmomatic process template interface
This process executes FastQC only to inform the trim range for trimmomatic, not for QC checks.
This process is set with:
input_type
: fastqoutput_type
: fastqptype
: pre_assembly
It contains one secondary channel link end:
SIDE_phred
(alias:SIDE_phred
): Receives FastQ phred score
It contains three status channels:
STATUS_fastqc
: Status for the fastqc processSTATUS_report
: Status for the fastqc_report processSTATUS_trim
: Status for the trimmomatic process
Attributes: template_str
Class property that returns a populated template string
Methods
get_user_channel
(input_channel[, input_type])Returns the main raw channel for the process render
(template, context)Wrapper to the jinja2 render method from a template file set_channels
(**kwargs)General purpose method that sets the main channels set_main_channel_names
(input_suffix, …)Sets the main channel names based on the provide input and output channel suffixes. set_secondary_channel
(source, channel_list)General purpose method for setting a secondary channel update_attributes
(attr_dict)Updates the directives attribute from a dictionary object. update_main_forks
(sink)Updates the forks attribute with the sink channel destination update_main_input
Module contents¶
Submodules¶
assemblerflow.generator.engine module¶
-
assemblerflow.generator.engine.
process_map
= {'mlst': <class 'assemblerflow.generator.components.mlst.Mlst'>, 'true_coverage': <class 'assemblerflow.generator.components.reads_quality_control.TrueCoverage'>, 'fastqc_trimmomatic': <class 'assemblerflow.generator.components.reads_quality_control.FastqcTrimmomatic'>, 'integrity_coverage': <class 'assemblerflow.generator.components.reads_quality_control.IntegrityCoverage'>, 'fastqc': <class 'assemblerflow.generator.components.reads_quality_control.FastQC'>, 'reads_download': <class 'assemblerflow.generator.components.downloads.DownloadReads'>, 'mapping_patlas': <class 'assemblerflow.generator.components.patlas_mapping.PatlasMapping'>, 'patho_typing': <class 'assemblerflow.generator.components.typing.PathoTyping'>, 'check_coverage': <class 'assemblerflow.generator.components.reads_quality_control.CheckCoverage'>, 'seq_typing': <class 'assemblerflow.generator.components.typing.SeqTyping'>, 'spades': <class 'assemblerflow.generator.components.assembly.Spades'>, 'chewbbaca': <class 'assemblerflow.generator.components.mlst.Chewbbaca'>, 'prokka': <class 'assemblerflow.generator.components.annotation.Prokka'>, 'process_skesa': <class 'assemblerflow.generator.components.assembly_processing.ProcessSkesa'>, 'process_spades': <class 'assemblerflow.generator.components.assembly_processing.ProcessSpades'>, 'skesa': <class 'assemblerflow.generator.components.assembly.Skesa'>, 'assembly_mapping': <class 'assemblerflow.generator.components.assembly_processing.AssemblyMapping'>, 'trimmomatic': <class 'assemblerflow.generator.components.reads_quality_control.Trimmomatic'>, 'mash_dist': <class 'assemblerflow.generator.components.distance_estimation.PatlasMashDist'>, 'abricate': <class 'assemblerflow.generator.components.annotation.Abricate'>, 'mash_screen': <class 'assemblerflow.generator.components.distance_estimation.PatlasMashScreen'>, 'pilon': <class 'assemblerflow.generator.components.assembly_processing.Pilon'>}¶ dict: Maps the process ids to the corresponding template interface class wit the format:
{ "<template_string>": module.TemplateClass }
-
class
assemblerflow.generator.engine.
NextflowGenerator
(process_connections, nextflow_file, pipeline_name='assemblerflow', ignore_dependencies=False, auto_dependency=True)[source]¶ Bases:
object
Methods
build
()Main pipeline builder render_pipeline
()Write pipeline attributes to json write_configs
(project_root)Wrapper method that writes all configuration files to the pipeline -
processes
= None¶ list: Stores the process interfaces in the specified order
-
_fork_tree
= None¶ dict: A dictionary with the fork tree of the pipeline, which consists on the the paths of each lane. For instance, a single fork with two sinks is represented as: {1: [2,3]}. Subsequent forks are then added sequentially: {1:[2,3], 2:[3,4,5]}. This allows the path upstream of a process in a given lane to be traversed until the start of the pipeline.
-
lanes
= None¶ int: Stores the number of lanes in the pipelines
-
nf_file
= None¶ str: Path to file where the pipeline will be generated
-
pipeline_name
= None¶ str: Name of the pipeline, for customization and help purposes.
-
template
= None¶ str: String that will harbour the pipeline code
-
secondary_channels
= None¶ dict: Stores secondary channel links
-
main_raw_inputs
= None¶ list: Stores the main raw inputs from the user parameters into the first process(es).
-
secondary_inputs
= None¶ dict: Stores the secondary input channels that may be required by some processes. The key is the params variable and the key is the channel definition for nextflow:
{"genomeSize": "IN_genome_size = Channel.value(params.genomeSize)"}
-
extra_inputs
= None¶
-
status_channels
= None¶ list: Stores the status channels from each process
-
skip_class
= None¶ list: Stores the Process classes that should be skipped when iterating over the
processes
list.
-
resources
= None¶ str: Stores the resource directives string for each nextflow process. See
NextflowGenerator._get_resources_string()
.
-
containers
= None¶ str: Stores the container directives string for each nextflow process. See
NextflowGenerator._get_container_string()
.
-
params
= None¶ str: Stores the params directives string for the nextflow pipeline. See
NextflowGenerator._get_params_string()
-
user_config
= None¶ str: Stores the user configuration file placeholder. This is an empty configuration file that is only added the first time to a project directory. If the file already exists, it will not overwrite it.
-
compilers
= None¶ dict: Maps the information about each available compiler process in assemblerflow. The key of each entry is the name/signature of the compiler process. The value is a json/dict object that contains two key:pair values:
cls
: The reference to the compiler class object.template
: The nextflow template file of the process.
-
static
_parse_process_name
(name_str)[source]¶ Parses the process string and returns the process name and its directives
Process strings my contain directive information with the following syntax:
proc_name={'directive':'val'}
This method parses this string and returns the process name as a string and the directives information as a dictionary.
Parameters: - name_str : str
Raw string with process name and, potentially, directive information
Returns: - str
Process name
- dict or None
Process directives
-
_build_connections
(process_list, ignore_dependencies, auto_dependency)[source]¶ Parses the process connections dictionaries into a process list
This method is called upon instantiation of the NextflowGenerator class. Essentially, it sets the main input/output channel names of the processes so that they can be linked correctly.
If a connection between two consecutive process is not possible due to a mismatch in the input/output types, it exits with an error.
-
_get_process_names
(con, pid)[source]¶ Returns the input/output process names and output process directives
Parameters: - con : dict
Dictionary with the connection information between two processes.
- pid : int
Arbitrary and unique process ID.
Returns: - input_name : str
Name of the input process
- output_name : str
Name of the output process
- output_directives : dict
Parsed directives from the output process
-
_add_dependency
(p, template, inlane, outlane, pid)[source]¶ Automatically Adds a dependency of a process.
This method adds a template to the process list attribute as a dependency. It will adapt the input lane, output lane and process id of the process that depends on it.
Parameters: - p : Process
Process class that contains the dependency.
- template : str
Template name of the dependency.
- inlane : int
Input lane.
- outlane : int
Output lane.
- pid : int
Process ID.
-
_search_tree_backwards
(template, parent_lanes)[source]¶ Searches the process tree backwards in search of a provided process
The search takes into consideration the provided parent lanes and searches only those
Parameters: - template : str
Name of the process template attribute being searched
- parent_lanes : list
List of integers with the parent lanes to be searched
Returns: - bool
Returns True when the template is found. Otherwise returns False.
-
static
_test_connection
(parent_process, child_process)[source]¶ Tests if two processes can be connected by input/output type
Parameters: - parent_process : assemblerflow.generator.process.Process
Process that will be sending output.
- child_process : assemblerflow.generator.process.Process
Process that will receive output.
Adds the footer template to the master template string
-
_update_raw_input
(p, sink_channel=None, input_type=None)[source]¶ Given a process, this method updates the
main_raw_inputs
attribute with the corresponding raw input channel of that process. The input channel and input type can be overridden if the input_channel and input_type arguments are provided.Parameters: - p : assemblerflow.generator.Process.Process
Process instance whose raw input will be modified
- sink_channel: str
Sets the channel where the raw input will fork into. It overrides the process’s input_channel attribute.
- input_type: str
Sets the type of the raw input. It overrides the process’s input_type attribute.
-
_update_secondary_inputs
(p)[source]¶ Given a process, this method updates the
secondary_inputs
attribute with the corresponding secondary inputs of that process.Parameters: - p : assemblerflow.Process.Process
-
_update_extra_inputs
(p)[source]¶ Given a process, this method updates the
extra_inputs
attribute with the corresponding extra inputs of that processParameters: - p : assemblerflow.Process.Process
-
_get_fork_tree
(lane)[source]¶ Returns a list with the parent lanes from the provided lane
Parameters: - lane : int
Target lage
Returns: - list
List of the lanes preceding the provided lane.
-
_update_secondary_channels
(p)[source]¶ Given a process, this method updates the
secondary_channels
attribute with the corresponding secondary inputs of that channel.The rationale of the secondary channels is the following:
- Start storing any secondary emitting channels, by checking the link_start list attribute of each process. If there are channel names in the link start, it adds to the secondary channels dictionary.
- Check for secondary receiving channels, by checking the
link_end list attribute. If the link name starts with a
__ signature, it will created an implicit link with the last
process with an output type after the signature. Otherwise,
it will check is a corresponding link start already exists in
the at least one process upstream of the pipeline and if so,
it will update the
secondary_channels
attribute with the new link.
Parameters: - p : assemblerflow.Process.Process
-
_set_channels
()[source]¶ Sets the main channels for the pipeline
This method will parse de the
processes
attribute and perform the following tasks for each process:- Sets the input/output channels and main input forks and adds
them to the process’s
assemblerflow.process.Process._context
attribute (Seeset_channels()
). - Automatically updates the main input channel of the first
process of each lane so that they fork from the user provide
parameters (See
_update_raw_input()
). - Check for the presence of secondary inputs and adds them to the
secondary_inputs
attribute. - Check for the presence of secondary channels and adds them to the
secondary_channels
attribute.
Notes
On the secondary channel setup: With this approach, there can only be one secondary link start for each type of secondary link. For instance, If there are two processes that start a secondary channel for the
SIDE_max_len
channel, only the last one will be recorded, and all receiving processes will get the channel from the latest process. Secondary channels can only link if the source process if downstream of the sink process in its “forking” path.- Sets the input/output channels and main input forks and adds
them to the process’s
-
_set_init_process
()[source]¶ Sets the main raw inputs and secondary inputs on the init process
This method will fetch the
assemblerflow.process.Init
process instance and sets the raw input (assemblerflow.process.Init.set_raw_inputs()
) and the secondary inputs (assemblerflow.process.Init.set_secondary_inputs()
) for that process. This will handle the connection of the user parameters with channels that are then consumed in the pipeline.
-
_set_secondary_channels
()[source]¶ Sets the secondary channels for the pipeline
This will iterate over the
NextflowGenerator.secondary_channels
dictionary that is populated when executing_update_secondary_channels()
method.
-
_set_general_compilers
()[source]¶ Adds compiler channels to the
processes
attribute.This method will iterate over the pipeline’s processes and check if any process is feeding channels to a compiler process. If so, that compiler process is added to the pipeline and those channels are linked to the compiler via some operator.
-
static
_get_resources_string
(res_dict, pid)[source]¶ Returns the nextflow resources string from a dictionary object
If the dictionary has at least on of the resource directives, these will be compiled for each process in the dictionary and returned as a string read for injection in the nextflow config file template.
This dictionary should be:
dict = {"processA": {"cpus": 1, "memory": "4GB"}, "processB": {"cpus": 2}}
Parameters: - res_dict : dict
Dictionary with the resources for processes.
- pid : int
Unique identified of the process
Returns: - str
nextflow config string
-
static
_get_container_string
(cont_dict, pid)[source]¶ Returns the nextflow containers string from a dictionary object
If the dictionary has at least on of the container directives, these will be compiled for each process in the dictionary and returned as a string read for injection in the nextflow config file template.
This dictionary should be:
dict = {"processA": {"container": "asd", "version": "1.0.0"}, "processB": {"container": "dsd"}}
Parameters: - cont_dict : dict
Dictionary with the containers for processes.
- pid : int
Unique identified of the process
Returns: - str
nextflow config string
-
_get_params_string
()[source]¶ Returns the nextflow params string from a dictionary object.
The params dict should be a set of key:value pairs with the parameter name, and the default parameter value:
self.params = { "genomeSize": 2.1, "minCoverage": 15 }
The values are then added to the string as they are. For instance, a
2.1
float will appear asparam = 2.1
and a"'teste'" string will appear as ``param = 'teste'
(Note the string).Returns: - str
Nextflow params configuration string
-
_set_configurations
()[source]¶ This method will iterate over all process in the pipeline and populate the nextflow configuration files with the directives of each process in the pipeline.
-
render_pipeline
()[source]¶ Write pipeline attributes to json
This function writes the pipeline and their attributes to a json file, that is intended to be read by resources/pipeline_graph.html to render a graphical output showing the DAG.
-
write_configs
(project_root)[source]¶ Wrapper method that writes all configuration files to the pipeline directory
-
build
()[source]¶ Main pipeline builder
This method is responsible for building the
NextflowGenerator.template
attribute that will contain the nextflow code of the pipeline.First it builds the header, then sets the main channels, the secondary inputs, secondary channels and finally the status channels. When the pipeline is built, is writes the code to a nextflow file.
-
assemblerflow.generator.error_handling module¶
assemblerflow.generator.header_skeleton module¶
assemblerflow.generator.pipeline_parser module¶
-
assemblerflow.generator.pipeline_parser.
remove_inner_forks
(text)[source]¶ Recursively removes nested brackets
This function is used to remove nested brackets from fork strings using regular expressions
Parameters: - text: str
The string that contains brackets with inner forks to be removed
Returns: - text: str
the string with only the processes that are not in inner forks, thus the processes that belong to a given fork.
-
assemblerflow.generator.pipeline_parser.
empty_tasks
(p_string)[source]¶ Function to check if pipeline string is empty or has an empty string
Parameters: - p_string: str
- String with the definition of the pipeline, e.g.::
‘processA processB processC(ProcessD | ProcessE)’
-
assemblerflow.generator.pipeline_parser.
brackets_but_no_lanes
(p_string)[source]¶ Function to check if a LANE_TOKEN is provided but no fork is initiated. Parameters ———- p_string: str
- String with the definition of the pipeline, e.g.::
- ‘processA processB processC(ProcessD | ProcessE)’
-
assemblerflow.generator.pipeline_parser.
brackets_insanity_check
(p_string)[source]¶ This function performs a check for different number of ‘(‘ and ‘)’ characters, which indicates that some forks are poorly constructed.
Parameters: - p_string: str
- String with the definition of the pipeline, e.g.::
‘processA processB processC(ProcessD | ProcessE)’
-
assemblerflow.generator.pipeline_parser.
lane_char_insanity_check
(p_string)[source]¶ This function performs a sanity check for multiple ‘|’ character between two processes.
Parameters: - p_string: str
- String with the definition of the pipeline, e.g.::
‘processA processB processC(ProcessD | ProcessE)’
-
assemblerflow.generator.pipeline_parser.
final_char_insanity_check
(p_string)[source]¶ This function checks if lane token is the last element of the pipeline string.
Parameters: - p_string: str
- String with the definition of the pipeline, e.g.::
‘processA processB processC(ProcessD | ProcessE)’
-
assemblerflow.generator.pipeline_parser.
fork_procs_insanity_check
(p_string)[source]¶ This function checks if the pipeline string contains a process between the fork start token or end token and the separator (lane) token. Checks for the absence of processes in one of the branches of the fork [‘|)' and '(|’] and for the existence of a process before starting a fork (in an inner fork) [‘|(‘].
Parameters: - p_string: str
- String with the definition of the pipeline, e.g.::
‘processA processB processC(ProcessD | ProcessE)’
-
assemblerflow.generator.pipeline_parser.
start_proc_insanity_check
(p_string)[source]¶ This function checks if there is a starting process after the beginning of each fork. It checks for duplicated start tokens [‘((‘].
Parameters: - p_string: str
- String with the definition of the pipeline, e.g.::
‘processA processB processC(ProcessD | ProcessE)’
-
assemblerflow.generator.pipeline_parser.
late_proc_insanity_check
(p_string)[source]¶ This function checks if there are processes after the close token. It searches for everything that isn’t “|” or “)” after a “)” token.
Parameters: - p_string: str
- String with the definition of the pipeline, e.g.::
‘processA processB processC(ProcessD | ProcessE)’
-
assemblerflow.generator.pipeline_parser.
inner_fork_insanity_checks
(pipeline_string)[source]¶ This function performs two sanity checks in the pipeline string. The first check, assures that each fork contains a lane token ‘|’, while the second check looks for duplicated processes within the same fork.
Parameters: - pipeline_string: str
- String with the definition of the pipeline, e.g.::
‘processA processB processC(ProcessD | ProcessE)’
-
assemblerflow.generator.pipeline_parser.
insanity_checks
(pipeline_str)[source]¶ Wrapper that performs all sanity checks on the pipeline string
Parameters: - pipeline_str : str
String with the pipeline definition
-
assemblerflow.generator.pipeline_parser.
parse_pipeline
(pipeline_str)[source]¶ Parses a pipeline string into a dictionary with the connections between process
Parameters: - pipeline_str : str
- String with the definition of the pipeline, e.g.::
‘processA processB processC(ProcessD | ProcessE)’
Returns: - pipeline_links : list
-
assemblerflow.generator.pipeline_parser.
get_source_lane
(fork_process, pipeline_list)[source]¶ Returns the lane of the last process that matches fork_process
Parameters: - fork_process : list
List of processes before the fork.
- pipeline_list : list
List with the pipeline connection dictionaries.
Returns: - int
Lane of the last process that matches fork_process
-
assemblerflow.generator.pipeline_parser.
get_lanes
(lanes_str)[source]¶ From a raw pipeline string, get a list of lanes from the start of the current fork.
When the pipeline is being parsed, it will be split at every fork position. The string at the right of the fork position will be provided to this function. It’s job is to retrieve the lanes that result from that fork, ignoring any nested forks.
Parameters: - lanes_str : str
Pipeline string after a fork split
Returns: - lanes : list
List of lists, with the list of processes for each lane
-
assemblerflow.generator.pipeline_parser.
linear_connection
(plist, lane)[source]¶ Connects a linear list of processes into a list of dictionaries
Parameters: - plist : list
List with process names. This list should contain at least two entries.
- lane : int
Corresponding lane of the processes
Returns: - res : list
List of dictionaries with the links between processes
-
assemblerflow.generator.pipeline_parser.
fork_connection
(source, sink, source_lane, lane)[source]¶ Makes the connection between a process and the first processes in the lanes to wich it forks.
The
lane
argument should correspond to the lane of the source process. For each lane insink
, the lane counter will increase.Parameters: - source : str
Name of the process that is forking
- sink : list
List of the processes where the source will fork to. Each element corresponds to the start of a lane.
- source_lane : int
Lane of the forking process
- lane : int
Lane of the source process
Returns: - res : list
List of dictionaries with the links between processes
-
assemblerflow.generator.pipeline_parser.
linear_lane_connection
(lane_list, lane)[source]¶ Parameters: - lane_list : list
Each element should correspond to a list of processes for a given lane
- lane : int
Lane counter before the fork start
Returns: - res : list
List of dictionaries with the links between processes
assemblerflow.generator.process module¶
-
class
assemblerflow.generator.process.
Process
(template)[source]¶ Bases:
object
Main interface for basic process functionality
The
Process
class is intended to be inherited by specific process classes (e.g.,IntegrityCoverage
) and provides the basic functionality to build the channels and links between processes.Child classes are expected to inherit the
__init__
execution, which basically means that at least, the child must be defined as:class ChildProcess(Process): def__init__(self, **kwargs): super().__init__(**kwargs)
This ensures that when the
ChildProcess
class is instantiated, it automatically sets the attributes of the parent class.This also means that child processes must be instantiated providing information on the process type and jinja2 template with the nextflow code.
Parameters: - template : str
Name of the jinja2 template with the nextflow code for that process. Templates are stored in
generator/templates
.
Attributes: template_str
Class property that returns a populated template string
Methods
get_user_channel
(input_channel[, input_type])Returns the main raw channel for the process render
(template, context)Wrapper to the jinja2 render method from a template file set_channels
(**kwargs)General purpose method that sets the main channels set_main_channel_names
(input_suffix, …)Sets the main channel names based on the provide input and output channel suffixes. set_secondary_channel
(source, channel_list)General purpose method for setting a secondary channel update_attributes
(attr_dict)Updates the directives attribute from a dictionary object. update_main_forks
(sink)Updates the forks attribute with the sink channel destination update_main_input -
RAW_MAPPING
= {'fasta': {'checks': 'if (params.{0} instanceof Boolean){{exit 1, "\'{0}\' must be a path pattern. Provide value:\'$params.{0}\'"}}\nif (!params.{0}){{ exit 1, "\'{0}\' parameter missing"}}', 'description': 'Path fasta files. (default: $params.fastq)', 'channel': 'IN_fasta_raw', 'channel_str': 'Channel.fromPath(params.{0}).map{{ it -> file(it).exists() ? [it.toString().tokenize(\'/\').last().tokenize(\'.\')[0..-2].join(\'.\'), it] : null }}.ifEmpty {{ exit 1, "No fasta files provided with pattern:\'${{params.{0}}}\'" }}', 'default_value': "'fasta/*.fasta'", 'params': 'fasta'}, 'accessions': {'checks': 'if (!params.{0}){{ exit 1, "\'{0}\' parameter missing" }}\n', 'description': 'Path file with accessions, one perline. (default: $params.fastq)', 'channel': 'IN_accessions_raw', 'channel_str': 'Channel.fromPath(params.{0}).ifEmpty {{ exit 1, "No accessions file provided with path:\'${{params.{0}}}\'" }}', 'default_value': 'null', 'params': 'accessions'}, 'fastq': {'checks': 'if (params.{0} instanceof Boolean){{exit 1, "\'{0}\' must be a path pattern. Provide value:\'$params.{0}\'"}}\nif (!params.{0}){{ exit 1, "\'{0}\' parameter missing"}}', 'description': 'Path expression to paired-end fastq files. (default: $params.fastq)', 'channel': 'IN_fastq_raw', 'channel_str': 'Channel.fromFilePairs(params.{0}).ifEmpty {{ exit 1, "No fastq files provided with pattern:\'${{params.{0}}}\'" }}', 'default_value': "'fastq/*_{1,2}.*'", 'params': 'fastq'}}¶ dict: Contains the mapping between the
Process.input_type
attribute and the corresponding nextflow parameter and main channel definition, e.g.:"fastq" : { "params": "fastq", "channel: "<channel> }
-
pid
= None¶ int: Process ID number that represents the order and position in the generated pipeline
-
template
= None¶ str: Template name for the current process. This string will be used to fetch the file containing the corresponding jinja2 template in the
_set_template()
method
-
_template_path
= None¶ str: Path to the file containing the jinja2 template file. It’s set in
_set_template()
.
-
input_type
= None¶ str: Type of expected input data. Used to verify the connection between two processes is viable.
-
output_type
= None¶ str: Type of output data. Used to verify the connection between two processes is viable.
-
ignore_type
= None¶ boolean: If True, this process will ignore the input/output type requirements. This attribute is set to True for terminal singleton forks in the pipeline.
-
ignore_pid
= None¶ boolean: If True, this process will not make the pid advance. This is used for terminal forks before the end of the pipeline.
-
dependencies
= None¶ list: Contains the dependencies of the current process in the form of the
Process.template
attribute (e.g., [fastqc
])
-
input_channel
= None¶ str: Place holder of the main input channel for the current process. This attribute can change dynamically depending on the forks and secondary channels in the final pipeline.
-
output_channel
= None¶ str: Place holder of the main output channel for the current process. This attribute can change dynamically depending on the forks and secondary channels in the final pipeline.
-
input_user_channel
= None¶ dict: Stores a dictionary of two key:value pairs containing the raw input channel for the process. This is automatically
determined by theinput_type
attribute, and will- fetch the information that is mapped in the
RAW_MAPPING
- variable. It will only be used by the first process(es) defined in a pipeline.
- fetch the information that is mapped in the
-
link_start
= None¶ list: List of strings with the starting points for secondary channels. When building the pipeline, these strings will be matched with equal strings in the
link_end
attribute of other Processes.
-
link_end
= None¶ list: List of dictionaries containing the a string of the ending point for a secondary channel. Each dictionary should contain at least two key/vals:
{"link": <link string>, "alias":<string for template>}
-
status_channels
= None¶ list: Name of the status channels produced by the process. By default, it sets a single status channel. If more than one status channels are required for the process, list each one in this attribute (e.g.,
FastQC.status_channels
)
-
status_strs
= None¶ str: Name of the status channel for the current process. These strings will be provided to the StatusCompiler process to collect and compile status reports
-
forks
= None¶ list: List of strings with the literal definition of the forks for the current process, ready to be added to the template string.
-
main_forks
= None¶ list: List of the channels onto which the main output should be forked into. They will be automatically added to the
main_forks
attribute when setting the secondary channels
-
secondary_inputs
= None¶ list: List of dictionaries with secondary input channels from nextflow parameters. This dictionary should contain two key:value pairs with the
params
key, containing the parameter name, and thechannel
key, containing the nextflow channel definition:{ "params": "pathoSpecies", "channel": "IN_pathoSpecies = Channel .value(params.pathoSpecies)" }
-
extra_input
= None¶ str: with the name of the params that will be used to provide extra input into the process. This extra input will be mixed with the main input channel using nextflow’s
mix
operator. Its channel will be defined at the start of the pipeline, based on thechannel_str
key of theRAW_MAPPING
for the corresponding input type.
-
params
= None¶ dict: Maps the parameter names to the corresponding default values.
-
_context
= None¶ dict: Dictionary with the keyword placeholders for the string template of the current process.
-
directives
= None¶ dict: Specifies the directives (cpus, memory, container) for each nextflow process in the template. If specified, this directives will be added to the nextflow configuration file. Otherwise, the default values for cpus and memory will be used. In the case of containers, they will not run inside any container.
- The current supported directives are:
- cpus
- memory
- container
- container tag/version
An example of directives for two process is as follows:
self.directives = { "processA": {"cpus": 1, "memory": "1GB"}, "processB": {"memory": "5GB", "container": "my/image", "version": "0.5.0"} }
-
compiler
= None¶ dict: Specifies channels from the current process that are received by a compiler process. Each key in this dictionary should match a compiler process key in
compilers
. The value should be a list of the channels that will be fed to the compiler process:self.compiler["patlas_consensus"] = ["mashScreenOutputChannel"]
-
_set_template
(template)[source]¶ Sets the path to the appropriate jinja template file
When a Process instance is initialized, this method will fetch the location of the appropriate template file, based on the
template
argument. It will raise an exception is the template file is not found. Otherwise, it will set theProcess.template_path
attribute.
-
set_main_channel_names
(input_suffix, output_suffix, lane)[source]¶ Sets the main channel names based on the provide input and output channel suffixes. This is performed when connecting processes.
Parameters: - input_suffix : str
Suffix added to the input channel. Should be based on the lane and an arbitrary unique id
- output_suffix : str
Suffix added to the output channel. Should be based on the lane and an arbitrary unique id
- lane : int
Sets the lane of the process.
-
get_user_channel
(input_channel, input_type=None)[source]¶ Returns the main raw channel for the process
Provided with at least a channel name, this method returns the raw channel name and specification (the nextflow string definition) for the process. By default, it will fork from the raw input of the process’
input_type
attribute. However, this behaviour can be overridden by providing theinput_type
argument.If the specified or inferred input type exists in the
RAW_MAPPING
dictionary, the channel info dictionary will be retrieved along with the specified input channel. Otherwise, it will return None.An example of the returned dictionary is:
{"input_channel": "myChannel", "params": "fastq", "channel": "IN_fastq_raw", "channel_str":"IN_fastq_raw = Channel.fromFilePairs(params.fastq)" }
Returns: - dict or None
Dictionary with the complete raw channel info. None if no channel is found.
-
static
render
(template, context)[source]¶ Wrapper to the jinja2 render method from a template file
Parameters: - template : str
Path to template file.
- context : dict
Dictionary with kwargs context to populate the template
-
template_str
¶ Class property that returns a populated template string
This property allows the template of a particular process to be dynamically generated and returned when doing
Process.template_str
.Returns: - x : str
String with the complete and populated process template
-
set_channels
(**kwargs)[source]¶ General purpose method that sets the main channels
This method will take a variable number of keyword arguments to set the
Process._context
attribute with the information on the main channels for the process. This is done by appending the process ID (Process.pid
) attribute to the input, output and status channel prefix strings. In the output channel, the process ID is incremented by 1 to allow the connection with the channel in the next process.The
**kwargs
system for setting theProcess._context
attribute also provides additional flexibility. In this way, individual processes can provide additional information not covered in this method, without changing it.Parameters: - kwargs : dict
Dictionary with the keyword arguments for setting up the template context
-
update_main_forks
(sink)[source]¶ Updates the forks attribute with the sink channel destination
Parameters: - sink : str
Channel onto which the main input will be forked to
-
set_secondary_channel
(source, channel_list)[source]¶ General purpose method for setting a secondary channel
This method allows a given source channel to be forked into one or more channels and sets those forks in the
Process.forks
attribute. Both the source and the channels in thechannel_list
argument must be the final channel strings, which means that this method should be called only after setting the main channels.If the source is not a main channel, this will simply create a fork or set for every channel in the
channel_list
argument list:SOURCE_CHANNEL_1.into{SINK_1;SINK_2}
If the source is a main channel, this will apply some changes to the output channel of the process, to avoid overlapping main output channels. For instance, forking the main output channel for process 2 would create a
MAIN_2.into{...}
. The issue here is that theMAIN_2
channel is expected as the input of the next process, but now is being used to create the fork. To solve this issue, the output channel is modified into_MAIN_2
, and the fork is set to the channels provided channels plus theMAIN_2
channel:_MAIN_2.into{MAIN_2;MAIN_5;...}
Parameters: - source : str
String with the name of the source channel
- channel_list : list
List of channels that will receive a fork of the secondary channel
-
update_attributes
(attr_dict)[source]¶ Updates the directives attribute from a dictionary object.
This will only update the directives for processes that have been defined in the subclass.
Parameters: - attr_dict : dict
Dictionary containing the attributes that will be used to update the process attributes and/or directives.
-
class
assemblerflow.generator.process.
Compiler
(**kwargs)[source]¶ Bases:
assemblerflow.generator.process.Process
Extends the Process methods to status-type processes
Attributes: template_str
Class property that returns a populated template string
Methods
get_user_channel
(input_channel[, input_type])Returns the main raw channel for the process render
(template, context)Wrapper to the jinja2 render method from a template file set_channels
(**kwargs)General purpose method that sets the main channels set_compiler_channels
(channel_list[, operator])General method for setting the input channels for the status process set_main_channel_names
(input_suffix, …)Sets the main channel names based on the provide input and output channel suffixes. set_secondary_channel
(source, channel_list)General purpose method for setting a secondary channel update_attributes
(attr_dict)Updates the directives attribute from a dictionary object. update_main_forks
(sink)Updates the forks attribute with the sink channel destination update_main_input -
set_compiler_channels
(channel_list, operator='mix')[source]¶ General method for setting the input channels for the status process
Given a list of status channels that are gathered during the pipeline construction, this method will automatically set the input channel for the status process. This makes use of the
mix
channel operator of nextflow for multiple channels:STATUS_1.mix(STATUS_2,STATUS_3,...)
This will set the
status_channels
key for the_context
attribute of the process.Parameters: - channel_list : list
List of strings with the final name of the status channels
- operator : str
Specifies the operator used to join the compiler channels. Available options are ‘mix’and ‘join’.
-
class
assemblerflow.generator.process.
Init
(**kwargs)[source]¶ Bases:
assemblerflow.generator.process.Process
Attributes: template_str
Class property that returns a populated template string
Methods
get_user_channel
(input_channel[, input_type])Returns the main raw channel for the process render
(template, context)Wrapper to the jinja2 render method from a template file set_channels
(**kwargs)General purpose method that sets the main channels set_extra_inputs
(channel_dict)Sets the initial definition of the extra input channels. set_main_channel_names
(input_suffix, …)Sets the main channel names based on the provide input and output channel suffixes. set_raw_inputs
(raw_input)Sets the main input channels of the pipeline and their forks. set_secondary_channel
(source, channel_list)General purpose method for setting a secondary channel set_secondary_inputs
(channel_dict)Adds secondary inputs to the start of the pipeline. update_attributes
(attr_dict)Updates the directives attribute from a dictionary object. update_main_forks
(sink)Updates the forks attribute with the sink channel destination update_main_input -
set_raw_inputs
(raw_input)[source]¶ Sets the main input channels of the pipeline and their forks.
The
raw_input
dictionary input should contain one entry for each input type (fastq, fasta, etc). The corresponding value should be a dictionary/json with the following key:values:channel
: Name of the raw input channel (e.g.: channel1)channel_str
: The nextflow definition of the channel and- eventual checks (e.g.: channel1 = Channel.fromPath(param))
raw_forks
: A list of channels to which the channel name will for to.
Each new type of input parameter is automatically added to the
params
attribute, so that they are automatically collected for the pipeline description and help.Parameters: - raw_input : dict
Contains an entry for each input type with the channel name, channel string and forks.
-
set_secondary_inputs
(channel_dict)[source]¶ Adds secondary inputs to the start of the pipeline.
This channels are inserted into the pipeline file as they are provided in the values of the argument.
Parameters: - channel_dict : dict
Each entry should be <parameter>: <channel string>.
-
set_extra_inputs
(channel_dict)[source]¶ Sets the initial definition of the extra input channels.
The
channel_dict
argument should contain the input type and destination channel of each parameter (which is the key):channel_dict = { "param1": { "input_type": "fasta" "channels": ["abricate_2_3", "chewbbaca_3_4"] } }
Parameters: - channel_dict : dict
Dictionary with the extra_input parameter as key, and a dictionary as a value with the input_type and destination channels
-
class
assemblerflow.generator.process.
StatusCompiler
(**kwargs)[source]¶ Bases:
assemblerflow.generator.process.Compiler
Status compiler process template interface
This special process receives the status channels from all processes in the generated pipeline.
Attributes: template_str
Class property that returns a populated template string
Methods
get_user_channel
(input_channel[, input_type])Returns the main raw channel for the process render
(template, context)Wrapper to the jinja2 render method from a template file set_channels
(**kwargs)General purpose method that sets the main channels set_compiler_channels
(channel_list[, operator])General method for setting the input channels for the status process set_main_channel_names
(input_suffix, …)Sets the main channel names based on the provide input and output channel suffixes. set_secondary_channel
(source, channel_list)General purpose method for setting a secondary channel update_attributes
(attr_dict)Updates the directives attribute from a dictionary object. update_main_forks
(sink)Updates the forks attribute with the sink channel destination update_main_input
-
class
assemblerflow.generator.process.
ReportCompiler
(**kwargs)[source]¶ Bases:
assemblerflow.generator.process.Compiler
Reports compiler process template interface
This special process receives the report channels from all processes in the generated pipeline.
Attributes: template_str
Class property that returns a populated template string
Methods
get_user_channel
(input_channel[, input_type])Returns the main raw channel for the process render
(template, context)Wrapper to the jinja2 render method from a template file set_channels
(**kwargs)General purpose method that sets the main channels set_compiler_channels
(channel_list[, operator])General method for setting the input channels for the status process set_main_channel_names
(input_suffix, …)Sets the main channel names based on the provide input and output channel suffixes. set_secondary_channel
(source, channel_list)General purpose method for setting a secondary channel update_attributes
(attr_dict)Updates the directives attribute from a dictionary object. update_main_forks
(sink)Updates the forks attribute with the sink channel destination update_main_input
-
class
assemblerflow.generator.process.
PatlasConsensus
(**kwargs)[source]¶ Bases:
assemblerflow.generator.process.Compiler
Patlas consensus compiler process template interface
This special process receives the channels associated with the
patlas_consensus
key.Attributes: template_str
Class property that returns a populated template string
Methods
get_user_channel
(input_channel[, input_type])Returns the main raw channel for the process render
(template, context)Wrapper to the jinja2 render method from a template file set_channels
(**kwargs)General purpose method that sets the main channels set_compiler_channels
(channel_list[, operator])General method for setting the input channels for the status process set_main_channel_names
(input_suffix, …)Sets the main channel names based on the provide input and output channel suffixes. set_secondary_channel
(source, channel_list)General purpose method for setting a secondary channel update_attributes
(attr_dict)Updates the directives attribute from a dictionary object. update_main_forks
(sink)Updates the forks attribute with the sink channel destination update_main_input
assemblerflow.generator.process_details module¶
-
assemblerflow.generator.process_details.
colored_print
(msg, color_label='white_bold')[source]¶ - This function enables users to add a color to the print. It also enables to pass end_char to print allowing to print several strings in the same line in different prints.
Parameters: - color_string: str
The color code to pass to the function, which enables color change as well as background color change.
- msg: str
The actual text to be printed
- end_char: str
The character in which each print should finish. By default it will be “
- “.
-
assemblerflow.generator.process_details.
procs_dict_parser
(procs_dict)[source]¶ This function handles the dictionary of attributes of each Process class to print to stdout.
Parameters: - procs_dict: dict
A dictionary with the class attributes used by the argument that prints the lists of processes, both for short_list and for detailed_list.
-
assemblerflow.generator.process_details.
proc_collector
(process_map, args, processes_list=None)[source]¶ Function that collects all processes available and stores a dictionary of the required arguments of each process class to be passed to procs_dict_parser
Parameters: - process_map: dict
The dictionary with the Processes currently available in assemblerflow and their corresponding classes as values
- args: argparse.Namespace
The arguments passed through argparser that will be access to check the type of list to be printed
- processes_list: list
List with all the available processes of a recipe. In case no recipe is passed, the list should come empty.
Module contents¶
Placeholder for Process creation docs
assemblerflow.templates package¶
Subpackages¶
Submodules¶
assemblerflow.templates.assembly_report module¶
This module is intended to provide a summary report for a given assembly in Fasta format.
The following variables are expected whether using NextFlow or the
main()
executor.
fastq_id
: Sample Identification string.- e.g.:
'SampleA'
- e.g.:
assembly
: Path to assembly file in Fasta format.- e.g.:
'assembly.fasta'
- e.g.:
${fastq_id}_assembly_report.csv
: CSV with summary information of the assembly.- e.g.:
'SampleA_assembly_report.csv'
- e.g.:
-
class
assemblerflow.templates.assembly_report.
Assembly
(assembly_file, sample_id)[source]¶ Bases:
object
Class that parses and filters an assembly file in Fasta format.
This class parses an assembly file, collects a number of summary statistics and metadata from the contigs and reports.
Parameters: - assembly_file : str
Path to assembly file.
- sample_id : str
Name of the sample for the current assembly.
Methods
get_coverage_sliding
(coverage_file[, window])Parameters: get_gc_sliding
([window])Calculates a sliding window of the GC content for the assembly get_summary_stats
([output_csv])Generates a CSV report with summary statistics about the assembly -
summary_info
= None¶ OrderedDict: Initialize summary information dictionary. Contains keys:
ncontigs
: Number of contigsavg_contig_size
: Average size of contigsn50
: N50 metrictotal_len
: Total assembly lengthavg_gc
: Average GC proportionmissing_data
: Count of missing data characters
-
contigs
= None¶ OrderedDict: Object that maps the contig headers to the corresponding sequence
-
contig_coverage
= None¶ OrderedDict: Object that maps the contig headers to the corresponding list of per-base coverage
-
sample
= None¶ str: Sample id
-
contig_boundaries
= None¶ dict: Maps the boundaries of each contig in the genome
-
get_summary_stats
(output_csv=None)[source]¶ Generates a CSV report with summary statistics about the assembly
The calculated statistics are:
- Number of contigs
- Average contig size
- N50
- Total assembly length
- Average GC content
- Amount of missing data
Parameters: - output_csv: str
Name of the output CSV file.
-
get_gc_sliding
(window=500)[source]¶ Calculates a sliding window of the GC content for the assembly
Returns: - gc_res : list
List of GC proportion floats for each data point in the sliding window
- labels: list
List of labels for each data point
- xbars : list
List of the ending position of each contig in the genome
assemblerflow.templates.fastqc module¶
This module is intended to run FastQC on paired-end FastQ files.
The following variables are expected whether using NextFlow or the
main()
executor.
fastq_pair
: Pair of FastQ file paths- e.g.:
'SampleA_1.fastq.gz SampleA_2.fastq.gz'
- e.g.:
The generated output are output files that contain an object, usually a string.
pair_{1,2}_data
: File containing FastQC report at the nucleotide level for each pair- e.g.:
'pair_1_data'
and'pair_2_data'
- e.g.:
pair_{1,2}_summary
: File containing FastQC report for each category and for each pair- e.g.:
'pair_1_summary'
and'pair_2_summary'
- e.g.:
-
assemblerflow.templates.fastqc.
convert_adatpers
(adapter_fasta)[source]¶ Generates an adapter file for FastQC from a fasta file.
The provided adapters file is assumed to be a simple fasta file with the adapter’s name as header and the corresponding sequence:
>TruSeq_Universal_Adapter AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT >TruSeq_Adapter_Index 1 GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGTCTTCTGCTTG
Parameters: - adapter_fasta : str
Path to Fasta file with adapter sequences.
Returns: - adapter_out : str or None
The path to the reformatted adapter file. Returns
None
if the adapters file does not exist or the path is incorrect.
assemblerflow.templates.fastqc_report module¶
This module is intended parse the results of FastQC for paired end FastQ samples. It parses two reports:
- Categorical report
- Nucleotide level report.
The following variables are expected whether using NextFlow or the
main()
executor.
fastq_id
: Sample identification string- e.g.:
'SampleA'
- e.g.:
result_p1
: Path to both FastQC result files for pair 1- e.g.:
'SampleA_1_data SampleA_1_summary'
- e.g.:
result_p2
: Path to both FastQC result files for pair 2- e.g.:
'SampleA_2_data SampleA_2_summary'
- e.g.:
opts
: Specify additional arguments for executing fastqc_report. The arguments should be a string of command line arguments, The accepted arguments are:'--ignore-tests'
: Ignores test results from FastQC categorical summary. This is used in the first run of FastQC.
The generated output are output files that contain an object, usually a string.
fastqc_health
: Stores the health check for the current sample. If it- passes all checks, it contains only the string ‘pass’. Otherwise, contains
the summary categories and their respective results
- e.g.:
'pass'
optimal_trim
: Stores a tuple with the optimal trimming positions for 5’- and 3’ ends of the reads.
- e.g.:
'15 151'
-
assemblerflow.templates.fastqc_report.
write_json_report
(data1, data2)[source]¶ Writes the report
Parameters: - data1
- data2
-
assemblerflow.templates.fastqc_report.
get_trim_index
(biased_list)[source]¶ Returns the trim index from a
bool
listProvided with a list of
bool
elements ([False, False, True, True]
), this function will assess the index of the list that minimizes the number of True elements (biased positions) at the extremities. To do so, it will iterate over the boolean list and find an index position where there are two consecutiveFalse
elements after aTrue
element. This will be considered as an optimal trim position. For example, in the following list:[True, True, False, True, True, False, False, False, False, ...]
The optimal trim index will be the 4th position, since it is the first occurrence of a
True
element with two False elements after it.If the provided
bool
list has noTrue
elements, then the 0 index is returned.Parameters: - biased_list: list
List of
bool
elements, whereTrue
means a biased site.
Returns: - x : index position of the biased list for the optimal trim.
-
assemblerflow.templates.fastqc_report.
trim_range
(data_file)[source]¶ Assess the optimal trim range for a given FastQC data file.
This function will parse a single FastQC data file, namely the ‘Per base sequence content’ category. It will retrieve the A/T and G/C content for each nucleotide position in the reads, and check whether the G/C and A/T proportions are between 80% and 120%. If they are, that nucleotide position is marked as biased for future removal.
Parameters: - data_file: str
Path to FastQC data file.
Returns: - trim_nt: list
List containing the range with the best trimming positions for the corresponding FastQ file. The first element is the 5’ end trim index and the second element is the 3’ end trim index.
-
assemblerflow.templates.fastqc_report.
get_sample_trim
(p1_data, p2_data)[source]¶ Get the optimal read trim range from data files of paired FastQ reads.
Given the FastQC data report files for paired-end FastQ reads, this function will assess the optimal trim range for the 3’ and 5’ ends of the paired-end reads. This assessment will be based on the ‘Per sequence GC content’.
Parameters: - p1_data: str
Path to FastQC data report file from pair 1
- p2_data: str
Path to FastQC data report file from pair 2
Returns: - optimal_5trim: int
Optimal trim index for the 5’ end of the reads
- optima_3trim: int
Optimal trim index for the 3’ end of the reads
See also
-
assemblerflow.templates.fastqc_report.
get_summary
(summary_file)[source]¶ Parses a FastQC summary report file and returns it as a dictionary.
This function parses a typical FastQC summary report file, retrieving only the information on the first two columns. For instance, a line could be:
'PASS Basic Statistics SH10762A_1.fastq.gz'
This parser will build a dictionary with the string in the second column as a key and the QC result as the value. In this case, the returned
dict
would be something like:{"Basic Statistics": "PASS"}
Parameters: - summary_file: str
Path to FastQC summary report.
Returns: - summary_info: :py:data:`OrderedDict`
Returns the information of the FastQC summary report as an ordered dictionary, with the categories as strings and the QC result as values.
-
assemblerflow.templates.fastqc_report.
check_summary_health
(summary_file, **kwargs)[source]¶ Checks the health of a sample from the FastQC summary file.
Parses the FastQC summary file and tests whether the sample is good or not. There are four categories that cannot fail, and two that must pass in order for the sample pass this check. If the sample fails the quality checks, a list with the failing categories is also returned.
Categories that cannot fail:
fail_sensitive = [ "Per base sequence quality", "Overrepresented sequences", "Sequence Length Distribution", "Per sequence GC content" ]
Categories that must pass:
must_pass = [ "Per base N content", "Adapter Content" ]
Parameters: - summary_file: str
Path to FastQC summary file.
Returns: - x : bool
Returns
True
if the sample passes all tests.False
if not.- summary_info : list
A list with the FastQC categories that failed the tests. Is empty if the sample passes all tests.
assemblerflow.templates.integrity_coverage module¶
This module receives paired FastQ files, a genome size estimate and a minimum coverage threshold and has three purposes while iterating over the FastQ files:
- Checks the integrity of FastQ files (corrupted files).
- Guesses the encoding of FastQ files (this can be turned off in the
opts
argument).- Estimates the coverage for each sample.
The following variables are expected whether using NextFlow or the
main()
executor.
fastq_id
: Sample Identification string- e.g.:
'SampleA'
- e.g.:
fastq_pair
: Pair of FastQ file paths- e.g.:
'SampleA_1.fastq.gz SampleA_2.fastq.gz'
- e.g.:
gsize
: Expected genome size- e.g.:
'2.5'
- e.g.:
cov
: Minimum coverage threshold- e.g.:
'15'
- e.g.:
opts
: Specify additional arguments for executing integrity_coverage. The arguments should be a string of command line arguments, such as ‘-e’. The accepted arguments are:'-e'
: Skip encoding guess.
The generated output are output files that contain an object, usually a string.
(Values within ${}
are substituted by the corresponding variable.)
${fastq_id}_encoding
: Stores the encoding for the sample FastQ. If no encoding could be guessed, write ‘None’ to file.- e.g.:
'Illumina-1.8'
or'None'
- e.g.:
${fastq_id}_phred
: Stores the phred value for the sample FastQ. If no phred could be guessed, write ‘None’ to file.'33'
or'None'
${fastq_id}_coverage
: Stores the expected coverage of the samples, based on a given genome size.'112'
or'fail'
${fastq_id}_report
: Stores the report on the expected coverage estimation. This string written in this file will appear in the coverage report.'${fastq_id}, 112, PASS'
${fastq_id}_max_len
: Stores the maximum read length for the current sample.'152'
In case of a corrupted sample, all expected output files should have
'corrupt'
written.
-
assemblerflow.templates.integrity_coverage.
RANGES
= {'Illumina-1.3': [64, (64, 104)], 'Solexa': [64, (59, 104)], 'Sanger': [33, (33, 73)], 'Illumina-1.8': [33, (33, 74)], 'Illumina-1.5': [64, (66, 105)]}¶ dict: Dictionary containing the encoding values for several fastq formats. The key contains the format and the value contains a list with the corresponding phred score and a list with the range of encodings.
-
assemblerflow.templates.integrity_coverage.
MAGIC_DICT
= {b'\\x1f\\x8b\\x08': 'gz', b'\\x42\\x5a\\x68': 'bz2', b'\\x50\\x4b\\x03\\x04': 'zip'}¶ dict: Dictionary containing the binary signatures for three compression formats (gzip, bzip2 and zip).
-
assemblerflow.templates.integrity_coverage.
guess_file_compression
(file_path, magic_dict=None)[source]¶ Guesses the compression of an input file.
This function guesses the compression of a given file by checking for a binary signature at the beginning of the file. These signatures are stored in the
MAGIC_DICT
dictionary. The supported compression formats are gzip, bzip2 and zip. If none of the signatures in this dictionary are found at the beginning of the file, it returnsNone
.Parameters: - file_path : str
Path to input file.
- magic_dict : dict, optional
Dictionary containing the signatures of the compression types. The key should be the binary signature and the value should be the compression format. If left
None
, it falls back toMAGIC_DICT
.
Returns: - file_type : str or None
If a compression type is detected, returns a string with the format. If not, returns
None
.
-
assemblerflow.templates.integrity_coverage.
get_qual_range
(qual_str)[source]¶ Get range of the Unicode encode range for a given string of characters.
The encoding is determined from the result of the
ord()
built-in.Parameters: - qual_str : str
Arbitrary string.
Returns: - x : tuple
(Minimum Unicode code, Maximum Unicode code).
-
assemblerflow.templates.integrity_coverage.
get_encodings_in_range
(rmin, rmax)[source]¶ Returns the valid encodings for a given encoding range.
The encoding ranges are stored in the
RANGES
dictionary, with the encoding name as a string and a list as a value containing the phred score and a tuple with the encoding range. For a given encoding range provided via the two first arguments, this function will return all possible encodings and phred scores.Parameters: - rmin : int
Minimum Unicode code in range.
- rmax : int
Maximum Unicode code in range.
Returns: - valid_encodings : list
List of all possible encodings for the provided range.
- valid_phred : list
List of all possible phred scores.
assemblerflow.templates.mapping2json module¶
This module is intended to generate a json output for mapping results that can be imported in pATLAS.
The following variables are expected whether using NextFlow or the
main()
executor.
depth_file
: String with the name of the mash screen output file.- e.g.:
'samtoolsDepthOutput_sampleA.txt'
- e.g.:
json_dict
: the file that contains the dictionary with keys and values foraccessions and their respective lengths.
- e.g.:
'reads_sample_result_length.json'
- e.g.:
cutoff
: The cutoff used to trim the unwanted matches for the minimumcoverage results from mapping. This value may range between 0 and 1.
- e.g.:
0.6
- e.g.:
-
assemblerflow.templates.mapping2json.
depthfilereader
(depth_file, plasmid_length, cutoff)[source]¶ Function that parse samtools depth file and creates 3 dictionaries that will be useful to make the outputs of this script, both the tabular file and the json file that may be imported by pATLAS
Parameters: - depth_file: str
the path to depth file for each sample
- plasmid_length: dict
a dictionary that stores length of all plasmids in fasta given as input
- cutoff: str
the cutoff used to trim the unwanted matches for the minimum coverage results from mapping. This is then converted into a float within this function in order to compare with the value returned from the perc_value_per_ref.
Returns: - percentage_basescovered: dict
stores the percentage of the total sequence of a reference/accession (plasmid) in a dictionary
assemblerflow.templates.mashdist2json module¶
This module is intended to generate a json output for mash dist results that can be imported in pATLAS.
The following variables are expected whether using NextFlow or the
main()
executor.
mash_output
: String with the name of the mash screen output file.- e.g.:
'fastaFileA_mashdist.txt'
- e.g.:
-
assemblerflow.templates.mashdist2json.
send_to_output
(master_dict, last_seq, mash_output)[source]¶ Send dictionary to output json file This function sends master_dict dictionary to a json file if master_dict is populated with entries, otherwise it won’t create the file
Parameters: - master_dict: dict
dictionary that stores all entries for a specific query sequence in multi-fasta given to mash dist as input against patlas database
- last_seq: str
string that stores the last sequence that was parsed before writing to file and therefore after the change of query sequence between different rows on the input file
- mash_output: str
the name/path of input file to main function, i.e., the name/path of the mash dist output txt file.
assemblerflow.templates.mashscreen2json module¶
This module is intended to generate a json output for mash screen results that can be imported in pATLAS.
The following variables are expected whether using NextFlow or the
main()
executor.
mash_output
: String with the name of the mash screen output file.- e.g.:
'sortedMashScreenResults_SampleA.txt'
- e.g.:
assemblerflow.templates.pATLAS_consensus_json module¶
This module is intended to generate a json output from the consensus results from all the approaches available through options (mapping, assembly, mash screen)
The following variables are expected whether using NextFlow or the
main()
executor.
mapping_json
: String with the name of the json file with mapping results.- e.g.:
'mapping_SampleA.json'
- e.g.:
dist_json
: String with the name of the json file with mash dist results.- e.g.:
'mash_dist_SampleA.json'
- e.g.:
screen_json
: String with the name of the json file with mash screen results.- e.g.:
'mash_screen_sampleA.json'
- e.g.:
assemblerflow.templates.pipeline_status module¶
This module is intended to collect pipeline run statistics (such as time, cpu, RAM for each tasks) into a report JSON
trace_file
: Trace file generated by nextflow
-
assemblerflow.templates.pipeline_status.
main
(fastq_id, trace_file, workdir)[source]¶ Parses a nextflow trace file, searches for processes with a specific tag and sends a JSON report with the relevant information
The expected fields for the trace file are:
0. task_id 1. process 2. tag 3. status 4. exit code 5. start timestamp 6. container 7. cpus 8. duration 9. realtime 10. queue 11. cpu percentage 12. memory percentage 13. real memory size of the process 14. virtual memory size of the process
Parameters: - trace_file : str
Path to the nextflow trace file
assemblerflow.templates.process_abricate module¶
This module is intended parse the results of the Abricate for one or more samples.
The following variables are expected whether using NextFlow or the
main()
executor.
abricate_files
: Path to abricate output file.- e.g.:
'abr_resfinder.tsv'
- e.g.:
None
-
class
assemblerflow.templates.process_abricate.
Abricate
(fls)[source]¶ Bases:
object
Main parser for Abricate output files.
This class parses one or more output files from Abricate, usually from different databases. In addition to the parsing methods, it also provides a flexible method to filter and re-format the content of the abricate files.
Parameters: - fls : list
List of paths to Abricate output files.
Methods
get_filter
(*args, **kwargs)Wrapper of the iter_filter method that returns a list with results iter_filter
(filters[, databases, fields, …])General purpose filter iterator. parse_files
(fls)Public method for parsing abricate output files. -
storage
= None¶ dic: Main storage of Abricate’s file content. Each entry corresponds to a single line and contains the keys:
- ``infile``: Input file of Abricate. - ``reference``: Reference of the query sequence. - ``seq_range``: Range of the query sequence in the database sequence. - ``gene``: AMR gene name. - ``accession``: The genomic source of the sequence. - ``database``: The database the sequence came from. - ``coverage``: Proportion of gene covered. - ``identity``: Proportion of exact nucleotide matches.
-
parse_files
(fls)[source]¶ Public method for parsing abricate output files.
This method is called at at class instantiation for the provided output files. Additional abricate output files can be added using this method after the class instantiation.
Parameters: - fls : list
List of paths to Abricate files
-
iter_filter
(filters, databases=None, fields=None, filter_behavior='and')[source]¶ General purpose filter iterator.
This general filter iterator allows the filtering of entries based on one or more custom filters. These filters must contain an entry of the storage attribute, a comparison operator, and the test value. For example, to filter out entries with coverage below 80:
my_filter = ["coverage", ">=", 80]
Filters should always be provide as a list of lists:
iter_filter([["coverage", ">=", 80]]) # or my_filters = [["coverage", ">=", 80], ["identity", ">=", 50]] iter_filter(my_filters)
As a convenience, a list of the desired databases can be directly specified using the database argument, which will only report entries for the specified databases:
iter_filter(my_filters, databases=["plasmidfinder"])
By default, this method will yield the complete entry record. However, the returned filters can be specified using the fields option:
iter_filter(my_filters, fields=["reference", "coverage"])
Parameters: - filters : list
List of lists with the custom filter. Each list should have three elements. (1) the key from the entry to be compared; (2) the comparison operator; (3) the test value. Example:
[["identity", ">", 80]]
.- databases : list
List of databases that should be reported.
- fields : list
List of fields from each individual entry that are yielded.
- filter_behavior : str
options:
'and'
'or'
Sets the behaviour of the filters, if multiple filters have been provided. By default it is set to'and'
, which means that an entry has to pass all filters. It can be set to'or'
, in which case one one of the filters has to pass.
Yields: - dic : dict
Dictionary object containing a
Abricate.storage
entry that passed the filters.
-
class
assemblerflow.templates.process_abricate.
AbricateReport
(*args, **kwargs)[source]¶ Bases:
assemblerflow.templates.process_abricate.Abricate
Report generator for single Abricate output files
This class is intended to parse an Abricate output file from a single sample and database and generates a JSON report for the report webpage.
Parameters: - fls : list
List of paths to Abricate output files.
- database : (optional) str
Name of the database for the current report. If not provided, it will be inferred based on the first entry of the Abricate file.
Methods
get_filter
(*args, **kwargs)Wrapper of the iter_filter method that returns a list with results get_plot_data
()Generates the JSON report to plot the gene boxes get_table_data
()iter_filter
(filters[, databases, fields, …])General purpose filter iterator. parse_files
(fls)Public method for parsing abricate output files. write_report_data
()Writes the JSON report to a json file -
get_plot_data
()[source]¶ Generates the JSON report to plot the gene boxes
Following the convention of the reports platform, this method returns a list of JSON/dict objects with the information about each entry in the abricate file. The information contained in this JSON is:
{contig_id: <str>, seqRange: [<int>, <int>], gene: <str>, accession: <str>, coverage: <float>, identity: <float> }
Note that the seqRange entry contains the position in the corresponding contig, not the absolute position in the whole assembly.
Returns: - json_dic : list
List of JSON/dict objects with the report data.
assemblerflow.templates.process_assembly_mapping module¶
This module is intended to process the coverage report from the
assembly_mapping
process.
TODO: Better purpose
The following variables are expected whether using NextFlow or the
main()
executor.
fastq_id
: Sample Identification string.- e.g.:
'SampleA'
- e.g.:
assembly
: Fasta assembly file.- e.g.:
'SH10761A.assembly.fasta'
- e.g.:
coverage
: TSV file with the average coverage for each assembled contig.- e.g.:
'coverage.tsv'
- e.g.:
coverage_bp
: TSV file with the coverage for each assembled bp.- e.g.:
'coverage.tsv'
- e.g.:
bam_file
: BAM file with the alignment of reads to the genome.- e.g.:
'sorted.bam'
- e.g.:
opts
: List of options for processing assembly mapping output.- Minimum coverage for assembled contigs. Can be``auto``.
- e.g.:
'auto'
or'10'
- e.g.:
- Maximum number of contigs.
- e.g.: ‘100’
gsize
: Expected genome size.- e.g.:
'2.5'
- e.g.:
${fastq_id}_filtered.assembly.fasta
: Filtered assembly file in Fasta format.- e.g.:
'SampleA_filtered.assembly.fasta'
- e.g.:
filtered.bam
: BAM file with the same filtering as the assembly file.- e.g.:
filtered.bam
- e.g.:
-
assemblerflow.templates.process_assembly_mapping.
parse_coverage_table
(coverage_file)[source]¶ Parses a file with coverage information into objects.
This function parses a TSV file containing coverage results for all contigs in a given assembly and will build an
OrderedDict
with the information about their coverage and length. The length information is actually gathered from the contig header using a regular expression that assumes the usual header produced by Spades:contig_len = int(re.search("length_(.+?)_", line).group(1))
Parameters: - coverage_file : str
Path to TSV file containing the coverage results.
Returns: - coverage_dict : OrderedDict
Contains the coverage and length information for each contig.
- total_size : int
Total size of the assembly in base pairs.
- total_cov : int
Sum of coverage values across all contigs.
-
assemblerflow.templates.process_assembly_mapping.
filter_assembly
(assembly_file, minimum_coverage, coverage_info, output_file)[source]¶ Generates a filtered assembly file.
This function generates a filtered assembly file based on an original assembly and a minimum coverage threshold.
Parameters: - assembly_file : str
Path to original assembly file.
- minimum_coverage : int or float
Minimum coverage required for a contig to pass the filter.
- coverage_info : OrderedDict or dict
Dictionary containing the coverage information for each contig.
- output_file : str
Path where the filtered assembly file will be generated.
-
assemblerflow.templates.process_assembly_mapping.
filter_bam
(coverage_info, bam_file, min_coverage, output_bam)[source]¶ Uses Samtools to filter a BAM file according to minimum coverage
Provided with a minimum coverage value, this function will use Samtools to filter a BAM file. This is performed to apply the same filter to the BAM file as the one applied to the assembly file in
filter_assembly()
.Parameters: - coverage_info : OrderedDict or dict
Dictionary containing the coverage information for each contig.
- bam_file : str
Path to the BAM file.
- min_coverage : int
Minimum coverage required for a contig to pass the filter.
- output_bam : str
Path to the generated filtered BAM file.
-
assemblerflow.templates.process_assembly_mapping.
check_filtered_assembly
(coverage_info, coverage_bp, minimum_coverage, genome_size, contig_size, max_contigs)[source]¶ Checks whether a filtered assembly passes a size threshold
Given a minimum coverage threshold, this function evaluates whether an assembly will pass the minimum threshold of
genome_size * 1e6 * 0.8
, which means 80% of the expected genome size or the maximum threshold ofgenome_size * 1e6 * 1.5
, which means 150% of the expected genome size. It will issue a warning if any of these thresholds is crossed. In the case of an expected genome size below 80% it will return False.Parameters: - coverage_info : OrderedDict or dict
Dictionary containing the coverage information for each contig.
- coverage_bp : dict
Dictionary containing the per base coverage information for each contig. Used to determine the total number of base pairs in the final assembly.
- minimum_coverage : int
Minimum coverage required for a contig to pass the filter.
- genome_size : int
Expected genome size.
- contig_size : dict
Dictionary with the len of each contig. Contig headers as keys and the corresponding lenght as values.
- max_contigs : int
Maximum threshold for contig number. A warning is issued if this threshold is crossed.
Returns: - x : bool
True if the filtered assembly size is higher than 80% of the expected genome size.
-
assemblerflow.templates.process_assembly_mapping.
get_coverage_from_file
(coverage_file)[source]¶ Parameters: - coverage_file
-
assemblerflow.templates.process_assembly_mapping.
evaluate_min_coverage
(coverage_opt, assembly_coverage, assembly_size)[source]¶ Evaluates the minimum coverage threshold from the value provided in the coverage_opt.
Parameters: - coverage_opt : str or int or float
If set to “auto” it will try to automatically determine the coverage to 1/3 of the assembly size, to a minimum value of 10. If it set to a int or float, the specified value will be used.
- assembly_coverage : int or float
The average assembly coverage for a genome assembly. This value is retrieved by the :py:func:parse_coverage_table function.
- assembly_size : int
The size of the genome assembly. This value is retrieved by the py:func:get_assembly_size function.
Returns: - x: int
Minimum coverage threshold.
-
assemblerflow.templates.process_assembly_mapping.
get_assembly_size
(assembly_file)[source]¶ Returns the number of nucleotides and the size per contig for the provided assembly file path
Parameters: - assembly_file : str
Path to assembly file.
Returns: - assembly_size : int
Size of the assembly in nucleotides
- contig_size : dict
Length of each contig (contig name as key and length as value)
assemblerflow.templates.process_spades module¶
assemblerflow.templates.spades module¶
This module is intended execute Spades on paired-end FastQ files.
The following variables are expected whether using NextFlow or the
main()
executor.
fastq_id
: Sample Identification string.- e.g.:
'SampleA'
- e.g.:
fastq_pair
: Pair of FastQ file paths.- e.g.:
'SampleA_1.fastq.gz SampleA_2.fastq.gz'
- e.g.:
kmers
: Setting for Spades kmers. Can be either'auto'
,'default'
or a user provided list.- e.g.:
'auto'
or'default'
or'55 77 99 113 127'
- e.g.:
opts
: List of options for spades execution.- The minimum number of reads to consider an edge in the de Bruijn graph during the assembly.
- e.g.:
'5'
- e.g.:
- Minimum contigs k-mer coverage.
- e.g.:
['2' '2']
- e.g.:
contigs.fasta
: Main output of spades with the assembly- e.g.:
contigs.fasta
- e.g.:
spades_status
: Stores the status of the spades run. If it was successfully executed, it stores'pass'
. Otherwise, it stores theSTDERR
message.- e.g.:
'pass'
- e.g.:
-
assemblerflow.templates.spades.
set_kmers
(kmer_opt, max_read_len)[source]¶ Returns a kmer list based on the provided kmer option and max read len.
Parameters: - kmer_opt : str
The k-mer option. Can be either
'auto'
,'default'
or a sequence of space separated integers,'23, 45, 67'
.- max_read_len : int
The maximum read length of the current sample.
Returns: - kmers : list
List of k-mer values that will be provided to Spades.
assemblerflow.templates.trimmomatic module¶
This module is intended execute trimmomatic on paired-end FastQ files.
The following variables are expected whether using NextFlow or the
main()
executor.
fastq_id
: Pair of FastQ file paths.- e.g.:
'SampleA'
- e.g.:
fastq_pair
: Pair of FastQ file paths.- e.g.:
'SampleA_1.fastq.gz SampleA_2.fastq.gz'
- e.g.:
trim_range
: Crop range detected using FastQC.- e.g.:
'15 151'
- e.g.:
opts
: List of options for trimmomatic- e.g.:
'["5:20", "3", "3", "55"]'
- e.g.:
'[trim_sliding_window, trim_leading, trim_trailing, trim_min_length]'
- e.g.:
phred
: List of guessed phred values for each sample- e.g.:
'[SampleA: 33, SampleB: 33]'
- e.g.:
The generated output are output files that contain an object, usually a string.
(Values within ${}
are substituted by the corresponding variable.)
${fastq_id}_*P*
: Pair of paired FastQ files generated by Trimmomatic- e.g.:
'SampleA_1_P.fastq.gz SampleA_2_P.fastq.gz'
- e.g.:
trimmomatic_status
: Stores the status of the trimmomatic run. If it was successfully executed, it stores ‘pass’. Otherwise, it stores theSTDERR
message.- e.g.:
'pass'
- e.g.:
-
assemblerflow.templates.trimmomatic.
parse_log
(log_file)[source]¶ Retrieves some statistics from a single Trimmomatic log file.
This function parses Trimmomatic’s log file and stores some trimming statistics in an
OrderedDict
object. This object contains the following keys:clean_len
: Total length after trimming.total_trim
: Total trimmed base pairs.total_trim_perc
: Total trimmed base pairs in percentage.5trim
: Total base pairs trimmed at 5’ end.3trim
: Total base pairs trimmed at 3’ end.
Parameters: - log_file : str
Path to trimmomatic log file.
Returns: - x :
OrderedDict
Object storing the trimming statistics.
-
assemblerflow.templates.trimmomatic.
write_report
(storage_dic, output_file)[source]¶ Writes a report from multiple samples.
Parameters: - storage_dic : dict or
OrderedDict
Storage containing the trimming statistics. See
parse_log()
for its generation.- output_file : str
Path where the output file will be generated.
- storage_dic : dict or
assemblerflow.templates.trimmomatic_report module¶
This module is intended parse the results of the Trimmomatic log for a set of one or more samples.
The following variables are expected whether using NextFlow or the
main()
executor.
log_files
: Trimmomatic log files.- e.g.:
'Sample1_trimlog.txt Sample2_trimlog.txt'
- e.g.:
trimmomatic_report.csv
: Summary report of the trimmomatic logs for all samples
-
assemblerflow.templates.trimmomatic_report.
parse_log
(log_file)[source]¶ Retrieves some statistics from a single Trimmomatic log file.
This function parses Trimmomatic’s log file and stores some trimming statistics in an
OrderedDict
object. This object contains the following keys:clean_len
: Total length after trimming.total_trim
: Total trimmed base pairs.total_trim_perc
: Total trimmed base pairs in percentage.5trim
: Total base pairs trimmed at 5’ end.3trim
: Total base pairs trimmed at 3’ end.
Parameters: - log_file : str
Path to trimmomatic log file.
Returns: - x :
OrderedDict
Object storing the trimming statistics.
-
assemblerflow.templates.trimmomatic_report.
write_report
(storage_dic, output_file)[source]¶ Writes a report from multiple samples.
Parameters: - storage_dic : dict or
OrderedDict
Storage containing the trimming statistics. See
parse_log()
for its generation.- output_file : str
Path where the output file will be generated.
- storage_dic : dict or
Module contents¶
Placeholder for template generation docs