FlowCraft

_images/logo_large.png

A NextFlow pipeline assembler for genomics.

Overview

FlowCraft is an assembler of pipelines written in nextflow for analyses of genomic data. The premisse is simple:

Software are container blocks → Build your lego-like pipeline → Execute it (almost) anywhere.

What is Nextflow

If you do not know nextflow, be sure to check it out. It’s an awesome framework based on the dataflow programming model used for building parallelized, scalable and reproducible workflows using software containers. It provides an abstraction layer between the execution and the logic of the pipeline, which means that the same pipeline code can be executed on multiple platforms, from a local laptop to clusters managed with SLURM, SGE, etc. These are quite attractive features since genomic pipelines are increasingly executed on large computer clusters to handle large volumes of data and/or tasks. Moreover, portability and reproducibility are becoming central pillars in modern data science.

What FlowCraft does

FlowCraft is a python engine that automatically builds nextflow pipelines by assembling pre-made ready-to-use components. These components are modular pieces of software or scripts, such as fastqc, trimmomatic, spades, etc, that are written for nextflow and have a set of attributes, such as input and output types, parameters, directives, etc. This modular nature allows them to be freely connected as long as they respect some basic rules, such as the input type of a component must match with the output type of the preceding component. In this way, nextflow processes can be written only once, and FlowCraft is the magic glue that connects them, handling the linking and forking of channels automatically. Moreover, each component is associated with a docker image, which means that there is no need to install any dependencies at all and all software runs on a transparent and reliable box. To illustrate:

  • A linear genome assembly pipeline can be easily built using FlowCraft with the following pipeline string:

    trimmomatic fastqc spades
    

Which will generate all the necessary files to run the nextflow pipeline on any linux system that has nextflow and a container engine.

  • You can easily add more components to perform assembly polishing, in this case, pilon:

    trimmomatic fastqc spades pilon
    
  • If a new assembler comes along and you want to switch that component in the pipeline, its as easy as replacing spades (or any other component):

    trimmomatic fastqc skesa pilon
    
  • And you can also fork the output of a component into multiple ones. For instance, we could annotate the resulting assemblies with multiple software:

    trimmomatic fastqc spades pilon (abricate | prokka)
    
  • Or fork the execution of a pipeline early on to compare different software:

    trimmomatic fastqc (spades pilon | skesa pilon)
    

This will fork the output of fastqc into spades and skesa, and the pipeline will proceed independently in these two new ‘lanes’.

  • Directives for each process can be dynamically set when building the pipeline, such as the cpu/RAM usage or the software version:

    trimmomatic={'cpus':'4'} fastqc={'version':'0.11.5'} skesa={'memory':'10GB'} pilon (abricate | prokka)
    
  • And extra input can be directly inserted in any part of the pipeline. For example, it is possible to assemble genomes from both fastq files and SRR accessions (downloaded from public databases) in a single workflow:

    download_reads trimmomatic={'extra_input':'reads'} fastqc skesa pilon
    

This pipeline can be executed by providing a file with accession numbers (--accessions parameter by default) and fastq reads, using the --reads parameter defined with the extra_input directive.

Who is FlowCraft for

FlowCraft can be useful for bioinformaticians with varied levels of expertise that need to executed genomic pipelines often and potentially in different platforms. Building and executing pipelines requires no programming knowledge, but familiarization with nextflow is highly recommended to take full advantage of the generated pipelines.

At the moment, the available pre-made processes are mainly focused on bacterial genome assembly simply because that was how we started. However, our goal is to expand the library of existing components to other commonly used tools in the field of genomics and to widen the applicability and usefulness of FlowCraft pipelines.

Why not just write a Nextflow pipeline?

In many cases, building a static nextflow pipeline is sufficient for our goals. However, when building our own pipelines, we often felt the need to add dynamism to this process, particularly if we take into account how fast new tools arise and existing ones change. Our biological goals also change over time and we might need different pipelines to answer different questions. FlowCraft makes this very easy by having a set of pre-made and ready-to-use components that can be freely assembled. By using components (fastqc, trimmomatic) as its atomic elements, very complex pielines that take full advantage of nextflow can be built with little effort. Moreover, these components have explicit and standardized input and output types, which means that the addition of new modules does not require any changes in the existing code base. They just need to take into account how data will be received by the process and how data may be emitted from the process, to ensure that it can link with other components.

However, why not both?

FlowCraft generates a complete Nextflow pipeline file, which ca be used as a starting point for your customized processes!

Installation

User installation

FlowCraft is available as a bioconda package, which already comes with nextflow:

conda install flowcraft

Alternatively, you can install only FlowCraft, via pip:

pip install flowcraft

You will also need a container engine (see Container engine below)

Container engine

All components of FlowCraft are executed in docker containers, which means that you’ll need to have a container engine installed. The container engines available are the ones supported by Nextflow:

If you already have any one of these installed, you are good to go. If not, you’ll need to install one. We recommend singularity because it does not require the processes to run on a separate root daemon.

Singularity

Singularity is available to download and install here. Make sure that you have singularity v2.5.x or higher. Note that singularity should be installed as root and available on the machine(s) that will be running the nextflow processes.

Important

Singularity is available as a bioconda package. However, conda installs singularity in user space without root privileges, which may prevent singularity images from being correctly downloaded. Therefore it is not recommended that you install singularity via bioconda.

Docker

Docker can be installed following the instructions on the website: https://www.docker.com/community-edition#/download. To run docker as anon-root user, you’ll need to following the instructions on the website: https://docs.docker.com/install/linux/linux-postinstall/#manage-docker-as-a-non-root-user

Developer installation

If you are looking to contribute to FlowCraft or simply interested in tweaking it, clone the github repository and its submodule and then run setup.py:

git clone https://github.com/assemblerflow/flowcraft.git
cd flowcraft
git submodule update --init --recursive
python3 setup.py install

About

FlowCraft is developed by the Molecular Microbiology and Infection Unit (UMMI) at the Instituto de Medicina Molecular Joao Antunes.

This project is licensed under the GPLv3 license. The source code of FlowCraft is available at https://github.com/assemblerflow/flowcraft and the webservice is available at https://github.com/assemblerflow/flowcraft-webapp.

Basic Usage

FlowCraft has currently two execution modes, build and inspect, that are used to build and inspect the nextflow pipeline, respectively. However, a report mode is also being developed.

Build

Assembling a pipeline

Pipelines are generated using the build mode of FlowCraft and the -t parameter to specify the components inside quotes:

flowcraft build -t "trimmomatic fastqc spades" -o my_pipe.nf

All components should be written inside quotes and be space separated. This command will generate a linear pipeline with three components on the current working directory (for more features and tips on how pipelines can be built, see the pipeline building section). A linear pipeline means that there are no bifurcations between components, and the input data will flow linearly.

The rationale of how the data flows across the pipeline is simple and intuitive. Data enters a component and is processed in some way, which may result on the creation of result files (stored in the results directory) and reports files (stored in the reports directory) (see Results and reports below). If that component has an output_type, it will feed the processed data into the next component (or components) and this will repeated until the end of the pipeline.

If you are interesting in checking the pipeline DAG tree, open the my_pipe.html file (same name as the pipeline with the html extension) in any browser.

_images/fork_4.png

The integrity_coverage component is a dependency of trimmomatic, so it was automatically added to the pipeline.

Important

Not all pipeline configurations will work. You always need to ensure that the output type of a component matches the input type of the next component, otherwise FlowCraft will exit with an error.

Pipeline directory

In addition to the main nextflow pipeline file (my_pipe.nf), FlowCraft will write several auxiliary files that are necessary for the pipeline to run. The contents of the directory should look something like this:

$ ls
bin                lib           my_pipe.nf       params.config     templates
containers.config  my_pipe.html  nextflow.config  profiles.config   resources.config  user.config

You do not have to worry about most of these files. However, the *.config files can be modified to change several aspects of the pipeline run (see Pipeline configuration for more details). Briefly:

  • params.config: Contains all the available parameters of the pipeline (see Parameters below). These can be changed here, or provided directly on run-time (e.g.: nextflow run --fastq value).
  • resources.config: Contains the resource directives of the pipeline processes, such as cpus, allocated RAM and other nextflow process directives.
  • containers.config: Specifies the container and version tag of each process in the pipeline.
  • profiles.config: Contains a number of predefined profiles of executor and container engine.
  • user.config: Empty configuration file that is not over-written if you build another pipeline in the same directory. Used to set persistent configurations across different pipelines.

Parameters

The parameters of the pipeline can be viewed by running the pipeline file with nextflow and using the --help option:

$ nextflow run my_pipe.nf --help
N E X T F L O W  ~  version 0.30.1
Launching `my_pipe.nf` [kickass_mcclintock] - revision: 480b3455ba

============================================================
                F L O W C R A F T
============================================================
Built using flowcraft v1.2.1.dev1


Usage:
    nextflow run my_pipe.nf

       --fastq                     Path expression to paired-end fastq files. (default: fastq/*_{1,2}.*) (default: 'fastq/*_{1,2}.*')

       Component 'INTEGRITY_COVERAGE_1_1'
       ----------------------------------
       --genomeSize_1_1            Genome size estimate for the samples in Mb. It is used to estimate the coverage and other assembly parameters andchecks (default: 1)
       --minCoverage_1_1           Minimum coverage for a sample to proceed. By default it's setto 0 to allow any coverage (default: 0)

       Component 'TRIMMOMATIC_1_2'
       ---------------------------
       --adapters_1_2              Path to adapters files, if any. (default: 'None')
       --trimSlidingWindow_1_2     Perform sliding window trimming, cutting once the average quality within the window falls below a threshold (default: '5:20')
       --trimLeading_1_2           Cut bases off the start of a read, if below a threshold quality (default: 3)
       --trimTrailing_1_2          Cut bases of the end of a read, if below a threshold quality (default: 3)
       --trimMinLength_1_2         Drop the read if it is below a specified length  (default: 55)

       Component 'FASTQC_1_3'
       ----------------------
       --adapters_1_3              Path to adapters files, if any. (default: 'None')

       Component 'SPADES_1_4'
       ----------------------
       --spadesMinCoverage_1_4     The minimum number of reads to consider an edge in the de Bruijn graph during the assembly (default: 2)
       --spadesMinKmerCoverage_1_4 Minimum contigs K-mer coverage. After assembly only keep contigs with reported k-mer coverage equal or above this value (default: 2)
       --spadesKmers_1_4           If 'auto' the SPAdes k-mer lengths will be determined from the maximum read length of each assembly. If 'default', SPAdes will use the default k-mer lengths.  (default: 'auto')

All these parameters are specific to the components of the pipeline. However, the main input parameter (or parameters) of the pipeline is always available. In this case, since the pipeline started with fastq paired-end files as the main input, the --fastq parameter is available. If the pipeline started with any other input type or with more than one input type, the appropriate parameters will appear (more information in the raw input types section).

The parameters are composed by their name (adapters) followed by the ID of the process it refers to (_1_2). The IDs can be consulted in the DAG tree (See Assembling a pipeline). This is done to prevent issues when duplicating components and, as such, all parameters will be independent between different components. This behaviour can be changed when building the pipeline by using the --merge-params option (See Merge parameters).

Note

The --merge-params option of the build mode will merge all parameters with identical names (e.g.: --genomeSize_1_1 and --genomeSize_1_5 become simply --genomeSize) . This is usually more appropriate and useful in linear pipelines without component duplication.

Providing/modifying parameters

These parameters can be provided on run-time:

nextflow run my_pipe.nf --genomeSize_1_1 5 --adapters_1_2 "/path/to/adapters"

or edited in the params.config file:

params {
    genomeSize_1_1 = 5
    adapters_1_2 = "path/to/adapters"
}

Most parameters in FlowCraft’s components already come with sensible defaults, which means that usually you’ll only need to provide a small number of arguments. In the example above, the --fastq is the only parameter required. I have placed fastq files on the data directory:

$ ls data
sample_1.fastq.gz  sample_2.fastq.gz

We’ll need to provide the pattern to the fastq files. This pattern is perhaps a bit confusing at first, but it’s necessary for the correct inference of the paired:

--fastq "data/*_{1,2}.*"

In this case, the pairs are separated by the “_1.” or “_2.” substring, which leads to the pattern *_{1,2}.*. Another common nomenclature for paired fastq files is something like sample_R1_L001.fastq.gz. In this case, an acceptable pattern would be *_R{1,2}_*.

Important

Note the quotes around the fastq path pattern. These quotes are necessary to allow nextflow to resolve the pattern, otherwise your shell might try to resolve it and provide the wrong input to nextflow.

Execution

Once you build your pipeline with Flowcraft you have a standard nextflow pipeline ready to run. Therefore, all you need to do is:

nextflow run my_pipe.nf --fastq "data/*_{1,2}.*

Changing executor and container engine

The default run mode of an FlowCraft pipeline is to be executed locally and using the singularity container engine. In nextflow terms, this is equivalent to have executor = "local" and singularity.enabled = true. If you want to change these settings, you can modify the nextflow.config file, or use one of the available profiles in the profiles.config file. These profiles provide a combination of common <executor>_<container_engine> that are supported by nextflow. Therefore, if you want to run the pipeline on a cluster with SLURM and shifter, you’ll just need to specify the `` slurm_shifter`` profile:

nextflow run my_pipe.nf --fastq "data/*_{1,2}.*" -profile slurm_shifter

Common executors include:

  • slurm
  • sge
  • lsf
  • pbs

Other container engines are:

  • docker
  • singularity
  • shifter

Docker images

All components of FlowCraft are executed in containers, which means that the first time they are executed in a machine, the corresponding image will have to be downloaded. In the case of docker, images are pulled and stored in var/lib/docker by default. In the case of singularity, the nextflow.config generated by FlowCraft sets the cache dir for the images at $HOME/.singularity_cache. Note that when an image is downloading, nextflow does not display any informative message, except for singularity where you’ll get something like:

Pulling Singularity image docker://ummidock/trimmomatic:0.36-2 [cache /home/diogosilva/.singularity_cache/ummidock-trimmomatic-0.36-2.img]

So, if a process seems to take too long to run the first time, it’s probably because the image is being downloaded.

Results and reports

As the pipeline runs, processes may write result and report files to the results and reports directories, respectively. For example, the reports of the pipeline above, would look something like this:

reports
├── coverage_1_1
│   └── estimated_coverage_initial.csv
├── fastqc_1_3
│   ├── FastQC_2run_report.csv
│   ├── run_2
│   │   ├── sample_1_0_summary.txt
│   │   └── sample_1_1_summary.txt
│   ├── sample_1_1_trim_fastqc.html
│   └── sample_1_2_trim_fastqc.html
└── status
    ├── master_fail.csv
    ├── master_status.csv
    └── master_warning.csv

The estimated_coverage_initial.csv file contains a very rough coverage estimation for each sample, the fastqc* directory contains the html reports and summary files of FastQC for each sample, and the status directory contains a log of the status, warnings and fails of each process for each sample.

The actual results for each process that produces them, are stored in the results directory:

results
├── assembly
│   └── spades_1_4
│       └── sample_1_trim_spades3111.fasta
└── trimmomatic_1_2
    ├── sample_1_1_trim.fastq.gz
    └── sample_1_2_trim.fastq.gz

If you are interested in checking the actual environment where the execution of a particular process occurred for any given sample, you can inspected the pipeline_stats.txt file in the root of the pipeline directory. This file contains rich information about the execution of each process, including the working directory:

task_id hash        process         tag         status      exit    start                   container                           cpus    duration    realtime    queue   %cpu    %mem    rss     vmem
5       7c/cae270   trimmomatic_1_2 sample_1    COMPLETED   0       2018-04-12 11:42:29.599 docker:ummidock/trimmomatic:0.36-2  2       1m 25s      1m 17s      -       329.3%  1.1%    1.5 GB  33.3 GB

The hash column contains the start of the current working directory of that process. In the example below, the directory would be:

work/7c/cae270*

Inspect

FlowCraft has two options (overview and broadcast) for inspecting the progress of a pipeline that is running locally, either in a personal computer or a server machine. In both cases, the progress of the pipeline will be continuously updated in real-time.

In a terminal

To open inspect in the terminal just write the following command on the folder that the pipeline is running:

flowcraft inspect
_images/flowcraft_inspect_terminal.png

overview is the default behavior of this module, but it can also be called like this:

flowcraft inspect -m overview

Note

To exit the inspection just type q or ctrl+c.

In a browser

It is also possible to track the pipeline progress in a browser in any device using the flowcraft web application. To do so, the following command should be run in the folder where the pipeline is running:

flowcraft inspect -m broadcast

This will output an URL to the terminal that can be opened in a browser. This is an example of the screen that is displayed once the url is opened:

_images/flowcraft_inspect_broadcast.png

Important

This pipeline inspection will be available for anyone via the provided URL, which means that the URL can be shared with anyone and/or any device with a browser. However, the inspection section will only be available while the flowcraft inspect -m broadcast command is running. Once this command is cancelled, the data will be erased from the service and the URL will no longer be available.

Want to know more?

Pipeline inspection is the full documentation of the inspect mode.

Reports

The reporting of a FlowCraft pipeline is saved on a JSON file that is stored in pipeline_reports/pipeline_report.json. To visualize the reports you’ll just need to execute the following command in the folder where the pipeline was executed:

flowcraft report

This will output an URL to the terminal that can be opened in a browser. This is an example of the screen that is displayed once the url is opened:

_images/flowcraft_report.png

The actual layout and content of the reports will depend on the pipeline you build and it will only provide the information that is directly related to your pipeline components.

Important

This pipeline report will be available for anyone via the provided URL, which means that the URL can be shared with anyone and/or any device with a browser. However, the report section will only be available while the flowcraft report command is running. Once this command is cancelled, the data will be erased from the service and the URL will no longer be available.

Real time reports

The reports of any FlowCraft pipeline can be monitored in real-time using the --watch option:

flowcraft report --watch

This will output an URL exactly as in the previous section and will render the same reports page with a small addition. In the top right of the screen in the navigation bar, there will be a new icon that informs the user when new reports are available:

_images/flowcraft_report_watch.png

Local visualization

The FlowCraft report JSON file can also be visualized locally by drag and dropping it into the FlowCraft web application page, currently hosted at http://www.flowcraft.live/reports

Offline visualization

The complete FlowCraft report is also available as a standalone HTML file that can be visualized offline. This HTML file, stored in pipeline_reports/pipeline_report.html, can be opened in any modern browser.

Pipeline building

FlowCraft offers a few extra features when building pipelines using the build execution mode.

Raw input types

The first component (or components) you place at the start of the pipeline determine the raw input type, and the parameter for providing input data. The input type information is provided in the documentation page of each component. For instance, if the first component is FastQC, which has an input type of FastQ, the parameter for providing the raw input data will be --fastq. Here are the currently supported input types and their respective parameters:

  • FastQ: --fastq
  • Fasta: --fasta
  • Accessions: --accessions

Merge parameters

By default, parameters in a FlowCraft pipeline are unique and independent between different components, even if the parameters have the same name and/or the components are the same. This allows for the execution of the same software using different parameters in a single workflow. The params.config of these pipelines will look something like:

params {
    /*
    Component 'trimmomatic_1_2'
    --------------------------
    */
    adapters_1_2 = 'None'
    trimSlidingWindow_1_2 = '5:20'
    trimLeading_1_2 = 3
    trimTrailing_1_2 = 3
    trimMinLength_1_2 = 55

    /*
    Component 'fastqc_1_3'
    ---------------------
    */
    adapters_1_3 = 'None'
}

Notice that the adapters parameter occurs twice and can be independently set in each component.

If you want to override this behaviour, FlowCraft has a --merge-params option that merges all parameters with the same name in a single parameter, which is then equally applied to all components. So, if we generate the pipeline above with this option:

flowcraft build -t "trimmomatic fastqc" -o pipe.nf --merge-params

Then, the params.config will become:

params {
    adapters = 'None'
    trimSlidingWindow = '5:20'
    trimLeading = 3
    trimTrailing = 3
    trimMinLength = 5
}

Forks

The output of any component in an FlowCraft pipeline can be forked into two or more components, using the following fork syntax:

trimmomatic fastqc (spades | skesa)
_images/fork_1.png

In this example, the output of fastqc will be fork into two new lanes, which will proceed independently from each other. In this syntax, a fork is triggered by the ( symbol (and the corresponding closing )) and each lane will be separated by a | symbol. There is no limitation to the number of forks or lanes that a pipeline has. For instance, we could add more components after the skesa module, including another fork:

trimmomatic fastqc (spades | skesa pilon (abricate | prokka | chewbbaca))
_images/fork_2.png

In this example, data will be forked after fastqc into two new lanes, processed by spades and skesa. In the skesa lane, data will continue to flow into the pilon component and its output will fork into three new lanes.

It is also possible to start a fork at the beggining of the pipeline, which basically means that the pipeline will have multiple starting points. If we want to provide the raw input two multiple process, the fork syntax can start at the beginning of the pipeline:

(seq_typing | trimmomatic fastqc (spades | skesa))
_images/fork_3.png

In this case, since both initial components (seq_typing and integrity_coverage) received fastq files as input, the data provided via the --fastq parameter will be forked and provided to both processes.

Note

Some components have dependencies which need to be included previously in the pipeline. For instance, trimmomatic requires integrity_coverage and pilon requires assembly_mapping. By default, FlowCraft will insert any missing dependencies right before the process, which is why these components appear in the figures above.

Warning

Pay special attention to the syntax of the pipeline string when using forks. However, when unable to parse it, FlowCraft will do its best to inform you where the parsing error occurred.

Directives

Several directives with information on cpu usage, RAM, version, etc. can be specified for each individual component when building the pipeline using the ={} notation. These directives are written to the resources.config and containers.config files that are generated in the pipeline directory. You can pass any of the directives already supported by nextflow (https://www.nextflow.io/docs/latest/process.html#directives), but the most commonly used include:

  • cpus
  • memory
  • queue

In addition, you can also pass the container and version directives which are parsed by FlowCraft to dynamically change the container and/or version tag of any process.

Here is an example where we specify cpu usage, allocated memory and container version in the pipeline string:

flowcraft build -t "fastqc={'version':'0.11.5'} \
                        trimmomatic={'cpus':'2'} \
                        spades={'memory':'\'10GB\''}" -o my_pipeline.nf

When a directive is not specified, it will assume the default value of the nextflow directive.

Warning

Take special care not to include any white space characters inside the directives field. Common mistakes occur when specifying directives like fastqc={'version': '0.11.5'}.

Note

The values specified in these directives are placed in the respective config files exactly as they are. For instance, spades={'memory':'10GB'}" will appear in the config as spades.memory = 10Gb, which will raise an error in nextflow because 10Gb should be a string. Therefore, if you want a string you’ll need to add the ' as in this example: spades={'memory':'\'10GB\''}". The reason why these directives are not automatically converted is to allow the specification of dynamic computing resources, such as spades={'memory':'{10.Gb*task.attempt}'}"

Extra inputs

By default, only the first process (or processes) in a pipeline will receive the raw input data provided by the user. However, the extra_input special directive allows one or more processes to receive input from an additional parameter that is provided by the user:

reads_download integrity_coverage={'extra_input':'local'} trimmomatic spades

The default main input of this pipeline is a text file with accession numbers for the reads_download component. The extra_input creates a new parameter, named local in this example, that allows us to provide additional input data to the integrity_coverage component directly:

nextflow run pipe.nf --accessions accession_list.txt --local "fastq/*_{1,2}.*"

What will happen in this pipeline, is that the fastq files provided to the integrity_coverage component will be mixed with the ones provided by the reads_download component. Therefore, if we provide 10 accessions and 10 fastq samples, we’ll end up with 20 samples being processed by the end of the pieline.

It is important to note that the extra input parameter expected data compliant with the input type of the process. If files other than fastq files would be provided in the pipeline above, this would result in a pipeline error.

If the extra_input directive is used on a component that has a different input type from the first component in the pipeline, it is possible to use the default value:

trimmomatic spades abricate={'extra_input':'default'}

In this case, the input type of the first component if fastq and the input type of abricate is fasta. The default value will make available the default parameter for fasta raw input, which is fasta:

nextflow run pipe.nf --fastq "fastq/*_{1,2}.*" --fasta "fasta/*.fasta"

Pipeline file

Instead of providing the pipeline components via the command line, you can specify them in a text file:

# my_pipe.txt
trimmomatic fastqc spades

And then provide the pipeline file to the -t parameter:

flowcraft build -t my_pipe.txt -o my_pipe.nf

Pipeline files are usually more readable, particularly when they become more complex. Consider the following example:

integrity_coverage (
    spades={'memory':'\'50GB\''} |
    skesa={'memory':'\'40GB\'','cpus':'4'} |
    trimmomatic fastqc (
        spades pilon (abricate={'extra_input':'default'} | prokka) |
        skesa pilon (abricate | prokka)
    )
)

In addition to be more readable, it is also easier to edit, re-use and share.

Pipeline configuration

When a nextflow pipeline is built with FlowCraft, a number of configuration files are automatically generated in the same directory. They are all imported at the end of the nextflow.config file and are sorted by their configuration role. All configuration files are overwritten if you build another pipeline in the same directory, with the exception of the user.config file, which is meant to be a persistent configuration file.

Parameters

The params.config file includes all available paramenters for the pipeline and their respective default values. Most of these parameters already contain sensible defaults.

Resources

The resources.config file includes the majority of the directives provided for each process, including cpus and memory. You’ll note that each process name has a suffix like _1_1, which is a unique process identifier composed of <lane>_<process_number>. This ensures that even when the same component is specified multiple times in a pipeline, you’ll still be able to set directives for each one individually.

Containers

The containers.config file includes the container directive for each process in the pipeline. These containers are retrieved from dockerhub, if they do not exist locally yet. You can change the container string to any other value, but it should point to an image that exist on dockerhub or locally.

Profiles

The profiles.config file includes a set of pre-made profiles with all possible combinations of executors and container engines. You can add new ones or modify existing one.

User configutations

The user.config file is configuration file that is not overwritten when a new pipeline is build in the same directory. It can contain any configuration that is supported by nextflow and will overwrite all other configuration files.

Pipeline inspection

FlowCraft offers an inspect mode for tracking the progress of a nextflow pipeline either directly in a terminal (overview) or by broadcasting information to the flowcraft web application (broadcast).

Note

This mode was design for nextflow pipelines generated by FlowCraft. It should be possible to inspect any nextflow pipeline, provided that the requirements below are met, but compatibility it’s not guaranteed.

How it works: Simply run flowcraft inspect -m <mode> in the directory where the pipeline is running. In either run mode, FlowCraft will keep running (until you cancel it) and continuously update the progress of a pipeline. If the pipeline is interrupted or fails for some reason, FlowCraft should be able to correctly reset the inspection automatically when resuming its execution.

Requirements for inspect

While the inspect mode is running, it will parse the information written into two files that are generated by nextflow:

  • .nextflow.log: The log file that is automatically generated by nextflow.
  • trace file: The trace file that is generated by nextflow when using the -with-trace option. By default, it searches for the pipeline_stats.txt file, but this can be changed using the -i option.

Trace fields

FlowCraft parses several fields of the trace file, but only a few are mandatory for its execution. If the trace file does not contain any of the optional fields, that information will simply not appear on the terminal or web app. Nevertheless, to take full advantage of the inspect mode, the following trace fields should be present:

  • Mandatory:
    • tag: The tag of the nextflow process. Flowcraft assumes that this is a string with only the sample name (e.g.: SampleA). While this is not strictly required, providing strings with other information (e.g.: Running bowtie for sampleA) may result in some inconsistencies in the inspection.
    • task_id: The task ID is used to skip entries that have already been parsed.
  • Optional:
    • hash: Used to get the work directory the process execution.
    • cpus, %cpu, memory, rss, rchar and wchar: Used for statistics of computational resources.

Note

Any additional fields present in the trace file are ignored.

Usage

flowcraft inspect --help
usage: flowcraft inspect [-h] [-i TRACE_FILE] [-r REFRESH_RATE]
                         [-m {overview,broadcast}] [-u URL] [--pretty]

optional arguments:
  -h, --help            show this help message and exit
  -i TRACE_FILE         Specify the nextflow trace file.
  -r REFRESH_RATE       Set the refresh frequency for the continuous inspect
                        functions
  -m {overview,broadcast}, --mode {overview,broadcast}
                        Specify the inspection run mode.
  -u URL, --url URL     Specify the URL to where the data should be broadcast
  --pretty              Pretty inspection mode that removes usual reporting
                        processes.
  • -i: Used to specify the path to the trace file that should be parsed. By default, FlowCraft will try to parse the pipeline_stats.txt file in current working directory.
  • -r: Sets the time interval in seconds between each parsing of the relevant nextflow files. By default it is set to 0.01.
  • -m: The inspection mode. overview is the terminal display while broadcast sends the data to FlowCraft’s web service.
  • -u: The URL of FlowCraft’s web service. By default it is already set to the main service and you do not need to specify it. It is only useful when the service is running on local host or in other custom instance.
  • --pretty: By default the inspection shows the progress of all processes in the pipeline. Using this option filters the processes to the most relevant ones of FlowCraft’s pipelines.

Pipeline reports

abricate

Table data

AMR table:
  • <abricate database>: Number of hits for a particular given database
_images/abricate_table.png

Plot data

  • Sliding window AMR annotation: Provides annotation of Abricate hits for each database along the genome. This report component is only available when the pilon component was used downstream of abricate.
_images/sliding_window_amr.png

assembly_mapping

Plot data

  • Data loss chart: Gives a trend of the data loss (in total number of base pairs) across components that may filter this data.
_images/sparkline.png

Warnings

Assembly table:
  • When the number of contigs exceeds the threshold of 100 contigs per 1.5Mb.

Fails

Assembly table:
  • When the assembly size if smaller than 80% or larger than 150% of the expected genome size.

check_coverage

Table data

Quality control table:
  • Coverage: Estimated coverage based on the number of base pairs and the expected genome size.
_images/quality_control_table.png

Warnings

Quality control table:
  • When the enconding and phred score cannot be guessed from the FastQ file(s).

Fails

Quality control table:
  • When the sample has lower estimated coverage than the provided coverage threshold.

chewbbaca

Table data

Chewbbaca table:
  • Table with the summary statistics of ChewBBACA allele calling, including the number of exact matches, inferred loci, loci not found, etc.
_images/chewbbaca_table.png

dengue_typing

Table data

Typing table:
  • seqtyping: The sequence typing result (serotypy-genotype).
_images/typing_table_dengue.png

fastqc

Plot data

  • Base sequence quality: The average quality score across the read length.
_images/fastqc_base_sequence_quality.png
  • Sequence quality: Distribution of the mean sequence quality score.
_images/fastqc_per_base_sequence_quality.png
  • Base GC content: Distribution of the GC content of each sequence.
_images/fastqc_base_gc_content.png
  • Sequence length: Distribution of the read sequence length.
_images/fastqc_sequence_length.png
  • Missing data: Normalized count of missing data across the read length.
_images/fastqc_missing_data.png

Warnings

The following FastQC categories will issue a warning when they have a WARN flag:
  • Per base sequence quality.
  • Overrepresented sequences.
The following FastQC categories will issue a warning when do not have a PASS flag:
  • Per base sequence content.

Fails

The following FastQC categories will issue a fail when they have a FAIL flag:
  • Per base sequence quality.
  • Overrepresented sequences.
  • Sequence length distribution.
  • Per sequence GC content.
The following FastQC categories will issue a fail when the do not have a PASS flag:
  • Per base N content.
  • Adapter content.

fastqc_trimmomatic

Table data

Quality control table:
  • Trimmed (%): Percentage of trimmed base pairs.
_images/quality_control_table.png

Plot data

  • Data loss chart: Gives a trend of the data loss (in total number of base pairs) across components that may filter this data.
_images/sparkline.png

integrity_coverage

Table data

Quality control table:
  • Raw BP: Number of raw base pairs from the FastQ file(s).
  • Reads: Number of reads in the FastQ file(s)
  • Coverage: Estimated coverage based on the number of base pairs and the expected genome size.
_images/quality_control_table.png

Plot data

  • Data loss chart: Gives a trend of the data loss (in total number of base pairs) across components that may filter this data.
_images/sparkline.png

Warnings

Quality control table:
  • When the enconding and phred score cannot be guessed from the FastQ file(s).

Fails

Quality control table:
  • When the sample has lower estimated coverage than the provided coverage threshold.

mash_dist

Table data

Plasmids table:
  • Mash Dist: Number of plasmid hits
_images/mash_dist_table.png

Plot data

  • Sliding window Plasmid annotation: Provides annotation of plasmid hits along the genome assembly. This report component is only available when the mash_dist component is used.
_images/sliding_window_mash_dist.png

mlst

Table data

Typing table:
  • MLST species: The inferred species name.
  • MLST ST: The inferred sequence type.
_images/typing_table.png

patho_typing

Table data

Typing table:
  • Patho_typing: The pathotyping result.
_images/typing_table.png

pilon

Table data

Quality control table:
  • Contigs: Number of assembled contigs.
  • Assembled BP: Total number of assembled base pairs.
_images/assembly_table_skesa.png

Plot data

  • Contig size distribution: Distribution of the size of each assembled contig.
_images/contig_size_distribution.png
  • Sliding window coverage and GC content: Provides coverage and GC content metrics along the genome using a sliding window approach and two synchronised charts.
_images/sliding_window_amr.png

Warnings

Quality control table:
  • When the enconding and phred score cannot be guessed from the FastQ file(s).

Fails

Quality control table:
  • When the sample has lower estimated coverage than the provided coverage threshold.

process_mapping

Table data

Read mapping table:
  • Reads: Number reads in the the FastQ file(s).
  • Unmapped: Number of unmapped reads
  • Mapped 1x: Number of reads that aligned, concordantly and discordantly, exactly 1 time
  • Mapped >1x: Number of reads that aligned, concordantly or disconrdantly, more than 1 times
  • Overall alignment rate (%): Overall alignment rate
_images/read_mapping_remove_host.png

process_newick

Tree data

Phylogenetic reconstruction with bootstrap values for the provided tree.

_images/phylogenetic_tree.png

process_skesa

Table data

Quality control table:
  • Contigs (skesa): Number of assembled contigs.
  • Assembled BP: Total number of assembled base pairs.
_images/assembly_table_skesa.png

Warnings

Assembly table:
  • When the number of contigs exceeds the threshold of 100 contigs per 1.5Mb.

Fails

Assembly table:
  • When the assembly size if smaller than 80% or larger than 150% of the expected genome size.

process_spades

Table data

Quality control table:
  • Contigs (spades): Number of assembled contigs.
  • Assembled BP: Total number of assembled base pairs.
_images/assembly_table_spades.png

Warnings

Assembly table:
  • When the number of contigs exceeds the threshold of 100 contigs per 1.5Mb.

Fails

Assembly table:
  • When the assembly size if smaller than 80% or larger than 150% of the expected genome size.

process_viral_assembly

Table data

Quality control table:
  • Contigs (SPAdes): Number of assembled contigs.
  • Assembled BP (SPAdes): Total number of assembled base pairs.
  • ORFs: Number of complete ORFs in the assembly.
  • Contigs (MEGAHIT): Number of assembled contigs.
  • Assembled BP (MEGAHIT): Total number of assembled base pairs.
_images/assembly_table_viral_assembly.png

Fails

Assembly table:
  • When the assembly size if smaller than 80% or larger than 150% of the expected genome size.

seq_typing

Table data

Typing table:
  • seqtyping: The sequence typing result.
_images/typing_table.png

sistr

Table data

Typing table:
  • sistr: The sequence typing result.
_images/typing_table.png

trimmomatic

Table data

Quality control table:
  • Trimmed (%): Percentage of trimmed base pairs.
_images/quality_control_table.png

Plot data

  • Data loss chart: Gives a trend of the data loss (in total number of base pairs) across components that may filter this data.
_images/sparkline.png

true_coverage

Table data

Quality control table:
  • True Coverage: Estimated coverage based on read mapping on MLST genes.
_images/quality_control_table.png

Fails

Quality control table:
  • When the sample has lower estimated coverage than the provided coverage threshold.

Components

These are the currently available FlowCraft components with a short description of their tasks. For a more detailed information, follow the links of each component.

Download

  • reads_download: Downloads reads from the SRA/ENA public databases from a list of accessions.
  • fasterq_dump: Downloads reads from the SRA public databases from a list of accessions, using fasterq-dump.

Reads Quality Control

  • check_coverage: Estimates the coverage for each sample and filters FastQ files according to a specified minimum coverage threshold.
  • fastqc: Runs FastQC on paired-end FastQ files.
  • fastqc_trimmomatic: Runs Trimmomatic on paired-end FastQ files informed by the FastQC report.
  • filter_poly: Runs PrinSeq on paired-end FastQ files to remove low complexity sequences.
  • integrity_coverage: Tests the integrity of the provided FastQ files, provides the option to filter FastQ files based on the expected assembly coverage and provides information about the maximum read length and sequence encoding.
  • trimmomatic: Runs Trimmomatic on paired-end FastQ files.
  • downsample_fastq: Subsamples fastq files up to a target coverage depth.

Assembly

  • megahit: Assembles metagenomic paired-end FastQ files using megahit.
  • metaspades: Assembles metagenomic paired-end FastQ files using metaSPAdes.
  • skesa: Assembles paired-end FastQ files using skesa.
  • spades: Assembles paired-end FastQ files using SPAdes.

Post-assembly

  • pilon: Corrects and filters assemblies using Pilon.
  • process_skesa: Processes the assembly output from Skesa and performs filtering base on quality criteria of GC content k-mer coverage and read length.
  • process_spades: Processes the assembly output from Spades and performs filtering base on quality criteria of GC content k-mer coverage and read length.

Binning

  • maxbin2: An automatic tool for binning metagenomic sequences

Annotation

  • abricate: Performs anti-microbial gene screening using abricate.
  • card_rgi: Performs anti-microbial resistance gene screening using CARD rgi (with contigs as input).
  • prokka: Performs assembly annotation using prokka.

Distance Estimation

  • mash_dist: Executes mash distance against a reference index plasmid database and generates a JSON for pATLAS. This component calculates pairwise distances between sequences (one from the database and the query sequence). However if a different database is provided it can use mash dist for other purposes.
  • mash_screen: Performs mash screen against a reference index plasmid database and generates a JSON input file for pATLAS. This component searches for containment of a given sequence in read sequencing data. However if a different database is provided it can use mash screen for other purposes.
  • fast_ani: Performs pairwise comparisons between fastas,

given a multifasta as input for fastANI. It will split the multifasta into single fastas that will then be provided as a matrix. The output will be the all pairwise comparisons that pass the minimum of 50 aligned sequences with a default length of 200 bp.

Mapping

  • assembly_mapping: Performs a mapping procedure of FastQ files into a their assembly and performs filtering based on quality criteria of read coverage and genome size.
  • bowtie: Align short paired-end sequencing reads to long reference sequences
  • mapping_patlas: Performs read mapping and generates a JSON input file for pATLAS.
  • remove_host: Performs read mapping with bowtie2 against the target host genome (default hg19) and removes the mapping reads
  • retrieve_mapped: Retrieves the mapped reads of a previous bowtie2 mapping process.

Taxonomic Profiling

  • kraken: Performs taxonomic identification with kraken on FastQ files (minikrakenDB2017 as default database)
  • kraken2: Performs taxonomic identification with kraken2 on FastQ files (minikraken2_v1_8GB as default database)
  • midas_species: Performs taxonomic identification on FastQ files at the species level with midas (requires database)

Typing

  • chewbbaca: Performs a core-genome/whole-genome Multilocus Sequence Typing analysis on an assembly using ChewBBACA.
  • metamlst: Checks the Sequence Type of metagenomic reads using Multilocus Sequence Typing.
  • mlst: Checks the Sequence Type of an assembly using Multilocus Sequence Typing.
  • patho_typing: In silico pathogenic typing from raw illumina reads.
  • seq_typing: Determines the type of a given sample from a set of reference sequences.
  • sistr: Serovar predictions from whole-genome sequence assemblies by determination of antigen gene and cgMLST gene alleles.
  • momps: Multi-locus sequence typing for Legionella pneumophila from assemblies and reads.

General orientation

Codebase structure

The most important elements of FlowCraft’s directory structure are:

  • generator:
    • components: Contains the Process classes for each component
    • templates: Contains the nextflow jinja template files for each component
    • engine.py: The engine of FlowCraft that builds the pipeline
    • process.py: Contains the abstract Process class that is inherited
    • by all component classes
    • pipeline_parser.py: Functions that parse and check the pipeline string
    • recipe.py: Class responsible for creating recipes
  • templates: A git submodule of the templates repository that contain the template scripts for the components.

Code style

  • Style: the code base of flowcraft should adhere (the best it can) to the PEP8 style guidelines.
  • Docstrings: code should be generally well documented following the numpy docstring style.
  • Quality: there is also an integration with the codacy service to evaluate code quality, which is useful for detecting several coding issues that may appear.

Testing

Tests are performed using pytest and the source files are stored in the flowcraft/tests directory. Tests must be executed on the root directory of the repository

Documentation

Documentation source files are stored in the docs directory. The general configuration file is found in docs/conf.py and the entry point to the documentation is docs/index.html.

Process creation guidelines

Basic process creation

The addition of a new process to FlowCraft requires three main steps:

  1. Create process template: Create a jinja2 template in flowcraft.generator.templates with the nextflow code.
  2. Create Process class: Create a Process subclass in flowcraft.generator.process with information about the process (e.g., expected input/output, secondary inputs, etc.).

Create process template

First, create the nextflow template that will be integrated into the pipeline as a process. This file must be placed in flowcraft.generator.templates and have the .nf extension. In order to allow the template to be dynamically added to a pipeline file, we use the jinja2 template language to substitute key variables in the process, such as input/output channels.

An example created as a my_process.nf file is as follows:

some_channel_{{ pid }} = Channel.value(params.param1{{ param_id}})
other_channel_{{ pid }} = Chanel.fromPath(params.param2{{ param_id}})

process myProcess_{{ pid }} {

    {% include "post.txt" ignore missing %}

    publishDir "results/myProcess_{{ pid }}", pattern: "*.tsv"

    input:
    set sample_id, <data> from {{ input_channel }}
    val x from some_channel_{{ pid }}
    file y from other_channel_{{ pid }}
    val direct_from_parms from Channel.value(params.param3{{param_id}}

    // The output is optional
    output:
    set sample_id, <data> into {{ output_channel }}
    {% with task_name="abricate" %}
    {%- include "compiler_channels.txt" ignore missing -%}
    {% endwith %}

    """
    <process code/commands>
    """
}

{{ forks }}

The fields surrounded by curly brackets are jinja placeholders that will be dynamically substituted when building the pipeline. They will ensure that the processes and potential forks correctly link with each other and that channels are unique and correctly linked. This example contains all placeholder variables that are currently supported by FlowCraft.

{{pid}}

Used as a unique process identifier that prevent issues from process and channel duplication in the pipeline. Therefore, is should be appended to each process and channel name as _{{ pid }} (note the underscore):

some_channel_{{ pid }}
process myProcess_{{ pid }}
{{param_id}}

Same as the {{ pid }}, but sets the identified for nextflow params. It should be appended to each param as {{ param_id }}. This will allow parameters to be specific to each component in the pipeline:

Channel.value(params.param1{{ param_id}})

Note that the parameters used in the template, should also be defined in the Process class params attribute (see Parameters).

{% include “post.txt” %}

Inserts beforeScript and afterScript statements to the process that sets environmental variables and a series of dotfiles for the process to log their status, warnings, fails and reports (see Dotfiles for more information). It also includes scripts for sending requests to REST APIs (only when certain pipeline parameters are used).

{{input_channel}}

All processes must include one and only one input channel. In most cases, this channel should be defined with a two element tuple that contains the sample ID and then the actual data file/stream. We suggest the sample ID variable to be named sample_id as a standard. If other name variable name is specified and you include the compiler_channels.txt in the process, you’ll need to change the sample ID variable (see Sample ID variable).

{{output_channel}}

Terminal processes may skip the output channel entirely. However, if you want to link the main output of this process with subsequent ones, this placeholder must be used only once. Like in the input channel, this channel should be defined with a two element tuple with the sample ID and the data. The sample ID must match the one specified in the input_channel.

{% include “compiler_channels.txt %}

This will include the special channels that will compile the status/logging of the processes throughout the pipeline. You must include the whole block (see Status channels):

{% with task_name="abricate" %}
{%- include "compiler_channels.txt" ignore missing -%}
{% endwith %}
{{forks}}

Inserts potential forks of the main output channel. It is mandatory if the output_channel is set.

Complete example

As an example of a complete process, this is the template of spades.nf:

IN_spades_opts_{{ pid }} = Channel.value([params.spadesMinCoverage{{ param_id }},params.spadesMinKmerCoverage{{ param_id }}])
IN_spades_kmers_{{pid}} = Channel.value(params.spadesKmers{{ param_id }})

process spades_{{ pid }} {

    // Send POST request to platform
    {% include "post.txt" ignore missing %}

    tag { fastq_id + " getStats" }
    publishDir 'results/assembly/spades/', pattern: '*_spades.assembly.fasta', mode: 'copy'

    input:
    set fastq_id, file(fastq_pair), max_len from {{ input_channel }}.join(SIDE_max_len_{{ pid }})
    val opts from IN_spades_opts_{{ pid }}
    val kmers from IN_spades_kmers_{{ pid }}

    output:
    set fastq_id, file('*_spades.assembly.fasta') optional true into {{ output_channel }}
    set fastq_id, val("spades"), file(".status"), file(".warning"), file(".fail") into STATUS_{{ pid }}
    file ".report.json"

    script:
    template "spades.py"
}

{{ forks }}

Create Process class

The process class will contain the information that FlowCraft will use to build the pipeline and assess potential conflicts/dependencies between process. This class should be created in one the category files in the flowcraft.generator.components module (e.g.: assembly.py). If the new component does not fit in any of the existing categories, create a new one that imports flowcraft.generator.process.Process and add your new class. This class should inherit from the Process base class:

class MyProcess(Process):

    def __init__(self, **kwargs):

        super().__init__(**kwargs)

        self.input_type = "fastq"
        self.output_type = "fasta"

This is the simplest working example of a process class, which basically needs to inherit the parent class attributes (the super part). Then we only need to define the expected input and output types of the process. There are no limitations to the input/output types. However, a pipeline will only build successfully when all processes correctly link the output with the input type.

Depending on the process, other attributes may be required:

  • Parameters: Parameters provided by the user to be used in the process.
  • Secondary inputs: Channels created from parameters provided by the user.
  • Secondary Link start and Link end: Secondary links that connect secondary information between two processes.
  • Dependencies: List of other processes that may be required for the current process.
  • Directives: Default information for RAM/CPU/Container directives and more.

Add to available components

Contrary to previous implementation (version <= 1.3.1), the available components are now retrieved automatically by FlowCraft and there is no need to add the process to any dictionary (previous process_map). In order for the component to be accessible to flowcraft build the process template name in snake_case must match the process class in CamelCase. For instance, if the process template is named my_process.nf, the process class must be MyProcess, then the FlowCraft will be able to automatically add it to the list of available components.

Note

Note that the template string does not include the .nf extension.

Process attributes

This section describes the main attributes of the Process class: what they do and how do they impact the pipeline generation.

Input/Output types

The input_type and output_type attributes set the expected type of input and output of the process. There are no limitations to the type of input/output that are provided. However, processes will only link when the output of one process matches the input of the subsequent process (unless the ignore_type attribute is set to True). Otherwise, FlowCraft will raise an exception stating that two processes could not be linked.

Note

The input/ouput types that are currently used are fastq, fasta.

Parameters

The params attribute sets the parameters that can be used by the process. For each parameter, a default value and a description should be provided. The default value will be set in the params.config file in the pipeline directory and the description will be used to generated the custom help message of the pipeline:

self.params = {
    "genomeSize": {
        "default": 2.1,
        "description": "Expected genome size (default: params.genomeSiz)
    },
    "minCoverage": {
        "default": 15,
        "description": "Minimum coverage to proceed (default: params.minCoverage)"
    }
}

These parameters can be simple values that are not feed into any channel, or can be automatically set to a secondary input channel via Secondary inputs (see below).

They can be specified when running the pipeline like any nextflow parameter (e.g.: --genomeSize 5) and used in the nextflow process as usual (e.g.: params.genomeSize).

Note

These pairs are then used to populate the params.config file that is generated in the pipeline directory. Note that the values are replaced literally in the config file. For instance, "genomeSize": 2.1, will appear as genomeSize = 2.1, whereas "adapters": "'None'" will appear as adapters = 'None'. If you want a value to appear as a string, the double and single quotes are necessary.

Secondary inputs

Warning

The secondary_inputs attribute has been deprecated since v1.2.1. Instead, specify the secondary channels directly in the nextflow template files.

Any process can receive one or more input channels in addition to the main channel. These are particularly useful when the process needs to receive additional options from the parameters scope of nextflow. These additional inputs can be specified via the secondary_inputs attribute, which should store a list of dictionaries (a dictionary for each input). Each dictionary should contains a key:value pair with the name of the parameter (params) and the definition of the nextflow channel (channel). Consider the example below:

self.secondary_inputs = [
        {
            "params": "genomeSize",
            "channel": "IN_genome_size = Channel.value(params.genomeSize)"
        },
        {
            "params": "minCoverage",
            "channel": "IN_min_coverage = Channel.value(params.minCoverage)"
        }
    ]

This process will receive two secondary inputs that are given by the genomeSize and minCoverage parameters. These should be also specified in the params attribute (See Parameters above).

For each of these parameters, the dictionary also stores how the channel should be defined at the beginning of the pipeline file. Note that this channel definition mentions the parameters (e.g. params.genomeSize). An additional best practice for channel definition is to include one or more sanity checks to ensure that the provided arguments are correct. These checks can be added in the nextflow template file, or literally in the channel string:

self.secondary_inputs = [
    {
        "params": "genomeSize",
        "channel":
                "IN_genome_size = Channel.value(params.genomeSize)"
                "map{it -> it.toString().isNumber() ? it : exit(1, \"The genomeSize parameter must be a number or a float. Provided value: '${params.genomeSize}'\")}"
        }

Extra input

The extra_input attribute is mostly a user specified directive that allows the injection of additional input data from a parameter into the main input channel of the process. When a pipeline is defined as:

process1 process2={'extra_input':'var'}

FlowCraft will expose a new var parameter, setup an extra input channel and mix it with process2 main input channel. A more detailed explanation follows below.

First, FlowCraft will create a nextflow channel from the parameter name provided via the extra_input directive. The channel string will depend on the input type of the process (this string is fetched from the RAW_MAPPING attribute). For instance, if the input type of process2 is fastq, the new extra channel will be:

IN_var_extraInput = Channel.fromFilePairs(params.var)

Since the same extra input parameter may be used by more than one process, the IN_var_extraInput channel will be automatically forked into the final destination channels:

// When there is a single destination channel
IN_var_extraInput.set{ EXTRA_process2_1_2 }
// When there are multiple destination channels for the same parameter
IN_var_extraInput.into{ EXTRA_process2_1_2; EXTRA_process3_1_3 }

The destination channels are the ones that will be actually mixed with the main input channels:

process process2 {
    input:
    (...) main_channel.mix(EXTRA_process2_1_2)
}

In these cases, the processes that receive the extra input will process the data provided by the preceding channel AND by the parameter. The data provided via the extra input parameter does not have to wait for the main_channel, which means that they can run in parallel, if there are enough resources.

Compiler

The compiler attribute allows one or more channels of the process to be fed into a compiler process (See Compiler processes). These are special processes that collect information from one or more processes to execute a given task. Therefore, this parameter can only be used when there is an appropriate compiler process available (the available compiler processes are set in the compilers dictionary). In order to provide one or more channels to a compiler process, simply add a key:value to the attribute, where the key is the id of the compiler process present in the compilers dictionary and the value is the list of channels:

self.compiler["patlas_consensus"] = ["mappingOutputChannel"]

Dependencies

If a process depends on the presence of one or more processes upstream in the pipeline, these can be specific via the dependencies attribute. When building the pipeline if at least one of the dependencies is absent, FlowCraft will raise an exception informing of a missing dependency.

Directives

The directives attribute allows for information about cpu/RAM usage and container to be specified for each nextflow process in the template file. For instance, considering the case where a Process has a template with two nextflow processes:

process proc_A_{{ pid }} {
    // stuff
}

process proc_B_{{ pid }} {
    // stuff
}

Then, information about each process can be specified individually in the directives attribute:

class myProcess(Process):
    (...)
    self.directives = {
        "proc_A": {
            "cpus": 1
            "memory": "4GB"
        },
        "proc_B": {
            "cpus": 4
            "container": "my/container"
            "version": "1.0.0"
        }
    }

The information in this attribute will then be used to build the resources.config (containing the information about cpu/RAM) and containers.config (containing the container images) files. Whenever a directive is missing, such as the container and version from proc_A and memory from proc_B, nothing about them will be written into the config files and they will use the default pipeline values:

Ignore type

The ignore_type attribute, controls whether a match between the input of the current process and the output of the previous one is enforced or not. When there are multiple terminal processes that fork from the main channel, there is no need to enforce the type match and in that case this attribute can be set to False.

Process ID

The process ID, set via the pid attribute, is an arbitrarily and incremental number that is awarded to each process depending on its position in the pipeline. It is mainly used to ensure that there are no duplicated channels even when the same process is used multiple times in the same pipeline.

Template

The template attribute is used to fetch the jinja2 template file that corresponds to the current process. The path to the template file is determined as follows:

join(<template directory>, template + ".nf")

Status channels

The status channels are special channels dedicated to passing information regarding the status, warnings, fails and logging from each process (see Dotfiles for more information). They are used only when the nextflow template file contains the appropriate jinja2 placeholder:

output:
{% with task_name="<nextflow_template_name>" %}
{%- include "compiler_channels.txt" ignore missing -%}
{% endwith %}

By default, every Process class contains a status_channels list attribute that contains the template string:

self.status_channels = ["STATUS_{}".format(template)]

If there is only one nextflow process in the template and the task_name variable in the template matches the template attribute, then it’s all automatically set up.

If the template file contains more than one nextflow process definition, multiple placeholders can be provided in the template:

process A {
    (...)
    output:
    {% with task_name="A" %}
    {%- include "compiler_channels.txt" ignore missing -%}
    {% endwith %}
}

process B {
    (...)
    output:
    {% with task_name="B" %}
    {%- include "compiler_channels.txt" ignore missing -%}
    {% endwith %}
}

In this case, the status_channels attribute would need to be changed to:

self.status_channels = ["A", "B"]
Sample ID variable

In case you change the standard nextflow variable that stores the sample ID in the input of the process (sample_id), you also need to change it for the compiler_channels placeholder:

process A {

input:
set other_id, data from {{ input_channel }}

output:
{% with task_name="B", sample_id="other_id" %}
{%- include "compiler_channels.txt" ignore missing -%}
{% endwith %}

}

Advanced use cases

Compiler processes

Compilers are special processes that collect data from one or more processes and perform a given task with that compiled data. They are automatically included in the pipeline when at least one of the source channels is present. In the case there are multiple source channels, they are merged according to a specified operator.

Creating a compiler process

The creation of the compiler process is simpler than that of a regular process but follows the same three steps.

  1. Create a nextflow template file in flowcraft.generator.templates:

    process fullConsensus {
    
        input:
        set id, file(infile_list) from {{ compile_channels }}
    
        output:
        <output channels>
    
        script:
        """
        <commands/code/template>
        """
    
    }
    

The only requirement is the inclusion of a compiler_channels jinja placeholder in the main input channel.

  1. Create a Compiler class in the flowcraft.generator.process module:

    class PatlasConsensus(Compiler):
    
        def __init__(self, **kwargs):
    
            super().__init__(**kwargs)
    

This class must inherit from Compiler and does not require any more changes.

3. Map the compiler template file to the class in compilers attribute:

self.compilers = {
"patlas_consensus": {
    "cls": pc.PatlasConsensus,
    "template": "patlas_consensus",
    "operator": "join"
    }
}

Each compiler should contain a key:value entry. The key is the compiler id that is then specified in the compiler attribute of the component classes. The value is a json/dict object that species the compiler class in the cls key, the template string in the template string and the operator used to join the channels into the compiler via the operator key.

How a compiler process works

Consider the case where you have a compiler process named compiler_1 and two processes, process_1 and process_2, both of which feed a single channel to compiler_1. This means that the class definition of these processes include:

class Process_1(Process):
    (...)
    self.compiler["compiler_1"] = ["channel1"]

class Process_2(Process):
    (...)
    self.compiler["compiler_1"] = ["channel2"]

If a pipeline is built with at least one of these process, the compiler_1 process will be automatically included in the pipeline. If more than one channel is provided to the compiler, they will be merged with the specified operator:

process compiler_1 {

    input:
    set sample_id, file(infile_list) from channel2.join(channel1)

}

This will allow the output of multiple separate process to be processed by a single process in the pipeline, and it automatically adjusts according to the channels provided to the compiler.

Template creation guidelines

Though none of these guidelines are mandatory nor required, their usage is highly recommended for several reasons:

  • Consistency in the outputs of the templates throughout the pipeline, particularly the status and report dotfiles (see Dotfiles section);
  • Debugging purposes;
  • Versioning;
  • Proper documentation of the template scripts.

Preface header

After the script shebang, a header with a brief description of the purpose and expected inputs and outputs should be provided. A complete example of such description can be viewed in flowcraft.templates.integrity_coverage.

Purpose

Purpose section contains a brief description of the script’s objective. E.g.:

Purpose
-------

This module is intended parse the results of FastQC for paired end FastQ \
samples.

Expected input

Expected input section contains a description of the variables that are provided to the main function of the template script. These variables are defined in the input channels of the process in which the template is supposed to be executed. E.g.:

Expected input
--------------

The following variables are expected whether using NextFlow or the
:py:func:`main` executor.

- ``mash_output`` : String with the name of the mash screen output file.
    - e.g.: ``'sortedMashScreenResults_SampleA.txt'``

This means that the process that will execute this channel will have the input defined as:

input:
file(mash_output) from <channel>

Generated output

Generated output section contains a description of the output files that the template script is intended to generated. E.g.:

Generated output
----------------

The generated output are output files that contain an object, usually a string.

- ``fastqc_health`` : Stores the health check for the current sample. If it
    passes all checks, it contains only the string 'pass'. Otherwise, contains
    the summary categories and their respective results

These can then be passed to the output channel(s) in the nextflow process:

output:
file(fastqc_health) into <channel>

Note

Since templates can be re-used by multiple processes, not all generated outputs need to be passed to output channels. Depending on the job of the nextflow process, it may catch none or all of the output files generated by the template.

Versioning and logging

FlowCraft has a specific logger (get_logger()) and versioning system that can be imported from flowcraft.templates.flowcraft_utils:

# the module that imports the logger and the decorator class for versioning
# of the script itself and other software used in the script
from flowcraft_utils.flowcraft_base import get_logger, MainWrapper

Logger

A logger function is also required to add logs to the script. The logs are written to the .command.log file in the work directory of each process.

First, the logger must be called, for example, after the imports as follows:

logger = get_logger(__file__)

Then, it may be used at will, using the default logging levels . E.g.:

logger.debug("Information tha may be important for debugging")
logger.info("Information related to the normal execution steps")
logger.warning("Events that may require the attention of the developer")
logger.error("Module exited unexpectedly with error:\\n{}".format(
            traceback.format_exc()))

MainWrapper decorator

This MainWrapper class decorator allows the program to fetch information on the script version, build and template name. For example:

# This can also be declared after the imports
__version__ = "1.0.0"
__build__ = "15012018"
__template__ = "process_abricate-nf"

The MainWrapper should decorate the main function of the script. E.g.:

@MainWrapper
def main():
    #some awesome code
    ...

Besides searching for the script’s version, build and template name this decorator will also search for a specific set of functions that start with the substring __get_version. For example:

def __get_version_fastqc():

    try:

    cli = ["fastqc", "--version"]
    p = subprocess.Popen(cli, stdout=PIPE, stderr=PIPE)
    stdout, _ = p.communicate()

    version = stdout.strip().split()[1][1:].decode("utf8")

    except Exception as e:
        logger.debug(e)
        version = "undefined"

    # Note that it returns a dictionary that will then be written to the .versions
    # dotfile
    return {
        "program": "FastQC",
        "version": version,
        # some programs may also contain build.
    }

These functions are used to fetch the version, name and other relevant information from third-party software and the only requirement is that they return a dictionary with at least two key:value pairs:

  • program: String with the name of the program.
  • version: String with the version of the program.

For more information, refer to the build_versions() method.

Nextflow .command.sh

When these templates are used as a Nextflow template they are executed as a .command.sh file in the work directory of each process. In this case, we recommended the inclusion of an if statement to parse the arguments sent from nextflow to the python template. For example, imagine we have a path to a file name to pass as argument between nextflow and the required template:

# code check for nextflow execution
if __file__.endswith(".command.sh"):
    FILE_NAME = '$Nextflow_file_name'
    # logger output can also be included here, for example:
    logger.debug("Running {} with parameters:".format(
        os.path.basename(__file__)))
    logger.debug("FILE_NAME: {}".format(FILE_NAME))

Then, we could use this variable as the argument of a function, such as:

def main(FILE_NAME):
    #some awesome code
    ...

This way, we can use this function with nextflow arguments or without them, as is the case when the templates are used as standalone modules.

Use numpy docstrings

FlowCraft uses numpy docstrings to document code. Use this link for reference.

Recipe creation guidelines

Recipes are pre-made pipeline strings that may be associated with specific parameters and directives and are used to rapidly build a certain type of pipeline.

Instead of building a pipeline like:

-t "integrity_coverage fastqc_trimmomatic fastqc spades pilon"

The user simply can specific a recipe with that pipeline:

-r assembly

Recipe creation

The creation of new recipes is a very simple and straightforward process. You need to create a new file in the flowcraft/generator/recipes folder with any name and create a basic class with three attributes:

try:
    from generator.recipe import Recipe
except ImportError:
    from flowcraft.generator.recipe import Recipe


class Innuca(Recipe):

    def __init__(self):
        super().__init__()

        # Recipe name
        self.name = "innuca"

        # Recipe pipeline
        self.pipeline_str = <pipeline string>

        # Recipe parameters and directives
        self.directives = { <directives> }

And that’s it! Now there is a new recipe available with the innuca name and we can build this pipeline using the option -r innuca.

Name

This is the name of the recipe, which is used to make a match with the recipe name provided by the user via the -r option.

Pipeline_str

The pipeline string as if provided via the -t option.

Directives

A dictionary containing the parameters and directives for each process in the pipeline string. Setting this attribute is optional and components that are not specified here will assume their default values. In general, each element in this dictionary should have the following format:

self.directives = {
    "component_name": {
        "params": {
            "paramA": "value"
        },
        "directives": {
            "directiveA": "value"
        }
    }
}

This will set the provided parameters and directives to the component, but it is possible to provide only one.

A more concrete example of a real component and directives follows:

self.pipeline_str = "integrity_coverage fastqc"

# Set parameters and directives only for integrity_coverage
# and leave fastqc with the defaults
self.directives = {
    "integrity_coverage": {
        "params": {
            "minCoverage": 20
        },
        "directives": {
            "memory": "1GB"
        }
    }
}
Duplicate components

In some cases, the same component may be present multiple times in the pipeline string of a recipe. In these cases, directives can be assigned to each individual component by adding a #<id> suffix to the component:

self.pipeline_str = "integrity_coverage ( trimmomatic spades#1 | spades#2)"

self.directives = {
    "spades#1": {
        "directives": {
            "memory": "10GB"
        }
    },
    "spades#2": {
        "directives": {
            "version": "3.7.0"
        }
    }
}

Docker containers guidelines

All FlowCraft components require a docker container in order to be executed, thus if a new component is added, a docker image should be added as well and uploaded to .. _docker hub: https://hub.docker.com/ in order to be available to pull in other machines. Although this can be done in any personal repository, we recommend that this docker images are added to an already existing .. _FlowCraft github repository: https://github.com/assemblerflow/docker-imgs (called here Official) so that docker builds can be automated with github integration. Also, the centralization of all images will allow other contributors to easily access and edit these containers instead of forking from one side to another every time a container needs to be changed/updated.

Official FlowCraft Docker images

Writing docker images

Official FlowCraft Docker images are available in .. _this github repository: https://github.com/assemblerflow/docker-imgs . If you want to add your image to this repository please fork it and make a Pull Request (PR) with the requested new image or create an issue asking to be added to the organization as a contributor.

Building docker images

Then, after the image has been added to the FlowCraft .. _docker-imgs https://github.com/assemblerflow/docker-imgs github repository, they can be built through .. _FlowCraft docker hub https://hub.docker.com/u/flowcraft/dashboard/ .

Tag naming

Each time a docker image is built using the automated build of docker hub it should follow this nomenclature: version-patch. This is used to avoid the override of previous builds for the same images, allowing for instance users to use different version of the same software using the same docker image but with different tags.

  • Version: Is a string with tree letters like this: 1.1.1. Versions should

change every time a new software is added the container.

  • Patch: Is a number that follows a - after the version. Patches should

change every time a change does not affect the software inside it. For example, updates to database related files required by some of the software inside the container.

Unofficial FlowCraft Docker images

Although we strongly recommend that all images are stored in FlowCraft .. _docker-imgs https://github.com/assemblerflow/docker-imgs github repo, it is not mandatory to do it. Images can be built in another github repo and also use another docker hub repository to build the images. However, do make sure that you define it correctly in the directives of the process as explained in Directives.

Dotfiles

Several dotfiles (files prefixed by a single ., as in .status) are created at the beginning of every nextflow process that has the following placeholder (see Create process template):

process myProcess {
    {% include "post.txt" ignore missing %}
    (...)
}

The actual script that creates the dotfiles is found in flowcraft/bin, is called set_dotfiles.sh and executes the following command:

touch .status .warning .fail .report.json .versions

Status

The .status file simply stores a string with the run status of the process. The supported status are:

  • pass: The process finished successfully
  • fail: The process ran without unexpected issues but failed due to some quality control check
  • error: The process exited with an unexpected error.

Warning

The .warning file stores any warnings that may occur during the execution of the process. There is no particular format for the warning messages other than that each individual warning should be in a separate line.

Fail

The .fail file stores any fail messages that may occur during the execution of the process. When this occurs, the .status channel must have the fail string as well. As in the warning dotfile, there is no particular format for the fail message.

Report JSON

Important

The general specification of the report JSON changed in version 1.2.2. See the issue tracker for details.

The .report.json file stores any information from a given process that is deemed worthy of being reported and displayed at the end of the pipeline. Any information can be stored in this file, as long as it is in JSON format, but there are a couple of recommendations that are necessary to follow for them to be processed by a reporting web app (Currently hosted at flowcraft-webapp). However, if data processing will be performed with custom scripts, feel free to specify your own format.

Information for tables

Information meant to be displayed in tables should be in the following format:

json_dic = {
    "tableRow": [{
        "sample": "A",
        "data": [{
            "header": "Raw BP",
            "value": 123,
            "table": "qc"
        }, {
            "header": "Coverage",
            "value": 32,
            "table": "qc"
        }]
    }, {
        "sample": "B",
        "data": [{
            "header": "Coverage",
            "value": 35,
            "table": "qc"
        }]
    }]
}

This provides table information for multiple samples in the same process. In this case, data for two samples is provided. For each sample, values for one or more headers can be provided. For instance, this report provides information about the Raw BP and Coverage for sample A and this information should go to the qc table. If any other information is relevant to build the table, feel free to add more elements to the JSON.

Information for plots

Information meant to be displayed in plots should be in the following format:

json_dic = {
    "plotData": [{
        "sample": "strainA",
        "data": {
            "sparkline": 23123,
            "otherplot": [1,2,3]
         }
    }],
}

As in the table JSON, plotData should be an array with an entry for each sample. The data for each sample should be another JSON where the keys are the plot signatures, so that we know to which plot the data belongs. The corresponding values are whatever data object you need.

Other information

Other than tables and plots, which have a somewhat predefined format, there is not particular format for other information. They will simply store the data of interest to report and it will be the job of a downstream report app to process that data into an actual visual report.

Versions

The .version dotfile should contain a list of JSON objects with the version information of the programs used in any given process. There are only two required key:value pairs:

  • program: String with the name of the software/script/template
  • version: String with the version of said software.

As an example:

version = {
    "program": "abricate"
    "version": "0.3.7"
}

Key:value pairs with other metadata can be included at will for downstream processing.

Pipeline reporting

This section describes how the reports of a FlowCraft pipeline are generated and collected at the end of a run. These reports can then be sent to the FlowCraft web application where the results are visualized.

Important

Note that if the nextflow process reports add new types of data, one or more React components need to be added to the web application for them to be rendered.

Data collection

The data for the pipeline reports is collected from three dotfiles in each nextflow process (they should be present in each work sub directory):

  • .report.json: Contains report data (See Report JSON for more information).
  • .versions: Contains information about the versions of the software used (See Versions for more information).
  • .command.trace: Contains resource usage information.

The .command.trace file is generated by nextflow when the trace scope is active. The .report.json and .version files are specific to FlowCraft pipelines.

Generation of dotfiles

Both report.json and .versions empty dotfiles are automatically generated by the {% include "post.txt" ignore missing %} placeholder, specified in the Create process template section. Using this placeholder in your processes is all that is needed.

Collection of dotfiles

The .report.json, .versions and .command.trace files are automatically collected and sent to dedicated report channels in the pipeline by the {%- include "compiler_channels.txt" ignore missing -%} placeholder, specified in the process creation section. Placing this placeholder in your processes will generate the following line in the output channel specification:

set {{ sample_id|default("sample_id") }}, val("{{ task_name }}_{{ pid }}"), val("{{ pid }}"), file(".report.json"), file(".versions"), file(".command.trace") into REPORT_{{task_name}}_{{ pid }}

This line collects several metadata associated with the process along with the three dotfiles.

Compilation of dotfiles

As mentioned in the previous section, the dotfiles and other relevant metadata for are sent through special report channels to a FlowCraft component that is responsible for compiling all the information and generate a single report file at the end of each pipeline run.

This component is specified in flowcraft.generator.templates.report_compiler.nf and it consists of two nextflow processes:

  • First, the report process receives the data from each executed process that sends report data and runs the flowcraft/bin/prepare_reports.py script on that data. This script will simply merge metadata and dotfiles information in a single JSON file. This file contains the following keys:

    • reportJson: The data in .report.json file.
    • versions: The data in .versions file.
    • trace: The data in .command.trace file.
    • processId: The process ID
    • pipelineId: The pipeline ID that defaults to one, unless specified in the parameters.
    • projectid: The project ID that defaults to one, unless specified in the parameters.
    • userId: The user ID that defaults to one, unless specified in the parameters.
    • username: The user name that defaults to user, unless specified in the parameters
    • processName: The name of the flowcraft component.
    • workdir: The work directory where the process was executed.
  • Second, all JSON files created in the process above are merged and a single reports JSON file is created. This file will contains the following structure:

    reportJSON = {
        "data": {
            "results": [<array of report JSONs>]
        }
    }
    

Reports

Report JSON specification

The report JSON is quite flexibly on the information it can contain. Here are some guidelines to promote consistency on the reports generated by each component. In general, the reports file is an array of JSON objects that contain relevant information for each executed process in the pipeline:

reportFile = [{<processA/tagA reports>}, {<processB/tagB reports>}, ... ]

Nextflow metadata

The nextflow metada is automatically added to the reportFile as a single JSON entry with the nfMetadata key that contains the following information:

"nfMetadata": {
    "scriptId": "${workflow.scriptId}",
    "scriptName": "${workflow.scriptId}",
    "profile": "${workflow.profile}",
    "container": "${workflow.container}",
    "containerEngine": "${workflow.containerEngine}",
    "commandLine": "${workflow.commandLine}",
    "runName": "${workflow.runName}",
    "sessionId": "${workflow.sessionId}",
    "projectDir": "${workflow.projectDir}",
    "launchDir": "${workflow.launchDir}",
    "start_time": "${workflow.start}"
}

Note

Unlike the remaining JSON entries in the report file, which are generated for each process execution, the nfMetadata entry is generated only once per project execution.

Root

The reports contained in the reports.json file for each process execution are added to the root object:

{
    "pipelineId": 1,
    "processId": pid,
    "processName": task_name,
    "projectid": RUN_NAME,
    "reportJson": reports,
    "runName": RUN_NAME,
    "scriptId": SCRIPT_ID,
    "versions": versions,
    "trace": trace,
    "userId": 1,
    "username": "user",
    "workdir": dirname(abspath(report_json))
}

The other key:values are added automatically when the reports are compiled for each process execution.

Versions

Inside the root, the signature key for software version information is versions:

"versions": [{
    "program": "progA",
    "version": "1.0.0",
    "build": "1"
}, {
    "program": "progB",
    "version": "2.1"
}]

Only the program and version keys are mandatory.

ReportJson

Table data

Inside reportJson, the signature key for table data is tableRow:

 "reportJson": {
     "tableRow": [{
         "sample": "strainA",
         "data": [{
             "header": "Raw BP",
             "value": 123,
             "table": "qc",
         }, {
             "header": "Coverage",
             "value": 32,
             "table": "qc"
         }],
         "sample": "strainB",
         "data": [{
             "header": "Raw BP",
             "value": 321,
             "table": "qc",
         }, {
             "header": "Coverage",
             "value": 22,
             "table": "qc"
         }]
     }]
}

tableRow should contain an array of JSON for each sample with two key:value pairs:

  • sample: Sample name
  • data: Table data (see below).

data should be an array of JSON with at least three key:value pairs:

  • header: Column header
  • value: The data value
  • table: Informs to which table this data should go.

Note

Available table keys: typing, qc, assembly, abricate, chewbbaca.

Plot data

Inside reportJson, the signature key for plot data is plotData:

"reportJson": {
    "plotData": [{
        "sample": "strainA",
        "data": {
            "sparkline": 23123,
            "otherplot": [1,2,3]
         }
    }],
}

plotData should contain an array of JSON for each sample with two key:value pairs:

  • sample: Sample name
  • data: Plot data (see below).

data should contain a JSON object with the plot signatures as keys, and the relevant plot data as value. This data can be any object (integer, float, array, JSON, etc). It will be up to the components in the flowcraft web application to parse this data and generate the appropriate chart.

Warnings and fails

Inside reportJson, the signature key for warnings is warnings and for failures is fail:

"reportJson": {
    "warnings": [{
        "sample": "strainA",
        "table": "qc",
        "value": ["message 1", "message 2"]
    }],
    "fail": [{
        "sample": "strainA",
        "table": "assembly",
        "value": ["message 1"]
    }]
}

warnings/fail should contain an array of JSON for each sample with two key:value pairs:

  • sample: Sample name
  • value: An array with one or more string messages.
  • table [optional]: If a table signature is provided, the warning/fail messages information will appear on that table. Otherwise, it will appear as a general warning/error that is associated to the sample but not to any particular table.

flowcraft package

Subpackages

flowcraft.generator package

Subpackages
flowcraft.generator.components package
Submodules
flowcraft.generator.components.annotation module
class flowcraft.generator.components.annotation.Abricate(**kwargs)[source]

Bases: flowcraft.generator.process.Process

Abricate mapping process template interface

This process is set with:

  • input_type: assembly
  • output_type: None
  • ptype: post_assembly

It contains one secondary channel link end:

  • MAIN_assembly (alias: MAIN_assembly): Receives the last

assembly.

Attributes:
template_str

Class property that returns a populated template string

Methods

get_user_channel(input_channel[, input_type]) Returns the main raw channel for the process
render(template, context) Wrapper to the jinja2 render method from a template file
set_channels(**kwargs) General purpose method that sets the main channels
set_main_channel_names(input_suffix, …) Sets the main channel names based on the provide input and output channel suffixes.
set_param_id(param_id) Sets the param_id for the process, which will be used to render the template.
set_secondary_channel(source, channel_list) General purpose method for setting a secondary channel
update_attributes(attr_dict) Updates the directives attribute from a dictionary object.
update_main_forks(sink) Updates the forks attribute with the sink channel destination
update_main_input  
class flowcraft.generator.components.annotation.CardRgi(**kwargs)[source]

Bases: flowcraft.generator.process.Process

card’s rgi process template interface

This process is set with:

  • input_type: fasta
  • output_type: txt
  • ptype: resistance gene detection (assembly)
Attributes:
template_str

Class property that returns a populated template string

Methods

get_user_channel(input_channel[, input_type]) Returns the main raw channel for the process
render(template, context) Wrapper to the jinja2 render method from a template file
set_channels(**kwargs) General purpose method that sets the main channels
set_main_channel_names(input_suffix, …) Sets the main channel names based on the provide input and output channel suffixes.
set_param_id(param_id) Sets the param_id for the process, which will be used to render the template.
set_secondary_channel(source, channel_list) General purpose method for setting a secondary channel
update_attributes(attr_dict) Updates the directives attribute from a dictionary object.
update_main_forks(sink) Updates the forks attribute with the sink channel destination
update_main_input  
class flowcraft.generator.components.annotation.Prokka(**kwargs)[source]

Bases: flowcraft.generator.process.Process

Prokka mapping process template interface

This process is set with:

  • input_type: assembly
  • output_type: None
  • ptype: post_assembly

It contains one secondary channel link end:

  • MAIN_assembly (alias: MAIN_assembly): Receives the last

assembly.

Attributes:
template_str

Class property that returns a populated template string

Methods

get_user_channel(input_channel[, input_type]) Returns the main raw channel for the process
render(template, context) Wrapper to the jinja2 render method from a template file
set_channels(**kwargs) General purpose method that sets the main channels
set_main_channel_names(input_suffix, …) Sets the main channel names based on the provide input and output channel suffixes.
set_param_id(param_id) Sets the param_id for the process, which will be used to render the template.
set_secondary_channel(source, channel_list) General purpose method for setting a secondary channel
update_attributes(attr_dict) Updates the directives attribute from a dictionary object.
update_main_forks(sink) Updates the forks attribute with the sink channel destination
update_main_input  
class flowcraft.generator.components.annotation.Diamond(**kwargs)[source]

Bases: flowcraft.generator.process.Process

diamond process for protein database queries

This process is set with:

  • input_type: fasta
  • output_type: None
  • ptype: post_assembly
Attributes:
template_str

Class property that returns a populated template string

Methods

get_user_channel(input_channel[, input_type]) Returns the main raw channel for the process
render(template, context) Wrapper to the jinja2 render method from a template file
set_channels(**kwargs) General purpose method that sets the main channels
set_main_channel_names(input_suffix, …) Sets the main channel names based on the provide input and output channel suffixes.
set_param_id(param_id) Sets the param_id for the process, which will be used to render the template.
set_secondary_channel(source, channel_list) General purpose method for setting a secondary channel
update_attributes(attr_dict) Updates the directives attribute from a dictionary object.
update_main_forks(sink) Updates the forks attribute with the sink channel destination
update_main_input  
flowcraft.generator.components.assembly module
class flowcraft.generator.components.assembly.Bcalm(**kwargs)[source]

Bases: flowcraft.generator.process.Process

Bcalm process template interface

This process is set with:

  • input_type: fastq
  • output_type: assembly
  • ptype: assembly
Attributes:
template_str

Class property that returns a populated template string

Methods

get_user_channel(input_channel[, input_type]) Returns the main raw channel for the process
render(template, context) Wrapper to the jinja2 render method from a template file
set_channels(**kwargs) General purpose method that sets the main channels
set_main_channel_names(input_suffix, …) Sets the main channel names based on the provide input and output channel suffixes.
set_param_id(param_id) Sets the param_id for the process, which will be used to render the template.
set_secondary_channel(source, channel_list) General purpose method for setting a secondary channel
update_attributes(attr_dict) Updates the directives attribute from a dictionary object.
update_main_forks(sink) Updates the forks attribute with the sink channel destination
update_main_input  
class flowcraft.generator.components.assembly.Spades(**kwargs)[source]

Bases: flowcraft.generator.process.Process

Spades process template interface

This process is set with:

  • input_type: fastq
  • output_type: assembly
  • ptype: assembly

It contains one secondary channel link end:

  • SIDE_max_len (alias: SIDE_max_len): Receives max read length
Attributes:
template_str

Class property that returns a populated template string

Methods

get_user_channel(input_channel[, input_type]) Returns the main raw channel for the process
render(template, context) Wrapper to the jinja2 render method from a template file
set_channels(**kwargs) General purpose method that sets the main channels
set_main_channel_names(input_suffix, …) Sets the main channel names based on the provide input and output channel suffixes.
set_param_id(param_id) Sets the param_id for the process, which will be used to render the template.
set_secondary_channel(source, channel_list) General purpose method for setting a secondary channel
update_attributes(attr_dict) Updates the directives attribute from a dictionary object.
update_main_forks(sink) Updates the forks attribute with the sink channel destination
update_main_input  
class flowcraft.generator.components.assembly.Skesa(**kwargs)[source]

Bases: flowcraft.generator.process.Process

Skesa process template interface

Attributes:
template_str

Class property that returns a populated template string

Methods

get_user_channel(input_channel[, input_type]) Returns the main raw channel for the process
render(template, context) Wrapper to the jinja2 render method from a template file
set_channels(**kwargs) General purpose method that sets the main channels
set_main_channel_names(input_suffix, …) Sets the main channel names based on the provide input and output channel suffixes.
set_param_id(param_id) Sets the param_id for the process, which will be used to render the template.
set_secondary_channel(source, channel_list) General purpose method for setting a secondary channel
update_attributes(attr_dict) Updates the directives attribute from a dictionary object.
update_main_forks(sink) Updates the forks attribute with the sink channel destination
update_main_input  
class flowcraft.generator.components.assembly.ViralAssembly(**kwargs)[source]

Bases: flowcraft.generator.process.Process

Process to assemble viral genomes, based on SPAdes and megahit

Attributes:
template_str

Class property that returns a populated template string

Methods

get_user_channel(input_channel[, input_type]) Returns the main raw channel for the process
render(template, context) Wrapper to the jinja2 render method from a template file
set_channels(**kwargs) General purpose method that sets the main channels
set_main_channel_names(input_suffix, …) Sets the main channel names based on the provide input and output channel suffixes.
set_param_id(param_id) Sets the param_id for the process, which will be used to render the template.
set_secondary_channel(source, channel_list) General purpose method for setting a secondary channel
update_attributes(attr_dict) Updates the directives attribute from a dictionary object.
update_main_forks(sink) Updates the forks attribute with the sink channel destination
update_main_input  
class flowcraft.generator.components.assembly.Abyss(**kwargs)[source]

Bases: flowcraft.generator.process.Process

ABySS process template interface

This process is set with:

  • input_type: fastq
  • output_type: assembly
  • ptype: assembly
Attributes:
template_str

Class property that returns a populated template string

Methods

get_user_channel(input_channel[, input_type]) Returns the main raw channel for the process
render(template, context) Wrapper to the jinja2 render method from a template file
set_channels(**kwargs) General purpose method that sets the main channels
set_main_channel_names(input_suffix, …) Sets the main channel names based on the provide input and output channel suffixes.
set_param_id(param_id) Sets the param_id for the process, which will be used to render the template.
set_secondary_channel(source, channel_list) General purpose method for setting a secondary channel
update_attributes(attr_dict) Updates the directives attribute from a dictionary object.
update_main_forks(sink) Updates the forks attribute with the sink channel destination
update_main_input  
class flowcraft.generator.components.assembly.Unicycler(**kwargs)[source]

Bases: flowcraft.generator.process.Process

Unicycler process template interface

This process is set with:

  • input_type: fastq
  • output_type: assembly
  • ptype: assembly
Attributes:
template_str

Class property that returns a populated template string

Methods

get_user_channel(input_channel[, input_type]) Returns the main raw channel for the process
render(template, context) Wrapper to the jinja2 render method from a template file
set_channels(**kwargs) General purpose method that sets the main channels
set_main_channel_names(input_suffix, …) Sets the main channel names based on the provide input and output channel suffixes.
set_param_id(param_id) Sets the param_id for the process, which will be used to render the template.
set_secondary_channel(source, channel_list) General purpose method for setting a secondary channel
update_attributes(attr_dict) Updates the directives attribute from a dictionary object.
update_main_forks(sink) Updates the forks attribute with the sink channel destination
update_main_input  
flowcraft.generator.components.assembly_processing module
class flowcraft.generator.components.assembly_processing.ProcessSkesa(**kwargs)[source]

Bases: flowcraft.generator.process.Process

Attributes:
template_str

Class property that returns a populated template string

Methods

get_user_channel(input_channel[, input_type]) Returns the main raw channel for the process
render(template, context) Wrapper to the jinja2 render method from a template file
set_channels(**kwargs) General purpose method that sets the main channels
set_main_channel_names(input_suffix, …) Sets the main channel names based on the provide input and output channel suffixes.
set_param_id(param_id) Sets the param_id for the process, which will be used to render the template.
set_secondary_channel(source, channel_list) General purpose method for setting a secondary channel
update_attributes(attr_dict) Updates the directives attribute from a dictionary object.
update_main_forks(sink) Updates the forks attribute with the sink channel destination
update_main_input  
class flowcraft.generator.components.assembly_processing.ProcessSpades(**kwargs)[source]

Bases: flowcraft.generator.process.Process

Process spades process template interface

This process is set with:

  • input_type: assembly
  • output_type: assembly
  • ptype: post_assembly
Attributes:
template_str

Class property that returns a populated template string

Methods

get_user_channel(input_channel[, input_type]) Returns the main raw channel for the process
render(template, context) Wrapper to the jinja2 render method from a template file
set_channels(**kwargs) General purpose method that sets the main channels
set_main_channel_names(input_suffix, …) Sets the main channel names based on the provide input and output channel suffixes.
set_param_id(param_id) Sets the param_id for the process, which will be used to render the template.
set_secondary_channel(source, channel_list) General purpose method for setting a secondary channel
update_attributes(attr_dict) Updates the directives attribute from a dictionary object.
update_main_forks(sink) Updates the forks attribute with the sink channel destination
update_main_input  
class flowcraft.generator.components.assembly_processing.AssemblyMapping(**kwargs)[source]

Bases: flowcraft.generator.process.Process

Assembly mapping process template interface

This process is set with:

  • input_type: assembly
  • output_type: assembly
  • ptype: post_assembly

It contains one secondary channel link end:

  • MAIN_fq (alias: _MAIN_assembly): Receives the FastQ files

from the last process with fastq output type.

It contains two status channels:

  • STATUS_am: Status for the assembly_mapping process
  • STATUS_amp: Status for the process_assembly_mapping process
Attributes:
template_str

Class property that returns a populated template string

Methods

get_user_channel(input_channel[, input_type]) Returns the main raw channel for the process
render(template, context) Wrapper to the jinja2 render method from a template file
set_channels(**kwargs) General purpose method that sets the main channels
set_main_channel_names(input_suffix, …) Sets the main channel names based on the provide input and output channel suffixes.
set_param_id(param_id) Sets the param_id for the process, which will be used to render the template.
set_secondary_channel(source, channel_list) General purpose method for setting a secondary channel
update_attributes(attr_dict) Updates the directives attribute from a dictionary object.
update_main_forks(sink) Updates the forks attribute with the sink channel destination
update_main_input  
class flowcraft.generator.components.assembly_processing.Pilon(**kwargs)[source]

Bases: flowcraft.generator.process.Process

Pilon mapping process template interface

This process is set with:

  • input_type: assembly
  • output_type: assembly
  • ptype: post_assembly

It contains one dependency process:

  • assembly_mapping: Requires the BAM file generated by the

assembly mapping process

Attributes:
template_str

Class property that returns a populated template string

Methods

get_user_channel(input_channel[, input_type]) Returns the main raw channel for the process
render(template, context) Wrapper to the jinja2 render method from a template file
set_channels(**kwargs) General purpose method that sets the main channels
set_main_channel_names(input_suffix, …) Sets the main channel names based on the provide input and output channel suffixes.
set_param_id(param_id) Sets the param_id for the process, which will be used to render the template.
set_secondary_channel(source, channel_list) General purpose method for setting a secondary channel
update_attributes(attr_dict) Updates the directives attribute from a dictionary object.
update_main_forks(sink) Updates the forks attribute with the sink channel destination
update_main_input  
class flowcraft.generator.components.assembly_processing.Bandage(**kwargs)[source]

Bases: flowcraft.generator.process.Process

Visualize the assembly using Bandage

This process is set with:

  • input_type: assembly
  • output_type: none
  • ptype: post_assembly
Attributes:
template_str

Class property that returns a populated template string

Methods

get_user_channel(input_channel[, input_type]) Returns the main raw channel for the process
render(template, context) Wrapper to the jinja2 render method from a template file
set_channels(**kwargs) General purpose method that sets the main channels
set_main_channel_names(input_suffix, …) Sets the main channel names based on the provide input and output channel suffixes.
set_param_id(param_id) Sets the param_id for the process, which will be used to render the template.
set_secondary_channel(source, channel_list) General purpose method for setting a secondary channel
update_attributes(attr_dict) Updates the directives attribute from a dictionary object.
update_main_forks(sink) Updates the forks attribute with the sink channel destination
update_main_input  
class flowcraft.generator.components.assembly_processing.Quast(**kwargs)[source]

Bases: flowcraft.generator.process.Process

Assess assembly quality using QUAST

This process is set with:

  • input_type: assembly
  • output_type: tsv
  • ptype: post_assembly
Attributes:
template_str

Class property that returns a populated template string

Methods

get_user_channel(input_channel[, input_type]) Returns the main raw channel for the process
render(template, context) Wrapper to the jinja2 render method from a template file
set_channels(**kwargs) General purpose method that sets the main channels
set_main_channel_names(input_suffix, …) Sets the main channel names based on the provide input and output channel suffixes.
set_param_id(param_id) Sets the param_id for the process, which will be used to render the template.
set_secondary_channel(source, channel_list) General purpose method for setting a secondary channel
update_attributes(attr_dict) Updates the directives attribute from a dictionary object.
update_main_forks(sink) Updates the forks attribute with the sink channel destination
update_main_input  
flowcraft.generator.components.distance_estimation module
class flowcraft.generator.components.distance_estimation.MashDist(**kwargs)[source]

Bases: flowcraft.generator.process.Process

Attributes:
template_str

Class property that returns a populated template string

Methods

get_user_channel(input_channel[, input_type]) Returns the main raw channel for the process
render(template, context) Wrapper to the jinja2 render method from a template file
set_channels(**kwargs) General purpose method that sets the main channels
set_main_channel_names(input_suffix, …) Sets the main channel names based on the provide input and output channel suffixes.
set_param_id(param_id) Sets the param_id for the process, which will be used to render the template.
set_secondary_channel(source, channel_list) General purpose method for setting a secondary channel
update_attributes(attr_dict) Updates the directives attribute from a dictionary object.
update_main_forks(sink) Updates the forks attribute with the sink channel destination
update_main_input  
class flowcraft.generator.components.distance_estimation.MashScreen(**kwargs)[source]

Bases: flowcraft.generator.process.Process

Attributes:
template_str

Class property that returns a populated template string

Methods

get_user_channel(input_channel[, input_type]) Returns the main raw channel for the process
render(template, context) Wrapper to the jinja2 render method from a template file
set_channels(**kwargs) General purpose method that sets the main channels
set_main_channel_names(input_suffix, …) Sets the main channel names based on the provide input and output channel suffixes.
set_param_id(param_id) Sets the param_id for the process, which will be used to render the template.
set_secondary_channel(source, channel_list) General purpose method for setting a secondary channel
update_attributes(attr_dict) Updates the directives attribute from a dictionary object.
update_main_forks(sink) Updates the forks attribute with the sink channel destination
update_main_input  
class flowcraft.generator.components.distance_estimation.MashSketchFasta(**kwargs)[source]

Bases: flowcraft.generator.process.Process

Attributes:
template_str

Class property that returns a populated template string

Methods

get_user_channel(input_channel[, input_type]) Returns the main raw channel for the process
render(template, context) Wrapper to the jinja2 render method from a template file
set_channels(**kwargs) General purpose method that sets the main channels
set_main_channel_names(input_suffix, …) Sets the main channel names based on the provide input and output channel suffixes.
set_param_id(param_id) Sets the param_id for the process, which will be used to render the template.
set_secondary_channel(source, channel_list) General purpose method for setting a secondary channel
update_attributes(attr_dict) Updates the directives attribute from a dictionary object.
update_main_forks(sink) Updates the forks attribute with the sink channel destination
update_main_input  
class flowcraft.generator.components.distance_estimation.MashSketchFastq(**kwargs)[source]

Bases: flowcraft.generator.components.distance_estimation.MashSketchFasta

Attributes:
template_str

Class property that returns a populated template string

Methods

get_user_channel(input_channel[, input_type]) Returns the main raw channel for the process
render(template, context) Wrapper to the jinja2 render method from a template file
set_channels(**kwargs) General purpose method that sets the main channels
set_main_channel_names(input_suffix, …) Sets the main channel names based on the provide input and output channel suffixes.
set_param_id(param_id) Sets the param_id for the process, which will be used to render the template.
set_secondary_channel(source, channel_list) General purpose method for setting a secondary channel
update_attributes(attr_dict) Updates the directives attribute from a dictionary object.
update_main_forks(sink) Updates the forks attribute with the sink channel destination
update_main_input  
class flowcraft.generator.components.distance_estimation.FastAni(**kwargs)[source]

Bases: flowcraft.generator.process.Process

Attributes:
template_str

Class property that returns a populated template string

Methods

get_user_channel(input_channel[, input_type]) Returns the main raw channel for the process
render(template, context) Wrapper to the jinja2 render method from a template file
set_channels(**kwargs) General purpose method that sets the main channels
set_main_channel_names(input_suffix, …) Sets the main channel names based on the provide input and output channel suffixes.
set_param_id(param_id) Sets the param_id for the process, which will be used to render the template.
set_secondary_channel(source, channel_list) General purpose method for setting a secondary channel
update_attributes(attr_dict) Updates the directives attribute from a dictionary object.
update_main_forks(sink) Updates the forks attribute with the sink channel destination
update_main_input  
flowcraft.generator.components.downloads module
class flowcraft.generator.components.downloads.ReadsDownload(**kwargs)[source]

Bases: flowcraft.generator.process.Process

Process template interface for reads downloading from SRA and NCBI

This process is set with:

  • input_type: accessions
  • output_type fastq
Attributes:
template_str

Class property that returns a populated template string

Methods

get_user_channel(input_channel[, input_type]) Returns the main raw channel for the process
render(template, context) Wrapper to the jinja2 render method from a template file
set_channels(**kwargs) General purpose method that sets the main channels
set_main_channel_names(input_suffix, …) Sets the main channel names based on the provide input and output channel suffixes.
set_param_id(param_id) Sets the param_id for the process, which will be used to render the template.
set_secondary_channel(source, channel_list) General purpose method for setting a secondary channel
update_attributes(attr_dict) Updates the directives attribute from a dictionary object.
update_main_forks(sink) Updates the forks attribute with the sink channel destination
update_main_input  
class flowcraft.generator.components.downloads.FasterqDump(**kwargs)[source]

Bases: flowcraft.generator.process.Process

Process template for fasterq-dump

This process is set with:

  • input_type: accessions
  • output_type fastq
Attributes:
template_str

Class property that returns a populated template string

Methods

get_user_channel(input_channel[, input_type]) Returns the main raw channel for the process
render(template, context) Wrapper to the jinja2 render method from a template file
set_channels(**kwargs) General purpose method that sets the main channels
set_main_channel_names(input_suffix, …) Sets the main channel names based on the provide input and output channel suffixes.
set_param_id(param_id) Sets the param_id for the process, which will be used to render the template.
set_secondary_channel(source, channel_list) General purpose method for setting a secondary channel
update_attributes(attr_dict) Updates the directives attribute from a dictionary object.
update_main_forks(sink) Updates the forks attribute with the sink channel destination
update_main_input  
flowcraft.generator.components.metagenomics module
class flowcraft.generator.components.metagenomics.Kraken(**kwargs)[source]

Bases: flowcraft.generator.process.Process

kraken process template interface

This process is set with:

  • input_type: fastq
  • output_type: txt
  • ptype: taxonomic classification
Attributes:
template_str

Class property that returns a populated template string

Methods

get_user_channel(input_channel[, input_type]) Returns the main raw channel for the process
render(template, context) Wrapper to the jinja2 render method from a template file
set_channels(**kwargs) General purpose method that sets the main channels
set_main_channel_names(input_suffix, …) Sets the main channel names based on the provide input and output channel suffixes.
set_param_id(param_id) Sets the param_id for the process, which will be used to render the template.
set_secondary_channel(source, channel_list) General purpose method for setting a secondary channel
update_attributes(attr_dict) Updates the directives attribute from a dictionary object.
update_main_forks(sink) Updates the forks attribute with the sink channel destination
update_main_input  
class flowcraft.generator.components.metagenomics.Kraken2(**kwargs)[source]

Bases: flowcraft.generator.process.Process

kraken2 process template interface

This process is set with:

  • input_type: fastq
  • output_type: txt
  • ptype: taxonomic classification
Attributes:
template_str

Class property that returns a populated template string

Methods

get_user_channel(input_channel[, input_type]) Returns the main raw channel for the process
render(template, context) Wrapper to the jinja2 render method from a template file
set_channels(**kwargs) General purpose method that sets the main channels
set_main_channel_names(input_suffix, …) Sets the main channel names based on the provide input and output channel suffixes.
set_param_id(param_id) Sets the param_id for the process, which will be used to render the template.
set_secondary_channel(source, channel_list) General purpose method for setting a secondary channel
update_attributes(attr_dict) Updates the directives attribute from a dictionary object.
update_main_forks(sink) Updates the forks attribute with the sink channel destination
update_main_input  
class flowcraft.generator.components.metagenomics.Maxbin2(**kwargs)[source]

Bases: flowcraft.generator.process.Process

MaxBin2, a metagenomics binning software

This process is set with:

  • input_type: assembly
  • output_type: assembly
  • ptype: post_assembly

It contains one secondary channel link end:

  • MAIN_fq (alias: _MAIN_assembly): Receives the FastQ files

from the last process with fastq output type.

Attributes:
template_str

Class property that returns a populated template string

Methods

get_user_channel(input_channel[, input_type]) Returns the main raw channel for the process
render(template, context) Wrapper to the jinja2 render method from a template file
set_channels(**kwargs) General purpose method that sets the main channels
set_main_channel_names(input_suffix, …) Sets the main channel names based on the provide input and output channel suffixes.
set_param_id(param_id) Sets the param_id for the process, which will be used to render the template.
set_secondary_channel(source, channel_list) General purpose method for setting a secondary channel
update_attributes(attr_dict) Updates the directives attribute from a dictionary object.
update_main_forks(sink) Updates the forks attribute with the sink channel destination
update_main_input  
class flowcraft.generator.components.metagenomics.Megahit(**kwargs)[source]

Bases: flowcraft.generator.process.Process

megahit process template interface

This process is set with:

  • input_type: fastq
  • output_type: assembly
  • ptype: assembly

It contains one secondary channel link end:

  • SIDE_max_len (alias: SIDE_max_len): Receives max read length
Attributes:
template_str

Class property that returns a populated template string

Methods

get_user_channel(input_channel[, input_type]) Returns the main raw channel for the process
render(template, context) Wrapper to the jinja2 render method from a template file
set_channels(**kwargs) General purpose method that sets the main channels
set_main_channel_names(input_suffix, …) Sets the main channel names based on the provide input and output channel suffixes.
set_param_id(param_id) Sets the param_id for the process, which will be used to render the template.
set_secondary_channel(source, channel_list) General purpose method for setting a secondary channel
update_attributes(attr_dict) Updates the directives attribute from a dictionary object.
update_main_forks(sink) Updates the forks attribute with the sink channel destination
update_main_input  
class flowcraft.generator.components.metagenomics.Metaspades(**kwargs)[source]

Bases: flowcraft.generator.process.Process

Metaspades process template interface

This process is set with:

  • input_type: fastq
  • output_type: assembly
  • ptype: assembly

It contains one secondary channel link end:

  • SIDE_max_len (alias: SIDE_max_len): Receives max read length
Attributes:
template_str

Class property that returns a populated template string

Methods

get_user_channel(input_channel[, input_type]) Returns the main raw channel for the process
render(template, context) Wrapper to the jinja2 render method from a template file
set_channels(**kwargs) General purpose method that sets the main channels
set_main_channel_names(input_suffix, …) Sets the main channel names based on the provide input and output channel suffixes.
set_param_id(param_id) Sets the param_id for the process, which will be used to render the template.
set_secondary_channel(source, channel_list) General purpose method for setting a secondary channel
update_attributes(attr_dict) Updates the directives attribute from a dictionary object.
update_main_forks(sink) Updates the forks attribute with the sink channel destination
update_main_input  
class flowcraft.generator.components.metagenomics.Midas_species(**kwargs)[source]

Bases: flowcraft.generator.process.Process

Midas species process template interface

This process is set with:

  • input_type: fastq
  • output_type: txt
  • ptype: taxonomic classification (species)
Attributes:
template_str

Class property that returns a populated template string

Methods

get_user_channel(input_channel[, input_type]) Returns the main raw channel for the process
render(template, context) Wrapper to the jinja2 render method from a template file
set_channels(**kwargs) General purpose method that sets the main channels
set_main_channel_names(input_suffix, …) Sets the main channel names based on the provide input and output channel suffixes.
set_param_id(param_id) Sets the param_id for the process, which will be used to render the template.
set_secondary_channel(source, channel_list) General purpose method for setting a secondary channel
update_attributes(attr_dict) Updates the directives attribute from a dictionary object.
update_main_forks(sink) Updates the forks attribute with the sink channel destination
update_main_input  
class flowcraft.generator.components.metagenomics.RemoveHost(**kwargs)[source]

Bases: flowcraft.generator.process.Process

bowtie2 to remove host reads process template interface

This process is set with:

  • input_type: fastq
  • output_type: fastq
  • ptype: removal os host reads
Attributes:
template_str

Class property that returns a populated template string

Methods

get_user_channel(input_channel[, input_type]) Returns the main raw channel for the process
render(template, context) Wrapper to the jinja2 render method from a template file
set_channels(**kwargs) General purpose method that sets the main channels
set_main_channel_names(input_suffix, …) Sets the main channel names based on the provide input and output channel suffixes.
set_param_id(param_id) Sets the param_id for the process, which will be used to render the template.
set_secondary_channel(source, channel_list) General purpose method for setting a secondary channel
update_attributes(attr_dict) Updates the directives attribute from a dictionary object.
update_main_forks(sink) Updates the forks attribute with the sink channel destination
update_main_input  
class flowcraft.generator.components.metagenomics.Metaprob(**kwargs)[source]

Bases: flowcraft.generator.process.Process

MetaProb to bin metagenomic reads interface

This process is set with:

  • input_type: fastq
  • output_type: csv
  • ptype: binning of reads
Attributes:
template_str

Class property that returns a populated template string

Methods

get_user_channel(input_channel[, input_type]) Returns the main raw channel for the process
render(template, context) Wrapper to the jinja2 render method from a template file
set_channels(**kwargs) General purpose method that sets the main channels
set_main_channel_names(input_suffix, …) Sets the main channel names based on the provide input and output channel suffixes.
set_param_id(param_id) Sets the param_id for the process, which will be used to render the template.
set_secondary_channel(source, channel_list) General purpose method for setting a secondary channel
update_attributes(attr_dict) Updates the directives attribute from a dictionary object.
update_main_forks(sink) Updates the forks attribute with the sink channel destination
update_main_input  
class flowcraft.generator.components.metagenomics.SplitAssembly(**kwargs)[source]

Bases: flowcraft.generator.process.Process

Component to filter metagenomic assemblies by contig size If the contig is larger than $param.size, it gets separated from the original assembly to continue the processes downstream of the pipeline.

This process is set with:

  • input_type: fasta
  • output_type: fasta
  • ptype: assembly filter
Attributes:
template_str

Class property that returns a populated template string

Methods

get_user_channel(input_channel[, input_type]) Returns the main raw channel for the process
render(template, context) Wrapper to the jinja2 render method from a template file
set_channels(**kwargs) General purpose method that sets the main channels
set_main_channel_names(input_suffix, …) Sets the main channel names based on the provide input and output channel suffixes.
set_param_id(param_id) Sets the param_id for the process, which will be used to render the template.
set_secondary_channel(source, channel_list) General purpose method for setting a secondary channel
update_attributes(attr_dict) Updates the directives attribute from a dictionary object.
update_main_forks(sink) Updates the forks attribute with the sink channel destination
update_main_input  
flowcraft.generator.components.mlst module
class flowcraft.generator.components.mlst.Mlst(**kwargs)[source]

Bases: flowcraft.generator.process.Process

Mlst mapping process template interface

This process is set with:

  • input_type: assembly
  • output_type: None
  • ptype: post_assembly

It contains one secondary channel link end:

  • MAIN_assembly (alias: MAIN_assembly): Receives the last

assembly.

Attributes:
template_str

Class property that returns a populated template string

Methods

get_user_channel(input_channel[, input_type]) Returns the main raw channel for the process
render(template, context) Wrapper to the jinja2 render method from a template file
set_channels(**kwargs) General purpose method that sets the main channels
set_main_channel_names(input_suffix, …) Sets the main channel names based on the provide input and output channel suffixes.
set_param_id(param_id) Sets the param_id for the process, which will be used to render the template.
set_secondary_channel(source, channel_list) General purpose method for setting a secondary channel
update_attributes(attr_dict) Updates the directives attribute from a dictionary object.
update_main_forks(sink) Updates the forks attribute with the sink channel destination
update_main_input  
class flowcraft.generator.components.mlst.Chewbbaca(**kwargs)[source]

Bases: flowcraft.generator.process.Process

Chewbbaca process template interface

This process is set with:

  • input_type: assembly
  • output_type: None
  • ptype: post_assembly

It contains one secondary channel link end:

  • MAIN_assembly (alias: MAIN_assembly): Receives the last

assembly.

Attributes:
template_str

Class property that returns a populated template string

Methods

get_user_channel(input_channel[, input_type]) Returns the main raw channel for the process
render(template, context) Wrapper to the jinja2 render method from a template file
set_channels(**kwargs) General purpose method that sets the main channels
set_main_channel_names(input_suffix, …) Sets the main channel names based on the provide input and output channel suffixes.
set_param_id(param_id) Sets the param_id for the process, which will be used to render the template.
set_secondary_channel(source, channel_list) General purpose method for setting a secondary channel
update_attributes(attr_dict) Updates the directives attribute from a dictionary object.
update_main_forks(sink) Updates the forks attribute with the sink channel destination
update_main_input  
class flowcraft.generator.components.mlst.Metamlst(**kwargs)[source]

Bases: flowcraft.generator.process.Process

MetaMlst mapping process template interface

This process is set with:

  • input_type: reads
  • output_type: None
  • ptype: pre_assembly
Attributes:
template_str

Class property that returns a populated template string

Methods

get_user_channel(input_channel[, input_type]) Returns the main raw channel for the process
render(template, context) Wrapper to the jinja2 render method from a template file
set_channels(**kwargs) General purpose method that sets the main channels
set_main_channel_names(input_suffix, …) Sets the main channel names based on the provide input and output channel suffixes.
set_param_id(param_id) Sets the param_id for the process, which will be used to render the template.
set_secondary_channel(source, channel_list) General purpose method for setting a secondary channel
update_attributes(attr_dict) Updates the directives attribute from a dictionary object.
update_main_forks(sink) Updates the forks attribute with the sink channel destination
update_main_input  
flowcraft.generator.components.patlas_mapping module
class flowcraft.generator.components.patlas_mapping.MappingPatlas(**kwargs)[source]

Bases: flowcraft.generator.process.Process

Attributes:
template_str

Class property that returns a populated template string

Methods

get_user_channel(input_channel[, input_type]) Returns the main raw channel for the process
render(template, context) Wrapper to the jinja2 render method from a template file
set_channels(**kwargs) General purpose method that sets the main channels
set_main_channel_names(input_suffix, …) Sets the main channel names based on the provide input and output channel suffixes.
set_param_id(param_id) Sets the param_id for the process, which will be used to render the template.
set_secondary_channel(source, channel_list) General purpose method for setting a secondary channel
update_attributes(attr_dict) Updates the directives attribute from a dictionary object.
update_main_forks(sink) Updates the forks attribute with the sink channel destination
update_main_input  
flowcraft.generator.components.reads_quality_control module
class flowcraft.generator.components.reads_quality_control.IntegrityCoverage(**kwargs)[source]

Bases: flowcraft.generator.process.Process

Process template interface for first integrity_coverage process

This process is set with:

  • input_type: fastq
  • output_type: fastq
  • ptype: pre_assembly

It contains two secondary channel link starts:

  • SIDE_phred: Phred score of the FastQ files
  • SIDE_max_len: Maximum read length
Attributes:
template_str

Class property that returns a populated template string

Methods

get_user_channel(input_channel[, input_type]) Returns the main raw channel for the process
render(template, context) Wrapper to the jinja2 render method from a template file
set_channels(**kwargs) General purpose method that sets the main channels
set_main_channel_names(input_suffix, …) Sets the main channel names based on the provide input and output channel suffixes.
set_param_id(param_id) Sets the param_id for the process, which will be used to render the template.
set_secondary_channel(source, channel_list) General purpose method for setting a secondary channel
update_attributes(attr_dict) Updates the directives attribute from a dictionary object.
update_main_forks(sink) Updates the forks attribute with the sink channel destination
update_main_input  
class flowcraft.generator.components.reads_quality_control.CheckCoverage(**kwargs)[source]

Bases: flowcraft.generator.process.Process

Process template interface for additional integrity_coverage process

This process is set with:

  • input_type: fastq
  • output_type: fastq
  • ptype: pre_assembly

It contains one secondary channel link start:

  • SIDE_max_len: Maximum read length
Attributes:
template_str

Class property that returns a populated template string

Methods

get_user_channel(input_channel[, input_type]) Returns the main raw channel for the process
render(template, context) Wrapper to the jinja2 render method from a template file
set_channels(**kwargs) General purpose method that sets the main channels
set_main_channel_names(input_suffix, …) Sets the main channel names based on the provide input and output channel suffixes.
set_param_id(param_id) Sets the param_id for the process, which will be used to render the template.
set_secondary_channel(source, channel_list) General purpose method for setting a secondary channel
update_attributes(attr_dict) Updates the directives attribute from a dictionary object.
update_main_forks(sink) Updates the forks attribute with the sink channel destination
update_main_input  
class flowcraft.generator.components.reads_quality_control.TrueCoverage(**kwargs)[source]

Bases: flowcraft.generator.process.Process

TrueCoverage process template interface

Attributes:
template_str

Class property that returns a populated template string

Methods

get_user_channel(input_channel[, input_type]) Returns the main raw channel for the process
render(template, context) Wrapper to the jinja2 render method from a template file
set_channels(**kwargs) General purpose method that sets the main channels
set_main_channel_names(input_suffix, …) Sets the main channel names based on the provide input and output channel suffixes.
set_param_id(param_id) Sets the param_id for the process, which will be used to render the template.
set_secondary_channel(source, channel_list) General purpose method for setting a secondary channel
update_attributes(attr_dict) Updates the directives attribute from a dictionary object.
update_main_forks(sink) Updates the forks attribute with the sink channel destination
update_main_input  
class flowcraft.generator.components.reads_quality_control.Fastqc(**kwargs)[source]

Bases: flowcraft.generator.process.Process

FastQC process template interface

This process is set with:

  • input_type: fastq
  • output_type: fastq
  • ptype: pre_assembly

It contains two status channels:

  • STATUS_fastqc: Status for the fastqc process
  • STATUS_report: Status for the fastqc_report process
Attributes:
template_str

Class property that returns a populated template string

Methods

get_user_channel(input_channel[, input_type]) Returns the main raw channel for the process
render(template, context) Wrapper to the jinja2 render method from a template file
set_channels(**kwargs) General purpose method that sets the main channels
set_main_channel_names(input_suffix, …) Sets the main channel names based on the provide input and output channel suffixes.
set_param_id(param_id) Sets the param_id for the process, which will be used to render the template.
set_secondary_channel(source, channel_list) General purpose method for setting a secondary channel
update_attributes(attr_dict) Updates the directives attribute from a dictionary object.
update_main_forks(sink) Updates the forks attribute with the sink channel destination
update_main_input  
status_channels = None

list: Setting status channels for FastQC execution and FastQC report

class flowcraft.generator.components.reads_quality_control.Trimmomatic(**kwargs)[source]

Bases: flowcraft.generator.process.Process

Trimmomatic process template interface

This process is set with:

  • input_type: fastq
  • output_type: fastq
  • ptype: pre_assembly

It contains one secondary channel link end:

  • SIDE_phred (alias: SIDE_phred): Receives FastQ phred score
Attributes:
template_str

Class property that returns a populated template string

Methods

get_user_channel(input_channel[, input_type]) Returns the main raw channel for the process
render(template, context) Wrapper to the jinja2 render method from a template file
set_channels(**kwargs) General purpose method that sets the main channels
set_main_channel_names(input_suffix, …) Sets the main channel names based on the provide input and output channel suffixes.
set_param_id(param_id) Sets the param_id for the process, which will be used to render the template.
set_secondary_channel(source, channel_list) General purpose method for setting a secondary channel
update_attributes(attr_dict) Updates the directives attribute from a dictionary object.
update_main_forks(sink) Updates the forks attribute with the sink channel destination
update_main_input  
class flowcraft.generator.components.reads_quality_control.FastqcTrimmomatic(**kwargs)[source]

Bases: flowcraft.generator.process.Process

Fastqc + Trimmomatic process template interface

This process executes FastQC only to inform the trim range for trimmomatic, not for QC checks.

This process is set with:

  • input_type: fastq
  • output_type: fastq
  • ptype: pre_assembly

It contains one secondary channel link end:

  • SIDE_phred (alias: SIDE_phred): Receives FastQ phred score

It contains three status channels:

  • STATUS_fastqc: Status for the fastqc process
  • STATUS_report: Status for the fastqc_report process
  • STATUS_trim: Status for the trimmomatic process
Attributes:
template_str

Class property that returns a populated template string

Methods

get_user_channel(input_channel[, input_type]) Returns the main raw channel for the process
render(template, context) Wrapper to the jinja2 render method from a template file
set_channels(**kwargs) General purpose method that sets the main channels
set_main_channel_names(input_suffix, …) Sets the main channel names based on the provide input and output channel suffixes.
set_param_id(param_id) Sets the param_id for the process, which will be used to render the template.
set_secondary_channel(source, channel_list) General purpose method for setting a secondary channel
update_attributes(attr_dict) Updates the directives attribute from a dictionary object.
update_main_forks(sink) Updates the forks attribute with the sink channel destination
update_main_input  
class flowcraft.generator.components.reads_quality_control.FilterPoly(**kwargs)[source]

Bases: flowcraft.generator.process.Process

PrinSeq process to filter non-informative sequences from reads

This process is set with:

  • input_type: fastq
  • output_type: fastq
  • ptype: pre_assembly
Attributes:
template_str

Class property that returns a populated template string

Methods

get_user_channel(input_channel[, input_type]) Returns the main raw channel for the process
render(template, context) Wrapper to the jinja2 render method from a template file
set_channels(**kwargs) General purpose method that sets the main channels
set_main_channel_names(input_suffix, …) Sets the main channel names based on the provide input and output channel suffixes.
set_param_id(param_id) Sets the param_id for the process, which will be used to render the template.
set_secondary_channel(source, channel_list) General purpose method for setting a secondary channel
update_attributes(attr_dict) Updates the directives attribute from a dictionary object.
update_main_forks(sink) Updates the forks attribute with the sink channel destination
update_main_input  
class flowcraft.generator.components.reads_quality_control.DownsampleFastq(**kwargs)[source]

Bases: flowcraft.generator.process.Process

Downsamples FastQ file based on depth using seqtk

This process is set with:

  • input_type: fastq
  • output_type: fastq
Attributes:
template_str

Class property that returns a populated template string

Methods

get_user_channel(input_channel[, input_type]) Returns the main raw channel for the process
render(template, context) Wrapper to the jinja2 render method from a template file
set_channels(**kwargs) General purpose method that sets the main channels
set_main_channel_names(input_suffix, …) Sets the main channel names based on the provide input and output channel suffixes.
set_param_id(param_id) Sets the param_id for the process, which will be used to render the template.
set_secondary_channel(source, channel_list) General purpose method for setting a secondary channel
update_attributes(attr_dict) Updates the directives attribute from a dictionary object.
update_main_forks(sink) Updates the forks attribute with the sink channel destination
update_main_input  
flowcraft.generator.components.typing module
class flowcraft.generator.components.typing.SeqTyping(**kwargs)[source]

Bases: flowcraft.generator.process.Process

Attributes:
template_str

Class property that returns a populated template string

Methods

get_user_channel(input_channel[, input_type]) Returns the main raw channel for the process
render(template, context) Wrapper to the jinja2 render method from a template file
set_channels(**kwargs) General purpose method that sets the main channels
set_main_channel_names(input_suffix, …) Sets the main channel names based on the provide input and output channel suffixes.
set_param_id(param_id) Sets the param_id for the process, which will be used to render the template.
set_secondary_channel(source, channel_list) General purpose method for setting a secondary channel
update_attributes(attr_dict) Updates the directives attribute from a dictionary object.
update_main_forks(sink) Updates the forks attribute with the sink channel destination
update_main_input  
class flowcraft.generator.components.typing.PathoTyping(**kwargs)[source]

Bases: flowcraft.generator.process.Process

Attributes:
template_str

Class property that returns a populated template string

Methods

get_user_channel(input_channel[, input_type]) Returns the main raw channel for the process
render(template, context) Wrapper to the jinja2 render method from a template file
set_channels(**kwargs) General purpose method that sets the main channels
set_main_channel_names(input_suffix, …) Sets the main channel names based on the provide input and output channel suffixes.
set_param_id(param_id) Sets the param_id for the process, which will be used to render the template.
set_secondary_channel(source, channel_list) General purpose method for setting a secondary channel
update_attributes(attr_dict) Updates the directives attribute from a dictionary object.
update_main_forks(sink) Updates the forks attribute with the sink channel destination
update_main_input  
class flowcraft.generator.components.typing.Sistr(**kwargs)[source]

Bases: flowcraft.generator.process.Process

Attributes:
template_str

Class property that returns a populated template string

Methods

get_user_channel(input_channel[, input_type]) Returns the main raw channel for the process
render(template, context) Wrapper to the jinja2 render method from a template file
set_channels(**kwargs) General purpose method that sets the main channels
set_main_channel_names(input_suffix, …) Sets the main channel names based on the provide input and output channel suffixes.
set_param_id(param_id) Sets the param_id for the process, which will be used to render the template.
set_secondary_channel(source, channel_list) General purpose method for setting a secondary channel
update_attributes(attr_dict) Updates the directives attribute from a dictionary object.
update_main_forks(sink) Updates the forks attribute with the sink channel destination
update_main_input  
class flowcraft.generator.components.typing.Momps(**kwargs)[source]

Bases: flowcraft.generator.process.Process

Attributes:
template_str

Class property that returns a populated template string

Methods

get_user_channel(input_channel[, input_type]) Returns the main raw channel for the process
render(template, context) Wrapper to the jinja2 render method from a template file
set_channels(**kwargs) General purpose method that sets the main channels
set_main_channel_names(input_suffix, …) Sets the main channel names based on the provide input and output channel suffixes.
set_param_id(param_id) Sets the param_id for the process, which will be used to render the template.
set_secondary_channel(source, channel_list) General purpose method for setting a secondary channel
update_attributes(attr_dict) Updates the directives attribute from a dictionary object.
update_main_forks(sink) Updates the forks attribute with the sink channel destination
update_main_input  
class flowcraft.generator.components.typing.DengueTyping(**kwargs)[source]

Bases: flowcraft.generator.process.Process

Attributes:
template_str

Class property that returns a populated template string

Methods

get_user_channel(input_channel[, input_type]) Returns the main raw channel for the process
render(template, context) Wrapper to the jinja2 render method from a template file
set_channels(**kwargs) General purpose method that sets the main channels
set_main_channel_names(input_suffix, …) Sets the main channel names based on the provide input and output channel suffixes.
set_param_id(param_id) Sets the param_id for the process, which will be used to render the template.
set_secondary_channel(source, channel_list) General purpose method for setting a secondary channel
update_attributes(attr_dict) Updates the directives attribute from a dictionary object.
update_main_forks(sink) Updates the forks attribute with the sink channel destination
update_main_input  
Module contents
Submodules
flowcraft.generator.engine module
class flowcraft.generator.engine.NextflowGenerator(process_connections, nextflow_file, process_map, pipeline_name='flowcraft', ignore_dependencies=False, auto_dependency=True, merge_params=True, export_params=False)[source]

Bases: object

Methods

build() Main pipeline builder
dag_to_file(dict_viz[, output_file]) Writes dag to output file
export_directives() Export pipeline directives as a JSON to stdout
export_params() Export pipeline params as a JSON to stdout
fetch_docker_tags() Export all dockerhub tags associated with each component given by the -t flag.
render_pipeline() Write pipeline attributes to json
write_configs(project_root) Wrapper method that writes all configuration files to the pipeline directory
process_map = None

dict: Maps the nextflow template name to the corresponding Process class of the component.

processes = None

list: Stores the process interfaces in the specified order

lanes = None

int: Stores the number of lanes in the pipelines

export_parameters = None

bool: Determines whether the build mode is only for the export of parameters in JSON format. Setting to True will disabled some checks, such as component dependency requirements

nf_file = None

str: Path to file where the pipeline will be generated

pipeline_name = None

str: Name of the pipeline, for customization and help purposes.

template = None

str: String that will harbour the pipeline code

secondary_channels = None

dict: Stores secondary channel links

main_raw_inputs = None

list: Stores the main raw inputs from the user parameters into the first process(es).

merge_params = None

bool: Determines whether the params of the pipeline should be merged (i.e., the same param name in multiple components is merged into one) or if they should be unique and specific to each component.

extra_inputs = None
status_channels = None

list: Stores the status channels from each process

skip_class = None

list: Stores the Process classes that should be skipped when iterating over the processes list.

resources = None

str: Stores the resource directives string for each nextflow process. See NextflowGenerator._get_resources_string().

containers = None

str: Stores the container directives string for each nextflow process. See NextflowGenerator._get_container_string().

params = None

str: Stores the params directives string for the nextflow pipeline. See NextflowGenerator._get_params_string()

manifest = None

str: Stores de manifest directives string for the nextflow pipeline. See NextflowGenerator._get_manifest_string()

user_config = None

str: Stores the user configuration file placeholder. This is an empty configuration file that is only added the first time to a project directory. If the file already exists, it will not overwrite it.

compilers = None

dict: Maps the information about each available compiler process in flowcraft. The key of each entry is the name/signature of the compiler process. The value is a json/dict object that contains two key:pair values:

  • cls: The reference to the compiler class object.
  • template: The nextflow template file of the process.
dag_to_file(dict_viz, output_file='.treeDag.json')[source]

Writes dag to output file

Parameters:
dict_viz: dict

Tree like dictionary that is used to export tree data of processes to html file and here for the dotfile .treeDag.json

render_pipeline()[source]

Write pipeline attributes to json

This function writes the pipeline and their attributes to a json file, that is intended to be read by resources/pipeline_graph.html to render a graphical output showing the DAG.

write_configs(project_root)[source]

Wrapper method that writes all configuration files to the pipeline directory

export_params()[source]

Export pipeline params as a JSON to stdout

This run mode iterates over the pipeline processes and exports the params dictionary of each component as a JSON to stdout.

export_directives()[source]

Export pipeline directives as a JSON to stdout

fetch_docker_tags()[source]

Export all dockerhub tags associated with each component given by the -t flag.

build()[source]

Main pipeline builder

This method is responsible for building the NextflowGenerator.template attribute that will contain the nextflow code of the pipeline.

First it builds the header, then sets the main channels, the secondary inputs, secondary channels and finally the status channels. When the pipeline is built, is writes the code to a nextflow file.

flowcraft.generator.error_handling module
exception flowcraft.generator.error_handling.ProcessError(value)[source]

Bases: Exception

exception flowcraft.generator.error_handling.SanityError(value)[source]

Bases: Exception

Class to raise a custom error for sanity checks

exception flowcraft.generator.error_handling.InspectionError(value)[source]

Bases: Exception

exception flowcraft.generator.error_handling.ReportError(value)[source]

Bases: Exception

exception flowcraft.generator.error_handling.RecipeError(value)[source]

Bases: Exception

exception flowcraft.generator.error_handling.LogError(value)[source]

Bases: Exception

flowcraft.generator.footer_skeleton module
flowcraft.generator.header_skeleton module
flowcraft.generator.inspect module
flowcraft.generator.pipeline_parser module
flowcraft.generator.pipeline_parser.guess_process(query_str, process_map)[source]

Function to guess processes based on strings that are not available in process_map. If the string has typos and is somewhat similar (50%) to any process available in flowcraft it will print info to the terminal, suggesting the most similar processes available in flowcraft.

Parameters:
query_str: str

The string of the process with potential typos

process_map:

The dictionary that contains all the available processes

flowcraft.generator.pipeline_parser.remove_inner_forks(text)[source]

Recursively removes nested brackets

This function is used to remove nested brackets from fork strings using regular expressions

Parameters:
text: str

The string that contains brackets with inner forks to be removed

Returns:
text: str

the string with only the processes that are not in inner forks, thus the processes that belong to a given fork.

flowcraft.generator.pipeline_parser.empty_tasks(p_string)[source]

Function to check if pipeline string is empty or has an empty string

Parameters:
p_string: str
String with the definition of the pipeline, e.g.::

‘processA processB processC(ProcessD | ProcessE)’

flowcraft.generator.pipeline_parser.brackets_but_no_lanes(p_string)[source]

Function to check if a LANE_TOKEN is provided but no fork is initiated. Parameters ———- p_string: str

String with the definition of the pipeline, e.g.::
‘processA processB processC(ProcessD | ProcessE)’
flowcraft.generator.pipeline_parser.brackets_insanity_check(p_string)[source]

This function performs a check for different number of ‘(‘ and ‘)’ characters, which indicates that some forks are poorly constructed.

Parameters:
p_string: str
String with the definition of the pipeline, e.g.::

‘processA processB processC(ProcessD | ProcessE)’

flowcraft.generator.pipeline_parser.lane_char_insanity_check(p_string)[source]

This function performs a sanity check for multiple ‘|’ character between two processes.

Parameters:
p_string: str
String with the definition of the pipeline, e.g.::

‘processA processB processC(ProcessD | ProcessE)’

flowcraft.generator.pipeline_parser.final_char_insanity_check(p_string)[source]

This function checks if lane token is the last element of the pipeline string.

Parameters:
p_string: str
String with the definition of the pipeline, e.g.::

‘processA processB processC(ProcessD | ProcessE)’

flowcraft.generator.pipeline_parser.fork_procs_insanity_check(p_string)[source]

This function checks if the pipeline string contains a process between the fork start token or end token and the separator (lane) token. Checks for the absence of processes in one of the branches of the fork [‘|)' and '(|’] and for the existence of a process before starting a fork (in an inner fork) [‘|(‘].

Parameters:
p_string: str
String with the definition of the pipeline, e.g.::

‘processA processB processC(ProcessD | ProcessE)’

flowcraft.generator.pipeline_parser.start_proc_insanity_check(p_string)[source]

This function checks if there is a starting process after the beginning of each fork. It checks for duplicated start tokens [‘((‘].

Parameters:
p_string: str
String with the definition of the pipeline, e.g.::

‘processA processB processC(ProcessD | ProcessE)’

flowcraft.generator.pipeline_parser.late_proc_insanity_check(p_string)[source]

This function checks if there are processes after the close token. It searches for everything that isn’t “|” or “)” after a “)” token.

Parameters:
p_string: str
String with the definition of the pipeline, e.g.::

‘processA processB processC(ProcessD | ProcessE)’

flowcraft.generator.pipeline_parser.inner_fork_insanity_checks(pipeline_string)[source]

This function performs two sanity checks in the pipeline string. The first check, assures that each fork contains a lane token ‘|’, while the second check looks for duplicated processes within the same fork.

Parameters:
pipeline_string: str
String with the definition of the pipeline, e.g.::

‘processA processB processC(ProcessD | ProcessE)’

flowcraft.generator.pipeline_parser.insanity_checks(pipeline_str)[source]

Wrapper that performs all sanity checks on the pipeline string

Parameters:
pipeline_str : str

String with the pipeline definition

flowcraft.generator.pipeline_parser.parse_pipeline(pipeline_str)[source]
Parses a pipeline string into a list of dictionaries with the connections
between processes
Parameters:
pipeline_str : str
String with the definition of the pipeline, e.g.::

‘processA processB processC(ProcessD | ProcessE)’

Returns:
pipeline_links : list
flowcraft.generator.pipeline_parser.get_source_lane(fork_process, pipeline_list)[source]

Returns the lane of the last process that matches fork_process

Parameters:
fork_process : list

List of processes before the fork.

pipeline_list : list

List with the pipeline connection dictionaries.

Returns:
int

Lane of the last process that matches fork_process

flowcraft.generator.pipeline_parser.get_lanes(lanes_str)[source]

From a raw pipeline string, get a list of lanes from the start of the current fork.

When the pipeline is being parsed, it will be split at every fork position. The string at the right of the fork position will be provided to this function. It’s job is to retrieve the lanes that result from that fork, ignoring any nested forks.

Parameters:
lanes_str : str

Pipeline string after a fork split

Returns:
lanes : list

List of lists, with the list of processes for each lane

flowcraft.generator.pipeline_parser.linear_connection(plist, lane)[source]

Connects a linear list of processes into a list of dictionaries

Parameters:
plist : list

List with process names. This list should contain at least two entries.

lane : int

Corresponding lane of the processes

Returns:
res : list

List of dictionaries with the links between processes

flowcraft.generator.pipeline_parser.fork_connection(source, sink, source_lane, lane)[source]

Makes the connection between a process and the first processes in the lanes to which it forks.

The lane argument should correspond to the lane of the source process. For each lane in sink, the lane counter will increase.

Parameters:
source : str

Name of the process that is forking

sink : list

List of the processes where the source will fork to. Each element corresponds to the start of a lane.

source_lane : int

Lane of the forking process

lane : int

Lane of the source process

Returns:
res : list

List of dictionaries with the links between processes

flowcraft.generator.pipeline_parser.linear_lane_connection(lane_list, lane)[source]
Parameters:
lane_list : list

Each element should correspond to a list of processes for a given lane

lane : int

Lane counter before the fork start

Returns:
res : list

List of dictionaries with the links between processes

flowcraft.generator.pipeline_parser.add_unique_identifiers(pipeline_str)[source]
Returns the pipeline string with unique identifiers and a dictionary with
references between the unique keys and the original values
Parameters:
pipeline_str : str

Pipeline string

Returns:
str

Pipeline string with unique identifiers

dict

Match between process unique values and original names

flowcraft.generator.pipeline_parser.remove_unique_identifiers(identifiers_to_tags, pipeline_links)[source]

Removes unique identifiers and add the original process names to the already parsed pipelines

Parameters:
identifiers_to_tags : dict

Match between unique process identifiers and process names

pipeline_links: list

Parsed pipeline list with unique identifiers

Returns:
list

Pipeline list with original identifiers

flowcraft.generator.process module
class flowcraft.generator.process.Process(template)[source]

Bases: object

Main interface for basic process functionality

The Process class is intended to be inherited by specific process classes (e.g., IntegrityCoverage) and provides the basic functionality to build the channels and links between processes.

Child classes are expected to inherit the __init__ execution, which basically means that at least, the child must be defined as:

class ChildProcess(Process):
    def__init__(self, **kwargs):
        super().__init__(**kwargs)

This ensures that when the ChildProcess class is instantiated, it automatically sets the attributes of the parent class.

This also means that child processes must be instantiated providing information on the process type and jinja2 template with the nextflow code.

Parameters:
template : str

Name of the jinja2 template with the nextflow code for that process. Templates are stored in generator/templates.

Attributes:
template_str

Class property that returns a populated template string

Methods

get_user_channel(input_channel[, input_type]) Returns the main raw channel for the process
render(template, context) Wrapper to the jinja2 render method from a template file
set_channels(**kwargs) General purpose method that sets the main channels
set_main_channel_names(input_suffix, …) Sets the main channel names based on the provide input and output channel suffixes.
set_param_id(param_id) Sets the param_id for the process, which will be used to render the template.
set_secondary_channel(source, channel_list) General purpose method for setting a secondary channel
update_attributes(attr_dict) Updates the directives attribute from a dictionary object.
update_main_forks(sink) Updates the forks attribute with the sink channel destination
update_main_input  
RAW_MAPPING = {'accessions': {'channel': 'IN_accessions_raw', 'channel_str': 'Channel.fromPath(params.{0}).ifEmpty {{ exit 1, "No accessions file provided with path:\'${{params.{0}}}\'" }}', 'checks': 'if (!params.{0}){{ exit 1, "\'{0}\' parameter missing" }}\n', 'default_value': 'null', 'description': 'Path file with accessions, one perline. (default: $params.fastq)', 'params': 'accessions'}, 'fasta': {'channel': 'IN_fasta_raw', 'channel_str': 'Channel.fromPath(params.{0}).map{{ it -> file(it).exists() ? [it.toString().tokenize(\'/\').last().tokenize(\'.\')[0..-2].join(\'.\'), it] : null }}.ifEmpty {{ exit 1, "No fasta files provided with pattern:\'${{params.{0}}}\'" }}', 'checks': 'if (params.{0} instanceof Boolean){{exit 1, "\'{0}\' must be a path pattern. Provide value:\'$params.{0}\'"}}\nif (!params.{0}){{ exit 1, "\'{0}\' parameter missing"}}', 'default_value': "'fasta/*.fasta'", 'description': 'Path fasta files. (default: $params.fastq)', 'params': 'fasta'}, 'fastq': {'channel': 'IN_fastq_raw', 'channel_str': 'Channel.fromFilePairs(params.{0}).ifEmpty {{ exit 1, "No fastq files provided with pattern:\'${{params.{0}}}\'" }}', 'checks': 'if (params.{0} instanceof Boolean){{exit 1, "\'{0}\' must be a path pattern. Provide value:\'$params.{0}\'"}}\nif (!params.{0}){{ exit 1, "\'{0}\' parameter missing"}}', 'default_value': "'fastq/*_{1,2}.*'", 'description': 'Path expression to paired-end fastq files. (default: $params.fastq)', 'params': 'fastq'}}

dict: Contains the mapping between the Process.input_type attribute and the corresponding nextflow parameter and main channel definition, e.g.:

"fastq" : {
    "params": "fastq",
    "channel: "<channel>
}
pid = None

int: Process ID number that represents the order and position in the generated pipeline

template = None

str: Template name for the current process. This string will be used to fetch the file containing the corresponding jinja2 template in the _set_template() method

input_type = None

str: Type of expected input data. Used to verify the connection between two processes is viable.

output_type = None

str: Type of output data. Used to verify the connection between two processes is viable.

ignore_type = None

boolean: If True, this process will ignore the input/output type requirements. This attribute is set to True for terminal singleton forks in the pipeline.

ignore_pid = None

boolean: If True, this process will not make the pid advance. This is used for terminal forks before the end of the pipeline.

dependencies = None

list: Contains the dependencies of the current process in the form of the Process.template attribute (e.g., [fastqc])

input_channel = None

str: Place holder of the main input channel for the current process. This attribute can change dynamically depending on the forks and secondary channels in the final pipeline.

output_channel = None

str: Place holder of the main output channel for the current process. This attribute can change dynamically depending on the forks and secondary channels in the final pipeline.

input_user_channel = None

dict: Stores a dictionary of two key:value pairs containing the raw input channel for the process. This is automatically

determined by the input_type attribute, and will
fetch the information that is mapped in the RAW_MAPPING
variable. It will only be used by the first process(es) defined in a pipeline.

list: List of strings with the starting points for secondary channels. When building the pipeline, these strings will be matched with equal strings in the link_end attribute of other Processes.

list: List of dictionaries containing the a string of the ending point for a secondary channel. Each dictionary should contain at least two key/vals: {"link": <link string>, "alias":<string for template>}

status_channels = None

list: Name of the status channels produced by the process. By default, it sets a single status channel. If more than one status channels are required for the process, list each one in this attribute (e.g., FastQC.status_channels)

status_strs = None

str: Name of the status channel for the current process. These strings will be provided to the StatusCompiler process to collect and compile status reports

forks = None

list: List of strings with the literal definition of the forks for the current process, ready to be added to the template string.

main_forks = None

list: List of the channels onto which the main output should be forked into. They will be automatically added to the main_forks attribute when setting the secondary channels

secondary_inputs = None

list: List of dictionaries with secondary input channels from nextflow parameters. This dictionary should contain two key:value pairs with the params key, containing the parameter name, and the channel key, containing the nextflow channel definition:

{
    "params": "pathoSpecies",
    "channel": "IN_pathoSpecies = Channel
                                    .value(params.pathoSpecies)"
}
extra_input = None

str: with the name of the params that will be used to provide extra input into the process. This extra input will be mixed with the main input channel using nextflow’s mix operator. Its channel will be defined at the start of the pipeline, based on the channel_str key of the RAW_MAPPING for the corresponding input type.

params = None

dict: Maps the parameter names to the corresponding default values.

param_id = None

str: The parameter id suffix that will be added to each parameter. In case it is empty, the multiple identical parameters in different components will be merged.

directives = None

dict: Specifies the directives (cpus, memory, container) for each nextflow process in the template. If specified, this directives will be added to the nextflow configuration file. Otherwise, the default values for cpus and memory will be used. In the case of containers, they will not run inside any container.

The current supported directives are:
  • cpus
  • memory
  • container
  • container tag/version

An example of directives for two process is as follows:

self.directives = {
    "processA": {"cpus": 1, "memory": "1GB"},
    "processB": {"memory": "5GB", "container": "my/image",
                 "version": "0.5.0"}
}
compiler = None

dict: Specifies channels from the current process that are received by a compiler process. Each key in this dictionary should match a compiler process key in compilers. The value should be a list of the channels that will be fed to the compiler process:

self.compiler["patlas_consensus"] = ["mashScreenOutputChannel"]
set_main_channel_names(input_suffix, output_suffix, lane)[source]

Sets the main channel names based on the provide input and output channel suffixes. This is performed when connecting processes.

Parameters:
input_suffix : str

Suffix added to the input channel. Should be based on the lane and an arbitrary unique id

output_suffix : str

Suffix added to the output channel. Should be based on the lane and an arbitrary unique id

lane : int

Sets the lane of the process.

set_param_id(param_id)[source]

Sets the param_id for the process, which will be used to render the template.

Parameters:
param_id : str

The param_id attribute of the process.

get_user_channel(input_channel, input_type=None)[source]

Returns the main raw channel for the process

Provided with at least a channel name, this method returns the raw channel name and specification (the nextflow string definition) for the process. By default, it will fork from the raw input of the process’ input_type attribute. However, this behaviour can be overridden by providing the input_type argument.

If the specified or inferred input type exists in the RAW_MAPPING dictionary, the channel info dictionary will be retrieved along with the specified input channel. Otherwise, it will return None.

An example of the returned dictionary is:

 {"input_channel": "myChannel",
 "params": "fastq",
 "channel": "IN_fastq_raw",
 "channel_str":"IN_fastq_raw = Channel.fromFilePairs(params.fastq)"
}
Returns:
dict or None

Dictionary with the complete raw channel info. None if no channel is found.

static render(template, context)[source]

Wrapper to the jinja2 render method from a template file

Parameters:
template : str

Path to template file.

context : dict

Dictionary with kwargs context to populate the template

template_str

Class property that returns a populated template string

This property allows the template of a particular process to be dynamically generated and returned when doing Process.template_str.

Returns:
x : str

String with the complete and populated process template

set_channels(**kwargs)[source]

General purpose method that sets the main channels

This method will take a variable number of keyword arguments to set the Process._context attribute with the information on the main channels for the process. This is done by appending the process ID (Process.pid) attribute to the input, output and status channel prefix strings. In the output channel, the process ID is incremented by 1 to allow the connection with the channel in the next process.

The **kwargs system for setting the Process._context attribute also provides additional flexibility. In this way, individual processes can provide additional information not covered in this method, without changing it.

Parameters:
kwargs : dict

Dictionary with the keyword arguments for setting up the template context

update_main_input(input_str)[source]
update_main_forks(sink)[source]

Updates the forks attribute with the sink channel destination

Parameters:
sink : str

Channel onto which the main input will be forked to

set_secondary_channel(source, channel_list)[source]

General purpose method for setting a secondary channel

This method allows a given source channel to be forked into one or more channels and sets those forks in the Process.forks attribute. Both the source and the channels in the channel_list argument must be the final channel strings, which means that this method should be called only after setting the main channels.

If the source is not a main channel, this will simply create a fork or set for every channel in the channel_list argument list:

SOURCE_CHANNEL_1.into{SINK_1;SINK_2}

If the source is a main channel, this will apply some changes to the output channel of the process, to avoid overlapping main output channels. For instance, forking the main output channel for process 2 would create a MAIN_2.into{...}. The issue here is that the MAIN_2 channel is expected as the input of the next process, but now is being used to create the fork. To solve this issue, the output channel is modified into _MAIN_2, and the fork is set to the channels provided channels plus the MAIN_2 channel:

_MAIN_2.into{MAIN_2;MAIN_5;...}
Parameters:
source : str

String with the name of the source channel

channel_list : list

List of channels that will receive a fork of the secondary channel

update_attributes(attr_dict)[source]

Updates the directives attribute from a dictionary object.

This will only update the directives for processes that have been defined in the subclass.

Parameters:
attr_dict : dict

Dictionary containing the attributes that will be used to update the process attributes and/or directives.

class flowcraft.generator.process.Compiler(**kwargs)[source]

Bases: flowcraft.generator.process.Process

Extends the Process methods to status-type processes

Attributes:
template_str

Class property that returns a populated template string

Methods

get_user_channel(input_channel[, input_type]) Returns the main raw channel for the process
render(template, context) Wrapper to the jinja2 render method from a template file
set_channels(**kwargs) General purpose method that sets the main channels
set_compiler_channels(channel_list[, operator]) General method for setting the input channels for the status process
set_main_channel_names(input_suffix, …) Sets the main channel names based on the provide input and output channel suffixes.
set_param_id(param_id) Sets the param_id for the process, which will be used to render the template.
set_secondary_channel(source, channel_list) General purpose method for setting a secondary channel
update_attributes(attr_dict) Updates the directives attribute from a dictionary object.
update_main_forks(sink) Updates the forks attribute with the sink channel destination
update_main_input  
set_compiler_channels(channel_list, operator='mix')[source]

General method for setting the input channels for the status process

Given a list of status channels that are gathered during the pipeline construction, this method will automatically set the input channel for the status process. This makes use of the mix channel operator of nextflow for multiple channels:

STATUS_1.mix(STATUS_2,STATUS_3,...)

This will set the status_channels key for the _context attribute of the process.

Parameters:
channel_list : list

List of strings with the final name of the status channels

operator : str

Specifies the operator used to join the compiler channels. Available options are ‘mix’and ‘join’.

class flowcraft.generator.process.Init(**kwargs)[source]

Bases: flowcraft.generator.process.Process

Attributes:
template_str

Class property that returns a populated template string

Methods

get_user_channel(input_channel[, input_type]) Returns the main raw channel for the process
render(template, context) Wrapper to the jinja2 render method from a template file
set_channels(**kwargs) General purpose method that sets the main channels
set_extra_inputs(channel_dict) Sets the initial definition of the extra input channels.
set_main_channel_names(input_suffix, …) Sets the main channel names based on the provide input and output channel suffixes.
set_param_id(param_id) Sets the param_id for the process, which will be used to render the template.
set_raw_inputs(raw_input) Sets the main input channels of the pipeline and their forks.
set_secondary_channel(source, channel_list) General purpose method for setting a secondary channel
set_secondary_inputs(channel_dict) Adds secondary inputs to the start of the pipeline.
update_attributes(attr_dict) Updates the directives attribute from a dictionary object.
update_main_forks(sink) Updates the forks attribute with the sink channel destination
update_main_input  
set_raw_inputs(raw_input)[source]

Sets the main input channels of the pipeline and their forks.

The raw_input dictionary input should contain one entry for each input type (fastq, fasta, etc). The corresponding value should be a dictionary/json with the following key:values:

  • channel: Name of the raw input channel (e.g.: channel1)
  • channel_str: The nextflow definition of the channel and
    eventual checks (e.g.: channel1 = Channel.fromPath(param))
  • raw_forks: A list of channels to which the channel name will for to.

Each new type of input parameter is automatically added to the params attribute, so that they are automatically collected for the pipeline description and help.

Parameters:
raw_input : dict

Contains an entry for each input type with the channel name, channel string and forks.

set_secondary_inputs(channel_dict)[source]

Adds secondary inputs to the start of the pipeline.

This channels are inserted into the pipeline file as they are provided in the values of the argument.

Parameters:
channel_dict : dict

Each entry should be <parameter>: <channel string>.

set_extra_inputs(channel_dict)[source]

Sets the initial definition of the extra input channels.

The channel_dict argument should contain the input type and destination channel of each parameter (which is the key):

channel_dict = {
    "param1": {
        "input_type": "fasta"
        "channels": ["abricate_2_3", "chewbbaca_3_4"]
    }
}
Parameters:
channel_dict : dict

Dictionary with the extra_input parameter as key, and a dictionary as a value with the input_type and destination channels

class flowcraft.generator.process.StatusCompiler(**kwargs)[source]

Bases: flowcraft.generator.process.Compiler

Status compiler process template interface

This special process receives the status channels from all processes in the generated pipeline.

Attributes:
template_str

Class property that returns a populated template string

Methods

get_user_channel(input_channel[, input_type]) Returns the main raw channel for the process
render(template, context) Wrapper to the jinja2 render method from a template file
set_channels(**kwargs) General purpose method that sets the main channels
set_compiler_channels(channel_list[, operator]) General method for setting the input channels for the status process
set_main_channel_names(input_suffix, …) Sets the main channel names based on the provide input and output channel suffixes.
set_param_id(param_id) Sets the param_id for the process, which will be used to render the template.
set_secondary_channel(source, channel_list) General purpose method for setting a secondary channel
update_attributes(attr_dict) Updates the directives attribute from a dictionary object.
update_main_forks(sink) Updates the forks attribute with the sink channel destination
update_main_input  
class flowcraft.generator.process.ReportCompiler(**kwargs)[source]

Bases: flowcraft.generator.process.Compiler

Reports compiler process template interface

This special process receives the report channels from all processes in the generated pipeline.

Attributes:
template_str

Class property that returns a populated template string

Methods

get_user_channel(input_channel[, input_type]) Returns the main raw channel for the process
render(template, context) Wrapper to the jinja2 render method from a template file
set_channels(**kwargs) General purpose method that sets the main channels
set_compiler_channels(channel_list[, operator]) General method for setting the input channels for the status process
set_main_channel_names(input_suffix, …) Sets the main channel names based on the provide input and output channel suffixes.
set_param_id(param_id) Sets the param_id for the process, which will be used to render the template.
set_secondary_channel(source, channel_list) General purpose method for setting a secondary channel
update_attributes(attr_dict) Updates the directives attribute from a dictionary object.
update_main_forks(sink) Updates the forks attribute with the sink channel destination
update_main_input  
class flowcraft.generator.process.PatlasConsensus(**kwargs)[source]

Bases: flowcraft.generator.process.Compiler

Patlas consensus compiler process template interface

This special process receives the channels associated with the patlas_consensus key.

Attributes:
template_str

Class property that returns a populated template string

Methods

get_user_channel(input_channel[, input_type]) Returns the main raw channel for the process
render(template, context) Wrapper to the jinja2 render method from a template file
set_channels(**kwargs) General purpose method that sets the main channels
set_compiler_channels(channel_list[, operator]) General method for setting the input channels for the status process
set_main_channel_names(input_suffix, …) Sets the main channel names based on the provide input and output channel suffixes.
set_param_id(param_id) Sets the param_id for the process, which will be used to render the template.
set_secondary_channel(source, channel_list) General purpose method for setting a secondary channel
update_attributes(attr_dict) Updates the directives attribute from a dictionary object.
update_main_forks(sink) Updates the forks attribute with the sink channel destination
update_main_input  
flowcraft.generator.process_details module
flowcraft.generator.process_details.colored_print(msg, color_label='white_bold')[source]
This function enables users to add a color to the print. It also enables to pass end_char to print allowing to print several strings in the same line in different prints.
Parameters:
color_string: str

The color code to pass to the function, which enables color change as well as background color change.

msg: str

The actual text to be printed

end_char: str

The character in which each print should finish. By default it will be “

“.
flowcraft.generator.process_details.procs_dict_parser(procs_dict)[source]

This function handles the dictionary of attributes of each Process class to print to stdout lists of all the components or the components which the user specifies in the -t flag.

Parameters:
procs_dict: dict

A dictionary with the class attributes for all the components (or components that are used by the -t flag), that allow to create both the short_list and detailed_list. Dictionary example: {“abyss”: {‘input_type’: ‘fastq’, ‘output_type’: ‘fasta’, ‘dependencies’: [], ‘directives’: {‘abyss’: {‘cpus’: 4, ‘memory’: ‘{ 5.GB * task.attempt }’, ‘container’: ‘flowcraft/abyss’, ‘version’: ‘2.1.1’, ‘scratch’: ‘true’}}}

flowcraft.generator.process_details.proc_collector(process_map, args, pipeline_string)[source]

Function that collects all processes available and stores a dictionary of the required arguments of each process class to be passed to procs_dict_parser

Parameters:
process_map: dict

The dictionary with the Processes currently available in flowcraft and their corresponding classes as values

args: argparse.Namespace

The arguments passed through argparser that will be access to check the type of list to be printed

pipeline_string: str

the pipeline string

flowcraft.generator.recipe module
class flowcraft.generator.recipe.InnuendoRecipe[source]

Bases: object

Methods

build_downstream(process_descriptions, task, …) Builds the downstream pipeline of the current process
build_pipeline_string(forks) Parses, filters and merge all possible pipeline forks into the final pipeline string
build_upstream(process_descriptions, task, …) Builds the upstream pipeline of the current process
define_pipeline_string(process_descriptions, …) Builds the possible forks and connections between the provided processes
run_auto_pipeline(tasks) Main method to run the automatic pipeline creation
validate_pipeline(pipeline_string) Validate pipeline string
count_forks = None

int : counts the total possible number of forks

forks = None

list : a list with all the possible forks

pipeline_string = None

str : the generated pipeline string

process_to_id = None

dict: key value between the process name and its identifier

static validate_pipeline(pipeline_string)[source]

Validate pipeline string

Validates the pipeline string by searching for forbidden characters

Parameters:
pipeline_string : str

STring with the processes provided

build_upstream(process_descriptions, task, all_tasks, task_pipeline, count_forks, total_tasks, forks)[source]

Builds the upstream pipeline of the current process

Checks for the upstream processes to the current process and adds them to the current pipeline fragment if they were provided in the process list.

Parameters:
process_descriptions : dict

Information of processes input, output and if is forkable

task : str

Current process

all_tasks : list

A list of all provided processes

task_pipeline : list

Current pipeline fragment

count_forks : int

Current number of forks

total_tasks : str

All space separated processes

forks : list

Current forks

Returns
——-
list : resulting pipeline fragment
build_downstream(process_descriptions, task, all_tasks, task_pipeline, count_forks, total_tasks, forks)[source]

Builds the downstream pipeline of the current process

Checks for the downstream processes to the current process and adds them to the current pipeline fragment.

Parameters:
process_descriptions : dict

Information of processes input, output and if is forkable

task : str

Current process

all_tasks : list

A list of all provided processes

task_pipeline : list

Current pipeline fragment

count_forks : int

Current number of forks

total_tasks : str

All space separated processes

forks : list

Current forks

Returns
——-
list : resulting pipeline fragment
define_pipeline_string(process_descriptions, tasks, check_upstream, check_downstream, count_forks, total_tasks, forks)[source]

Builds the possible forks and connections between the provided processes

This method loops through all the provided tasks and builds the upstream and downstream pipeline if required. It then returns all possible forks than need to be merged à posteriori`

Parameters:
process_descriptions : dict

Information of processes input, output and if is forkable

tasks : str

Space separated processes

check_upstream : bool

If is to build the upstream pipeline of the current task

check_downstream : bool

If is to build the downstream pipeline of the current task

count_forks : int

Number of current forks

total_tasks : str

All space separated processes

forks : list

Current forks

Returns:
list : List with all the possible pipeline forks
build_pipeline_string(forks)[source]

Parses, filters and merge all possible pipeline forks into the final pipeline string

This method checks for shared start and end sections between forks and merges them according to the shared processes:

[[spades, ...], [skesa, ...], [...,[spades, skesa]]]
    -> [..., [[spades, ...], [skesa, ...]]]

Then it defines the pipeline string by replacing the arrays levels to the flowcraft fork format:

[..., [[spades, ...], [skesa, ...]]]
    -> ( ... ( spades ... | skesa ... ) )
Parameters:
forks : list

List with all the possible pipeline forks.

Returns:
str : String with the pipeline definition used as input for
parse_pipeline
run_auto_pipeline(tasks)[source]

Main method to run the automatic pipeline creation

This method aggregates the functions required to build the pipeline string that can be used as input for the workflow generator.

Parameters:
tasks : str

A string with the space separated tasks to be included in the pipeline

Returns:
str : String with the pipeline definition used as input for
parse_pipeline
class flowcraft.generator.recipe.Innuendo(*args, **kwargs)[source]

Bases: flowcraft.generator.recipe.InnuendoRecipe

Recipe class for the INNUENDO Project. It has all the available in the platform for quick use of the processes in the scope of the project.

Methods

build_downstream(process_descriptions, task, …) Builds the downstream pipeline of the current process
build_pipeline_string(forks) Parses, filters and merge all possible pipeline forks into the final pipeline string
build_upstream(process_descriptions, task, …) Builds the upstream pipeline of the current process
define_pipeline_string(process_descriptions, …) Builds the possible forks and connections between the provided processes
run_auto_pipeline(tasks) Main method to run the automatic pipeline creation
validate_pipeline(pipeline_string) Validate pipeline string
flowcraft.generator.recipe.brew_innuendo(args)[source]

Brews a given list of processes according to the recipe

Parameters:
args : argparse.Namespace

The arguments passed through argparser that will be used to check the the recipe, tasks and brew the process

Returns:
str

The final pipeline string, ready for the engine.

list

List of process strings.

class flowcraft.generator.recipe.Recipe[source]

Bases: object

Methods

brew  
pipeline_str = None

str: The raw pipeline string, with no attribute or directives, except for number indicators for when there are duplicate components.

e.g.: “fastqc trimmomatic spades” e.g.: “fastqc trimmomatic (spades#1 | spades#2)

directives = None

dict: Dictionary with the parameters and directives for each component in the pipeline_str attribute. Missing components will be left with the default parameters and directives.

brew()[source]
flowcraft.generator.recipe.brew_recipe(recipe_name)[source]

Returns a pipeline string from a recipe name.

Parameters:
recipe_name : str

Name of the recipe. Must match the name attribute in one of the classes defined in flowcraft.generator.recipes

Returns:
str

Pipeline string ready for parsing and processing by flowcraft engine

flowcraft.generator.recipe.list_recipes(full=False)[source]

Method that iterates over all available recipes and prints their information to the standard output

Parameters:
full : bool

If true, it will provide the pipeline string along with the recipe name

Module contents

Placeholder for Process creation docs

flowcraft.templates package

Subpackages
flowcraft.templates.flowcraft_utils package
Submodules
flowcraft.templates.flowcraft_utils.flowcraft_base module
flowcraft.templates.flowcraft_utils.flowcraft_base.get_logger(filepath, level=10)[source]
flowcraft.templates.flowcraft_utils.flowcraft_base.log_error()[source]

Nextflow specific function that logs an error upon unexpected failing

class flowcraft.templates.flowcraft_utils.flowcraft_base.MainWrapper(f)[source]

Bases: object

Methods

__call__(*args, **kwargs) Call self as a function.
build_versions() Writes versions JSON for a template file
build_versions()[source]

Writes versions JSON for a template file

This method creates the JSON file .versions based on the metadata and specific functions that are present in a given template script.

It starts by fetching the template metadata, which can be specified via the __version__, __template__ and __build__ attributes. If all of these attributes exist, it starts to populate a JSON/dict array (Note that the absence of any one of them will prevent the version from being written).

Then, it will search the template scope for functions that start with the substring __set_version (For example ``def __set_version_fastqc()`). These functions should gather the version of an arbitrary program and return a JSON/dict object with the following information:

{
    "program": <program_name>,
    "version": <version>
    "build": <build>
}

This JSON/dict object is then written in the .versions file.

Module contents
Submodules
flowcraft.templates.assembly_report module
Purpose

This module is intended to provide a summary report for a given assembly in Fasta format.

Expected input

The following variables are expected whether using NextFlow or the main() executor.

  • sample_id : Sample Identification string.
    • e.g.: 'SampleA'
  • assembly : Path to assembly file in Fasta format.
    • e.g.: 'assembly.fasta'
Generated output
  • ${sample_id}_assembly_report.csv : CSV with summary information of the assembly.
    • e.g.: 'SampleA_assembly_report.csv'
Code documentation
class flowcraft.templates.assembly_report.Assembly(assembly_file, sample_id)[source]

Bases: object

Class that parses and filters an assembly file in Fasta format.

This class parses an assembly file, collects a number of summary statistics and metadata from the contigs and reports.

Parameters:
assembly_file : str

Path to assembly file.

sample_id : str

Name of the sample for the current assembly.

Methods

get_coverage_sliding(coverage_file[, window])
Parameters:
get_gc_sliding([window]) Calculates a sliding window of the GC content for the assembly
get_summary_stats([output_csv]) Generates a CSV report with summary statistics about the assembly
summary_info = None

OrderedDict: Initialize summary information dictionary. Contains keys:

  • ncontigs: Number of contigs
  • avg_contig_size: Average size of contigs
  • n50: N50 metric
  • total_len: Total assembly length
  • avg_gc: Average GC proportion
  • missing_data: Count of missing data characters
contigs = None

OrderedDict: Object that maps the contig headers to the corresponding sequence

contig_coverage = None

OrderedDict: Object that maps the contig headers to the corresponding list of per-base coverage

sample = None

str: Sample id

contig_boundaries = None

dict: Maps the boundaries of each contig in the genome

get_summary_stats(output_csv=None)[source]

Generates a CSV report with summary statistics about the assembly

The calculated statistics are:

  • Number of contigs
  • Average contig size
  • N50
  • Total assembly length
  • Average GC content
  • Amount of missing data
Parameters:
output_csv: str

Name of the output CSV file.

get_gc_sliding(window=2000)[source]

Calculates a sliding window of the GC content for the assembly

Returns:
gc_res : list

List of GC proportion floats for each data point in the sliding window

get_coverage_sliding(coverage_file, window=2000)[source]
Parameters:
coverage_file : str

Path to file containing the coverage info at the per-base level (as generated by samtools depth)

window : int

Size of sliding window

flowcraft.templates.fastqc module
Purpose

This module is intended to run FastQC on paired-end FastQ files.

Expected input

The following variables are expected whether using NextFlow or the main() executor.

  • fastq_pair : Pair of FastQ file paths
    • e.g.: 'SampleA_1.fastq.gz SampleA_2.fastq.gz'
Generated output

The generated output are output files that contain an object, usually a string.

  • pair_{1,2}_data : File containing FastQC report at the nucleotide level for each pair
    • e.g.: 'pair_1_data' and 'pair_2_data'
  • pair_{1,2}_summary: File containing FastQC report for each category and for each pair
    • e.g.: 'pair_1_summary' and 'pair_2_summary'
Code documentation
flowcraft.templates.fastqc.convert_adatpers(adapter_fasta)[source]

Generates an adapter file for FastQC from a fasta file.

The provided adapters file is assumed to be a simple fasta file with the adapter’s name as header and the corresponding sequence:

>TruSeq_Universal_Adapter
AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT
>TruSeq_Adapter_Index 1
GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGTCTTCTGCTTG
Parameters:
adapter_fasta : str

Path to Fasta file with adapter sequences.

Returns:
adapter_out : str or None

The path to the reformatted adapter file. Returns None if the adapters file does not exist or the path is incorrect.

flowcraft.templates.fastqc_report module
Purpose

This module is intended parse the results of FastQC for paired end FastQ samples. It parses two reports:

  • Categorical report
  • Nucleotide level report.
Expected input

The following variables are expected whether using NextFlow or the main() executor.

  • sample_id : Sample identification string
    • e.g.: 'SampleA'
  • result_p1 : Path to both FastQC result files for pair 1
    • e.g.: 'SampleA_1_data SampleA_1_summary'
  • result_p2 : Path to both FastQC result files for pair 2
    • e.g.: 'SampleA_2_data SampleA_2_summary'
  • opts : Specify additional arguments for executing fastqc_report. The arguments should be a string of command line arguments, The accepted arguments are:
    • '--ignore-tests' : Ignores test results from FastQC categorical summary. This is used in the first run of FastQC.
Generated output

The generated output are output files that contain an object, usually a string.

  • fastqc_health : Stores the health check for the current sample. If it
    passes all checks, it contains only the string ‘pass’. Otherwise, contains the summary categories and their respective results - e.g.: 'pass'
  • optimal_trim : Stores a tuple with the optimal trimming positions for 5’
    and 3’ ends of the reads. - e.g.: '15 151'
Code documentation
flowcraft.templates.fastqc_report.write_json_report(sample_id, data1, data2)[source]

Writes the report

Parameters:
data1
data2
flowcraft.templates.fastqc_report.get_trim_index(biased_list)[source]

Returns the trim index from a bool list

Provided with a list of bool elements ([False, False, True, True]), this function will assess the index of the list that minimizes the number of True elements (biased positions) at the extremities. To do so, it will iterate over the boolean list and find an index position where there are two consecutive False elements after a True element. This will be considered as an optimal trim position. For example, in the following list:

[True, True, False, True, True, False, False, False, False, ...]

The optimal trim index will be the 4th position, since it is the first occurrence of a True element with two False elements after it.

If the provided bool list has no True elements, then the 0 index is returned.

Parameters:
biased_list: list

List of bool elements, where True means a biased site.

Returns:
x : index position of the biased list for the optimal trim.
flowcraft.templates.fastqc_report.trim_range(data_file)[source]

Assess the optimal trim range for a given FastQC data file.

This function will parse a single FastQC data file, namely the ‘Per base sequence content’ category. It will retrieve the A/T and G/C content for each nucleotide position in the reads, and check whether the G/C and A/T proportions are between 80% and 120%. If they are, that nucleotide position is marked as biased for future removal.

Parameters:
data_file: str

Path to FastQC data file.

Returns:
trim_nt: list

List containing the range with the best trimming positions for the corresponding FastQ file. The first element is the 5’ end trim index and the second element is the 3’ end trim index.

flowcraft.templates.fastqc_report.get_sample_trim(p1_data, p2_data)[source]

Get the optimal read trim range from data files of paired FastQ reads.

Given the FastQC data report files for paired-end FastQ reads, this function will assess the optimal trim range for the 3’ and 5’ ends of the paired-end reads. This assessment will be based on the ‘Per sequence GC content’.

Parameters:
p1_data: str

Path to FastQC data report file from pair 1

p2_data: str

Path to FastQC data report file from pair 2

Returns:
optimal_5trim: int

Optimal trim index for the 5’ end of the reads

optima_3trim: int

Optimal trim index for the 3’ end of the reads

See also

trim_range

flowcraft.templates.fastqc_report.get_summary(summary_file)[source]

Parses a FastQC summary report file and returns it as a dictionary.

This function parses a typical FastQC summary report file, retrieving only the information on the first two columns. For instance, a line could be:

'PASS   Basic Statistics        SH10762A_1.fastq.gz'

This parser will build a dictionary with the string in the second column as a key and the QC result as the value. In this case, the returned dict would be something like:

{"Basic Statistics": "PASS"}
Parameters:
summary_file: str

Path to FastQC summary report.

Returns:
summary_info: :py:data:`OrderedDict`

Returns the information of the FastQC summary report as an ordered dictionary, with the categories as strings and the QC result as values.

flowcraft.templates.fastqc_report.check_summary_health(summary_file, **kwargs)[source]

Checks the health of a sample from the FastQC summary file.

Parses the FastQC summary file and tests whether the sample is good or not. There are four categories that cannot fail, and two that must pass in order for the sample pass this check. If the sample fails the quality checks, a list with the failing categories is also returned.

Categories that cannot fail:

fail_sensitive = [
    "Per base sequence quality",
    "Overrepresented sequences",
    "Sequence Length Distribution",
    "Per sequence GC content"
]

Categories that must pass:

must_pass = [
    "Per base N content",
    "Adapter Content"
]
Parameters:
summary_file: str

Path to FastQC summary file.

Returns:
x : bool

Returns True if the sample passes all tests. False if not.

summary_info : list

A list with the FastQC categories that failed the tests. Is empty if the sample passes all tests.

flowcraft.templates.integrity_coverage module
Purpose

This module receives paired FastQ files, a genome size estimate and a minimum coverage threshold and has three purposes while iterating over the FastQ files:

  • Checks the integrity of FastQ files (corrupted files).
  • Guesses the encoding of FastQ files (this can be turned off in the opts argument).
  • Estimates the coverage for each sample.
Expected input

The following variables are expected whether using NextFlow or the main() executor.

  • sample_id : Sample Identification string
    • e.g.: 'SampleA'
  • fastq_pair : Pair of FastQ file paths
    • e.g.: 'SampleA_1.fastq.gz SampleA_2.fastq.gz'
  • gsize : Expected genome size
    • e.g.: '2.5'
  • cov : Minimum coverage threshold
    • e.g.: '15'
  • opts : Specify additional arguments for executing integrity_coverage. The arguments should be a string of command line arguments, such as ‘-e’. The accepted arguments are:
    • '-e' : Skip encoding guess.
Generated output

The generated output are output files that contain an object, usually a string. (Values within ${} are substituted by the corresponding variable.)

  • ${sample_id}_encoding : Stores the encoding for the sample FastQ. If no encoding could be guessed, write ‘None’ to file.
    • e.g.: 'Illumina-1.8' or 'None'
  • ${sample_id}_phred : Stores the phred value for the sample FastQ. If no phred could be guessed, write ‘None’ to file.
    • '33' or 'None'
  • ${sample_id}_coverage : Stores the expected coverage of the samples, based on a given genome size.
    • '112' or 'fail'
  • ${sample_id}_report : Stores the report on the expected coverage estimation. This string written in this file will appear in the coverage report.
    • '${sample_id}, 112, PASS'
  • ${sample_id}_max_len : Stores the maximum read length for the current sample.
    • '152'
Notes

In case of a corrupted sample, all expected output files should have 'corrupt' written.

Code documentation
flowcraft.templates.integrity_coverage.RANGES = {'Illumina-1.3': [64, (64, 104)], 'Illumina-1.5': [64, (66, 105)], 'Illumina-1.8': [33, (33, 74)], 'Sanger': [33, (33, 73)], 'Solexa': [64, (59, 104)]}

dict: Dictionary containing the encoding values for several fastq formats. The key contains the format and the value contains a list with the corresponding phred score and a list with the range of encodings.

flowcraft.templates.integrity_coverage.MAGIC_DICT = {b'\\x1f\\x8b\\x08': 'gz', b'\\x42\\x5a\\x68': 'bz2', b'\\x50\\x4b\\x03\\x04': 'zip'}

dict: Dictionary containing the binary signatures for three compression formats (gzip, bzip2 and zip).

flowcraft.templates.integrity_coverage.guess_file_compression(file_path, magic_dict=None)[source]

Guesses the compression of an input file.

This function guesses the compression of a given file by checking for a binary signature at the beginning of the file. These signatures are stored in the MAGIC_DICT dictionary. The supported compression formats are gzip, bzip2 and zip. If none of the signatures in this dictionary are found at the beginning of the file, it returns None.

Parameters:
file_path : str

Path to input file.

magic_dict : dict, optional

Dictionary containing the signatures of the compression types. The key should be the binary signature and the value should be the compression format. If left None, it falls back to MAGIC_DICT.

Returns:
file_type : str or None

If a compression type is detected, returns a string with the format. If not, returns None.

flowcraft.templates.integrity_coverage.get_qual_range(qual_str)[source]

Get range of the Unicode encode range for a given string of characters.

The encoding is determined from the result of the ord() built-in.

Parameters:
qual_str : str

Arbitrary string.

Returns:
x : tuple

(Minimum Unicode code, Maximum Unicode code).

flowcraft.templates.integrity_coverage.get_encodings_in_range(rmin, rmax)[source]

Returns the valid encodings for a given encoding range.

The encoding ranges are stored in the RANGES dictionary, with the encoding name as a string and a list as a value containing the phred score and a tuple with the encoding range. For a given encoding range provided via the two first arguments, this function will return all possible encodings and phred scores.

Parameters:
rmin : int

Minimum Unicode code in range.

rmax : int

Maximum Unicode code in range.

Returns:
valid_encodings : list

List of all possible encodings for the provided range.

valid_phred : list

List of all possible phred scores.

flowcraft.templates.mapping2json module
flowcraft.templates.mashdist2json module
Purpose

This module is intended to generate a json output for mash dist results that can be imported in pATLAS.

Expected input

The following variables are expected whether using NextFlow or the main() executor.

  • mash_output : String with the name of the mash screen output file.
    • e.g.: 'fastaFileA_mashdist.txt'
Code documentation
flowcraft.templates.mashdist2json.send_to_output(master_dict, mash_output, sample_id, assembly_file)[source]

Send dictionary to output json file This function sends master_dict dictionary to a json file if master_dict is populated with entries, otherwise it won’t create the file

Parameters:
master_dict: dict

dictionary that stores all entries for a specific query sequence in multi-fasta given to mash dist as input against patlas database

last_seq: str

string that stores the last sequence that was parsed before writing to file and therefore after the change of query sequence between different rows on the input file

mash_output: str

the name/path of input file to main function, i.e., the name/path of the mash dist output txt file.

sample_id: str

The name of the sample being parse to .report.json file

flowcraft.templates.mashscreen2json module
Purpose

This module is intended to generate a json output for mash screen results that can be imported in pATLAS.

Expected input

The following variables are expected whether using NextFlow or the main() executor.

  • mash_output : String with the name of the mash screen output file.
    • e.g.: 'sortedMashScreenResults_SampleA.txt'
Code documentation
flowcraft.templates.megahit module
Purpose

This module is intended execute megahit on paired-end FastQ files.

Expected input

The following variables are expected whether using NextFlow or the main() executor.

  • sample_id : Sample Identification string.
    • e.g.: 'SampleA'
  • fastq_pair : Pair of FastQ file paths.
    • e.g.: 'SampleA_1.fastq.gz SampleA_2.fastq.gz'
  • kmers : Setting for megahit kmers. Can be either 'auto', 'default' or a user provided list. All must be odd, in the range 15-255, increment <= 28
    • e.g.: 'auto' or 'default' or '55 77 99 113 127'
  • clear : If ‘true’, remove the input fastq files at the end of the

    component run, IF THE FILES ARE IN THE WORK DIRECTORY

Generated output
  • contigs.fa : Main output of megahit with the assembly
    • e.g.: contigs.fa
  • megahit_status : Stores the status of the megahit run. If it was successfully executed, it stores 'pass'. Otherwise, it stores the STDERR message.
    • e.g.: 'pass'
Code documentation
flowcraft.templates.megahit.is_odd(k_mer)[source]
flowcraft.templates.megahit.set_kmers(kmer_opt, max_read_len)[source]

Returns a kmer list based on the provided kmer option and max read len.

Parameters:
kmer_opt : str

The k-mer option. Can be either 'auto', 'default' or a sequence of space separated integers, '23, 45, 67'.

max_read_len : int

The maximum read length of the current sample.

Returns:
kmers : list

List of k-mer values that will be provided to megahit.

flowcraft.templates.megahit.fix_contig_names(asseembly_path)[source]

Removes whitespace from the assembly contig names

Parameters:
asseembly_path : path to assembly file
Returns:
str:

Path to new assembly file with fixed contig names

flowcraft.templates.megahit.clean_up(fastq)[source]

Cleans the temporary fastq files. If they are symlinks, the link source is removed

Parameters:
fastq : list

List of fastq files.

flowcraft.templates.metaspades module
Purpose

This module is intended execute metaSpades on paired-end FastQ files.

Expected input

The following variables are expected whether using NextFlow or the main() executor.

  • sample_id : Sample Identification string.
    • e.g.: 'SampleA'
  • fastq_pair : Pair of FastQ file paths.
    • e.g.: 'SampleA_1.fastq.gz SampleA_2.fastq.gz'
  • kmers : Setting for Spades kmers. Can be either 'auto', 'default' or a user provided list.
    • e.g.: 'auto' or 'default' or '55 77 99 113 127'
Generated output
  • contigs.fasta : Main output of spades with the assembly
    • e.g.: contigs.fasta
  • spades_status : Stores the status of the spades run. If it was successfully executed, it stores 'pass'. Otherwise, it stores the STDERR message.
    • e.g.: 'pass'
Code documentation
flowcraft.templates.metaspades.clean_up(fastq)[source]

Cleans the temporary fastq files. If they are symlinks, the link source is removed

Parameters:
fastq : list

List of fastq files.

flowcraft.templates.metaspades.set_kmers(kmer_opt, max_read_len)[source]

Returns a kmer list based on the provided kmer option and max read len.

Parameters:
kmer_opt : str

The k-mer option. Can be either 'auto', 'default' or a sequence of space separated integers, '23, 45, 67'.

max_read_len : int

The maximum read length of the current sample.

Returns:
kmers : list

List of k-mer values that will be provided to Spades.

flowcraft.templates.pATLAS_consensus_json module
Purpose

This module is intended to generate a json output from the consensus results from all the approaches available through options (mapping, assembly, mash screen)

Expected input

The following variables are expected whether using NextFlow or the main() executor.

  • mapping_json : String with the name of the json file with mapping results.
    • e.g.: 'mapping_SampleA.json'
  • dist_json : String with the name of the json file with mash dist results.
    • e.g.: 'mash_dist_SampleA.json'
  • screen_json : String with the name of the json file with mash screen results.
    • e.g.: 'mash_screen_sampleA.json'
Code documentation
flowcraft.templates.pipeline_status module
Purpose

This module is intended to collect pipeline run statistics (such as time, cpu, RAM for each tasks) into a report JSON

Expected input
  • trace_file : Trace file generated by nextflow
Code documentation
flowcraft.templates.pipeline_status.get_json_info(fields, header)[source]
Parameters:
fields
flowcraft.templates.pipeline_status.get_previous_stats(stats_path)[source]
Parameters:
workdir
flowcraft.templates.process_abricate module
Purpose

This module is intended parse the results of the Abricate for one or more samples.

Expected input

The following variables are expected whether using NextFlow or the main() executor.

  • abricate_files : Path to abricate output file.
    • e.g.: 'abr_resfinder.tsv'
Generated output

None

Code documentation
class flowcraft.templates.process_abricate.Abricate(fls)[source]

Bases: object

Main parser for Abricate output files.

This class parses one or more output files from Abricate, usually from different databases. In addition to the parsing methods, it also provides a flexible method to filter and re-format the content of the abricate files.

Parameters:
fls : list

List of paths to Abricate output files.

Methods

get_filter(*args, **kwargs) Wrapper of the iter_filter method that returns a list with results
iter_filter(filters[, databases, fields, …]) General purpose filter iterator.
parse_files(fls) Public method for parsing abricate output files.
storage = None

dic: Main storage of Abricate’s file content. Each entry corresponds to a single line and contains the keys:

- ``log_file``: Name of the summary log file containing abricate
  results
- ``infile``: Input file of Abricate.
- ``reference``: Reference of the query sequence.
- ``seq_range``: Range of the query sequence in the database
 sequence.
- ``gene``: AMR gene name.
- ``accession``: The genomic source of the sequence.
- ``database``: The database the sequence came from.
- ``coverage``: Proportion of gene covered.
- ``identity``: Proportion of exact nucleotide matches.
parse_files(fls)[source]

Public method for parsing abricate output files.

This method is called at at class instantiation for the provided output files. Additional abricate output files can be added using this method after the class instantiation.

Parameters:
fls : list

List of paths to Abricate files

iter_filter(filters, databases=None, fields=None, filter_behavior='and')[source]

General purpose filter iterator.

This general filter iterator allows the filtering of entries based on one or more custom filters. These filters must contain an entry of the storage attribute, a comparison operator, and the test value. For example, to filter out entries with coverage below 80:

my_filter = ["coverage", ">=", 80]

Filters should always be provide as a list of lists:

iter_filter([["coverage", ">=", 80]])
# or
my_filters = [["coverage", ">=", 80],
              ["identity", ">=", 50]]

iter_filter(my_filters)

As a convenience, a list of the desired databases can be directly specified using the database argument, which will only report entries for the specified databases:

iter_filter(my_filters, databases=["plasmidfinder"])

By default, this method will yield the complete entry record. However, the returned filters can be specified using the fields option:

iter_filter(my_filters, fields=["reference", "coverage"])
Parameters:
filters : list

List of lists with the custom filter. Each list should have three elements. (1) the key from the entry to be compared; (2) the comparison operator; (3) the test value. Example:

[["identity", ">", 80]].

databases : list

List of databases that should be reported.

fields : list

List of fields from each individual entry that are yielded.

filter_behavior : str

options: 'and' 'or' Sets the behaviour of the filters, if multiple filters have been provided. By default it is set to 'and', which means that an entry has to pass all filters. It can be set to 'or', in which case one one of the filters has to pass.

Yields:
dic : dict

Dictionary object containing a Abricate.storage entry that passed the filters.

get_filter(*args, **kwargs)[source]

Wrapper of the iter_filter method that returns a list with results

It should be called exactly as in the iter_filter

Returns:
_ : list

List of dictionary entries that passed the filters in the iter_filter method.

See also

iter_filter

class flowcraft.templates.process_abricate.AbricateReport(*args, **kwargs)[source]

Bases: flowcraft.templates.process_abricate.Abricate

Report generator for single Abricate output files

This class is intended to parse an Abricate output file from a single sample and database and generates a JSON report for the report webpage.

Parameters:
fls : list

List of paths to Abricate output files.

database : (optional) str

Name of the database for the current report. If not provided, it will be inferred based on the first entry of the Abricate file.

Methods

get_filter(*args, **kwargs) Wrapper of the iter_filter method that returns a list with results
get_plot_data() Generates the JSON report to plot the gene boxes
get_table_data()
iter_filter(filters[, databases, fields, …]) General purpose filter iterator.
parse_files(fls) Public method for parsing abricate output files.
write_report_data() Writes the JSON report to a json file
get_plot_data()[source]

Generates the JSON report to plot the gene boxes

Following the convention of the reports platform, this method returns a list of JSON/dict objects with the information about each entry in the abricate file. The information contained in this JSON is:

{contig_id: <str>,
 seqRange: [<int>, <int>],
 gene: <str>,
 accession: <str>,
 coverage: <float>,
 identity: <float>
 }

Note that the seqRange entry contains the position in the corresponding contig, not the absolute position in the whole assembly.

Returns:
json_dic : list

List of JSON/dict objects with the report data.

get_table_data()[source]
write_report_data()[source]

Writes the JSON report to a json file

flowcraft.templates.process_assembly module
Purpose

This module is intended to process the output of assemblies from a single sample from programs such as Spades or Skesa. The main input is an assembly file produced by an assembler, which will then be filtered according to user-specified parameters.

Expected input

The following variables are expected whether using NextFlow or the main() executor.

  • sample_id: Sample Identification string.
    • e.g.: 'SampleA'
  • assembly: Fasta file with the assembly.
    • e.g.: 'contigs.fasta'
  • opts: List of options for processing spades assembly.
    1. Minimum contig length.
      • e.g.: '150'
    2. Minimum k-mer coverage.
      • e.g.: '2'
    3. Maximum number of contigs per 1.5Mb.
      • e.g.: '100'
  • assembler: The name of the assembler
    • e.g.: spades
Generated output

(Values within ${} are substituted by the corresponding variable.)

  • '${sample_id}.assembly.fasta' : Fasta file with the filtered assembly.
    • e.g.: 'Sample1.assembly.fasta'
  • ${sample_id}.report.fasta : CSV file with the results of the filters for each contig.
    • e.g.: 'Sample1.report.csv'
Code documentation
class flowcraft.templates.process_assembly.Assembly(assembly_file, min_contig_len, min_kmer_cov, sample_id)[source]

Bases: object

Class that parses and filters a Fasta assembly file

This class parses an assembly fasta file, collects a number of summary statistics and metadata from the contigs, filters contigs based on user-defined metrics and writes filtered assemblies and reports.

Parameters:
assembly_file : str

Path to assembly file.

min_contig_len : int

Minimum contig length when applying the initial assembly filter.

min_kmer_cov : int

Minimum k-mer coverage when applying the initial assembly. filter.

sample_id : str

Name of the sample for the current assembly.

Methods

filter_contigs(*comparisons) Filters the contigs of the assembly according to user provided comparisons.
get_assembly_length() Returns the length of the assembly, without the filtered contigs.
write_assembly(output_file[, filtered]) Writes the assembly to a new file.
write_report(output_file) Writes a report with the test results for the current assembly
contigs = None

dict: Dictionary storing data for each contig.

filtered_ids = None

list: List of filtered contig_ids.

min_gc = None

float: Sets the minimum GC content on a contig.

sample = None

str: The name of the sample for the assembly.

report = None

dict: Will contain the filtering results for each contig.

filters = None

list: Setting initial filters to check when parsing the assembly file. This can be later changed using the ‘filter_contigs’ method.

filter_contigs(*comparisons)[source]

Filters the contigs of the assembly according to user provided comparisons.

The comparisons must be a list of three elements with the contigs key, operator and test value. For example, to filter contigs with a minimum length of 250, a comparison would be:

self.filter_contigs(["length", ">=", 250])

The filtered contig ids will be stored in the filtered_ids list.

The result of the test for all contigs will be stored in the report dictionary.

Parameters:
comparisons : list

List with contig key, operator and value to test.

get_assembly_length()[source]

Returns the length of the assembly, without the filtered contigs.

Returns:
x : int

Total length of the assembly.

write_assembly(output_file, filtered=True)[source]

Writes the assembly to a new file.

The filtered option controls whether the new assembly will be filtered or not.

Parameters:
output_file : str

Name of the output assembly file.

filtered : bool

If True, does not include filtered ids.

write_report(output_file)[source]

Writes a report with the test results for the current assembly

Parameters:
output_file : str

Name of the output assembly file.

flowcraft.templates.process_assembly_mapping module
Purpose

This module is intended to process the coverage report from the assembly_mapping process.

TODO: Better purpose

Expected input

The following variables are expected whether using NextFlow or the main() executor.

  • sample_id : Sample Identification string.
    • e.g.: 'SampleA'
  • assembly : Fasta assembly file.
    • e.g.: 'SH10761A.assembly.fasta'
  • coverage : TSV file with the average coverage for each assembled contig.
    • e.g.: 'coverage.tsv'
  • coverage_bp : TSV file with the coverage for each assembled bp.
    • e.g.: 'coverage.tsv'
  • bam_file : BAM file with the alignment of reads to the genome.
    • e.g.: 'sorted.bam'
  • opts : List of options for processing assembly mapping output.
    1. Minimum coverage for assembled contigs. Can be``auto``.
      • e.g.: 'auto' or '10'
    2. Maximum number of contigs.
      • e.g.: ‘100’
  • gsize: Expected genome size.
    • e.g.: '2.5'
Generated output
  • ${sample_id}_filtered.assembly.fasta : Filtered assembly file in Fasta format.
    • e.g.: 'SampleA_filtered.assembly.fasta'
  • filtered.bam : BAM file with the same filtering as the assembly file.
    • e.g.: filtered.bam
Code documentation
flowcraft.templates.process_assembly_mapping.parse_coverage_table(coverage_file)[source]

Parses a file with coverage information into objects.

This function parses a TSV file containing coverage results for all contigs in a given assembly and will build an OrderedDict with the information about their coverage and length. The length information is actually gathered from the contig header using a regular expression that assumes the usual header produced by Spades:

contig_len = int(re.search("length_(.+?)_", line).group(1))
Parameters:
coverage_file : str

Path to TSV file containing the coverage results.

Returns:
coverage_dict : OrderedDict

Contains the coverage and length information for each contig.

total_size : int

Total size of the assembly in base pairs.

total_cov : int

Sum of coverage values across all contigs.

flowcraft.templates.process_assembly_mapping.filter_assembly(assembly_file, minimum_coverage, coverage_info, output_file)[source]

Generates a filtered assembly file.

This function generates a filtered assembly file based on an original assembly and a minimum coverage threshold.

Parameters:
assembly_file : str

Path to original assembly file.

minimum_coverage : int or float

Minimum coverage required for a contig to pass the filter.

coverage_info : OrderedDict or dict

Dictionary containing the coverage information for each contig.

output_file : str

Path where the filtered assembly file will be generated.

flowcraft.templates.process_assembly_mapping.filter_bam(coverage_info, bam_file, min_coverage, output_bam)[source]

Uses Samtools to filter a BAM file according to minimum coverage

Provided with a minimum coverage value, this function will use Samtools to filter a BAM file. This is performed to apply the same filter to the BAM file as the one applied to the assembly file in filter_assembly().

Parameters:
coverage_info : OrderedDict or dict

Dictionary containing the coverage information for each contig.

bam_file : str

Path to the BAM file.

min_coverage : int

Minimum coverage required for a contig to pass the filter.

output_bam : str

Path to the generated filtered BAM file.

flowcraft.templates.process_assembly_mapping.check_filtered_assembly(coverage_info, coverage_bp, minimum_coverage, genome_size, contig_size, max_contigs, sample_id)[source]

Checks whether a filtered assembly passes a size threshold

Given a minimum coverage threshold, this function evaluates whether an assembly will pass the minimum threshold of genome_size * 1e6 * 0.8, which means 80% of the expected genome size or the maximum threshold of genome_size * 1e6 * 1.5, which means 150% of the expected genome size. It will issue a warning if any of these thresholds is crossed. In the case of an expected genome size below 80% it will return False.

Parameters:
coverage_info : OrderedDict or dict

Dictionary containing the coverage information for each contig.

coverage_bp : dict

Dictionary containing the per base coverage information for each contig. Used to determine the total number of base pairs in the final assembly.

minimum_coverage : int

Minimum coverage required for a contig to pass the filter.

genome_size : int

Expected genome size.

contig_size : dict

Dictionary with the len of each contig. Contig headers as keys and the corresponding lenght as values.

max_contigs : int

Maximum threshold for contig number. A warning is issued if this threshold is crossed.

sample_id : str

Id or name of the current sample

Returns:
x : bool

True if the filtered assembly size is higher than 80% of the expected genome size.

flowcraft.templates.process_assembly_mapping.get_coverage_from_file(coverage_file)[source]
Parameters:
coverage_file
flowcraft.templates.process_assembly_mapping.evaluate_min_coverage(coverage_opt, assembly_coverage, assembly_size)[source]

Evaluates the minimum coverage threshold from the value provided in the coverage_opt.

Parameters:
coverage_opt : str or int or float

If set to “auto” it will try to automatically determine the coverage to 1/3 of the assembly size, to a minimum value of 10. If it set to a int or float, the specified value will be used.

assembly_coverage : int or float

The average assembly coverage for a genome assembly. This value is retrieved by the :py:func:parse_coverage_table function.

assembly_size : int

The size of the genome assembly. This value is retrieved by the py:func:get_assembly_size function.

Returns:
x: int

Minimum coverage threshold.

flowcraft.templates.process_assembly_mapping.get_assembly_size(assembly_file)[source]

Returns the number of nucleotides and the size per contig for the provided assembly file path

Parameters:
assembly_file : str

Path to assembly file.

Returns:
assembly_size : int

Size of the assembly in nucleotides

contig_size : dict

Length of each contig (contig name as key and length as value)

flowcraft.templates.skesa module
Purpose

This module is intended execute Skesa on paired-end FastQ files.

Expected input

The following variables are expected whether using NextFlow or the main() executor.

  • sample_id : Sample Identification string.
    • e.g.: 'SampleA'
  • fastq_pair : Pair of FastQ file paths.
    • e.g.: 'SampleA_1.fastq.gz SampleA_2.fastq.gz'
  • clear : If ‘true’, remove the input fastq files at the end of the
    component run, IF THE FILES ARE IN THE WORK DIRECTORY
Generated output
  • ${sample_id}_*.assembly.fasta : Main output of skesawith the assembly
    • e.g.: sample_1_skesa.fasta
  • clear : If ‘true’, remove the input fastq files at the end of the
    component run, IF THE FILES ARE IN THE WORK DIRECTORY
Code documentation
flowcraft.templates.skesa.clean_up(fastq)[source]

Cleans the temporary fastq files. If they are symlinks, the link source is removed

Parameters:
fastq : list

List of fastq files.

flowcraft.templates.spades module
Purpose

This module is intended execute Spades on paired-end FastQ files.

Expected input

The following variables are expected whether using NextFlow or the main() executor.

  • sample_id : Sample Identification string.
    • e.g.: 'SampleA'
  • fastq_pair : Pair of FastQ file paths.
    • e.g.: 'SampleA_1.fastq.gz SampleA_2.fastq.gz'
  • kmers : Setting for Spades kmers. Can be either 'auto', 'default' or a user provided list.
    • e.g.: 'auto' or 'default' or '55 77 99 113 127'
  • opts : List of options for spades execution.
    1. The minimum number of reads to consider an edge in the de Bruijn graph during the assembly.
      • e.g.: '5'
    2. Minimum contigs k-mer coverage.
      • e.g.: ['2' '2']
  • clear : If ‘true’, remove the input fastq files at the end of the

    component run, IF THE FILES ARE IN THE WORK DIRECTORY

Generated output
  • contigs.fasta : Main output of spades with the assembly
    • e.g.: contigs.fasta
  • spades_status : Stores the status of the spades run. If it was successfully executed, it stores 'pass'. Otherwise, it stores the STDERR message.
    • e.g.: 'pass'
Code documentation
flowcraft.templates.spades.set_kmers(kmer_opt, max_read_len)[source]

Returns a kmer list based on the provided kmer option and max read len.

Parameters:
kmer_opt : str

The k-mer option. Can be either 'auto', 'default' or a sequence of space separated integers, '23, 45, 67'.

max_read_len : int

The maximum read length of the current sample.

Returns:
kmers : list

List of k-mer values that will be provided to Spades.

flowcraft.templates.spades.clean_up(fastq)[source]

Cleans the temporary fastq files. If they are symlinks, the link source is removed

Parameters:
fastq : list

List of fastq files.

flowcraft.templates.trimmomatic module
Purpose

This module is intended execute trimmomatic on paired-end FastQ files.

Expected input

The following variables are expected whether using NextFlow or the main() executor.

  • sample_id : Pair of FastQ file paths.
    • e.g.: 'SampleA'
  • fastq_pair : Pair of FastQ file paths.
    • e.g.: 'SampleA_1.fastq.gz SampleA_2.fastq.gz'
  • trim_range : Crop range detected using FastQC.
    • e.g.: '15 151'
  • opts : List of options for trimmomatic
    • e.g.: '["5:20", "3", "3", "55"]'
    • e.g.: '[trim_sliding_window, trim_leading, trim_trailing, trim_min_length]'
  • phred : List of guessed phred values for each sample
    • e.g.: '[SampleA: 33, SampleB: 33]'
  • clear : If ‘true’, remove the input fastq files at the end of the
    component run, IF THE FILES ARE IN THE WORK DIRECTORY
Generated output

The generated output are output files that contain an object, usually a string. (Values within ${} are substituted by the corresponding variable.)

  • ${sample_id}_*P*: Pair of paired FastQ files generated by Trimmomatic
    • e.g.: 'SampleA_1_P.fastq.gz SampleA_2_P.fastq.gz'
  • trimmomatic_status: Stores the status of the trimmomatic run. If it was successfully executed, it stores ‘pass’. Otherwise, it stores the STDERR message.
    • e.g.: 'pass'
Code documentation
flowcraft.templates.trimmomatic.parse_log(log_file)[source]

Retrieves some statistics from a single Trimmomatic log file.

This function parses Trimmomatic’s log file and stores some trimming statistics in an OrderedDict object. This object contains the following keys:

  • clean_len: Total length after trimming.
  • total_trim: Total trimmed base pairs.
  • total_trim_perc: Total trimmed base pairs in percentage.
  • 5trim: Total base pairs trimmed at 5’ end.
  • 3trim: Total base pairs trimmed at 3’ end.
Parameters:
log_file : str

Path to trimmomatic log file.

Returns:
x : OrderedDict

Object storing the trimming statistics.

flowcraft.templates.trimmomatic.write_report(storage_dic, output_file, sample_id)[source]

Writes a report from multiple samples.

Parameters:
storage_dic : dict or OrderedDict

Storage containing the trimming statistics. See parse_log() for its generation.

output_file : str

Path where the output file will be generated.

flowcraft.templates.trimmomatic.trimmomatic_log(log_file, sample_id)[source]
flowcraft.templates.trimmomatic.clean_up(fastq_pairs, clear)[source]

Cleans the working directory of unwanted temporary files

flowcraft.templates.trimmomatic.merge_default_adapters()[source]

Merges the default adapters file in the trimmomatic adapters directory

Returns:
str

Path with the merged adapters file.

flowcraft.templates.trimmomatic_report module
Purpose

This module is intended parse the results of the Trimmomatic log for a set of one or more samples.

Expected input

The following variables are expected whether using NextFlow or the main() executor.

  • log_files: Trimmomatic log files.
    • e.g.: 'Sample1_trimlog.txt Sample2_trimlog.txt'
Generated output
  • trimmomatic_report.csv : Summary report of the trimmomatic logs for all samples
Code documentation
flowcraft.templates.trimmomatic_report.parse_log(log_file)[source]

Retrieves some statistics from a single Trimmomatic log file.

This function parses Trimmomatic’s log file and stores some trimming statistics in an OrderedDict object. This object contains the following keys:

  • clean_len: Total length after trimming.
  • total_trim: Total trimmed base pairs.
  • total_trim_perc: Total trimmed base pairs in percentage.
  • 5trim: Total base pairs trimmed at 5’ end.
  • 3trim: Total base pairs trimmed at 3’ end.
Parameters:
log_file : str

Path to trimmomatic log file.

Returns:
x : OrderedDict

Object storing the trimming statistics.

flowcraft.templates.trimmomatic_report.write_report(storage_dic, output_file, sample_id)[source]

Writes a report from multiple samples.

Parameters:
storage_dic : dict or OrderedDict

Storage containing the trimming statistics. See parse_log() for its generation.

output_file : str

Path where the output file will be generated.

sample_id : str

Id or name of the current sample.

Module contents

Placeholder for template generation docs

flowcraft.tests package

Submodules
flowcraft.tests.data_pipelines module
flowcraft.tests.test_assemblerflow module
flowcraft.tests.test_engine module
flowcraft.tests.test_pipeline_parser module
flowcraft.tests.test_pipeline_parser.test_get_lanes()[source]
flowcraft.tests.test_pipeline_parser.test_linear_connection()[source]
flowcraft.tests.test_pipeline_parser.test_two_fork_connection()[source]
flowcraft.tests.test_pipeline_parser.test_two_fork_connection_mismatch_lane()[source]
flowcraft.tests.test_pipeline_parser.test_multi_fork_connection()[source]
flowcraft.tests.test_pipeline_parser.test_linear_lane_connection()[source]
flowcraft.tests.test_pipeline_parser.test_linear_multi_lane_connection()[source]
flowcraft.tests.test_pipeline_parser.test_get_source_lane()[source]
flowcraft.tests.test_pipeline_parser.test_get_source_lane_2()[source]
flowcraft.tests.test_pipeline_parser.test_parse_pipeline()[source]
flowcraft.tests.test_pipeline_parser.test_parse_pipeline_file()[source]
flowcraft.tests.test_pipeline_parser.test_unique_id_len()[source]
flowcraft.tests.test_pipeline_parser.test_remove_id()[source]
flowcraft.tests.test_process_details module
flowcraft.tests.test_processes module
flowcraft.tests.test_sanity module
Module contents

Submodules

flowcraft.flowcraft module

Module contents