BIO Pype

A simple and lightweight python framework to organize bioinformatics analyses.

About:

Bioinformatics tasks require a long list of parameters: genome files, indexes, database, results of previous analyses and many more. It is not simple to keep records of version of softwares and databases and guarantee the reproducibility of an analysis.

This framework facilitate the organization of programs and files (eg, genomes and databases) in structured YAML files, defined in the framework interface as profiles. In addition, it provides a way to organize and execute bioinformatic tasks: each base task is wrapped in a python script, called snippets. A snippet take advantage of the profile, as a result, the number of arguments required to execute the script are greatly diminished. A number of snippets can be chained in a YAML file, to form a pipeline. A pipeline is a series of tasks that need to be executed in a given order, by a queuing system (eg, serial execution, parallel execution, torque/moab…). Snippets and pipelines take advantage of the argparse python library to provide command line documentation upon execution.

Quick Intro

In a nutshell:

Available presets

There are available preset of snippets and pipelines. You can access the list of repository and manage the installed modules with the repos subcommand

$ pype repos
usage: pype repos [-r REPO_LIST] {list,install,clean,info} ...

positional arguments:
  {list,install,clean,info}
  list                List the available repositories
  install             Install modules from selected repository
  clean               Cleanup all module folders
  info                Print location of the modules currently in use

optional arguments:
  -r REPO_LIST, --repo REPO_LIST
                    Repository list. Default:
                    /usr/lib/python2.7/site-packages/pype/repos.yaml

Overview of the framework

The framework is heavily based on the python argparse module aiming for a self explanatory use of the main command line interface: pype.

The management of tools version and sources (assembly/organisms) is organized in profiles files.

 $ pype profiles
 usage: pype profiles {info,check}

 positional arguments:
 {info,check}
 info        Retrieve details from available profiles
 check       Check if a profile is valid

$ pype profiles info --all
     default:          Default profile b37 from 1k genome project
     hg19:             Profile for the UCSC hg19 assembly
     hg38:             Profile for the hg38 assembly human genome

Each profile is defined by a YAML file, an example:

info:
    description: HG19 default profile
    date:        10/03/2016
genome_build: 'hg19'
 files:
    genome_fa:    /path/to/b37/genome.fa
    genome_fa_gz: /path/to/b37//genome.fa.gz
    exons_list:   /path/to/exome_hg37_interval_list.txt
    db_snp:       /path/to/dbSNP/dbsnp_138.b37.vcf.gz
    cosmic:       /path/to/cosmic/CosmicCodingMuts.vcf.gz

 programs:
    samtools_0:
        path: samtools
        version: 0.1.18
    samtools_1:
        path: samtools
        version: 1.2

Each tools is wrapped into a python function (snippets), which is designed to load program specified in the selected profile, and usually execute the tools using the subprocess module.

$ pype snippets
error: too few arguments
usage: pype snippets [--log LOG]
                {biobambam_merge,mutect,bwa_mem,freebayes}
                ...

positional arguments:
{biobambam_merge,mutect,bwa_mem,freebayes}
biobambam_merge     Mark duplicates and merge BAMs with biobambam
mutect              Call germline and somatic variants with mutect
oncotator           Annotate various file format with Oncotator
bwa_mem             Align fastQ file with the BWA mem algorithm
freebayes           A haplotype-based variant detector

The tools can be chained and combined one-another within a YAML files (pipelines), which specify dependencies and input files. The profile is parsed to provide command line option by the argparse interface

$ pype pipelines
error: too few arguments
usage: pype pipelines [--queue {msub,echo,none}] [--log LOG]
                 {bwa,freebayes,mutect,bwa_mem}
                 ...

positional arguments:
{bwa,delly,freebayes,mutect,bwa_mem}
bwa                 Align with bwa mem, merge different lanes and
                    performs various QC stats
freebayes           Freebayes pipeline from Fastq to VCF
mutect              Mutect from fastq files
bwa_mem             Align with bwa mem and performs QC stats

optional arguments:
--queue {msub,echo,none}
                   Select the queuing system to run the pipeline
--log LOG          Path used to write the pipeline logs. Default
                   working directory.

A typical scenario that demonstrate the advantages offered by this package is the alignment of two DNA sequencing (NGS) samples, a normal and a tumor samples, and perform a comparison between two samples to find somatic mutations. In the command line this task require for each sample a command for the alignment, a command for sorting and a command for indexing. After the two samples are processed the results need to be compared by a mutation caller, usually specifying some database such as COSMIC or dbSNP.

The easiest and fastest way to implement the tasks is by wrapping the command line into bash script and submit them to a queuing system, or run it directly in a shell. This would impose a self organization of logs and scripts, and still possibli results in a lot of maintenance to keep the scripts up to date.

Other pipeline systems can solve the problem, but they may not be executable on the fly by command line interface (eg, need to configure file previous execution) or they may not allows to change genome build or database version easily.

with the bio_pype framework each step can be launched separately, example of usage of the alignment snippet:

pype snippets bwa_mem
usage:
pype snippets bwa_mem -h HEADER -1 F1 -2 F2 [-t TMP] [-o OUT]

optional arguments:
  -h HEADER          Header file containing the @RG groups,
                     or the comma separated header line
  -1 F1              First mate fastQ file
  -2 F2              Second mate fastQ file
  -t TMP, --tmp TMP  Temporary folder
  -o OUT, --out OUT  Output name for the bam file

or use the pipeline:

pype pipelines mutect
usage: pype pipelines mutect --tumor_bam TUMOR_BAM --out_dir OUT_DIR
                             --sample_name SAMPLE_NAME --normal_bam NORMAL_BAM
                             [--tmp_dir TMP_DIR] --tumor_bwa_list
                             TUMOR_BWA_LIST --qc_bam_tumor QC_BAM_TUMOR
                             --normal_bwa_list NORMAL_BWA_LIST --qc_bam_normal
                             QC_BAM_NORMAL

Required:
  Required pipeline arguments

  --tumor_bam TUMOR_BAM
                        Input Bam file of the tumor sample, type: str
  --out_dir OUT_DIR     Output file name prefix for the analysis, type: str
  --sample_name SAMPLE_NAME
                        Sample name or identifier of the run, type: str
  --normal_bam NORMAL_BAM
                        Input Bam file of the normal/control sample, type: str
  --tumor_bwa_list TUMOR_BWA_LIST
                        Batch file to run the bwa_mem pipeline on different
                        lanes of the tumor, type: str
  --qc_bam_tumor QC_BAM_TUMOR
                        Tumor BAM files QC directory output path, type: str
  --normal_bwa_list NORMAL_BWA_LIST
                        Batch file to run the bwa_mem pipeline on different
                        lanes of the normal/control, type: str
  --qc_bam_normal QC_BAM_NORMAL
                        Normal/Control BAM files QC directory output path,
                        type: str

Optional:
  Optional pipeline arguments

  --tmp_dir TMP_DIR     temporary folder, type: str. Default: /scatch

Python and Environment Modules:

example of loading the hg37 profile in the python prompt:

from pype.modules.profiles import get_profiles

profiles = get_profiles({})
profile = profiles['hg37']
genome = profile.files['genome_fa_gz']

The framework also supports Environment Modules:

from pype.modules.profiles import get_profiles
from pype.env_modules import get_module_cmd, program_string

profiles = get_profiles({})
module('add', program_string(profile.programs['samtools_1']))

Installation

Installation from git:

Note

It is strongly advised to use virtualenv to install the module locally.

git clone https://bitbucket.org/ffavero/bio_pype
cd bio_pype
python setup.py test
python setup.py install

Using virtualenv:

This code guide thought the installation and use of virtualenv when the main python installation does not include the virtualenv module.

mkdir ~/venv_files
# Download virtualenv files (not installed in the python version
# used in computerome)
curl -L -o ~/venv_files/setuptools-20.2.2-py2.py3-none-any.whl \
   https://github.com/pypa/virtualenv/raw/develop/virtualenv_support/setuptools-20.2.2-py2.py3-none-any.whl
curl -L -o ~/venv_files/pip-8.1.0-py2.py3-none-any.whl \
   https://github.com/pypa/virtualenv/raw/develop/virtualenv_support/pip-8.1.0-py2.py3-none-any.whl
curl -L -o ~/venv_files/virtualenv.py \
   https://github.com/pypa/virtualenv/raw/develop/virtualenv.py

# Create a virtual environment
python ~/venv_files/virtualenv.py --extra-search-dir=~/venv_files \
   ~/venv_bio_pype
# Activate the environment
source ~/venv_bio_pype/bin/activate

# Install bio_pype as instructed above from git

# The pype command line utility will be available upon the "activation"
# of the environment or by specifying the full path, in this case:

# ~/venv_bio_pype/bin/pype

# To deactivate the virtual env:
deactivate

Usage

Basic usage

Access first level modules:

$ pype
error: too few arguments
usage: pype [-p PROFILE] {snippets,profiles} ...

Slim and simple framework to ease the managements of data, tools and
pipelines in the computerome HPC

positional arguments:
    snippets        Snippets
    profiles        List of profiles

optional arguments:
  -p PROFILE, --profile PROFILE
                    Select profile to set reference genome and tools
                    versions

This is version 0.0.1 - Francesco Favero - 7 March 2016

Access submodules in the selected level:

$ pype snippets
error: too few arguments
usage: pype snippets {test} ...

positional arguments:
  {test}
    test  test_snippets

Execute or access submodule:

$ pype snippets test
error: argument --test is required
usage: pype snippets test --test TEST [-o OPT]

optional arguments:
  --test TEST  Test metavar
  -o OPT       test option
$ pype snippets test --test World
Hello World
$ pype snippets test --test mate -o Cheers
Cheers mate

It is possible to pass local folders where to pick custom snippets and profiles via environment variables:

$ PYPE_SNIPPETS="/path/to/custom_snippets" \
  PYPE_PROFILES="/path/to/custom_profiles" pype

Or alternatively it is possible to set custom PYPE_SNIPPETS and PYPE_PROFILES in the ~/.bashrc (or alternatives) file.

API reference

Snippets

Writing new Snippets:

The snippets are located in a python module (mind the __init__.py in the folder containing the snippets). In order to function, each snippet need to have 4 specific function:

  1. requirements: a function returning a dictionary with the necessary resource to run the snippet (used to allocate resource in queuing systems)
  2. results: a function accepting a dictionary with the snippet arguments and returning a dictionary listing all the files produced by the execution of the snippet
  3. add_parser: a function that implement the argparse module and defines the command line arguments accepted by the snippet
  4. a function named as the snippet file name (without the .py extension), containing the code for the execution of the tool

A very minimalistic snippets structure


def requirements():
    return({'ncpu': 1, 'time': '00:01:00'})


def results(argv):
    return([])


def add_parser(subparsers, module_name):
    return subparsers.add_parser(module_name, help='test_snippets', add_help=False)


def test_args(parser, subparsers, argv):
    parser.add_argument('--test', dest='test',
                        help='Test metavar', required=True)
    parser.add_argument('-o', dest='opt', type=str,
                        help='test option')
    return parser.parse_args(argv)


def test(parser, subparsers, module_name, argv, profile, log):
    args = test_args(add_parser(subparsers, module_name), subparsers, argv)
    if args.opt:
        greetings = args.opt
    else:
        greetings = 'Hello'
    print(" ".join(map(str, [greetings, args.test])))

An example of snippets using the Environment modules and the profile object:

Programmatic snippets access:

It is also possible to programmatically launch a snippet, within python console or within other python programs.

Although the interface it is not well documented yet.

import os
import argparse
os.environ['PYPE_SNIPPETS'] = 'test/data/snippets/'

from pype.modules import snippets

parser = argparse.ArgumentParser(prog='pype', description='Test')
subparsers = parser.add_subparsers(dest='modules')

snippets.snippets(None, subparsers, None, [
                  '--log', 'test/data/tmp', 'test',
                  '--test', 'World'], 'default')
snippets.snippets(None, subparsers, 'test', [
                  '--log', 'test/data/tmp', 'test',
                  '--test', 'World', '-o', 'Hello!'], 'default')

Pipelines

Writing new Pipelines:

The pipelines are YAML files located in a python module (mind the __init__.py in the folder containing the pipelines). The location of the module containing the pipelines can be controlled by the environment variable PYPE_PIPELINES.

The YAML file structure require a header field called info and the main field items with the execution information.

The info field expect the following keys:

  1. description: with a brief description of the pipelines
  2. date: with a string for tracking recent modification of the pipeline
  3. arguments: which is not a required key, but it can be used to add information and description of the argument required in the pipeline

The items keys is constructed combining multiple blocks in a hierarchical manner. The items are called PipelineItem.

  • The PipelineItem are block of information linked to execution of one or more tasks.
  • A PipelineItem can wrap a snippet, another pipeline or the batch execution of a snippets or pipeline

The input of a batch snippet/pipeline is defined by the argument ‘–input_batch’. And consists in in a list of predefined arguments or tab separated file in which each line defines a run of the snippet/pipeline and each column refers to an argument of the snippet/pipeline (column names defines the specific argument).

Possible attributes in a PipelineItem:

  • Required:
    • type: possible values: snippets, pipeline, batch_snippets or batch_pipeline
    • name: name of the item to look for in the specified type category.
    • arguments: specific arguments for the selected type/name function
  • Optional:
    • mute: if True the PipelineItem will not return any results, hence it will not be listed as a dependencies of parents processes
    • dependencies: a list of other PipelineItem
    • requirements: dictionary to alter the default in-snippets requirements (apply only for snippets and batch_snippets)

An example of pipelines using the Environment modules and the profile object:

This will results in the command line interface:

pype pipelines bwa_mem

error: argument --qc_out is required
usage: pype pipelines bwa_mem --qc_out QC_OUT --tmp_dir TMP_DIR --bam_out
                              BAM_OUT --fq1 FQ1 --fq2 FQ2 --header HEADER

optional arguments:
  --qc_out QC_OUT    Path to store the QC output, type: str
  --tmp_dir TMP_DIR  Temporary directory, type: str
  --bam_out BAM_OUT  Resulting bam file, type: str
  --fq1 FQ1          First mate of fastq pairs, type: str
  --fq2 FQ2          Second mate of fastq pairs, type: str
  --header HEADER    @RG group header, comma separated, type: str

Programmatic pipelines access:

It is also possible to programmatically launch a pipeline, within python console or within other python programs.

As for the snippets, the interface it is not well documented yet.

import os
import argparse
os.environ['PYPE_SNIPPETS'] = 'test/data/snippets'
os.environ['PYPE_PIPELINES'] = 'test/data/pipelines'

from pype.modules import pipelines

parser = argparse.ArgumentParser(prog='pype', description='Test')
subparsers = parser.add_subparsers(dest='modules')

input_fa = 'test/data/files/input.fa'
rev_fa = 'test/data/tmp/rev.fa'
compl_fa = 'test/data/tmp/rev_comp.fa'
out_fa = 'test/data/tmp/rev_comp_low.fa'
pipelines.pipelines(None, subparsers, None, [
          '--queue', 'none', '--log', 'test/data/tmp',
          'rev_compl_low_fa', '--input_fa', input_fa,
          '--reverse_fa', rev_fa, '--complement_fa',
          compl_fa, '--output', out_fa], 'default')

Profiles

Extending profiles:

Similarly to the pipelines, the profiles are YAML files in a python module (it is requires a __init__.py file in the folder containing the files). In order to be parsed correctly, the profile YAML file needs to have a specific structure:

info:
   description: Test Profile
   date:        14/03/2016

files:
   genome_fa:    /abs/path/to/fasta.fa
   genome_fa_gz: /abs/path/to/fasta.fa.gz

programs:
   samtools_0:
      path: samtools
      version: 0.1.19
   samtools_1:
      path: samtools
      version: 1.2
   bwa:
      path: bwa
      version: 0.7.10
   start:
      path: star
      version: 2.5.1b

Indexes and tables