Public Health England SNP calling Pipeline (PHEnix)

This documentation is designed to give an overview as well as detailed API reference for Public Health England’s single nucleotide polymorphism calling pipeline. Currently the pipeline does not provide stand alone means of calling SNPs, instead it interfaces with other published tool e.g. BWA and samtools.

Contents:

Introduction

Author:Public Health England
Date:28 June, 2016

Installation

From source:

git clone https://github.com/phe-bioinformatics/PHEnix.git
pip2 install -e PHEnix

Directly from github:

pip install git+https://github.com/phe-bioinformatics/PHEnix.git

Note

Installing from Pip - Coming Soon.

Overview

This code was designed to allow users to input fastq files and a reference sequence and perform:

  • Reference mapping
  • VCF generation
  • VCF filtering
  • FASTA sequence of SNPs

The process is comprised of three steps:

  • Reference sequence preparation (prepare_reference.py)
  • Mapping and filtered VCF generation (run_snp_pipeline.py)
  • FASTA generation from single or multiple VCFs (vcf2fasta.py)

Example:

prepare a bwa and gatk reference using a fasta file (myref.fasta)

phenix.py prepare_reference \
--mapper bwa \
--variant gatk \
--reference myref.fasta

map, call and filter variants on fastq files (my.R1.fastq, my.R2.fastq). Filter SNPs on minimum depth, mapping quality and AD ratio

phenix.py run_snp_pipeline \
-r1 my.R1.fastq \
-r2 my.R2.fastq \
-r myref.fasta \
--sample-name mysample \
--mapper bwa \
--variant gatk \
--filters min_depth:5,mq_score:30,ad_ratio:0.9

generate a FASTA file of SNPs using filtered VCFs in current directory

phenix.py vcf2fastq -d ./ -o output.fasta --regex filtered

Requirements

A lot of functionality depends on the presence of existing 3rd party tools:

Mappers:

Variant Caller:

In order for them to function properly, they need to be already in you PATH. For commands that run through Java archives, please set appropriate environment variable (see below).

Python

  • Python >= 2.7
  • argparse
  • PyVCF
  • PyYAML
  • matplotlib (_optional_)
  • bintrees (_optional_) - If you are using vcf2fasta.py
  • numpy (_optional_)
  • matplotlib.venn (_optional_) - psycopg2 (_optional_)

3rd Party Requirements

Samtools

Samtools Samtools can be downloaded from https://github.com/samtools/samtools. It is used to filter and convert to SAM/BAM files and in mpileup variant caller.

BCFTools

BCFtools can be downloaded from https://github.com/samtools/bcftools. It is used for calling variants in mpileup.

BWA Heng Li’s mapper can be downloaded from https://github.com/lh3/bwa.

Bowtie2 Bowtie2 mapper available from https://github.com/BenLangmead/bowtie2.

GATK Set GATK_JAR - full path to the GATK Java archive.

Picard Tools

Picard is needed for GATK to create dictionary of reference fasta. Either set PICARD_TOOLS_PATH - path to directory where different Picard jars are or set PICARD_JAR - path to picard.jar. Older Picard distributions have many different jars (use first suggestion), where as newer versions have merged all into one jar file.

Prerequisites

The refence needs to be index in an appropriate way for different mapper/variant callers. This can be done manually for specific tools or using:

prepare_reference.py --mapper [bwa | bowtie2] \
--variant [gatk | mpileup] \
--reference <path_to_reference>

Internally, for the pipeline, the indexing will be done automatically, because software assumes that a reference is present for each sample. It has to be done differently for other use cases because if multiple processes trying to index the same file, this may lead to corrupted indeces.

Note

The reference file can’t have fas extension, because Picard Tools throws an exception.

Filters

One of the key parts of the VCF processing is to filter quality calls. To do this we have created a flexible interface with variaety of filters avaialble:

  • qual_score - Filter records where QUAL score is below given threshold.
  • ad_ratio - Filter, defined by gatk, records where ratio of alt allele to sum of all alleles is below given fraction.
  • dp4_ratio - Similar to ad_ratio, but used in mpileup variant caller.
  • mq_score- Filter records that fall below specified MQ score (from _INFO_ field).
  • mq0_ratio - Filter, defined by gatk, records with MQ0 to DP ratio _above_ given threshold (both values are from _INFO_ field).
  • mq0f_ratio - Similar to the mq0_ratio, but used in mpileup variant caller.
  • qg_score - Filter records that fall below specified GQ score (from first sample).
  • min_depth - Filter records with mean depth below specified threshold (DP from sample field).
  • uncall_gt - Filter records with uncallable genotypes in VCF (GT is ./.).

All filters are applied for each position and those positions that pass ALL filters are kept as quality calls. Positions failing filter will be kept for future reference and creating fasta files, when needed. To specify filters to be used, simply list them as key:threshold pairs separated by comma(,). For filters that don’t require threshold, leave blank after ‘:’.

Annotators

Individual VCFs can be annotated using custom annotators. Currently available annotators:

  • coverage - Annotates with information about _mean_ and _dev_ of the depth in the VCF (using DP from INFO).

The data can be accessed from the metadata field in the VCF in the following way:

r = vcf.Reader(filename="/path/to/vcf")
print r.metadata["coverageMetaData"][0]["mean"]
print r.metadata["coverageMetaData"][0]["dev"]

Calling SNPs

If you have a sample and you want to have one-stop-shop analysis run the following:

$ phenix.py run_snp_pipeline \
-r1 <path to R1.fastq> \
-r2 <path to R2.fastq> \
-r <path to reference> \
--sample-name <NAME> \
--mapper bwa \
--variant gatk \
--filters min_depth:5,mq_score:30 \
-o <path to output directory>

This will map with BWA and call variants with GATK. Intermediate files are written into the specified output directory. –sample-name option is very important, it specifies what output files will be called and the read group in the BAM file. If you omit it, test_sample will be used.

Config

If you have a pipeline, or want to run the same settings for different samples this process can be simplefied by having a single config file. The file is in YAML format and takes common options that may be used for all samples:

  • mapper - string
  • mapper-options - string
  • variant - string
  • variant-options - string
  • filters - dictionary
  • annotators - list

Below is an example of config file:

mapper: bwa
mapper-options: -t 4
variant: gatk
variant-options: --sample_ploidy 2 --genotype_likelihoods_model SNP -rf BadCigar -out_mode EMIT_ALL_SITES -nt 1
filters:
  ad_ratio: 0.9
  min_depth: 5
  qual_score: 30
  mq_score: 30
annotators:
  - coverage

Converting to FASTA

A lot of downstream applications take on FASTA formated files, not VCFs. We have included a script for converting VCF data to FASTA format.

$ vcfs2fasta -d <path to directory of VCFs> -o <path to output FASTA>

This tool is also able to filter out samples and columns that may be bad quality. E.g. –sample-Ns specifies maximum fraction of Ns present in a sample. –Ns specifies maximum fraction of Ns allowed per column. –with-mixtures can specify if mixed position should be output as over certain fraction. First, samples are sifted out then columns are checked. –with-stats allows to output genomic positions of the called SNPs. Each line coresponds to a character position in the FASTA file. E.g. to get the position of the 23rd snp, go to 23rd line in the positions file. With this option, if numpy and matplotlib are installed, plots directory will be created and a summary plot is generated, summarising called SNPs, Ns, mixtures, and bases with 0 depth. Providing –with-dist-matrix will also write distance matrix to the specified path.

API Reference

Filters

One of the key aims of this project is to provide a flexible and extendable interface for filtering VCFs. While there are a number of tools that will process a VCF file, we felt that none offered in depth understanding that we wanted. Instead we have based this library on PyVCF and extended with variety of filters Filters.

If you feel that there is a filter that has not been implemented, or you want to try your hand at implementing object oriented classes, then please see Implementing Filter.

Mappers

We found there was no single neat way of calling different mappers (there is BioPython, of course) but it wasn’t feasible to integrate it into our software stack in the way that we wanted to. Instead, we wrote a simple interfaces that can be easily extended to add new mappers. See Implementing Mapper.

Variant Callers

Just like with mappers, there was no neat way of incorporating other interfaces into this software stack. To add your favourite variant caller see Implementing Variant Caller.

phe

Subpackages
phe.annotations
Submodules
phe.annotations.CoverageAnnotator module
Date:10 Nov, 2015
Author:Alex Jironkin
class CoverageAnnotator(config=None)[source]

Bases: phe.annotations.Annotator

classdocs

Methods

annotate([vcf_path])
get_meta()
get_meta_values()
annotate(vcf_path=None)[source]
get_meta_values()[source]
name = 'coverage'
Module contents
Date:10 Nov, 2015
Author:Alex Jironkin
class Annotator(name)[source]

Bases: phe.metadata.PHEMetaData

Base class for Annotators

Attributes

name  

Methods

annotate([vcf_path])
get_meta()
get_meta_values()
annotate(vcf_path=None)[source]
get_meta()[source]
get_meta_values()[source]
name = None
available_annotators()[source]

Return list of available filters.

dynamic_annotator_loader()[source]

Fancy way of dynamically importing existing annotators.

Returns:

dict:

Available annotators dictionary. Keys are parameters that can be supplied to the filters.

make_annotators(config)[source]

Create a list of annotators from config.

Parameters:

config: dict, optional

Dictionary with name: value pairs. For each name, an appropriate Annotator will be found and instanciated.

Returns:

list:

List of Annotator annotators.

phe.mapping
Submodules
phe.mapping.BWAMapper module

Implementation of the Mapper class using BWA (Heng Li) mapper.

Date:17 Sep, 2015
Author:Alex Jironkin
class BWAMapper(cmd_options=None)[source]

Bases: phe.mapping.Mapper

BWA mapper developed by Heng Li.

Methods

create_aux_files(ref)
get_info([plain])
get_meta()
get_samtools_version() Get version of samtools used.
get_version()
make_bam(*args, **kwargs) Make indexed BAM from reference, and fastq files.
make_sam(*args, **kwargs)
validate() Validate itself, by checking appropriate commands can run.
create_aux_files(ref)[source]
get_info(plain=False)[source]
get_version()[source]
make_sam(*args, **kwargs)[source]
name = 'bwa'

Plain text name of the mapper.

phe.mapping.Bowtie2Mapper module

Implementation of the Mapper class using Bowtie2 mapper.

Date:17 Sep, 2015
Author:Alex Jironkin
class Bowtie2Mapper(cmd_options=None)[source]

Bases: phe.mapping.Mapper

Bowtie2 mapper

Methods

create_aux_files(ref)
get_info([plain])
get_meta()
get_samtools_version() Get version of samtools used.
get_version()
make_bam(*args, **kwargs) Make indexed BAM from reference, and fastq files.
make_sam(*args, **kwargs)
validate() Validate itself, by checking appropriate commands can run.
create_aux_files(ref)[source]
get_info(plain=False)[source]
get_version()[source]
make_sam(*args, **kwargs)[source]
name = 'bowtie2'

Plain text name of the mapper.

phe.mapping.mapping_factory module
Date:22 Sep, 2015
Author:Alex Jironkin
available_mappers()[source]

Return a list of available mappers.

dynamic_mapper_loader()[source]

Fancy way of dynamically importing existing mappers.

Returns:

dict:

Available mappers dictionary. Keys are names that can be used for existing mapper implementations.

factory(mapper=None, custom_options=None)[source]

Create an instance of a mapper from _config_.

Parameters:

mapper: str, optional

Name of the mapper to initialise.

custom_options: str, optional

Custom options to be passed to the mapper.

Returns:

phe.mapping.Mapper:

Instance of a phe.mapping.Mapper with provided name, or None if the mapper could not be found.

Module contents

Mapping related classes and functions.

Implementing Mapper

This project does not provide novel way of mapping or calling variants. Instead it interfaces with available mappers through command line interface. Hopefully, we have made it straight forward to add your favourite mapper for others to use.

To do this, you will need to implement Mapper class. There are a number of methods that need to be implemented for you mapper to work:

Bear in mind that cmd_options will be passed on, if they specified in the command line or config under mapper-options.

Once you have implemented this interface, save the file in the mapping directory and it should be automatically picked up. To verify run:

run_snp_pipeline.py --help

You should see your mapper in the list of available mappers. If the mapper works, it will also be included n automatically in the documentations.

Date:22 Sep, 2015
Author:Alex Jironkin
class Mapper(cmd_options=None)[source]

Bases: phe.metadata.PHEMetaData

Abstract Mapper class that provides generic interface to the mapping implementation for a particular mapper.

Attributes

name Return name of the mapper (implemented by individual classes.

Methods

create_aux_files(ref) Create required auxiliary files for ref.
get_info([plain]) Get information about the mapper.
get_meta()
get_samtools_version() Get version of samtools used.
get_version() Get the version of the underlying command used.
make_bam(*args, **kwargs) Make indexed BAM from reference, and fastq files.
make_sam(*args, **kwargs) Make SAM from reference, and fastq files.
validate() Validate itself, by checking appropriate commands can run.
create_aux_files(ref)[source]

Create required auxiliary files for ref.

Parameters:

ref: str

Path to the reference file.

Returns:

bool:

True if auxiliary files were created, False otherwise.

get_info(plain=False)[source]

Get information about the mapper.

get_meta()[source]
get_samtools_version()[source]

Get version of samtools used. Reterned as a triple (major, minor, patch)

get_version()[source]

Get the version of the underlying command used.

make_bam(*args, **kwargs)[source]

Make indexed BAM from reference, and fastq files.

Parameters:

ref: str

Path to the reference file used for mapping.

R1: str

Path to the R1/Forward reads file in FastQ format.

R2: str:

Path to the R2/Reverse reads file in FastQ format.

out_file: str

Name of the output file where BAM data is written.

sample_name: str

Name of the sample to be included in the read group (RG) field.

Returns:

bool:

True iff mapping, converting to BAM and indexing returns 0, False otherwise.

make_sam(*args, **kwargs)[source]

Make SAM from reference, and fastq files.

Parameters:

ref: str

Path to the reference file used for mapping.

R1: str

Path to the R1/Forward reads file in FastQ format.

R2: str:

Path to the R2/Reverse reads file in FastQ format.

out_file: str

Name of the output file where SAM data is written.

sample_name: str

Name of the sample to be included in the read group (RG) field.

make_aux: bool, optional

True/False to make auxilliary files (default: False).

Returns:

bool:

True iff mapping returns 0, False otherwise.

name

Return name of the mapper (implemented by individual classes.

validate()[source]

Validate itself, by checking appropriate commands can run.

phe.metadata
Module contents

Metadata related information.

Date:22 Sep, 2015
Author:Alex Jironkin
class PHEMetaData[source]

Bases: object

Abstract class to provide interface for meta-data creation.

Methods

get_meta() Get the metadata.
get_meta()[source]

Get the metadata.

phe.variant
Submodules
phe.variant.GATKVariantCaller module
Date:22 Sep, 2015
Author:Alex Jironkin
class GATKVariantCaller(cmd_options=None)[source]

Bases: phe.variant.VariantCaller

Implemetation of the Broad institute’s variant caller.

Methods

create_aux_files(ref) Create auxiliary files needed for this variant.
get_info([plain])
get_meta() Get the metadata about this variant caller.
get_version()
make_vcf(*args, **kwargs)
validate()
create_aux_files(ref)[source]

Create auxiliary files needed for this variant.

Tools needed: samtools and picard tools. Picard tools is a Java library that can be defined using environment variable: PICARD_JAR specifying path to picard.jar or PICARD_TOOLS_PATH specifying path to the directory where separate jars are (older version before jars were merged into a single picard.jar).

Parameters:

ref: str

Path to the reference file.

Returns:

bool:

True if auxiliary files were created, False otherwise.

get_info(plain=False)[source]
get_version()[source]
make_vcf(*args, **kwargs)[source]
name = 'gatk'

Plain text name of the variant caller.

validate()[source]
phe.variant.MPileupVariantCaller module
Date:22 Sep, 2015
Author:Alex Jironkin
class MPileupVariantCaller(cmd_options=None)[source]

Bases: phe.variant.VariantCaller

Implemetation of the Broad institute’s variant caller.

Methods

create_aux_files(ref) Index reference with faidx from samtools.
get_info([plain])
get_meta() Get the metadata about this variant caller.
get_version()
make_vcf(*args, **kwargs)
validate() Check if the command can run.
create_aux_files(ref)[source]

Index reference with faidx from samtools.

Parameters:

ref: str

Path to the reference file.

Returns:

bool:

True if auxiliary files were created, False otherwise.

get_info(plain=False)[source]
get_version()[source]
make_vcf(*args, **kwargs)[source]
name = 'mpileup'

Plain text name of the variant caller.

phe.variant.variant_factory

Classes and functions for working with variant callers.

Date:22 Sep, 2015
Author:Alex Jironkin
available_callers()[source]

Return list of available variant callers.

dynamic_caller_loader()[source]

Fancy way of dynamically importing existing variant callers.

Returns:

dict:

Available variant callers dictionary. Keys are parameters that can be used to call variants.

factory(variant=None, custom_options=None)[source]

Make an instance of a variant class.

Parameters:

variant: str, optional

Name of the variant class to instantiate.

custom_options: str, optional

Custom options to be passed directly to the implementing class.

Returns:

phe.variant.VariantCaller:

Instance of the phe.variant.VariantCaller for given variant name, or None if one couldn’t be found.

Module contents

Classes and methods to work with _variants and such.

Implementing Variant Caller

Similarly to Implementing Mapper and Implementing Filter adding your variant caller is easy and sraight forward. Implement VariantCaller class with the following attributes/methods:

Save your new variant caller in the variant directiry and it will be automatically picked up by the system. To verify run:

run_snp_pipeline.py --help

It should appear in the list of available callers.

Date:22 Sep, 2015
Modified:28 June, 2016
Author:Alex Jironkin
class VCFTemplate(vcf_reader)[source]

Bases: object

This is a small hack class for the Template used in generating VCF file.

class VariantCaller(cmd_options=None)[source]

Bases: phe.metadata.PHEMetaData

Abstract class used for access to the implemented variant callers.

Methods

create_aux_files(ref) Create needed (if any) auxiliary files.
get_info([plain]) Get information about this variant caller.
get_meta() Get the metadata about this variant caller.
get_version() Get the version of the underlying command used.
make_vcf(*args, **kwargs) Create a VCF from BAM file.
validate() Check if the command can run.
create_aux_files(ref)[source]

Create needed (if any) auxiliary files. These files are required for proper functioning of the variant caller.

get_info(plain=False)[source]

Get information about this variant caller.

get_meta()[source]

Get the metadata about this variant caller.

get_version()[source]

Get the version of the underlying command used.

make_vcf(*args, **kwargs)[source]

Create a VCF from BAM file.

Parameters:

ref: str

Path to the reference file.

bam: str

Path to the indexed BAM file for calling _variants.

vcf_file: str

path to the VCF file where data will be written to.

make_aux: bool, optional

True/False create auxilliary files (default: False).

Returns:

bool:

True if variant calling was successful, False otherwise.

validate()[source]

Check if the command can run.

class VariantSet(vcf_in, filters=None, reference=None)[source]

Bases: object

A convenient representation of set of _variants. TODO: Implement iterator and generator for the variant set.

Methods

add_metadata(info) Add metadata to the variant set.
filter_variants([keep_only_snps, out_vcf, ...]) Filter the VCF records.
variants([only_good]) Generator over the variant set.
write_variants(vcf_out[, only_snps, only_good]) Write _variants to a VCF file.
add_metadata(info)[source]

Add metadata to the variant set.

Parameters:

info: dict

Dictionary of key value pairs to be inserted into metadata.

filter_variants(keep_only_snps=False, out_vcf=None, only_good=False)[source]

Filter the VCF records.

Parameters:

keep_only_snps: bool, optional

Retain only SNP variants (default: False).

out_vcf: str, optional

If specified, filtered record will be written to out_vcf instead of being retianed in memory.

only_good: bool, optional

True/False if only SNPs that PASS should output.

Returns:

list or int:

If out_vcf is specified, then total number of written records is returned. Otherwise, list of records is returned.

variants(only_good=False)[source]

Generator over the variant set.

Parameters:

only_good: bool, optional

Iff True good and bad variants are returned, otherwise only good are returned (default: False).

write_variants(vcf_out, only_snps=False, only_good=False)[source]

Write _variants to a VCF file.

Parameters:

vcf_out: str

Path to the file where VCF data is written.

only_snps: bool, optional

True is only SNP are to be written to the output (default: False).

only_good: bool, optional

True if only those records that PASS all filters should be written (default: False).

Returns:

int:

Number of records written.

phe.variant_filters
Submodules
phe.variant_filters.ADFilter module

Filter VCFs on AD ratio.

Date:24 Sep, 2015
Author:Alex Jironkin
class ADFilter(args)[source]

Bases: phe.variant_filters.PHEFilterBase

Filter sites by AD ratio.

Methods

__call__(record) Filter a vcf.model._Record.
call_concensus(record)
customize_parser(parser)
decode(filter_id) Decode name of filter.
filter_name() Create filter names by their parameter separated by magic.
get_config() This is used for reconstructing filter.
is_gap()
is_n()
is_uncallable(record) Filter a vcf.model._Record.
short_desc()
classmethod customize_parser(parser)[source]
name = 'ADRatio'
parameter = 'ad_ratio'
short_desc()[source]
phe.variant_filters.DP4Filter module

Filter VCFs on AD ratio.

Date:24 Sep, 2015
Author:Alex Jironkin
class DP4Filter(args)[source]

Bases: phe.variant_filters.PHEFilterBase

Filter sites by DP4 ratio.

Methods

__call__(record) Filter a vcf.model._Record.
call_concensus(record)
customize_parser(parser)
decode(filter_id) Decode name of filter.
filter_name() Create filter names by their parameter separated by magic.
get_config() This is used for reconstructing filter.
is_gap()
is_n()
is_uncallable(record) Filter a vcf.model._Record.
short_desc()
classmethod customize_parser(parser)[source]
name = 'DP4'
parameter = 'dp4_ratio'
short_desc()[source]
phe.variant_filters.DepthFilter module

Filter VCF on depth of coverage.

Date:24 Sep, 2015
Author:Alex Jironkin
class DepthFilter(args)[source]

Bases: phe.variant_filters.PHEFilterBase

Filter sites by depth.

Methods

__call__(record) Filter a vcf.model._Record.
call_concensus(record)
customize_parser(parser)
decode(filter_id) Decode name of filter.
filter_name() Create filter names by their parameter separated by magic.
get_config() This is used for reconstructing filter.
is_gap()
is_n()
is_uncallable(record) Filter a vcf.model._Record.
short_desc()
classmethod customize_parser(parser)[source]
name = 'MinDepth'
parameter = 'min_depth'
short_desc()[source]
phe.variant_filters.GQFilter module

Filter VCF on GQ parameter.

Date:24 Sep, 2015
Author:Alex Jironkin
class GQFilter(args)[source]

Bases: phe.variant_filters.PHEFilterBase

Filter sites by GQ score.

Methods

__call__(record) Filter a vcf.model._Record.
call_concensus(record)
customize_parser(parser)
decode(filter_id) Decode name of filter.
filter_name() Create filter names by their parameter separated by magic.
get_config() This is used for reconstructing filter.
is_gap()
is_n()
is_uncallable(record) Filter a vcf.model._Record.
short_desc()
classmethod customize_parser(parser)[source]
name = 'MinGQ'
parameter = 'gq_score'
short_desc()[source]
phe.variant_filters.GTFilter module

Filter VCF on GT filter parameter.

Date:24 Sep, 2015
Author:Alex Jironkin
class UncallableGTFilter(args)[source]

Bases: phe.variant_filters.PHEFilterBase

Filter uncallable genotypes

Methods

__call__(record) Filter a vcf.model._Record.
call_concensus(record)
customize_parser(parser)
decode(filter_id) Decode name of filter.
filter_name() Create filter names by their parameter separated by magic.
get_config() This is used for reconstructing filter.
is_gap()
is_n()
is_uncallable(record) Filter a vcf.model._Record.
short_desc()
classmethod customize_parser(parser)[source]
is_gap()[source]
is_n()[source]
name = 'UncallGT'
parameter = 'uncall_gt'
short_desc()[source]
phe.variant_filters.MQ0FFilter module

Filter VCF on MQ filter.

Date:24 Sep, 2015
Author:Alex Jironkin
class MQ0FFilter(args)[source]

Bases: phe.variant_filters.PHEFilterBase

Filter sites by MQ0F (Total Mapping Quality Zero Reads to DP) ratio.

Methods

__call__(record) Filter a vcf.model._Record.
call_concensus(record)
customize_parser(parser)
decode(filter_id) Decode name of filter.
filter_name() Create filter names by their parameter separated by magic.
get_config() This is used for reconstructing filter.
is_gap()
is_n()
is_uncallable(record) Filter a vcf.model._Record.
short_desc()
classmethod customize_parser(parser)[source]
name = 'MinMQ0F'
parameter = 'mq0f_ratio'
short_desc()[source]
phe.variant_filters.MQ0Filter module

Filter VCF on MQ filter.

Date:24 Sep, 2015
Author:Alex Jironkin
class MQ0Filter(args)[source]

Bases: phe.variant_filters.PHEFilterBase

Filter sites by MQ0 (Total Mapping Quality Zero Reads) to DP ratio.

Methods

__call__(record) Filter a vcf.model._Record.
call_concensus(record)
customize_parser(parser)
decode(filter_id) Decode name of filter.
filter_name() Create filter names by their parameter separated by magic.
get_config() This is used for reconstructing filter.
is_gap()
is_n()
is_uncallable(record) Filter a vcf.model._Record.
short_desc()
classmethod customize_parser(parser)[source]
name = 'MinMQ0'
parameter = 'mq0_ratio'
short_desc()[source]
phe.variant_filters.MQFilter module

Filter VCF on MQ filter.

Date:24 Sep, 2015
Author:Alex Jironkin
class MQFilter(args)[source]

Bases: phe.variant_filters.PHEFilterBase

Filter sites by Mapping Quality (MQ) score.

Methods

__call__(record) Filter a vcf.model._Record.
call_concensus(record)
customize_parser(parser)
decode(filter_id) Decode name of filter.
filter_name() Create filter names by their parameter separated by magic.
get_config() This is used for reconstructing filter.
is_gap()
is_n()
is_uncallable(record) Filter a vcf.model._Record.
short_desc()
classmethod customize_parser(parser)[source]
name = 'MinMQ'
parameter = 'mq_score'
short_desc()[source]
phe.variant_filters.QualFilter module

Filter VCF on GQ parameter.

Date:24 Sep, 2015
Author:Alex Jironkin
class QualFilter(args)[source]

Bases: phe.variant_filters.PHEFilterBase

Filter sites by QUAL score.

Methods

__call__(record) Filter a vcf.model._Record.
call_concensus(record)
customize_parser(parser)
decode(filter_id) Decode name of filter.
filter_name() Create filter names by their parameter separated by magic.
get_config() This is used for reconstructing filter.
is_gap()
is_n()
is_uncallable(record) Filter a vcf.model._Record.
short_desc()
classmethod customize_parser(parser)[source]
name = 'LowSNPQual'
parameter = 'qual_score'
short_desc()[source]
Module contents

Classes and functions for working with variant filters.

Implementing Filter

The PHEFilterBase class was designed to be implemented in order to provide new filters for processing. As such it should be easy to extend PHEFilterBase to achieve this.

If you are not familiar with Python and classes please read this. It is a general introduction to the subject.

Implementation of a filter closely follows the structure for PyVCF

PHEBaseFilter adds additional infomations that is also added to the VCF header:

  • PHEFilterBase.parameter - Atrribute that allows dynmic loader to pickup the filter.This is the name used in command line arguments and configs.
  • PHEFilterBase._default_threshold - Default threshold used for the filter when non specified.
  • PHEFilterBase.short_desc() - Method for defining what the short description is dynamically.
  • PHEFilterBase.__call__() - Function that does the filtering (the same as PyVCF).

The most important of these is the __call__ function, as this is what does the filtering. It needs to return the following:

  • None if the record PASSES the filter.
  • Non None expressions are treated as failures.

Note

While it may not be obvious, but Python always returns a value. When no return statement is specified, None is returned to the calling function.

False is returned when data can not be accessed by the filter.

Once you have implemented the filter and saved it in the variant_filters directory, it should be automatically picked up by the system and available to filter on. To verify this run:

run_snp_pipeline.py --help

You filter should now be visible in the list of available filters.

Example
class MyFilter(PHEFilterBase):
    name="CoolQUAL"
    _default_threshold=40
    parameter=cool_qual

    def __init__(self, args):
        # Construct any specific details for the object

    def __call__(self, record):
        if record.QUAL < self.threshold:
            return record.QUAL

    def short_desc(self):
        return "My cool QUAL filter"
Date:24 Sep, 2015
Author:Alex Jironkin
class PHEFilterBase(args)[source]

Bases: vcf.filters.Base

Base class for VCF filters.

Attributes

parameter Short name of parameter being filtered.

Methods

__call__() filter a site, return not None if the site should be filtered
call_concensus(record)
customize_parser(parser) hook to extend argparse parser with custom arguments
decode(filter_id) Decode name of filter.
filter_name() Create filter names by their parameter separated by magic.
get_config() This is used for reconstructing filter.
is_gap()
is_n()
is_uncallable(record) Filter a vcf.model._Record.
short_desc() Short description of the filter (included in VCF).
static call_concensus(record)[source]
static decode(filter_id)[source]

Decode name of filter.

decoder_pattern = <_sre.SRE_Pattern object>
filter_name()[source]

Create filter names by their parameter separated by magic. E.g. if filter parameter is ad_ratio and threshold is 0.9 then ad_ratio:0.9 if the filter name.

get_config()[source]

This is used for reconstructing filter.

is_gap()[source]
is_n()[source]
is_uncallable(record)[source]

Filter a vcf.model._Record.

magic_sep = ':'
parameter

Short name of parameter being filtered.

short_desc()[source]

Short description of the filter (included in VCF).

available_filters()[source]

Return list of available filters.

dynamic_filter_loader()[source]

Fancy way of dynamically importing existing filters.

Returns:

dict:

Available filters dictionary. Keys are parameters that can be supplied to the filters.

make_filters(config)[source]

Create a list of filters from config.

str_to_filters(filters)[source]

Convert from filter string to array of filters. E.g. ad_ration:0.9,min_depth:5

Parameters:

filters: str

String version of filters, separated by comma.

Returns:

list:

phe.utils
Submodules
phe.utils.reader
Date:3 June, 2016
Author:Public Health England
class ParallelVCFReader(vcfs)[source]

Bases: object

Read multiple VCFs in parallel - one position at the time.

Methods

get_records() Generator of records, one position at the time.
get_samples() Get the list of sample names from the VCFs.
update([ids]) Update records in the readers.
get_records()[source]

Generator of records, one position at the time. If a position has more than 1 record in any of the readers, a list of all records with that position is returned. The user can deal appropriately with the records list.

Returns:

chrom: str

Chromosome

pos: int

Position

records: dict

Records in dictionary of lists for each sample.

{
    "sample1": [record1, record2], 
    "sample2": [record3]
}
get_samples()[source]

Get the list of sample names from the VCFs.

update(ids=None)[source]

Update records in the readers. If ids is not specified, then all readers are updated. Otherwise, only those with the same id are updated.

Parameters:

ids: list, optional

Optional list of IDs to update. If None, all will be updated.

is_uncallable(record)[source]

Is the Record uncallable? Currently the record is uncallable iff:

  • GT field is ./.
  • LowQual is in the filter.
Returns:

uncall: bool

True if any of the above items are true, False otherwise.

Module contents
calculate_memory_for_sort()[source]

Calculate available memory for samtools sort function. If there is enough memory, no temp files are created. Enough is defined as at least 1G per CPU.

Returns:

sort_memory: str or None

String to use directly with -m option in sort, or None.

Module contents

Scripts

Useful scripts for data analysis.

phenix

There is a single entry point to access all commands we have produced so far. Please refer to this documentation or –help on the cammand line.

usage: phenix [-h] [--debug] [--version]
              {run_snp_pipeline,filter_vcf,prepare_reference,vcf2fasta} ...
Options:
--debug=False More verbose logging (default: turned off).
--version show program’s version number and exit
Sub-commands:
run_snp_pipeline

Run SNP pipeline.

Run the snp pipeline with specified mapper, variant caller and some filters. Available mappers: [‘bwa’, ‘bowtie2’] Available variant callers: [‘mpileup’, ‘gatk’] Available filters: [‘gq_score’, ‘dp4_ratio’, ‘ad_ratio’, ‘min_depth’, ‘mq_score’, ‘mq0_ratio’, ‘uncall_gt’, ‘qual_score’, ‘mq0f_ratio’] Available annotators: [‘coverage’]

usage: phenix run_snp_pipeline [-h] [--workflow WORKFLOW] [--input INPUT]
                               [-r1 R1] [-r2 R2] [--reference REFERENCE]
                               [--sample-name SAMPLE_NAME] [--outdir OUTDIR]
                               [--config CONFIG] [--mapper MAPPER]
                               [--mapper-options MAPPER_OPTIONS] [--bam BAM]
                               [--variant VARIANT]
                               [--variant-options VARIANT_OPTIONS] [--vcf VCF]
                               [--filters FILTERS]
                               [--annotators ANNOTATORS [ANNOTATORS ...]]
                               [--keep-temp]
Options:
--workflow, -w Undocumented
--input, -i Undocumented
-r1 R1/Forward read in Fastq format.
-r2 R2/Reverse read in Fastq format.
--reference, -r
 Rerefence to use for mapping.
--sample-name=test_sample
 Name of the sample for mapper to include as read groups.
--outdir, -o Undocumented
--config, -c Undocumented
--mapper=bwa, -m=bwa
 Available mappers: [‘bwa’, ‘bowtie2’]
--mapper-options
 Custom maper options (advanced)
--bam Undocumented
--variant=gatk, -v=gatk
 Available variant callers: [‘mpileup’, ‘gatk’]
--variant-options
 Custom variant options (advanced)
--vcf Undocumented
--filters Filters to be applied to the VCF in key:value pairs, separated by comma (,). Available_filters: [‘gq_score’, ‘dp4_ratio’, ‘ad_ratio’, ‘min_depth’, ‘mq_score’, ‘mq0_ratio’, ‘uncall_gt’, ‘qual_score’, ‘mq0f_ratio’]
--annotators List of annotators to run before filters. Available: [‘coverage’]
--keep-temp=False
 Keep intermediate files like BAMs and VCFs (default: False).
filter_vcf

Filter a VCF.

Filter the VCF using provided filters.

usage: phenix filter_vcf [-h] --vcf VCF [--filters FILTERS | --config CONFIG]
                         --output OUTPUT [--reference REFERENCE] [--only-good]
Options:
--vcf, -v VCF file to (re)filter.
--filters, -f Filter(s) to apply as key:threshold pairs, separated by comma.
--config, -c Config with filters in YAML format. E.g.filters:-key:value
--output, -o Location for filtered VCF to be written.
--reference, -r
 mpileup version <= 1.3 do not output all positions. This is required to fix rfrence base in VCF.
--only-good=False
 Write only variants that PASS all filters (default all variants are written).
prepare_reference

Create aux files for reference.

Prepare reference for SNP pipeline by generating required aux files.

usage: phenix prepare_reference [-h] --reference REFERENCE [--mapper MAPPER]
                                [--variant VARIANT]
Options:
--reference, -r
 Path to reference file to prepare.
--mapper Available mappers: [‘bwa’, ‘bowtie2’]
--variant Available variants: [‘mpileup’, ‘gatk’]
vcf2fasta

Convert VCFs to FASTA.

Combine multiple VCFs into a single FASTA file.

usage: phenix vcf2fasta [-h]
                        (--directory DIRECTORY | --input INPUT [INPUT ...])
                        [--regexp REGEXP] --out OUT
                        [--with-mixtures WITH_MIXTURES]
                        [--column-Ns COLUMN_NS] [--column-gaps COLUMN_GAPS]
                        [--sample-Ns SAMPLE_NS] [--sample-gaps SAMPLE_GAPS]
                        [--reference REFERENCE]
                        [--include INCLUDE | --exclude EXCLUDE]
                        [--with-stats WITH_STATS] [--tmp TMP]
Options:
--directory, -d
 Path to the directory with .vcf files.
--input, -i List of VCF files to process.
--regexp Regular expression for finding VCFs in a directory.
--out, -o Path to the output FASTA file.
--with-mixtures
 Specify this option with a threshold to output mixtures above this threshold.
--column-Ns Keeps columns with fraction of Ns below specified threshold.
--column-gaps Keeps columns with fraction of Ns below specified threshold.
--sample-Ns Keeps samples with fraction of Ns below specified threshold.
--sample-gaps Keeps samples with fraction of gaps below specified threshold.
--reference If path to reference specified (FASTA), then whole genome will be written.
--include Only include positions in BED file in the FASTA
--exclude Exclude any positions specified in the BED file.
--with-stats If a path is specified, then position of the outputed SNPs is stored in this file. Requires mumpy and matplotlib.
--tmp Location for writing temp files (default: /tmp).

Galaxy

This tools is available for installation to your local Galaxy [http://galaxyproject.org] from the Toolshed [https://toolshed.g2.bx.psu.edu/]. This section describes in some detail how this is done and how the Galaxy tool can be used to process data.

Primer

The Phenix Galaxy tool only enables you to filter VCF files and to convert VCF files into FASTA files. The mapping and the SNP calling component of Phenix need to be provided by other Galaxy tools (also available from the Toolshed.)

How to get your own Galaxy server:

If you already have a local Galaxy server on which you are an Administrator and on which you can install tools from the Toolshed you can skip this section and continue reading at “How to install Phenix to your Galaxy server”. If not you might want to get one, which is easy.

Prerequisites:

  • A workstation running a Linux operating system
  • Min. 4GB of RAM (better 8GB or more)
  • Root access to that machine.
  • git (optional)
  • Python 2.6 or 2.7
  • GNU Make, gcc to compile and install tool dependencies

The last three are standard on most contemporary Linux installations.

Get your own Galaxy:

Please go to [https://wiki.galaxyproject.org/Admin/GetGalaxy] and follow the instructions in the sections from “Get the Code” to “Become an Admin” (including).

How to install Phenix to your Galaxy server:

This page (https://wiki.galaxyproject.org/Admin/Tools/AddToolFromToolShedTutorial) describes the general procedure of installing a tool from the toolshed. Here is a click-by-click guide to what you need to do to install Phenix.

Prerequisites:

There is a small number of Linux packages that need to be installed on your machine, so that Phenix can be installed and run propery. These are:

  • a C++ compiler
  • a Fortran compiler
  • the zlib development libraries
  • the samtools package

The exact name of these packages and the commands to install them depend on yout Linux distribution. For Ubuntu, Mint and Debian Linux the required commands should be:

sudo apt-get install build-essential
sudo apt-get install gfortran
sudo apt-get install zlib-devel
sudo apt-get install samtools

On Fedora the commands should be:

sudo dnf install gcc-c++
sudo dnf install gcc-gfortran
sudo dnf install zlib-devel
sudo dnf install samtools

And on OpenSUSE the commands should be:

sudo zypper install gcc-c++
sudo zypper install gcc-fortran
sudo zypper install zlib-devel
sudo zypper install samtools

For more esoteric Linux distributions please refer to your local IT expert and/or Google. After you have successfully installed these packages, follow the instructions below.

  • Make sure that you can access ftp sites from the command line running your Galaxy server. This is normally enabled by default, but sometimes requires an additional proxy setting. Try this command
wget ftp://ftp.gnu.org/gnu/libtool/libtool-2.4.tar.gz

If that does not download a file named ‘libtool-2.4.tar.gz’ to your current folder, speak to your local systems’ administrator.

  • In your config/galaxy.ini files, set
tool_dependency_dir = tool_dependencies
  • Restart your Galaxy
  • In your Galaxy, click on ‘Admin’ in the main menu at the top.
  • Select ‘Search Tool Shed’ from the menu on the left hand side.
  • Click on the little black triangle (the context menu) next to ‘Galaxy Main Tool Shed’ and select ‘Browse valid repositories’.
  • Type ‘phephenix’ [sic] into the search box and press Enter.
  • You should see the “package_phephenix_1_0” and the “phephenix” repositories. Select ‘Preview and install’ from the context menu of the latter.
  • Click ‘Install to Galaxy’.
  • Type ‘PHE TOOLS’ into the ‘Add new tool panel section:’ textbox.
  • Click ‘Install’.
  • You will be presented with a long list of packages that need to be installed. This will take a while. Wait until everything is green. If nothing happens for a little while, try reloading the page.

In your admin tool panel the menu item “Manage installed tools” was added. You can check the status of your installed packages there.

You need to install two more tools that are part of the Phenix workflow:

  • Browse the toolshed as before and look for ‘bwa’. Install the tool ‘bwa’ with synopsis “Wrapper for bwa mem, aln, sampe, and samse” owned by ‘devteam’. Put it into the ‘PHE TOOLS’ tool panel section.
  • Browse the toolshed as before and install the tool ‘phe_samtools_mpileup’. Put it into the ‘PHE TOOLS’ tool panel section.

How to use Phenix on Galaxy:

  • In the admin menu, go the ‘Manage installed tools’
  • Find the ‘phephenix’ tool and click on it, not on it’s context menu.
  • In the ‘Contents of this repository’ box at the bottom of the page, expand the workflows section using the little blue triange if necessary.
  • Click on ‘Phenix workflow’ and behold an image representation of the workflow.
  • Click on ‘Repository actions’ -> ‘Import workflow to Galaxy’

The Phenix workflow is now ready to use. You need to upload your data to a Galaxy history to use it. There are multiple options depending on your local Galaxy configuration. If you have followed the instructions above under ‘Get your own Galaxy’ the only available option is uploading from your local harddrive. When doing this, please make sure your fastq files are in the ‘fastqsanger’ format and your reference genome is in ‘fasta’ format. The basename of your R1 fastq files will be used as the name for the resulting bam and vcf files with appropriate file endings and will also appear as the sequence header in your resulting SNP alignment. Once your data is ready to use follow theses instructions to run the workflow.

Remark: Processing a single sample with the Phenix workflow can use up to 1.5GB of RAM. It is recommended you do not run more samples than your total system memory divided by 1.5 (2 for 4GB, 5 for 8GB, ...) or only as may samples as you have processor cores, whichever is lower.

  • Click on workflow in the top main menu and select ‘run’ from the ‘Phenix workflow’ context menu.
  • Select your reference genome in the ‘Reference fasta’ selection box.
  • Click both “paper stack” icons next to ‘R1 fastqs’ and next to ‘R2 fastqs’.
  • Select all R1.fastq files in the “R1 fastqs” selection box that appeared. An easy way to do this is to type ‘R1’ into the text box immediately under the selection box. The selection box will automatically be filtered for fastq file names that match what has been typed into the text box.
  • Do as above for the “R2 fastq” box.

Note

Make sure that the order of highlighted files in the R1 and R2 fastq selection boxes are corresponding, i.e. the first highlighted files in the top box will be run together with the first highlighted file in the bottom box, the second with the second and so on. Are the highlighted files in the same rank in both boxes always matching pairs from the same sample?

  • If you are happy with the above, click ‘Run workflow’.
  • Wait until three new history items per sample processed appear. These will be one bam file, one raw vcf, and one filtered vcf per sample.
  • Click on PHE TOOLS in the tool panel on the left hand side and select the “VCFs to fasta” tool. This tool will create a multi-fasta file from your filtered vcfs.
  • Select all files ending with filtered.vcf in the top file selection box under ‘Input VCF files(s)’, by holding down the Ctrl key.
  • Click ‘Execute’. Once everything is completed the “VCFs to fasta” tool with have produced your SNP alignment that can now be used for further processing.

Indices and tables