Public Health England SNP calling Pipeline (PHEnix)¶
This documentation is designed to give an overview as well as detailed API reference for Public Health England’s single nucleotide polymorphism calling pipeline. Currently the pipeline does not provide stand alone means of calling SNPs, instead it interfaces with other published tool e.g. BWA and samtools.
Contents:
Introduction¶
| Author: | Public Health England |
|---|---|
| Date: | 28 June, 2016 |
Installation¶
From source:
git clone https://github.com/phe-bioinformatics/PHEnix.git
pip2 install -e PHEnix
Directly from github:
pip install git+https://github.com/phe-bioinformatics/PHEnix.git
Note
Installing from Pip - Coming Soon.
Overview¶
This code was designed to allow users to input fastq files and a reference sequence and perform:
- Reference mapping
- VCF generation
- VCF filtering
- FASTA sequence of SNPs
The process is comprised of three steps:
- Reference sequence preparation (prepare_reference.py)
- Mapping and filtered VCF generation (run_snp_pipeline.py)
- FASTA generation from single or multiple VCFs (vcf2fasta.py)
Example:
prepare a bwa and gatk reference using a fasta file (myref.fasta)
phenix.py prepare_reference \
--mapper bwa \
--variant gatk \
--reference myref.fasta
map, call and filter variants on fastq files (my.R1.fastq, my.R2.fastq). Filter SNPs on minimum depth, mapping quality and AD ratio
phenix.py run_snp_pipeline \
-r1 my.R1.fastq \
-r2 my.R2.fastq \
-r myref.fasta \
--sample-name mysample \
--mapper bwa \
--variant gatk \
--filters min_depth:5,mq_score:30,ad_ratio:0.9
generate a FASTA file of SNPs using filtered VCFs in current directory
phenix.py vcf2fastq -d ./ -o output.fasta --regex filtered
Requirements¶
A lot of functionality depends on the presence of existing 3rd party tools:
Mappers:
- BWA - Download from [https://github.com/lh3/bwa]
- Bowtie2 - Download from [https://github.com/BenLangmead/bowtie2]
Variant Caller:
- GATK - Download from [https://www.broadinstitute.org/gatk/download/] * Picard - Download from [http://broadinstitute.github.io/picard/]
- MPileup - Download from [https://github.com/samtools/samtools]
In order for them to function properly, they need to be already in you PATH. For commands that run through Java archives, please set appropriate environment variable (see below).
Python¶
- Python >= 2.7
- argparse
- PyVCF
- PyYAML
- matplotlib (_optional_)
- bintrees (_optional_) - If you are using vcf2fasta.py
- numpy (_optional_)
- matplotlib.venn (_optional_) - psycopg2 (_optional_)
3rd Party Requirements¶
Samtools¶
Samtools Samtools can be downloaded from https://github.com/samtools/samtools. It is used to filter and convert to SAM/BAM files and in mpileup variant caller.
BCFTools¶
BCFtools can be downloaded from https://github.com/samtools/bcftools. It is used for calling variants in mpileup.
BWA Heng Li’s mapper can be downloaded from https://github.com/lh3/bwa.
Bowtie2 Bowtie2 mapper available from https://github.com/BenLangmead/bowtie2.
GATK Set GATK_JAR - full path to the GATK Java archive.
Picard Tools¶
Picard is needed for GATK to create dictionary of reference fasta. Either set PICARD_TOOLS_PATH - path to directory where different Picard jars are or set PICARD_JAR - path to picard.jar. Older Picard distributions have many different jars (use first suggestion), where as newer versions have merged all into one jar file.
Prerequisites¶
The refence needs to be index in an appropriate way for different mapper/variant callers. This can be done manually for specific tools or using:
prepare_reference.py --mapper [bwa | bowtie2] \
--variant [gatk | mpileup] \
--reference <path_to_reference>
Internally, for the pipeline, the indexing will be done automatically, because software assumes that a reference is present for each sample. It has to be done differently for other use cases because if multiple processes trying to index the same file, this may lead to corrupted indeces.
Note
The reference file can’t have fas extension, because Picard Tools throws an exception.
Filters¶
One of the key parts of the VCF processing is to filter quality calls. To do this we have created a flexible interface with variaety of filters avaialble:
- qual_score - Filter records where QUAL score is below given threshold.
- ad_ratio - Filter, defined by gatk, records where ratio of alt allele to sum of all alleles is below given fraction.
- dp4_ratio - Similar to ad_ratio, but used in mpileup variant caller.
- mq_score- Filter records that fall below specified MQ score (from _INFO_ field).
- mq0_ratio - Filter, defined by gatk, records with MQ0 to DP ratio _above_ given threshold (both values are from _INFO_ field).
- mq0f_ratio - Similar to the mq0_ratio, but used in mpileup variant caller.
- qg_score - Filter records that fall below specified GQ score (from first sample).
- min_depth - Filter records with mean depth below specified threshold (DP from sample field).
- uncall_gt - Filter records with uncallable genotypes in VCF (GT is ./.).
All filters are applied for each position and those positions that pass ALL filters are kept as quality calls. Positions failing filter will be kept for future reference and creating fasta files, when needed. To specify filters to be used, simply list them as key:threshold pairs separated by comma(,). For filters that don’t require threshold, leave blank after ‘:’.
Annotators¶
Individual VCFs can be annotated using custom annotators. Currently available annotators:
- coverage - Annotates with information about _mean_ and _dev_ of the depth in the VCF (using DP from INFO).
The data can be accessed from the metadata field in the VCF in the following way:
r = vcf.Reader(filename="/path/to/vcf")
print r.metadata["coverageMetaData"][0]["mean"]
print r.metadata["coverageMetaData"][0]["dev"]
Calling SNPs¶
If you have a sample and you want to have one-stop-shop analysis run the following:
$ phenix.py run_snp_pipeline \
-r1 <path to R1.fastq> \
-r2 <path to R2.fastq> \
-r <path to reference> \
--sample-name <NAME> \
--mapper bwa \
--variant gatk \
--filters min_depth:5,mq_score:30 \
-o <path to output directory>
This will map with BWA and call variants with GATK. Intermediate files are written into the specified output directory. –sample-name option is very important, it specifies what output files will be called and the read group in the BAM file. If you omit it, test_sample will be used.
Config¶
If you have a pipeline, or want to run the same settings for different samples this process can be simplefied by having a single config file. The file is in YAML format and takes common options that may be used for all samples:
- mapper - string
- mapper-options - string
- variant - string
- variant-options - string
- filters - dictionary
- annotators - list
Below is an example of config file:
mapper: bwa
mapper-options: -t 4
variant: gatk
variant-options: --sample_ploidy 2 --genotype_likelihoods_model SNP -rf BadCigar -out_mode EMIT_ALL_SITES -nt 1
filters:
ad_ratio: 0.9
min_depth: 5
qual_score: 30
mq_score: 30
annotators:
- coverage
Converting to FASTA¶
A lot of downstream applications take on FASTA formated files, not VCFs. We have included a script for converting VCF data to FASTA format.
$ vcfs2fasta -d <path to directory of VCFs> -o <path to output FASTA>
This tool is also able to filter out samples and columns that may be bad quality. E.g. –sample-Ns specifies maximum fraction of Ns present in a sample. –Ns specifies maximum fraction of Ns allowed per column. –with-mixtures can specify if mixed position should be output as over certain fraction. First, samples are sifted out then columns are checked. –with-stats allows to output genomic positions of the called SNPs. Each line coresponds to a character position in the FASTA file. E.g. to get the position of the 23rd snp, go to 23rd line in the positions file. With this option, if numpy and matplotlib are installed, plots directory will be created and a summary plot is generated, summarising called SNPs, Ns, mixtures, and bases with 0 depth. Providing –with-dist-matrix will also write distance matrix to the specified path.
API Reference¶
Filters¶
One of the key aims of this project is to provide a flexible and extendable interface for filtering VCFs. While there are a number of tools that will process a VCF file, we felt that none offered in depth understanding that we wanted. Instead we have based this library on PyVCF and extended with variety of filters Filters.
If you feel that there is a filter that has not been implemented, or you want to try your hand at implementing object oriented classes, then please see Implementing Filter.
Mappers¶
We found there was no single neat way of calling different mappers (there is BioPython, of course) but it wasn’t feasible to integrate it into our software stack in the way that we wanted to. Instead, we wrote a simple interfaces that can be easily extended to add new mappers. See Implementing Mapper.
Variant Callers¶
Just like with mappers, there was no neat way of incorporating other interfaces into this software stack. To add your favourite variant caller see Implementing Variant Caller.
phe¶
Subpackages¶
phe.annotations¶
| Date: | 10 Nov, 2015 |
|---|---|
| Author: | Alex Jironkin |
-
class
CoverageAnnotator(config=None)[source]¶ Bases:
phe.annotations.Annotatorclassdocs
Methods
annotate([vcf_path])get_meta()get_meta_values()-
name= 'coverage'¶
-
| Date: | 10 Nov, 2015 |
|---|---|
| Author: | Alex Jironkin |
-
class
Annotator(name)[source]¶ Bases:
phe.metadata.PHEMetaDataBase class for Annotators
Attributes
name Methods
annotate([vcf_path])get_meta()get_meta_values()-
name= None¶
-
phe.mapping¶
Implementation of the Mapper class using BWA (Heng Li) mapper.
| Date: | 17 Sep, 2015 |
|---|---|
| Author: | Alex Jironkin |
-
class
BWAMapper(cmd_options=None)[source]¶ Bases:
phe.mapping.MapperBWA mapper developed by Heng Li.
Methods
create_aux_files(ref)get_info([plain])get_meta()get_samtools_version()Get version of samtools used. get_version()make_bam(*args, **kwargs)Make indexed BAM from reference, and fastq files. make_sam(*args, **kwargs)validate()Validate itself, by checking appropriate commands can run. -
name= 'bwa'¶ Plain text name of the mapper.
-
Implementation of the Mapper class using Bowtie2 mapper.
| Date: | 17 Sep, 2015 |
|---|---|
| Author: | Alex Jironkin |
-
class
Bowtie2Mapper(cmd_options=None)[source]¶ Bases:
phe.mapping.MapperBowtie2 mapper
Methods
create_aux_files(ref)get_info([plain])get_meta()get_samtools_version()Get version of samtools used. get_version()make_bam(*args, **kwargs)Make indexed BAM from reference, and fastq files. make_sam(*args, **kwargs)validate()Validate itself, by checking appropriate commands can run. -
name= 'bowtie2'¶ Plain text name of the mapper.
-
| Date: | 22 Sep, 2015 |
|---|---|
| Author: | Alex Jironkin |
-
dynamic_mapper_loader()[source]¶ Fancy way of dynamically importing existing mappers.
Returns: dict:
Available mappers dictionary. Keys are names that can be used for existing mapper implementations.
-
factory(mapper=None, custom_options=None)[source]¶ Create an instance of a mapper from _config_.
Parameters: mapper: str, optional
Name of the mapper to initialise.
custom_options: str, optional
Custom options to be passed to the mapper.
Returns: Instance of a
phe.mapping.Mapperwith provided name, or None if the mapper could not be found.
Mapping related classes and functions.
This project does not provide novel way of mapping or calling variants. Instead it interfaces with available mappers through command line interface. Hopefully, we have made it straight forward to add your favourite mapper for others to use.
To do this, you will need to implement Mapper class. There are a
number of methods that need to be implemented for you mapper to work:
Mapper.name- Attribute that specifies the name of the mapper (used to dynamically load it).Mapper.make_sam()- Create a sam file in out_file from R1/Forward, R2/Reverse and ref. SeeMapper.make_sam()for all input options.Mapper.create_aux_files()- Create auxilliary files needed for the mapper. This is also called by phenix.Mapper.get_version()- get the version of the mapper.Mapper.get_info()- Get the meta data information for the mapper (included in the VCF header).Mapper.make_bam()- If you want, you can also implement this method, but my default samtools is used to convert from SAM -> BAM.
Bear in mind that cmd_options will be passed on, if they specified in the command line or config under mapper-options.
Once you have implemented this interface, save the file in the mapping directory and it should be automatically picked up. To verify run:
run_snp_pipeline.py --help
You should see your mapper in the list of available mappers. If the mapper works, it will also be included n automatically in the documentations.
| Date: | 22 Sep, 2015 |
|---|---|
| Author: | Alex Jironkin |
-
class
Mapper(cmd_options=None)[source]¶ Bases:
phe.metadata.PHEMetaDataAbstract Mapper class that provides generic interface to the mapping implementation for a particular mapper.
Attributes
nameReturn name of the mapper (implemented by individual classes. Methods
create_aux_files(ref)Create required auxiliary files for ref. get_info([plain])Get information about the mapper. get_meta()get_samtools_version()Get version of samtools used. get_version()Get the version of the underlying command used. make_bam(*args, **kwargs)Make indexed BAM from reference, and fastq files. make_sam(*args, **kwargs)Make SAM from reference, and fastq files. validate()Validate itself, by checking appropriate commands can run. -
create_aux_files(ref)[source]¶ Create required auxiliary files for ref.
Parameters: ref: str
Path to the reference file.
Returns: bool:
True if auxiliary files were created, False otherwise.
-
get_samtools_version()[source]¶ Get version of samtools used. Reterned as a triple (major, minor, patch)
-
make_bam(*args, **kwargs)[source]¶ Make indexed BAM from reference, and fastq files.
Parameters: ref: str
Path to the reference file used for mapping.
R1: str
Path to the R1/Forward reads file in FastQ format.
R2: str:
Path to the R2/Reverse reads file in FastQ format.
out_file: str
Name of the output file where BAM data is written.
sample_name: str
Name of the sample to be included in the read group (RG) field.
Returns: bool:
True iff mapping, converting to BAM and indexing returns 0, False otherwise.
-
make_sam(*args, **kwargs)[source]¶ Make SAM from reference, and fastq files.
Parameters: ref: str
Path to the reference file used for mapping.
R1: str
Path to the R1/Forward reads file in FastQ format.
R2: str:
Path to the R2/Reverse reads file in FastQ format.
out_file: str
Name of the output file where SAM data is written.
sample_name: str
Name of the sample to be included in the read group (RG) field.
make_aux: bool, optional
True/False to make auxilliary files (default: False).
Returns: bool:
True iff mapping returns 0, False otherwise.
-
name¶ Return name of the mapper (implemented by individual classes.
-
phe.metadata¶
phe.variant¶
| Date: | 22 Sep, 2015 |
|---|---|
| Author: | Alex Jironkin |
-
class
GATKVariantCaller(cmd_options=None)[source]¶ Bases:
phe.variant.VariantCallerImplemetation of the Broad institute’s variant caller.
Methods
create_aux_files(ref)Create auxiliary files needed for this variant. get_info([plain])get_meta()Get the metadata about this variant caller. get_version()make_vcf(*args, **kwargs)validate()-
create_aux_files(ref)[source]¶ Create auxiliary files needed for this variant.
Tools needed: samtools and picard tools. Picard tools is a Java library that can be defined using environment variable: PICARD_JAR specifying path to picard.jar or PICARD_TOOLS_PATH specifying path to the directory where separate jars are (older version before jars were merged into a single picard.jar).
Parameters: ref: str
Path to the reference file.
Returns: bool:
True if auxiliary files were created, False otherwise.
-
name= 'gatk'¶ Plain text name of the variant caller.
-
| Date: | 22 Sep, 2015 |
|---|---|
| Author: | Alex Jironkin |
-
class
MPileupVariantCaller(cmd_options=None)[source]¶ Bases:
phe.variant.VariantCallerImplemetation of the Broad institute’s variant caller.
Methods
create_aux_files(ref)Index reference with faidx from samtools. get_info([plain])get_meta()Get the metadata about this variant caller. get_version()make_vcf(*args, **kwargs)validate()Check if the command can run. -
create_aux_files(ref)[source]¶ Index reference with faidx from samtools.
Parameters: ref: str
Path to the reference file.
Returns: bool:
True if auxiliary files were created, False otherwise.
-
name= 'mpileup'¶ Plain text name of the variant caller.
-
Classes and functions for working with variant callers.
| Date: | 22 Sep, 2015 |
|---|---|
| Author: | Alex Jironkin |
-
dynamic_caller_loader()[source]¶ Fancy way of dynamically importing existing variant callers.
Returns: dict:
Available variant callers dictionary. Keys are parameters that can be used to call variants.
-
factory(variant=None, custom_options=None)[source]¶ Make an instance of a variant class.
Parameters: variant: str, optional
Name of the variant class to instantiate.
custom_options: str, optional
Custom options to be passed directly to the implementing class.
Returns: Instance of the
phe.variant.VariantCallerfor given variant name, or None if one couldn’t be found.
Classes and methods to work with _variants and such.
Similarly to Implementing Mapper and Implementing Filter
adding your variant caller is easy and sraight forward. Implement
VariantCaller class with the following attributes/methods:
VariantCaller.name- Name of your variant caller, used to dynamically load it.VariantCaller.make_vcf()- Create a VCF from input BAM. SeeVariantCaller.make_vcf()for input list.VariantCaller.get_version()- Return the version of this variant caller.VariantCaller.get_info()- Return information about this caller and included in the VCF header.VariantCaller.create_aux_files()- Create auxilliary files needed by the caller. Also used in phenix.
Save your new variant caller in the variant directiry and it will be automatically picked up by the system. To verify run:
run_snp_pipeline.py --help
It should appear in the list of available callers.
| Date: | 22 Sep, 2015 |
|---|---|
| Modified: | 28 June, 2016 |
| Author: | Alex Jironkin |
-
class
VCFTemplate(vcf_reader)[source]¶ Bases:
objectThis is a small hack class for the Template used in generating VCF file.
-
class
VariantCaller(cmd_options=None)[source]¶ Bases:
phe.metadata.PHEMetaDataAbstract class used for access to the implemented variant callers.
Methods
create_aux_files(ref)Create needed (if any) auxiliary files. get_info([plain])Get information about this variant caller. get_meta()Get the metadata about this variant caller. get_version()Get the version of the underlying command used. make_vcf(*args, **kwargs)Create a VCF from BAM file. validate()Check if the command can run. -
create_aux_files(ref)[source]¶ Create needed (if any) auxiliary files. These files are required for proper functioning of the variant caller.
-
make_vcf(*args, **kwargs)[source]¶ Create a VCF from BAM file.
Parameters: ref: str
Path to the reference file.
bam: str
Path to the indexed BAM file for calling _variants.
vcf_file: str
path to the VCF file where data will be written to.
make_aux: bool, optional
True/False create auxilliary files (default: False).
Returns: bool:
True if variant calling was successful, False otherwise.
-
-
class
VariantSet(vcf_in, filters=None, reference=None)[source]¶ Bases:
objectA convenient representation of set of _variants. TODO: Implement iterator and generator for the variant set.
Methods
add_metadata(info)Add metadata to the variant set. filter_variants([keep_only_snps, out_vcf, ...])Filter the VCF records. variants([only_good])Generator over the variant set. write_variants(vcf_out[, only_snps, only_good])Write _variants to a VCF file. -
add_metadata(info)[source]¶ Add metadata to the variant set.
Parameters: info: dict
Dictionary of key value pairs to be inserted into metadata.
-
filter_variants(keep_only_snps=False, out_vcf=None, only_good=False)[source]¶ Filter the VCF records.
Parameters: keep_only_snps: bool, optional
Retain only SNP variants (default: False).
out_vcf: str, optional
If specified, filtered record will be written to out_vcf instead of being retianed in memory.
only_good: bool, optional
True/False if only SNPs that PASS should output.
Returns: list or int:
If out_vcf is specified, then total number of written records is returned. Otherwise, list of records is returned.
-
variants(only_good=False)[source]¶ Generator over the variant set.
Parameters: only_good: bool, optional
Iff True good and bad variants are returned, otherwise only good are returned (default: False).
-
write_variants(vcf_out, only_snps=False, only_good=False)[source]¶ Write _variants to a VCF file.
Parameters: vcf_out: str
Path to the file where VCF data is written.
only_snps: bool, optional
True is only SNP are to be written to the output (default: False).
only_good: bool, optional
True if only those records that PASS all filters should be written (default: False).
Returns:
int:
Number of records written.
-
phe.variant_filters¶
Filter VCFs on AD ratio.
| Date: | 24 Sep, 2015 |
|---|---|
| Author: | Alex Jironkin |
-
class
ADFilter(args)[source]¶ Bases:
phe.variant_filters.PHEFilterBaseFilter sites by AD ratio.
Methods
__call__(record)Filter a vcf.model._Record.call_concensus(record)customize_parser(parser)decode(filter_id)Decode name of filter. filter_name()Create filter names by their parameter separated by magic. get_config()This is used for reconstructing filter. is_gap()is_n()is_uncallable(record)Filter a vcf.model._Record.short_desc()-
name= 'ADRatio'¶
-
parameter= 'ad_ratio'¶
-
Filter VCFs on AD ratio.
| Date: | 24 Sep, 2015 |
|---|---|
| Author: | Alex Jironkin |
-
class
DP4Filter(args)[source]¶ Bases:
phe.variant_filters.PHEFilterBaseFilter sites by DP4 ratio.
Methods
__call__(record)Filter a vcf.model._Record.call_concensus(record)customize_parser(parser)decode(filter_id)Decode name of filter. filter_name()Create filter names by their parameter separated by magic. get_config()This is used for reconstructing filter. is_gap()is_n()is_uncallable(record)Filter a vcf.model._Record.short_desc()-
name= 'DP4'¶
-
parameter= 'dp4_ratio'¶
-
Filter VCF on depth of coverage.
| Date: | 24 Sep, 2015 |
|---|---|
| Author: | Alex Jironkin |
-
class
DepthFilter(args)[source]¶ Bases:
phe.variant_filters.PHEFilterBaseFilter sites by depth.
Methods
__call__(record)Filter a vcf.model._Record.call_concensus(record)customize_parser(parser)decode(filter_id)Decode name of filter. filter_name()Create filter names by their parameter separated by magic. get_config()This is used for reconstructing filter. is_gap()is_n()is_uncallable(record)Filter a vcf.model._Record.short_desc()-
name= 'MinDepth'¶
-
parameter= 'min_depth'¶
-
Filter VCF on GQ parameter.
| Date: | 24 Sep, 2015 |
|---|---|
| Author: | Alex Jironkin |
-
class
GQFilter(args)[source]¶ Bases:
phe.variant_filters.PHEFilterBaseFilter sites by GQ score.
Methods
__call__(record)Filter a vcf.model._Record.call_concensus(record)customize_parser(parser)decode(filter_id)Decode name of filter. filter_name()Create filter names by their parameter separated by magic. get_config()This is used for reconstructing filter. is_gap()is_n()is_uncallable(record)Filter a vcf.model._Record.short_desc()-
name= 'MinGQ'¶
-
parameter= 'gq_score'¶
-
Filter VCF on GT filter parameter.
| Date: | 24 Sep, 2015 |
|---|---|
| Author: | Alex Jironkin |
-
class
UncallableGTFilter(args)[source]¶ Bases:
phe.variant_filters.PHEFilterBaseFilter uncallable genotypes
Methods
__call__(record)Filter a vcf.model._Record.call_concensus(record)customize_parser(parser)decode(filter_id)Decode name of filter. filter_name()Create filter names by their parameter separated by magic. get_config()This is used for reconstructing filter. is_gap()is_n()is_uncallable(record)Filter a vcf.model._Record.short_desc()-
name= 'UncallGT'¶
-
parameter= 'uncall_gt'¶
-
Filter VCF on MQ filter.
| Date: | 24 Sep, 2015 |
|---|---|
| Author: | Alex Jironkin |
-
class
MQ0FFilter(args)[source]¶ Bases:
phe.variant_filters.PHEFilterBaseFilter sites by MQ0F (Total Mapping Quality Zero Reads to DP) ratio.
Methods
__call__(record)Filter a vcf.model._Record.call_concensus(record)customize_parser(parser)decode(filter_id)Decode name of filter. filter_name()Create filter names by their parameter separated by magic. get_config()This is used for reconstructing filter. is_gap()is_n()is_uncallable(record)Filter a vcf.model._Record.short_desc()-
name= 'MinMQ0F'¶
-
parameter= 'mq0f_ratio'¶
-
Filter VCF on MQ filter.
| Date: | 24 Sep, 2015 |
|---|---|
| Author: | Alex Jironkin |
-
class
MQ0Filter(args)[source]¶ Bases:
phe.variant_filters.PHEFilterBaseFilter sites by MQ0 (Total Mapping Quality Zero Reads) to DP ratio.
Methods
__call__(record)Filter a vcf.model._Record.call_concensus(record)customize_parser(parser)decode(filter_id)Decode name of filter. filter_name()Create filter names by their parameter separated by magic. get_config()This is used for reconstructing filter. is_gap()is_n()is_uncallable(record)Filter a vcf.model._Record.short_desc()-
name= 'MinMQ0'¶
-
parameter= 'mq0_ratio'¶
-
Filter VCF on MQ filter.
| Date: | 24 Sep, 2015 |
|---|---|
| Author: | Alex Jironkin |
-
class
MQFilter(args)[source]¶ Bases:
phe.variant_filters.PHEFilterBaseFilter sites by Mapping Quality (MQ) score.
Methods
__call__(record)Filter a vcf.model._Record.call_concensus(record)customize_parser(parser)decode(filter_id)Decode name of filter. filter_name()Create filter names by their parameter separated by magic. get_config()This is used for reconstructing filter. is_gap()is_n()is_uncallable(record)Filter a vcf.model._Record.short_desc()-
name= 'MinMQ'¶
-
parameter= 'mq_score'¶
-
Filter VCF on GQ parameter.
| Date: | 24 Sep, 2015 |
|---|---|
| Author: | Alex Jironkin |
-
class
QualFilter(args)[source]¶ Bases:
phe.variant_filters.PHEFilterBaseFilter sites by QUAL score.
Methods
__call__(record)Filter a vcf.model._Record.call_concensus(record)customize_parser(parser)decode(filter_id)Decode name of filter. filter_name()Create filter names by their parameter separated by magic. get_config()This is used for reconstructing filter. is_gap()is_n()is_uncallable(record)Filter a vcf.model._Record.short_desc()-
name= 'LowSNPQual'¶
-
parameter= 'qual_score'¶
-
Classes and functions for working with variant filters.
The PHEFilterBase class was designed to be implemented in order
to provide new filters for processing. As such it should be easy to extend
PHEFilterBase to achieve this.
If you are not familiar with Python and classes please read this. It is a general introduction to the subject.
Implementation of a filter closely follows the structure for PyVCF
PHEBaseFilter adds additional infomations that is also added to the VCF header:
PHEFilterBase.parameter- Atrribute that allows dynmic loader to pickup the filter.This is the name used in command line arguments and configs.PHEFilterBase._default_threshold- Default threshold used for the filter when non specified.PHEFilterBase.short_desc()- Method for defining what the short description is dynamically.PHEFilterBase.__call__()- Function that does the filtering (the same as PyVCF).
The most important of these is the __call__ function, as this is what does the filtering. It needs to return the following:
- None if the record PASSES the filter.
- Non None expressions are treated as failures.
Note
While it may not be obvious, but Python always returns a value. When no return statement is specified, None is returned to the calling function.
False is returned when data can not be accessed by the filter.
Once you have implemented the filter and saved it in the variant_filters directory, it should be automatically picked up by the system and available to filter on. To verify this run:
run_snp_pipeline.py --help
You filter should now be visible in the list of available filters.
class MyFilter(PHEFilterBase):
name="CoolQUAL"
_default_threshold=40
parameter=cool_qual
def __init__(self, args):
# Construct any specific details for the object
def __call__(self, record):
if record.QUAL < self.threshold:
return record.QUAL
def short_desc(self):
return "My cool QUAL filter"
| Date: | 24 Sep, 2015 |
|---|---|
| Author: | Alex Jironkin |
-
class
PHEFilterBase(args)[source]¶ Bases:
vcf.filters.BaseBase class for VCF filters.
Attributes
parameterShort name of parameter being filtered. Methods
__call__()filter a site, return not None if the site should be filtered call_concensus(record)customize_parser(parser)hook to extend argparse parser with custom arguments decode(filter_id)Decode name of filter. filter_name()Create filter names by their parameter separated by magic. get_config()This is used for reconstructing filter. is_gap()is_n()is_uncallable(record)Filter a vcf.model._Record.short_desc()Short description of the filter (included in VCF). -
decoder_pattern= <_sre.SRE_Pattern object>¶
-
filter_name()[source]¶ Create filter names by their parameter separated by magic. E.g. if filter parameter is ad_ratio and threshold is 0.9 then ad_ratio:0.9 if the filter name.
-
magic_sep= ':'¶
-
parameter¶ Short name of parameter being filtered.
-
-
dynamic_filter_loader()[source]¶ Fancy way of dynamically importing existing filters.
Returns: dict:
Available filters dictionary. Keys are parameters that can be supplied to the filters.
-
str_to_filters(filters)[source]¶ Convert from filter string to array of filters. E.g. ad_ration:0.9,min_depth:5
Parameters: filters: str
String version of filters, separated by comma.
Returns: list:
List of
phe.variant_filters.PHEFilterBaseinstances.
phe.utils¶
| Date: | 3 June, 2016 |
|---|---|
| Author: | Public Health England |
-
class
ParallelVCFReader(vcfs)[source]¶ Bases:
objectRead multiple VCFs in parallel - one position at the time.
Methods
get_records()Generator of records, one position at the time. get_samples()Get the list of sample names from the VCFs. update([ids])Update records in the readers. -
get_records()[source]¶ Generator of records, one position at the time. If a position has more than 1 record in any of the readers, a list of all records with that position is returned. The user can deal appropriately with the records list.
Returns: chrom: str
Chromosome
pos: int
Position
records: dict
Records in dictionary of lists for each sample.
{ "sample1": [record1, record2], "sample2": [record3] }
-
Module contents¶
Scripts¶
Useful scripts for data analysis.
phenix¶
There is a single entry point to access all commands we have produced so far. Please refer to this documentation or –help on the cammand line.
usage: phenix [-h] [--debug] [--version]
{run_snp_pipeline,filter_vcf,prepare_reference,vcf2fasta} ...
- Options:
--debug=False More verbose logging (default: turned off). --version show program’s version number and exit - Sub-commands:
- run_snp_pipeline
Run SNP pipeline.
Run the snp pipeline with specified mapper, variant caller and some filters. Available mappers: [‘bwa’, ‘bowtie2’] Available variant callers: [‘mpileup’, ‘gatk’] Available filters: [‘gq_score’, ‘dp4_ratio’, ‘ad_ratio’, ‘min_depth’, ‘mq_score’, ‘mq0_ratio’, ‘uncall_gt’, ‘qual_score’, ‘mq0f_ratio’] Available annotators: [‘coverage’]
usage: phenix run_snp_pipeline [-h] [--workflow WORKFLOW] [--input INPUT] [-r1 R1] [-r2 R2] [--reference REFERENCE] [--sample-name SAMPLE_NAME] [--outdir OUTDIR] [--config CONFIG] [--mapper MAPPER] [--mapper-options MAPPER_OPTIONS] [--bam BAM] [--variant VARIANT] [--variant-options VARIANT_OPTIONS] [--vcf VCF] [--filters FILTERS] [--annotators ANNOTATORS [ANNOTATORS ...]] [--keep-temp]- Options:
--workflow, -w Undocumented --input, -i Undocumented -r1 R1/Forward read in Fastq format. -r2 R2/Reverse read in Fastq format. --reference, -r Rerefence to use for mapping. --sample-name=test_sample Name of the sample for mapper to include as read groups. --outdir, -o Undocumented --config, -c Undocumented --mapper=bwa, -m=bwa Available mappers: [‘bwa’, ‘bowtie2’] --mapper-options Custom maper options (advanced) --bam Undocumented --variant=gatk, -v=gatk Available variant callers: [‘mpileup’, ‘gatk’] --variant-options Custom variant options (advanced) --vcf Undocumented --filters Filters to be applied to the VCF in key:value pairs, separated by comma (,). Available_filters: [‘gq_score’, ‘dp4_ratio’, ‘ad_ratio’, ‘min_depth’, ‘mq_score’, ‘mq0_ratio’, ‘uncall_gt’, ‘qual_score’, ‘mq0f_ratio’] --annotators List of annotators to run before filters. Available: [‘coverage’] --keep-temp=False Keep intermediate files like BAMs and VCFs (default: False).
- filter_vcf
Filter a VCF.
Filter the VCF using provided filters.
usage: phenix filter_vcf [-h] --vcf VCF [--filters FILTERS | --config CONFIG] --output OUTPUT [--reference REFERENCE] [--only-good]- Options:
--vcf, -v VCF file to (re)filter. --filters, -f Filter(s) to apply as key:threshold pairs, separated by comma. --config, -c Config with filters in YAML format. E.g.filters:-key:value --output, -o Location for filtered VCF to be written. --reference, -r mpileup version <= 1.3 do not output all positions. This is required to fix rfrence base in VCF. --only-good=False Write only variants that PASS all filters (default all variants are written).
- prepare_reference
Create aux files for reference.
Prepare reference for SNP pipeline by generating required aux files.
usage: phenix prepare_reference [-h] --reference REFERENCE [--mapper MAPPER] [--variant VARIANT]- Options:
--reference, -r Path to reference file to prepare. --mapper Available mappers: [‘bwa’, ‘bowtie2’] --variant Available variants: [‘mpileup’, ‘gatk’]
- vcf2fasta
Convert VCFs to FASTA.
Combine multiple VCFs into a single FASTA file.
usage: phenix vcf2fasta [-h] (--directory DIRECTORY | --input INPUT [INPUT ...]) [--regexp REGEXP] --out OUT [--with-mixtures WITH_MIXTURES] [--column-Ns COLUMN_NS] [--column-gaps COLUMN_GAPS] [--sample-Ns SAMPLE_NS] [--sample-gaps SAMPLE_GAPS] [--reference REFERENCE] [--include INCLUDE | --exclude EXCLUDE] [--with-stats WITH_STATS] [--tmp TMP]- Options:
--directory, -d Path to the directory with .vcf files. --input, -i List of VCF files to process. --regexp Regular expression for finding VCFs in a directory. --out, -o Path to the output FASTA file. --with-mixtures Specify this option with a threshold to output mixtures above this threshold. --column-Ns Keeps columns with fraction of Ns below specified threshold. --column-gaps Keeps columns with fraction of Ns below specified threshold. --sample-Ns Keeps samples with fraction of Ns below specified threshold. --sample-gaps Keeps samples with fraction of gaps below specified threshold. --reference If path to reference specified (FASTA), then whole genome will be written. --include Only include positions in BED file in the FASTA --exclude Exclude any positions specified in the BED file. --with-stats If a path is specified, then position of the outputed SNPs is stored in this file. Requires mumpy and matplotlib. --tmp Location for writing temp files (default: /tmp).
Galaxy¶
This tools is available for installation to your local Galaxy [http://galaxyproject.org] from the Toolshed [https://toolshed.g2.bx.psu.edu/]. This section describes in some detail how this is done and how the Galaxy tool can be used to process data.
Primer¶
The Phenix Galaxy tool only enables you to filter VCF files and to convert VCF files into FASTA files. The mapping and the SNP calling component of Phenix need to be provided by other Galaxy tools (also available from the Toolshed.)
How to get your own Galaxy server:¶
If you already have a local Galaxy server on which you are an Administrator and on which you can install tools from the Toolshed you can skip this section and continue reading at “How to install Phenix to your Galaxy server”. If not you might want to get one, which is easy.
Prerequisites:
- A workstation running a Linux operating system
- Min. 4GB of RAM (better 8GB or more)
- Root access to that machine.
- git (optional)
- Python 2.6 or 2.7
- GNU Make, gcc to compile and install tool dependencies
The last three are standard on most contemporary Linux installations.
Get your own Galaxy:
Please go to [https://wiki.galaxyproject.org/Admin/GetGalaxy] and follow the instructions in the sections from “Get the Code” to “Become an Admin” (including).
How to install Phenix to your Galaxy server:¶
This page (https://wiki.galaxyproject.org/Admin/Tools/AddToolFromToolShedTutorial) describes the general procedure of installing a tool from the toolshed. Here is a click-by-click guide to what you need to do to install Phenix.
Prerequisites:
There is a small number of Linux packages that need to be installed on your machine, so that Phenix can be installed and run propery. These are:
- a C++ compiler
- a Fortran compiler
- the zlib development libraries
- the samtools package
The exact name of these packages and the commands to install them depend on yout Linux distribution. For Ubuntu, Mint and Debian Linux the required commands should be:
sudo apt-get install build-essential
sudo apt-get install gfortran
sudo apt-get install zlib-devel
sudo apt-get install samtools
On Fedora the commands should be:
sudo dnf install gcc-c++
sudo dnf install gcc-gfortran
sudo dnf install zlib-devel
sudo dnf install samtools
And on OpenSUSE the commands should be:
sudo zypper install gcc-c++
sudo zypper install gcc-fortran
sudo zypper install zlib-devel
sudo zypper install samtools
For more esoteric Linux distributions please refer to your local IT expert and/or Google. After you have successfully installed these packages, follow the instructions below.
- Make sure that you can access ftp sites from the command line running your Galaxy server. This is normally enabled by default, but sometimes requires an additional proxy setting. Try this command
wget ftp://ftp.gnu.org/gnu/libtool/libtool-2.4.tar.gz
If that does not download a file named ‘libtool-2.4.tar.gz’ to your current folder, speak to your local systems’ administrator.
- In your config/galaxy.ini files, set
tool_dependency_dir = tool_dependencies
- Restart your Galaxy
- In your Galaxy, click on ‘Admin’ in the main menu at the top.
- Select ‘Search Tool Shed’ from the menu on the left hand side.
- Click on the little black triangle (the context menu) next to ‘Galaxy Main Tool Shed’ and select ‘Browse valid repositories’.
- Type ‘phephenix’ [sic] into the search box and press Enter.
- You should see the “package_phephenix_1_0” and the “phephenix” repositories. Select ‘Preview and install’ from the context menu of the latter.
- Click ‘Install to Galaxy’.
- Type ‘PHE TOOLS’ into the ‘Add new tool panel section:’ textbox.
- Click ‘Install’.
- You will be presented with a long list of packages that need to be installed. This will take a while. Wait until everything is green. If nothing happens for a little while, try reloading the page.
In your admin tool panel the menu item “Manage installed tools” was added. You can check the status of your installed packages there.
You need to install two more tools that are part of the Phenix workflow:
- Browse the toolshed as before and look for ‘bwa’. Install the tool ‘bwa’ with synopsis “Wrapper for bwa mem, aln, sampe, and samse” owned by ‘devteam’. Put it into the ‘PHE TOOLS’ tool panel section.
- Browse the toolshed as before and install the tool ‘phe_samtools_mpileup’. Put it into the ‘PHE TOOLS’ tool panel section.
How to use Phenix on Galaxy:¶
- In the admin menu, go the ‘Manage installed tools’
- Find the ‘phephenix’ tool and click on it, not on it’s context menu.
- In the ‘Contents of this repository’ box at the bottom of the page, expand the workflows section using the little blue triange if necessary.
- Click on ‘Phenix workflow’ and behold an image representation of the workflow.
- Click on ‘Repository actions’ -> ‘Import workflow to Galaxy’
The Phenix workflow is now ready to use. You need to upload your data to a Galaxy history to use it. There are multiple options depending on your local Galaxy configuration. If you have followed the instructions above under ‘Get your own Galaxy’ the only available option is uploading from your local harddrive. When doing this, please make sure your fastq files are in the ‘fastqsanger’ format and your reference genome is in ‘fasta’ format. The basename of your R1 fastq files will be used as the name for the resulting bam and vcf files with appropriate file endings and will also appear as the sequence header in your resulting SNP alignment. Once your data is ready to use follow theses instructions to run the workflow.
Remark: Processing a single sample with the Phenix workflow can use up to 1.5GB of RAM. It is recommended you do not run more samples than your total system memory divided by 1.5 (2 for 4GB, 5 for 8GB, ...) or only as may samples as you have processor cores, whichever is lower.
- Click on workflow in the top main menu and select ‘run’ from the ‘Phenix workflow’ context menu.
- Select your reference genome in the ‘Reference fasta’ selection box.
- Click both “paper stack” icons next to ‘R1 fastqs’ and next to ‘R2 fastqs’.
- Select all R1.fastq files in the “R1 fastqs” selection box that appeared. An easy way to do this is to type ‘R1’ into the text box immediately under the selection box. The selection box will automatically be filtered for fastq file names that match what has been typed into the text box.
- Do as above for the “R2 fastq” box.
Note
Make sure that the order of highlighted files in the R1 and R2 fastq selection boxes are corresponding, i.e. the first highlighted files in the top box will be run together with the first highlighted file in the bottom box, the second with the second and so on. Are the highlighted files in the same rank in both boxes always matching pairs from the same sample?
- If you are happy with the above, click ‘Run workflow’.
- Wait until three new history items per sample processed appear. These will be one bam file, one raw vcf, and one filtered vcf per sample.
- Click on PHE TOOLS in the tool panel on the left hand side and select the “VCFs to fasta” tool. This tool will create a multi-fasta file from your filtered vcfs.
- Select all files ending with filtered.vcf in the top file selection box under ‘Input VCF files(s)’, by holding down the Ctrl key.
- Click ‘Execute’. Once everything is completed the “VCFs to fasta” tool with have produced your SNP alignment that can now be used for further processing.