ProDuSe: Process Duplex Sequence data

Authors:
Marco Albuquerque
Christopher Rushton
Contact:ckrushto@sfu.ca.
Source Code:Github
License:GNU General Public License v3.0

ProDuSe is a Python command-line variant caller designed for use on Illumina sequencing data obtained from libraries constructing using barcoded adapters. Using these adapters, read duplicates are identified and collapsed into a single consensus sequence, correcting PCR and sequencing errors. In samples with a high duplication rate, this allows extremely rare variants be be called confidently.

ProDuSe Commands

Config

Configures output directories, input and output file paths for each step of the ProDuSe pipeline, and configuration files for each script.

produse_config.py performs the following tasks:

  • Creates a main output directory (produse_analysis_direcetory) in the specified directory
  • Creates a log file listing the supplied arguments in the main output directory
  • Creates a Makefile in the main output directory (Depreciated, has no purpose)
  • (If necessary) Creates normal and bwa indexes for the designated reference genome

For each sample supplied:

  • Creates a sample-specific directory and subdirectories in the main output directory using the sample’s name
  • Symbolically Links input files into this sample-specific directory
  • Creates a config file for each step in the ProDuSe pipeline, containing sample-specific parameters

Run Using

produse configure_produse

or

python /path/to/ProDuSe/ProDuSe/configure_produse.py

Parameters

-f –fastqs:
If running a single sample, the fastq files for the sample to be analyzed.
Two arguments must be supplied.
A single sample directory names ‘Sample’ will be created.
Mutually exclusive with -sc –sample_config.
-sc –sample_config:
 
A configuration file listing sample names, fastq locations, and sample-specific parameters.
A directory will be created for each sample, using the name supplied.
Mutually exclusive with -f –fastqs.
-d –output_directory:
 The directory to output ProDuSe results. Default is the current working directory.
-r –reference:Reference genome build, in fasta format.
-c –config:A configuration file listing parameters to be supplied to each stage of the ProDuSe pipeline.

Additional Considerations

While indexes will be automatically generated if none are present in the reference genome directory, this make take a significant amount of time.

Trim

Purpose

Trims adapter sequence from barcoded reads, and saves this information in the read name. Reads which do not fall within barcode range and mismatch threshold are discarded.

Run Using

produse trim

or

python /path/to/ProDuSe/ProDuSe/trim.py

Parameters

-c, –config:A configuration file which can provide any of the following arguments
-i –input:A fastq file to be trimmed. This argument must be specified twice. This file may be gzipped.
-o –output:
An output file for writing trimmed reads. This argument must be specified exactly twice.
These files can automatically be gzipped by appending ‘.gz’ to the output file name.
The output file order corresponds with the input file order.
-as –adapter_sequence:
 The barcode sequence range, described using IUPAC bases.
-ap –adapter_position:
 
The positions in the adapter sequence to consider when comparing reference (i.e. -as) and actual adapter sequences
0 = Do not consider this position
1 = Consider this position
Note that the entire adapter sequence will be trimmed, even if a position is labeled 0.
–mm –max_mismatch:
 The maximum number of mismatches allowed between the reference and actual adapter sequences.
-v:
Instead of discarding reads with mismatching adapter sequences and saving all other read, do the opposite (save mismatching reads, discard all others).
Useful for debugging.
-u:Do not trim the adapter sequence.

Helpful tips

If the read discard rate is extremely high, check the supplied adapter sequence.

Additional Notes

If the supplied fastqs are multiplexed, a single sample can be extracted if no other samples use barcoded adapters, or if the barcoded adapter sequences between are distinct enough (i.e. the difference between the two barcode sequences exceeds the maximum mismatch threshold). Note that there may be some spillover if non-barcoded reads start with a sequence that falls within the barcode sequence range by chance, or if the differences between barcoded sequences is only slightly higher than the maximum mismatch threshold.

bwa

Maps supplied reads to a reference genome using the Burrows-Wheeler aligner. Sorts and converts the resulting SAM file into a sorted BAM file. Note that, for all intents and purposes, this script is simply a python wrapper around bwa, with some functionality restrictions.

Run Using

produse align

or

python /path/to/ProDuSe/ProDuSe/bwa.py

Parameters

-c –config:A configuration file which can supply any of the parameters described below
-i –input:A fastq file containing the reads to be mapped. This option must be provided exactly twice. These files may be gzipped.
-o –output:Path and name of the output BAM file.
-t –threads:Number of threads to use while running BWA.
-r –reference:Reference genome, in fasta format. BWA indexes should exist in the same parent directory. If these do not exist, use bwa index to generate them.
-R –readgroup:Read group information to add to the BAM file header. Useful for running some exterior tools, such as Picard.

Additional Considerations

To keep the terminal as clean as possible, many bwa output messages are suppressed. This can inadvertently include some error messages. If bwa.py exits abruptly, try running bwa directly to determine the cause of the failure.

Collapse

Identifies the start position, adapter sequence, and mapping strand for each read in the supplied BAM file. If two or more reads share the same start position, mapping strand, and adapter sequence (within mismatch tolerance), they are merged into a single consensus sequence. If there is a mismatch at a given position, the most common base is used as a consensus. The quality of each base set to the highest quality base at that position. If an individual read contains too many mismatches, it is discarded prior to collapsing.

Run Using

produse collapse

or

python /path/to/ProDuSe/ProDuSe/collapse.py

Parameters

-c –config:A configuration file which can provide any of the parameters below
-i –input:Input BAM file. The name of each read must contain the adapter sequence.
-o –output:Path and name of output collapsed fastq files. This parameter must be specified exactly twice. The output fastqs can be gzipped automatically by appending ”.gz” to the output name.
-sp –strand_position:
 The positions in the adapter sequence to use when comparison adapter sequences for reads of the same type (i.e. between forward reads, or between reverse reads). 1=Use this position, 0=Do not use this position.
-dp –duplex_position:
 The positions in the adapter sequence to use when comparing adapter sequences for reads of opposing types (i.e. forward vs reverse reads). 1=Use this position, 0=Do not use this position.
-amm –adapter_max_mismatch:
 The maximum number of mismatches allowed between the expected and actual adapter sequences when comparing reads of the same type (See -sp).
-dmm –duplex_max_mismatch:
 The maximum number of mismatches allowed between the expected and actual adapter sequences when comparing molecules of different types (See -dp).
-smm –sequence_max_mismatch:
 The maximum number of mismatches allowed in an individual read before it is discarded. This threshold should be adjusted based upon read length.
-oo –original_output:
 Path and name of fastq files to write original (i.e. pre-collapse) reads. Reads exceeding mismatch thresholds will still be discarded. This option must be be specified exactly twice, or not at all. These fastqs can be gzipped automatically by appending ”.gz” to the output name.
-sf –stats_file:
 Path and name of a text file to store collapsing statistics.

Additional Considerations

The runtime of this script depends not only on the absolute number of reads, but the proportion of reads which are duplicates. BAM files with high duplicate rates will take significantly longer than BAM files with a lower duplicate rate.

SplitMerge

Processes each read in the Stitched BAM file. Unstitched reads are simply saved to the output BAM file. If a read was stitched, the bases unique to the original forward and reverse read are identified and assigned to a replacement forward and reverse read. Bases shared between the two sequences are assigned to only one of the paired reads, to remove overlap. In addition, if Stitcher misaligned a portion of the Stitched sequence, the correct start position of each read is determined, and an INDEL is added as necessary.

Run Using

produse split_merge

or

python /path/to/ProDuSe/ProDuSe/ProDuSe/SplitMerge.py

Parameters

-c –config:An optional configuration file which can provide any of the arguments described below.
-i –input:A BAM file containing reads merged (stitched) by Stitcher. This file must be sorted by read name.
-u –unstitched_input:
 A BAM file containing reads immediately before they were merged by stitcher. If running the ProDuSe pipeline, this will be the collapsed BAM file. This file must be sorted by read name.
-o –output:Path and name of the output BAM file.

Note

Input files must be sorted by read name, not position.

Additional Considerations

While the vast majority (>99.99%) of stitched reads will be successfully separated, some reads with very complicated stitching will be discarded. Note that while this step will identify some additional INDELs that bwa did not identify due to their size, ProDuSe does not currently call INDELs.

SNV

Identifies the nucleotides that overlap each position in the BAM file. All reads formed by a consensus less than the min_read_per_uid threshold are ignored. The type of each nucleotide at a position are assigned as follows:

  1. If a read is in duplex, its’ bases are assigned as duplex supported bases (D). Otherwise, they are assigned as singletons (S)
  2. For each stand (+ and -), if a consensus read was generated by at least strong_supported_base_threshold number of original reads, that base is assigned as a strong base (P and N). Otherwise, they are assigned as weak bases (p and n).

Run Using

produse snv

or

python /path/to/ProDuSe/ProDuSe/snv.py

Parameters

-c –config:An optional configuration file which can specify any of the arguments described below
-i –input:An input BAM containing collapsed reads. This file should be sorted by read position.
-o –output:Path and name of an output VCF file containing candidate variants.
-s –molecule_stats:
 Path and file name of an output tsv file containing position depth and molecule abundance information.
-r –reference:Reference genome, in FASTA format. An index should be present in the same directory
-tb –target_bed:
 A tab-delineated BED3 file in which to restrict variant calling.
-vaft –variant_allele_fraction_threshold:
 Minimum variant allele fraction to call a variant on each strand.
-mo –min_molecules:
 Total molecules (i.e. actual depth) required to call a variant at this position. This threshold should be adjusted if you are restriction to sites expected to be mutated.
-mum –mutant_molecules:
 Minimum number of supporting molecules on both strands required to call a variant.
-mrpu –min_reads_per_uid:
 Each consensus read must have been formed from at least this number of original reads to use this read in variant calling.
-ssbt –strong_supported_base_threshold:
 Bases from a consensus read collapsed from at this number of original reads will be classified as a strong supporting base.
-eds –enforce_dual_strand:
 Only variants with support on both the positive and negative strand will be called.
-mq –min_qual:Bases from weak supporting molecules with less than this quality threshold will be treated as an “N”.

Filter

Filters candidate variants called by ProDuSe based upon the following characteristics:

  • Minimum molecule counts: The number of molecules supporting a variant at a given position must deviate from the number expected from simple noise in order to call a variant.
  • Molecule type: Stronger (duplex, strong bases) molecules will be examined for support (or lack of support) for a variant first. If too many strong molecule do not support a variant when support is expected, the variant will be discarded.
  • (If designated) Dual strand support: Both the positive and negative strand must support a variant confidently before it is called.

Run Using

produse filter

or

python /path/to/ProDuSe/ProDuSe/filter_produse.py

Parameters

-c –config:A configuration file which can supply any of the parameters described below.
-i –input:Input VCF file, listing candidate variants and molecule counts.
-m –molecule_stats:
 A tab-delineated file listing total depth and different molecule abundances across the entire capture space. Generated by snv.py.
-o –output:Path and name to use for the output VCF file.
-sb –strand_bias_threshold:
 Strand bias p-value threshold. If the strand bias of a variant is below this value, the variant will be discarded.
-st –strong_base_threshold:
 Minimum number of strong bases required to call a variant. Note that this threshold will increase depending on locus and overall library characteristics for each molecule type.
-wt –weak_base_threshold:
 Minimum number of weak bases required to call a variant. Note that this threshold will increase depending on locus and capture space characteristics for each molecule type.

Additional Considerations

If the duplicate rate of a library is extremely low, variants will be called based upon weak bases. Thus, a large number of false positive calls may be generated.

Other Commands

Adapter Predict

Predicts the barcoded adapter sequence range used in a given file.

Run Using

produse adapter_predict

or

python /path/to/ProDuSe/ProDuSe/adapter_predict.py

Parameters

-i –input:Paired fastq files. Two files must be specified in this argument.
-m –max_adapter_length:
 Limit the adapter sequence prediction to this length

Additional Considerations

Warning

Ensure your samples are de-multiplexed prior to running Adapter Predict. The adapter sequence is estimated from ALL reads in the fastq files.