ProDuSe: Process Duplex Sequence data

Authors:
Christopher Rushton
Marco Albuquerque
Contact:ckrushto@sfu.ca.
Source Code:Github
License:GNU General Public License v3.0

ProDuSe is a variant caller designed for use with libraries prepared and sequenced using semi-degenerate barcoded adapters. These barcodes allow PCR and optical duplicates which were derived from the same starting molecule to be flagged and merged into a single consensus sequence. The addition of a strand-specific tag also allows molecules which are derived from different parental strands to be flagged, allowing variants at extremely low frequency (VAF~=0.1%) to be called confidently.

Main Commands

Run ProDuSe

Purpose

Wrapper script which runs each step of the ProDuSe Pipeline sequentially on each sample. Can process multiple samples simultaneously.

Run Using

produse run_produse

or

/path/to/ProDuSe/ProdusePipeline.py

Parameters

General Parameters for running analysis

-c, –config:An INI configuration file, which can specify any of the following arguments. Any arguments passed at the command line override those specified in the configuration file, which override defaults.
–fastqs:A pair of filepaths to a set of paired-end FASTQ files. Mutually exclusive with -sc–sample_config.
-sc, –sample_config:
 An INI configuration file, specifying one or more samples to analyze. This file can also provide sample-specific arguments.
-d, –outdir:Base output directory for intermediate files and results. –directory_name will be created inside this directory. Default is the current working directory.
-r, –reference:A filepath to a reference genome FASTA file. Ideally, BWA and .fai indexes should be present in the same directory. If they are not, they will be generated automatically.
-j, –jobs:Number of samples to analyze in parallel. Use 0 or a negative number to process as many samples as possible simultaneously. Default is 1.

Additional Analysis Parameters

–bwa:A filepath to a bwa executable. Default is $(which bwa).
–samtools:A filepath to a samtools executable. Default is $(which samtools)
–directory_name:
 Name of the directory to create inside -d/–outdir to store intermediate files and results. Default is “produse_analysis_directory”.
–append_to_directory:
 If –directory_name already exists inside -d/–outdir, place the intermediate files and results for this analysis inside this directory. If any samples have the same name as those inside –directory_name, they will not be analyzed.
–cleanup:Following analysis, remove all files present in the “tmp” directory of each sample

Barcode Trimming Parameters

-b, –barcode_sequence:
 The sequence of the degenerate barcode, specified in IUPAC bases.
-p, –barcode_position:
 Positions in the -b/–barcode sequence to use when comparing expected and actual barcode sequences (1=Yes, 0=No). The entire barcode will still be trimmed, regardless of which positions are flagged with 0.
-mm, –max_mismatch:
 Maximum number of positions (specified using -p/–barcode_position) which can fall outside the expected degeneracy range before the read pair is discarded. This is the total between both the forward and reverse read.
–trim_other_end:
 Examine the end of the read for the presence of a barcode (for example, in the case of read-through). Will not remove partial barcodes

Family Collapsing Parameters

-fm, –family_mask:
 Barcode positions to consider when determining if two reads are members of the same family (1=Yes, 0=No).
-fmm, –family_mismatch:
 Maximum number of mismatched positions (specified using -fm/–family_mask) permitted before two reads are considered members of different families.
-dm, –duplex_mask:
 Barcode positions to consider when determining if two families are in duplex (1=Yes, 0=No).
-dmm, –duplex_max_mismatch:
 Maximum number of mismatched positions (specified using -dm/–duplex_mask) permitted before two families are not considered a duplex
-t, –targets:A filepath to a BED file listing the capture regions of interest. Read pairs that do not overlap these positions will be discarded
–tag_family_members:
 Store the name of each read incorporated into a family in the read tag “Zm”

Filtering Parameters

-f, –filter:A filepath to a pickled Random Forest Classifier, used to filter variants. Can be generated using train

Note

If no -b/–barcode is specified for one or more samples, adapter_predict will be run on those samples automatically

Pipeline Stages

This command will run the following stages of the ProDuSe Pipeline sequentially on each sample:

  • adapter_predict (Estimates the degenerate barcode sequence, only run if no barcode is specified)
  • Trim (Trim Barcodes)
  • Align reads to reference (Burrows-Wheeler Aligner)
  • Collapse (Identify and merge duplicate reads into a consensus)
  • ClipOverlap (Identifies positions which overlap between read pairs, and generates a consensus)
  • Call (Flag and filter variants)

Configuration Files

In lieu of specifying all arguments at the command line, arguments can be specified in either the sample configuration file, or the main produse configuration file. This approach is recommended, as it improves reproducibility between runs.

If a single argument is specified multiple times, the following orders of precedence apply:

  1. -sc/–sample_config file
  2. Command line
  3. -c/–config file

Arguments specified in the -sc/–sample_config file only apply to the specified sample. More information on configuration files can be found here.

Directory Layout

When run_produse is called, it creates a directory structure for each sample inside of –directory_name, as follows:

directory_name
        Sample_Name
                config
                tmp
                results
        ProDuSe_Task.log

The contents of each folder are as follows:

config:Stores configuration files for each step of the pipeline, and files indicating when a pipeline stage is complete
tmp:Stores intermediate files (ex. Trimmed FASTQ files, raw BAM files)
results:Stores the final BAM file and variant calls

All parameters used to run a given instance, as well as software versions, are specified in ProDuSe_Task.log

Adapter Predict

Predicts the barcoded adapter sequence range used in a given file.

Run Using

produse adapter_predict

or

python /path/to/ProDuSe/ProDuSe/adapter_predict.py

Parameters

-i –input:Paired fastq files. Two files must be specified.
-m –max_adapter_length:
 Limit the adapter sequence prediction to this length

Additional Considerations

Warning

Ensure your samples are de-multiplexed prior to running Adapter Predict. The adapter sequence is predicted from ALL reads in the fastq files.

Trim

Purpose

Trims the barcode sequence from each read, and stores the barcode in a FASTQ tag Reads where the barcode does not fall within specified barcode range and mismatch threshold are discarded.

Run Using

produse trim

or

python /path/to/ProDuSe/ProDuSe/Trim.py

Parameters

-c, –config:A configuration file which can provide any of the following arguments. See the config page for more details.
-i –input:Two FASTQ files to be trimmed. These files may be gzipped.
-o –output:
Two output FASTQ files
These files can automatically be gzipped by appending ‘.gz’ to the output file name.
The output file order corresponds with the input file order (i.e -i read1.fastq read2.fastq -> -o read1.out.fastq read2.out.fastq)
-b –barcode_sequence:
 The barcode sequence range, described using IUPAC bases. Can be determined using adapter_predict
-p –barcode_position:
 
The positions in the adapter sequence to consider when comparing reference (i.e. -b) and actual adapter sequences
0 = Do not consider this position
1 = Consider this position
Note that the entire barcode sequence will be trimmed, even if a position is labeled 0.
–mm –max_mismatch:
 The maximum number of mismatches allowed between the reference and actual adapter sequences before a read pair is discarded
—reverse:
Instead of discarding reads with mismatching barcode sequences, do the opposite (save mismatching reads, discard all others).
Useful for debugging.
–no_trim:Do not trim the adapter sequence. Only store the adapter sequence a read tag.
–trim_other_end:
 
Examine the other end of the read for barcode sequences as well.
Will not remove partial barcodes
Useful when read lengths are significantly longer than the median fragment length

Helpful Tips

If the read discard rate is extremely high, check the supplied barcode sequence.

Additional Notes

If the supplied fastqs are multiplexed, a single sample can be extracted if no other samples use barcoded adapters, or if the barcoded adapter sequences between are distinct enough (i.e. the difference between the two barcode sequences exceeds the maximum mismatch threshold). Note that there may be some spillover if non-barcoded reads start with a sequence that falls within the barcode sequence range by chance, or if the differences between barcoded sequences is only slightly higher than the maximum mismatch threshold.

Collapse

Identifies the start position, barcode sequence, and mapping strand for each read pair in the supplied BAM file. If both reads in one or more reads share the same start position, mapping strand, and barcode sequence (within mismatch tolerance), they are flagged as a “family”, and merged into a single consensus sequence. If family members disagree at a given position, the most common base is used as a consensus. In the case of a tie, the base with the highest aggregated quality score across all family members is used. The quality of each base set to the highest quality base at that position.

Run Using

produse collapse

or

python /path/to/ProDuSe/ProDuSe/Collapse.py

Parameters

-c –config:A configuration file which can supply any of the arguments below. See the config page for more details.
-i –input:Input SAM/CRAM/BAM file. Each read must contain a read tag which stores adapter sequences
-o –output:Output SAM/CRAM/BAM file containing collapsed reads (Use “-” for stdout). The output file format will be chosen based upon the supplied file extension, or if “-” is used, will be the same as the input file format. Will be unsorted.
-fm –family_mask:
 
Positions in the barcode sequence to use when comparing barcode sequences between reads which originate from the same parental strand.
1=Use this position, 0=Do not use this position.
-dm –duplex_mask:
 
The positions in the adapter sequence to use when comparing adapter sequences for reads of opposing types (i.e. forward vs reverse reads).
1=Use this position, 0=Do not use this position.
-fmm –family_max_mismatch:
 The maximum number of mismatches allowed between the barcode sequence of two read pairs before two read pairs are considered members of different families (See -fm).
-dmm –duplex_max_mismatch:
 The maximum number of mismatches allowed between the barcode sequence of two families before they are considered as not in duplex (See -dm).
-r –reference:Reference genome, in FASTA format. Must be the same genome version that the reads were aligned against.
-t –targets:A BED3 file or better listing regions of interest. Any read pairs which fall entirely outside these regions will be discarded
–tag_family_members:
 Store the original name of all reads which were incorporated into a family in the read tag “Zm”

Additional Considerations

The runtime of Collapse depends not only on the absolute number of reads, but the proportion of reads which are duplicates. BAM files with high duplicate rates will take significantly longer than BAM files with a lower duplicate rate.

Currently, this version of Collapse does not perform local realignment of soft-clipped regions.

ClipOverlap

Description

Identifies all positions which overlap inside of a given read pair. If any bases overlap, a consensus is generated and that consensus is assigned to only one read in the read pair.

Run Using

produse clip

or

/path/to/ProDuSe/Clone/ProDuSe/ClipOverlap.py

Parameters

-c –config:A configuration file which can provide any of the following arguments. See the config page for more info.
-i –input:An input SAM/BAM file (use “-” to read from standard in). Does not need to be sorted.
-o –output:An output SAM/BAM file in which to write clipped reads (use “-” to write to stdout). Will be UNSORTED. The file type is determined from the file extension, or the input file type if stdout is specified.
–tag_origin:Add a read tag indicating which read a consensus base originated. S=Both reads agree.

Warning

The output BAM file will be unsorted, even if the input BAM file is sorted.

Helpful Tips

The output BAM file can be sorted easily by pipeline the output to “samtools sort”. For example:

produse clip -i inFile.bam -o - | samtools sort > outFile.bam

Additional Info

Any overlap between read pairs is determined solely using the alignments in the BAM file: No realignment is performed. If the bases at a position disagrees between a read pair, the base with the highest mapping quality is used. In the case of a tie, the base from the read which starts later is used.

When obtaining a consensus between INDELs, the read with the lowest number of INDELs in the overlapping region is used as the consensus. No realignment is performed.

The consensus overlap is assigned to either read 1 or read 2. For the other read, the positions coresponding to the consensus are soft clipped.

Warning

Overlap between read pairs is removed via soft-clipping. This may cause problems for some structural variant callers which examine soft-clipped bases

Call

Identifies all possible positions which support an alternate allele (no matter how weakly) across the entire capture space. These variants are then filtered using a random forest filter, based upon numerous characteristics

Run Using

produse call

or

python /path/to/ProDuSe/ProDuSe/Call.py

Parameters

-c –config:An optional configuration file which can specify any of the arguments described below
-i –input:An input BAM containing collapsed (and ideally clipped) reads. This file must be sorted by position.
-o –output:Output VCF file containing filtered variants
-u, –unfilted:Output VCF file containing ALL possible variants
-r –reference:Reference genome, in FASTA format. A reference index should be present in the same directory
-t –target_bed:A BED3 file specifying a capture space to restrict variant calling
-f, –filter:A piclke containing a tranined Random Forest Classifier

Additional Commands

Resume ProDuSe

Description

Restarts a previously terminated instance of run_produse

Run Using

produse resume_produse

or

/path/to/ProDuSe/Clone/ProDuSe/ResumePipeline.py

Parameters

-d –produse_dir:
 Path to the base ProDuSe analysis directory (usually named produse_analysis_directory)
-j –jobs:Number of samples to process in parallel

Additional Information

run_produse automatically generates <task>_Complete file when each pipeline component is completed for each sample. These files are placed in the “config” directory of the coresponding sample. This script identifies samples in which not all <task>_Complete files have been generated, and resumes the analysis from there.

Note

If you wish to re-run a stage of the pipeline, simply remove the coresponding <task>_Complete file

Update Config

Description

Coverts an older produse config file to a format compatible with this version of ProDuSe

Run Using

produse update_config

or

python /path/to/ProDuSe/ProDuSe/UpdateConfig.py

Parameters

-i/–input:The configuration file to be reformatted
-o/–output:Path to the new, reformatted configuration file

Additional Info

Duplicate parameters are not permitted in config files. If two parameters specify the same argument, one will be removed automatically. If they specify different arguments, they will both be retained. One of the arguments must be removed manually.

Any parameters with no arguments are removed automatically

Train

Description

Trains ProDuSe’s variant calling filter using one or more sets of validated variants. A random forest classifier will be trained using this data

Run Using

produse train

or

/path/to/ProDuSe/Clone/ProDuSe/Train.py

Parameters

-c –config:A configuration file which can provide any of the following arguments. See the config page for more details.
-b –bam:One or more post-clipoverlap BAM files, which were used for variant filtering and validations. Must be specified in an order that coresponds to the order specified by -v/–validations
-v –validations:
 One or more VCF files listing validated variants. The number and order of file must corespond to files specified in -b/–bam.
-o –output:Output file which will store the random forest classifier
-r –reference:Reference genome, in FASTA format. An index should also be present in the same directory.
-t –targets:Optional. One or more BED files specifying regions in which to restrict candidate variant identification. Must be specified in an order coresponding to the order specified in -b/–bam.

Additional Information

For each sample provided, ProDuSe will identify all candidate variants in a manner identical to Call. If a candidate variant is flagged in the validation VCF file with VALIDATED=TRUE, it will be considered a true variant, while variants flagged with VALIDATED=UNKNOWN will be ignored completely. All remaining variants will be flaged as artifacts. False variants will be randomly subset so there are the same number of false and true variants prior to training the variant filter.