SVanalyzer¶
SVbenchmark¶
SVbenchmark compares a set of “test” structural variants in VCF format to a known truth set (also in VCF format) and outputs estimates of sensitivity and specificity.
Usage¶
svanalyzer benchmark --ref <reference FASTA file> --test <VCF-formatted file of variants to test> --truth <VCF-formatted file of true variants>
Options¶
Option | Description |
---|---|
–help|–manual | Display documentation. |
–ref | The reference FASTA file for the supplied VCF file or files (required). |
–test | A VCF-formatted file of structural variants to test (required). |
–truth | A VCF-formatted file of variants to compare against (required). |
–maxdist | Disallow matches if positions of two variants are more than maxdist bases from each other (default 100,000). |
–normshift | Disallow matches if alignments between alternate alleles have normalized shift greater than normshift (default 0.2) |
–normsizediff | Disallow matches if alternate alleles have normalized size difference greater than normsizediff (default 0.2) |
–normdist | Disallow matches if alternate alleles have normalized edit distance greater than normdist (default 0.2) |
–minsize | Only include true variants of size >= minsize for recall calculation and test variants >= minsize for precision calculation (default 0) |
–prefix | Prefix for output file names (default: “benchmark”) |
Description¶
For sequence-specified test and truth structural variants in VCF files (i.e., files with ATGC sequences in the REF and ALT fields), SVbenchmark aligns constructed alternate haplotypes of each test/truth variant pair separated by no more than the distance specified by the –maxdist option to determine if the pair represent two equivalent variants.
In the false positive output VCF file, the program reports all test variants that are not equivalent to any true variant. In the false negative output VCF file, the program reports all true variants that are not equivalent to any test variant. The recall rate is reported in the report file as the percentage of true variants that are not false negatives, and the precision is reported as the percentage of test variants that are not false positives.
As of SVanalyzer v0.33, SVbenchmark will include non-sequence-specified deletions in its comparisons so long as the ALT field values of the VCF deletion records are “<DEL>” and an END value is include in the INFO field (e.g., END=5289355).
SVmerge¶
SVmerge groups structural variants from a VCF file by calculating a distance matrix, then finding connected components of a graph in which the nodes are the variants and edges exist when the distances are below the specified maximum values.
The program steps through a set of structural variants, calculating distances to other nearby variants by comparing their alternate haplotypes. The program then reports clusters of variants, and prints a VCF file of “unique” variants, where the variant reported in the VCF record is a randomly-chosen representative from the largest cluster (or a randomly selected largest cluster, in the case of a tie among cluster sizes) of exactly matching variants.
Alternatively, a file of previously-calculated distances can be provided with the –distance_file option, and the clustering can be skipped with the option –skip_clusters.
NOTE: SVmerge only clusters and merges sequence-specific variants, i.e., structural variants with ATGCN sequences for their REF and ALT alleles, or deletions with a valid “END” INFO tag. These variants will be printed as singletons unless the –seqspecific option is specified (see below).
Usage¶
svanalyzer merge --ref <reference FASTA file> --variants <VCF-formatted variant file> --prefix <prefix for output files>
svanalyzer merge --ref <reference FASTA file> --fof <file of paths to VCF-formatted variant files> --prefix <prefix for output files>
Options¶
Option | Description |
---|---|
–help|–manual | Display documentation. |
–ref | The reference FASTA file for the supplied VCF file or files. |
–variants | A VCF-formatted file of (possibly equivalent) variants to merge. |
–fof | A file of paths to VCF-formatted files to merge. |
–prefix | Prefix for output file names (default “merged”) |
–maxdist | Maximum distance between pairs of variants to perform comparison for potential merging (default: 2000) |
–reldist | Maximum allowable edit distance, normalized by the mean length of larger allele for the two variants, in an alignment used to merge two variants (default: 0.2) |
–relsizediff | Maximum allowable alt allele size difference, normalized by the mean length of larger allele for the two variants, to merge two variants (default: 0.2) |
–relshift | Maximum allowable shift, normalized by the mean length of the larger allele for the two variants, in an alignment used to merge two variants (default: 0.2) |
–seqspecific | With this option, SVmerge will fail to print out any SV that does not have an ATGCN sequence for REF and ALT in the input VCF files. |
SVcomp¶
SVcomp calculates “distances” between pairs of structural variants in VCF format by constructing their alternate haplotypes and aligning them to each other.
Usage¶
svanalyzer comp --ref reference.fasta --first <first VCF-formatted file> --second <second VCF-formatted file>
Options¶
Option | Description |
---|---|
–help|–manual | Display documentation. |
–ref | The reference FASTA file for the supplied VCF file or files. |
–first | A VCF-formatted file of variants to compare |
–second | Second VCF-formatted file of variants to compare–must have the same number of variants as the first file |
SVwiden¶
SVwiden reads a VCF file and uses MUMmer to determine widened coordinates for structural variants, adding custom tags to the VCF record.
Usage¶
svanalyzer widen --ref <reference FASTA file> --variants <VCF-formatted variant file> --prefix <prefix for output files>
Options¶
Option | Description |
---|---|
–help|–manual | Display documentation. |
–ref | The reference FASTA file for the supplied VCF file or files. |
–variants | A VCF-formatted file of (possibly equivalent) variants to merge. |
–fof | A file of paths to VCF-formatted files to merge. |
–prefix | Prefix for output file names (default: “widened”) |
SVrefine¶
SVrefine reads a delta-formatted file of MUMmer alignments of an assembly to the reference to call structural variants (or refine variants in chosen genomic regions) and print them out in VCF format.
Usage¶
SVrefine.pl --delta <path to delta file of alignments> --regions <path to BED-formatted file of regions> --ref_fasta <path to reference multi-FASTA file> --query_fasta <path to query multi-FASTA file> --outvcf <path to output VCF file> --svregions <path to output BED file of SV regions> --outref <path to bed file of homozygous reference regions> --nocov <path to bed file of regions with no coverage>
Options¶
Option | Description |
---|---|
–help|–manual | Display documentation. |
–delta | Path to a delta file produced by MUMmer with alignments to be used for retrieving SVs. |
–regions | Path to a BED file of regions to be investigated for structural variants in the assembly (Optional). |
–ref_fasta | Path to a multi-fasta file containing the sequences used as a reference in the MUMmer alignment (Optional). |
–query_fasta | Path to a multi-fasta file containing the sequences used as a query in the MUMmer alignment (Optional). |
–outvcf | Path to which to write a new VCF-formatted file of structural variants. |
–refname | String to include as the reference name in the VCF header. |
–samplename | String to include as the sample name in the output VCF file. |
–maxsize | Specify an integer for the maximum size of SV to report. |
–noheader | Flag option to suppress printout of the VCF header. |
–nocov | Path to write a BED file with “no coverage” regions (only used when –regions option is specified). |
SVanalyzer is a software package for the analysis of large insertions, deletions, and inversions in DNA. SVanalyzer tools use repeat-aware methods to refine, compare, and cluster different structural variant calls.
Install¶
Software dependencies¶
SVanalyzer tools require samtools (http://www.htslib.org), the edlib aligner (https://github.com/Martinsos/edlib), MUMmer (https://github.com/mummer4/mummer), and bedtools (http://bedtools.readthedocs.io/en/latest/) to perform its structural variant comparisons.
Using conda¶
SVanalyzer can be installed using the conda package manager with the bioconda channel. For details on setting up conda/bioconda, see the Bioconda user docs.
conda create -n svanalyzer
conda activate svanalyzer
conda install svanalyzer
With a release tarball/github clone¶
SVanalyzer can also be installed by downloading a release tarball or cloning the github repository:
git clone https://github.com/nhansen/SVanalyzer.git
After unzipping the tarball or cloning the directory, build SVanalyzer:
cd SVanalyzer
perl Build.PL
./Build
./Build test
./Build install
To install SVanalyzer to an alternate location (e.g., if you do not have root permissions), call “perl Build.PL –install_base $HOME”.
Command documentation¶
- SVbenchmark - Compare a set of “test” structural variants in VCF format to a known truth set and report sensitivity and specificity
- SVmerge - Merge similar sequence-resolved SVs in VCF format
- SVcomp - Compare sequence-resolved SVs to each other
- SVwiden - Add tags to a VCF file of sequence-resolved SVs detailing surrounding repetitive genomic context
- SVrefine - Call sequence-resolved structural variants (SVs) from assembly consensus