Primary output files¶
- Compressed binary archive (GGR)
07/21/14: v1.0
Harvest is a suite of core-genome alignment and visualization tools for quickly analyzing thousands of intraspecific microbial genomes. Harvest includes Parsnp, a fast core-genome multi-aligner, harvest-tools, a binary archive format and format conversion tool, and Gingr, a dynamic visual platform. Combined they provide rapid core-genome alignments, variant calls, recombination detection, and phylogenetic trees.
Contents:
fixme
fixme
Gingr is an interactive tool for exploring large-scale phylogenies in tandem with their corresponding multi-alignments. Gingr can display informative overviews for hundreds or thousands of genomes, while allowing researchers to move quickly to more detailed views of specific subclades and genomic regions, even down to the nucleotide level of their multi-alignments. Additionally, its dynamic display of variants allows interactive selection of various filters, such as indels, poorly aligned regions and suspected sites of recombination. Gingr works chiefly in tandem with Parsnp, an efficient tool for core-genome multi-alignment and phylogenetic reconstruction. It is also applicable, however, to other analytical tools, accepting standard file formats such as multi-Fasta, XMFA, Newick and VCF.
Contents:
fixme
fixme
fixme
fixme
fixme
fixme
harvest-tools primarily serves as a reference-based compression tool for large genomic datasets. In addition, it serves as a proxy between several alignment and variant & annotated feature formats. While it was designed for use with both Parsnp & Gingr, it can also be used as standalone tool to efficiently compress multiple alignments, variants, phylogenies and annotations to a single convenient binary archive.
Contents:
harvest-tools is distributed as a precompiled binary. The three steps below represent the fastest way to start using the software:
- wget ftp://ftp.cbcb.umd.edu/pub/software/harvest/harvesttools-OSX64.gz
- tar -xvf harvesttools-OSX64.tar.gz
- wget ftp://ftp.cbcb.umd.edu/pub/software/harvest/harvesttools-Linux64.gz
- tar -xvf harvesttools-Linux64.tar.gz
From command-line:
harvest-tools –x <input xmfa> -f <input reference fasta> -g <reference genbank formatted annotations> -n <newick formatted tree>
harvest-tools quick start for three example scenarios.
With reference & genbank file as input:
harvest-tools -g <reference_genbank_file1,reference_genbank_file2,..> -r <reference fasta file> -x <XMFA file> -o hvt.ggr
With harvest-tools file as input, XMFA output:
harvest-tools -i input.ggr -X output.xmfa
With harvest-tools file as input, fasta formatted SNP file as output:
harvest-tools -i input.ggr -S output.snps
- -b: <bed filter intervals>,<filter name>,”<description>”
- -B: <output backbone intervals>
- -f: <reference fasta>
- -F: <reference fasta out>
- -g: <reference genbank>
- -h: (show this help)
- -i: <harvest input>
- -m: <multi-fasta alignment input>
- -n: <Newick tree input>
- -N: <Newick tree output>
- –midpoint-reroot
- -o: <hvt output>
- -q: (quiet mode)
- -S: <output for multi-fasta SNPs>
- -v: <VCF input>
- -V: <VCF output>
- -x: <xmfa alignment file>
- -X: <output xmfa alignment file>
fixme
- Autoconf ( http://www.gnu.org/software/autoconf/ )
- Protocol Buffers ( https://code.google.com/p/protobuf/ )
- Zlib ( http://www.zlib.net/ )
For your convenience, precompiled binaries provides at:
http://github.com/marbl/harvest
otherwise, to install from source:
git clone https://github.com/marbl/harvest-tools.git hvt_src
cd hvt_src
./bootsrap.sh
./configure [--prefix=...] [--with-protobuf=...]
make
[sudo] make install
command line tool ( <prefix>/bin/harvest )
static library ( <prefix>/lib/libharvest.a )
- includes
- Harvest, Protocol Buffer message class ( <prefix>/include/harvest.pb.h )
- HarvestIO, Harvest wrapper with file IO ( <prefix>/include/HarvestIO.h )
When running ./configure, use –prefix to install somewhere other than /usr/local. Use –with-protobuf if the Protocol Buffer libraries are not in their default location (/usr/local). Zlib should be installed in a standard system location. Sudo will be necessary for ‘make install’ if write permission
fixme
fixme
fixme
fixme
Parsnp was designed to align the core genome of hundreds to thousands of bacterial genomes within a few minutes to few hours. Input can be both draft assemblies and finished genomes, and output includes variant (SNP) calls, core genome phylogeny and multi-alignments. Parsnp leverages contextual information provided by multi-alignments surrounding SNP sites for filtration/cleaning, in addition to existing tools for recombination detection/filtration and phylogenetic reconstruction.
Contents:
Parsnp is distributed as a precompiled binary that should be devoid of external dependencies (all included in dist). The three steps below represent the fastest way to start using the software:
- wget ftp://ftp.cbcb.umd.edu/pub/software/harvest/parsnp-OSX64.gz
- tar -xvf parsnp-OSX64.tar.gz
- wget ftp://ftp.cbcb.umd.edu/pub/software/harvest/parsnp-Linux64.gz
- tar -xvf parsnp-Linux64.tar.gz
From command-line:
parsnp –p <threads> –d <directory of genomes> –r <ref genome>
Parsnp quick start for three example scenarios.
With reference & genbank file:
parsnp -g <reference_genbank_file1,reference_genbank_file2,..> -d <genome_dir> -p <threads>
With reference but without genbank file:
parsnp -r <reference_genome> -d <genome_dir> -p <threads>
Autorecruit reference to a draft assembly:
parsnp -q <draft_assembly> -d <genome_db> -p <threads>
Input/output:
-c = <flag>: (c)urated genome directory, use all genomes in dir and ignore MUMi? (default = NO)
-d = <path>: (d)irectory containing genomes/contigs/scaffolds
-g = <string>: Gen(b)ank file(s) (gbk), comma separated list (default = None)
-o = <string>: output directory? default [./P_CURRDATE_CURRTIME]
-q = <path>: (optional) specify (assembled) query genome to use, in addition to genomes found in genome dir (default = NONE)
-r = <path>: (r)eference genome (set to ! to pick random one from genome dir)
MUMi:
-M = <flag>: calculate MUMi and exit? overrides all other choices! (default: NO)
-U = <float>: max (M)UMi distance (default: autocutoff based on distribution of MUMi values)
MUM search:
-a = <int>: min (a)NCHOR length (default = 1.1*Log(S))
-C = <int>: maximal cluster D value? (default=100)
-z = <path>: min LCB si(z)e? (default = 25)
LCB alignment:
-D = <float>: maximal diagonal difference? Either percentage (e.g. 0.2) or bp (e.g. 100bp) (default = 0.12)
-e = <flag> greedily extend LCBs? experimental! (default = NO)
-n = <string>: alignment program (default: libMUSCLE)
SNP filters:
-R = <flag>: disable (R)epeat filtering?
-x = <flag>: enable recombination filtering? (default: NO)
Misc:
-h = <flag>: (h)elp: print this message
-P = <int>: max partition size? limits memory usage (default= 15000000)
-p = <int>: number of threads to use? (default= 1)
-v = <flag>: (v)erbose output? (default = NO)
fixme
For your convenience, precompiled binaries provides at:
http://github.com/marbl/harvest
otherwise, to install from source:
git clone https://github.com/marbl/parsnp.git parsnp_src
cd parsnp_src
First (important!), build libMUSCLE:
cd muscle
./autogen.sh
./configure --prefix=`pwd`
make install
Then, build Parsnp:
cd ..
./autogen.sh
./configure
make install
Once both installed (to cwd install by default):
export PARSNPDIR=/path/to/parsnp/install
First (important!), build libMUSCLE:
cd muscle
./autogen.sh
./configure --prefix=/usr/local/
sudo make install
Then, build Parsnp:
cd ..
./autogen.sh
./configure --prefix=/usr/local/
sudo make install
- Parsnp takes both draft and finished genomes of closely related strains as input, performs conservative core genome alignment and as output returns multi-alignments (XMFA), variants (VCF), core genome phylogeny (Newick) and Gingr input format (GGR).
- The main advantages of Parsnp over alternative approaches is robust filtration of variant (SNP) calls, multiple alignments as output and superior speed. Parsnp can align 200-300 bacterial strains in <30 minutes on a 16-core server and ~1000 in a couple of hours.
- If you are interested in pan genome/whole genome alignment, existing tools for the job that perform well include Mauve, Mugsy, among others. In addition, Parsnp is tailored for intraspecific genome analysis (outbreak analysis of a pathogen, etc). One main limitation of Parsnp is that it cannot handle subsets (core genome only) and is not as sensitive as existing methods.
- Gingr (http://github.com/marbl/gingr) can open Parsnp output and provide an interactive display of multi-alignments, variants and the phylogenetic tree estimated from the core genome alignment.
- Parsnp is a conservative core genome alignment method that necessarily requires that all genomes are present in each aligned regions. The focus is on aligning 1000s of closely related bacterial strains quickly while maintaining sensitivity comparable to existing WGA methods. In additon, the core genome has been shown to contain as few as 30-40% of the gene content (even in very closely-related clades) due to reductive genome evolution and/or a large accessory genome (with plenty of IS/phage elements). However, for increased sensitivity w.r.t aligned regions, and alignments containing subsets, both Mugsy and Mauve are terrific tools for the job.
- Since the core necessarily includes all genomes, the choice of reference does not matter (with a couple important exceptions, continue reading). Feel free to use the parameter ‘-r !’ to randomly select a reference if you are feeling indecisive or if all of the genomes contained in the genome directory are of similar quality. Typically, finished/closed genomes are used a the reference strain to ensure they are high-quality and do not contain assembly artifacts or contaminant.
- By default, parsnp calculates the MUMi distance between the reference and each of the genomes in the genome directory. All genomes with MUMi distance <= 0.01 are included, all others are discarded. To force all genomes present in the genome dir to be included simply include ‘-c’ as a command-line parameter.
- The goal of parsnp is to capture all informative signals found in the core genome of the specified clade of interest. Any SNPs in regions not shared by all genomes will not reported. Additionally, any SNP found in a likely poorly aligned region would also be discarded. Finally, parsnp does not perform LCB extension and therefore may miss SNPs appearing at the end of locally conserved blocks or clusters.
To further demonstrate the functionality of Parsnp we have prepared two small tutorial datasets. The first dataset is a MERS coronavirus outbreak dataset involving 42 isolates. The second dataset is a selected set of 31 Streptococcus pneumoniae genomes. Both of these datasets should run on modestly equipped laptops in a few minutes.
42 MERS Coronavirus genomes
Download genomes:
Run parsnp
./parsnp -r ./mers42/England-Qatar_2012.fna -d ./mers42Command-line output
![]()
- Visualize with Gingr:
![]()
Inspect Output:
31 Streptococcus pneumoniae genomes
Download genomes:
Run parsnp
./parsnp -r ./strep31/NC_011900.fna -d ./strep31 -p <num threads>Example output:
![]()
Force inclusion of all genomes (-c)
./parsnp -r ./strep31/NC_011900.fna -d ./strep31 -p <num threads> -c
Command-line output:
![]()
Visualize with Gingr
![]()
Enable recombination detection/filter (-x)
./parsnp -r ./strep31/NC_011900.fna -d ./strep31 -p <num threads> -c -xRe-visualize with Gingr
- Bootstrap values have improved after running recombination filter; columns with filtered SNPs are displayed in image:
![]()
fixme
fixme
fixme
fixme