Welcome to SymTyper’s documentation!

Contents:

Web-based SymTyper

Brief Overview

To run SymTyper from the web, follow these instructions:

First, invoke symtyper main submission page (http://www.symtyper.com). You should be presented with the following page in SymTyper’s Main Screen.

SymTyper Main Screen

SymTyper’s Main Screen

To submit a new analysis, browse and select your input fasta file and a valid ids file (Fasta Input Format) and then click submit (See New SymTyper Analysis Screen).

SymTyper Main Screen

New SymTyper Analysis Screen

The next screen will provide you with the URL where the output can be accessed. Depending on the input size, the processing can take between few minutes to hours (See Processing Screen and Job URL). Please copy the URL for future access. Job will be hosted on the SymTyper site for 15 days.

SymTyper Main Screen

Processing Screen and Job URL

If the anlysis completed successfully, you will be presented with the a summary table where the various componenents of the analysis can be accessed (SymTyper Results Main Screen). The results are gouped by section: Clades, Subtypes, Multiples, Trees, Breakdown. These sections are explained below.

SymTyper Results Main Screen

SymTyper Results Main Screen

Section Description
Clades Shows the breakdown of clades per sample. The results can be viewed or download as a matrix or show as a piechart per sample
Subtypes Shows the breakdowns of subtypes per sample. The results can be viewed independently for the Perfect, Unique and ShortNew subtypes
Multiples The graphs shows the distribution of sequences for each clade contaning multiple hits. The definition of a Multiple hits is described in the Multiples section
Trees Describes the breakdown of the number of sequences assigned to the internal nodes, and the clades per sample. The tree representation show the combined count for all the samples
Breakdown Shows a Sunburst representation of the Clades and subtypes by sample

Clades View

The Clades View shows a table view of the distribution of HITS, NOHITS, LOW and AMBIGUOUS hits per sample. Clicking the View Chart provides access to the clades distribution for each sample. The complete results and disbtribution of clades per sample can be downloaded from the results main page (see Pie Chart Distribution of Clade per Sample).

Pie chart view of clade distribution

Pie Chart Distribution of Clade per Sample

Subtypes View

The Subtypes Views shows the breakdown of subtypes per sample. The results can be viewed independently for the Perfect, Unique and the ShortNew subtypes. The subtypes are assigned based on the blast results of the query sequences to the clade specific references (See Subtypes Distribution per Clade).

Subtypes view

Subtypes Distribution per Clade

Section Description
Perfect A query sequence that aligns perfectly or with very high similarity to a unique symbiont reference in the database (e.g., 100% similarity to 100% of the length of the target)
Unique A query sequence that aligns unambiguously to symbiont reference in the database. (e.g., \(>=\) user defined % similarity to 100% target length and the bit score for the best hit is at least 3 orders or magnitude larger than than that for the second hit);
ShortNew A query sequence shorter than the average sequence in the reference database but aligns with high similarity to a unique reference according to the dynamic similarity threshold (See Dynamic Similarity)

Multiples View

The Multiples View is a graphical representation of the corrected subtypes count to which ambiguous sequences map. The algortihm used to resolved multiple hits is described in the Ambiguous Hit Correction and detailed in the manuscript (See Subtypes Distribution for the Corrected Ambiguous Hits).

Subtypes view

Subtypes Distribution for the Corrected Ambiguous Hits

The breakedown of subtypes for Resolved under the “Resolved tab”

Trees View

For each clade phylogeny, this view compiles the number of times a Lowest Common Ancestor was identified for an ambiguous sequence (after the Ambiguous Hit Correction stage). The tree can be downloaded in the Newick NHX format and viewed or parsed in phylogeny applications (See Newick NHX Format). A matrix file comparing results across samples can be be found in output archive available for download from the main page.

Breakdown View

Using user-friendly graphical Sunburst representation, this view summarizes the intricate structure of Symbiodinium clades and subtypes in a single or multi-sample view. Highlighting a level of the Sunburst charts display its structure and the percentage of sample reads assigned to it (See Subtype Breakdown Vizualization).

Subtypes view

Subtype Breakdown Vizualization

Command Line Symtyper

Installing Symtyper’s requirements

The most recent version of Symtyper is a self-contained Python script that can be run without being explicitly installed on the system. However, Symtyper depends on other applciation for its exectution. These applications are

Application Version Notes
HMMER >=3.0 http://selab.janelia.org/software/hmmer3/
Blast >=2.2.25 . Currently, only legacy Blast is supported
cd-hit >= 4.5.6  
biopython >= 1.61  
ete2 >= 2.2  
xvfb   Required for executing Symtyper on a remote server via ssh

Running Symtyper sub-program

Symtyper is comprised of 5 subprograms that each carry out a specific funciton. These programs are: clade, subtype, resolveMultipleHits, builPlacementTree and makeTSV. The details for each program is described below.

clade

usage

usage: symTyper.py clade [-h] -s SAMPLESFILE -i INFILE [-e EVALUE]

Input

REQUIRED

Param Description
-i, –inFile File containing the sequencing reads in fasta fomat. Note that this files requires the ids to be fomatted using the following Fasta Input Format
-s, –samplesFile The Samples File
Ouptut

fasta/: Directory cotaining a collection of fasta sequences representing the intput fasta file split by sample hmmer_output/: Directory containing HMMER output files, broken down by sample hmmer_parsedOutput/: Directory containing listing of AMBIGUOUS OUTPUT, HITS OUTPUT, NOHITS OUTPUT and LOWOUT for each of the input samples hmmer_hits/: Directory containing fasta files, split by clade, of sequences having hits against the clade database.

subtype

usage: symTyper.py subtype [-h] -s SAMPLESFILE -H HITSDIR -b BLASTOUTDIR -r BLASTRESULTS -f FASTAFILESDIR Directory contains HMMER output files, broken down by sample

Input

The input to “clade” is expected to be in the same format as that produced by the clade subprogram

Param Description
-f, –fastaFilesDir Directory cotaining sequences from the input fasta file, split by clade (fasta directory)
-s, –samplesFile The Samples File
-H, –hitsDir HMMER fasta hits output directory produced by clade (hmmer_hits directory)
-b, –blastOutDir Blast output directory
-r, –blastResults Parsed blast results directory
Output

blast_output/: Directory containing Blast output files, broken down by sample blastResults/: Directory containing informaiton on the Perfect, Unique, New, ShortNew and Short

The output formats for the files in blastResults/ can be found here:

PERFECT OUTPUT UNIQUE OUTPUT NEWOUT SHORTNEW OUTPUT SHORT OUTPUT

resolveMultipleHits

Input

usage: symTyper.py resolveMultipleHits [-h] -s SAMPLESFILE -m MULTIPLEFASTADIR -c CLUSTERSDIR

The input to resolveMultipleHits is expected to be in the same format as that produced by the subtype subprogram

Param Description
-s, –samplesFile The Samples File
-m, –multipleFastaDir Directory containing sequences with multiple hits, split by clade (x directory)
-c, –clustersDir Directory that will contain cluster information
Output

resolveMultiples/Reps: Representatives from each cluster resolveMultiples/clusters: Clusters produced for each sample resolveMultiples/correctedMultiplesHits: Contains output files from clustering and multiple hit resolution

The resolveMultiples/correctedMultiplesHits directory contains the following files and directory:

builPlacementTree

usage: symTyper.py builPlacementTree [-h] -c CORRECTEDRESULTSDIR -n NEWICKFILESDIR -o OUTPUTDIR

Input

The input to builPlacementTree is expected to be in the same format as that produced by the resolveMultipleHits subprogram

Param Description
-c, –correctedResultsDir Directory containing corrected Clade placements (the correctedMultiplesHits/corrected directory from resolveMultipleHits)
-n, –newickFilesDir Newick directory for input calde phylogenies in Newick format
-o, –outputDir Dir that will contain the newick and interenal nodes information
Output

The output directory containing the placement information for, broken down by sample

makeTSV

Input

Files and directories produced by the builPlacementTree subprogram

Output

SymTyper’s Concepts

Definitions

HIT

This is a clade-relevant definition. To be a HIT against a clade reference sequence, a query needs to unambiguously align with a defined similarity over a defined percentage of its length. Furthermore, the e-value of the first hit needs to be at least K orders of magnitude larger than that of an alternative clade.

NOHIT

This is a clade-relevant definition. A Sequence is considered a NOHIT if it does not have any satisfactory alignments against a clade.

AMBIGUOUS

This is a clade-relevant definition. An ambiguous sequence is one that has more than one satisfctory clade hit.

Perfect

This is a subtype-relevant definition. Perfect refers to a query sequence that aligns unambiguously to one sequence in the reference database (e.g., 100% similarity to 100% of the length of the target) for which the best hit’s raw bit score is at least 3 orders of magnitude larger than the raw bit score for the second hit.

Unique

This is a subtype-relevant definition. Unique refers to a query sequence that aligns to a single reference in the database with a user-defined (e.g., \(>=\) user defined % similarity to 100% target length) for which the best hit’s raw bit score is at least 3 orders of magnitude larger than the raw bit score for the second hit.

New

This is a subtype-relevant definition. A New subtype applies to a sequnence with no significant hit to any of the subtype database sequences.

ShortNew

This is a subtype-relevant definition. ShortNew refers to a query sequence that aligns with high similarity to a unique reference sequence according to the dynamic similarity threshold (Equation 1: Dynamic Similarity) below.

Multiples

This is a subtype-relevant definition. A query sequence of type multiple is a sequence that aligns with equal similarity to multiple subtypes sequences.

Short

This is a subtype-relevant definition. A query of type short, is one that does not meet the minimum similarity and length requirements (e.g., \(<\) 90% similarity to \(<\) 90% of the length of the target).

Dynamic Similarity

The dynamic similarity threshold is computed to allow query sequences that are shorter than the database references to be considered as potential hits. However, the shorter the sequnces, the higher the required stringency. The dynamic similarity threshold is computed as:

\(required\_similarity = 100 - \frac{C - min_c}{1-min_c} * (100 - min_s)\)

where:

C is the coverage fraction of the query over the hit sequences
\(min_c\) is the minimum accepted coverage fraction of the query and the hit sequences
\(min_s\) is the minimum similarity threshold between the query and the hit sequences

Ambiguous Hit Correction

An ambiguous hit occurs when a sequences aligns with multiple subtypes. To try to infer the correct subtype of the sequence, we employ a strategy similar to the wisdom of the crowd, and allow similar sequences to help contribute information about the closest subtype of the sequence. To do so, ambiguous sequences are clustered using high stringency and a subtype distribution (or spectrum) is computed for each cluster.

Suppose a cluster has a distribution: 88 C1.1, 45 C1.18, 6 C1.21 and 2 C1.28. This means that at least 88 sequences in the cluster were subtyped as C1.1. and only 1 was subtyped as C1.28.

Clusters’ distributions are usually highly skewed with few high frequency subtypes and a greater number of low frequency types. Since there distributions are subsequently used to infer the Lowest Common Ancestor (LCA) sequence as a proxy, it is very improtant to rid the data of unlikely subtype that can bias the computation of the LCA. For the previous distribution, the wisdom of the crowd tells us that this cluster of sequences is closest to C1.1. and unlikely to be C1.28 and therfore drops it for the C1.28. The same can be said about C1.21 since only 6 sequences have been aligned to it. The corrected distribution is thus likely 88 C1.1, 45 C1.18. This distribution will be subsequently used to map the reads to the common ancestor in the phylogeny.

The algoirthm used to correct the subtypes distribution uses a similar approach by formalizing which subtypes to drop for the distribution using a strigency parameter p. To do so, we iteratively drop the the subtypes that have counts within the \(p^{th}\) percentile of the distribution and stop when no subtypes can be dropped.

Resolved

An ambiguous read is said to be resolved if its filtered distribution after the Ambiguous Hit Correction contains a single subtype.

Lowest Common Ancestor

In a phylogenetic tree, an internal node, \(N\), is the lowest common ancestor (or most recent common ancestor) of a set of leaves \(L\), if \(N\) is the first common parent of all the leaves of in \(L\)

Placement Tree

A phylogeny of the subtypes in each clade where an internal node can be labeled using the number of seqeuencing reads for which is considered to be the most recent ancestor

TSV Format

A file with tab delimited columns

Samples File

A file cotaining the samples – one per line – in the dataset.

Input File Formats

Fasta Input Format

Sequence ids in the fasta file are required to have the following format.

Sample_ID::Seq_Number

  • Sample_ID: refers to the sample to which the sequence belongs. The sampleID should be present in the Samples File
  • Seq_Number: is a unique identifier for a the sequence.

Note that the two colons (::) are used to separate the Sample_ID and the Seq_Number.

Clade Output Format

HITS OUTPUT

  • Query sequence id
  • Hit start in query
  • Hit end in query
  • First hit id
  • Second hit id
  • First hit e-value
  • Second hit e-value

NOHITS OUTPUT

  • Query sequence id

AMBIGUOUS OUTPUT

  • Query sequence id
  • First hit id
  • Second hit id
  • First hit e-value
  • Second hit e-value

LOWOUT

  • Query sequence id
  • First hit id
  • Hit e-value

MULTIPLE OUTPUT

  • Query sequence id
  • List of hits ids

Subtype Output Formats

NEWOUT

  • Query sequence id

PERFECT OUTPUT

  • Query sequence id
  • Best hit id
  • Query length / Hit length
  • Percent identity

SHORT OUTPUT

  • Query sequence id
  • Query length
  • Best hit id
  • Best hit lenght

SHORTNEW OUTPUT

  • Query sequence id
  • Best hit id
  • Query length / Hit length
  • Percent identity

UNIQUE OUTPUT

  • Query sequence id
  • Best hit id

ResolveMultipleHits Output Formats

Corrected Output All Clade

Tab separated fields and colon separated values. Ex.

Cluster: CL_415 numSeq: 6       clade: C        breakDown:180:4 175M:2  subtypes: C3.24_HE579012: 6, C3k_AY589737: 6, C3.23_HE579011: 6

The previous line tell us that CL_145 representes 6 Sequences, 2 form sample 175M and 4 from sample 180. These sequences are in Clade C and have the subtype distribution listed in subtype list.

Resolved Output All Clades

  • Cluster ID
  • Number of sequences in the cluster
  • Clade
  • Subtype of sequences in the cluster

Corrected Output Per Clade

This file format is similar to that in Corrected Output All Clade except that the subtype list represents the corrected (or effective), rather than initial, subtypes.

Newick NHX Format

NHX is based on the New Hampshire (NH) standard (also called “Newick tree format”). Files in this format can be view using any application that supports it, such as the online treeview program (http://etetoolkit.org/treeview/).

For more details on the NHX format, see: http://www.genetics.wustl.edu/eddy/forester/NHX.html