EDGE: Empowering the Development of Genomics Expertise

EDGE ABCs

A quick About EDGE, overview of the Bioinformatic workflows, and the Computational environment

About EDGE Bioinformatics

EDGE bioinformatics was developed to help biologists process Next Generation Sequencing data (in the form of raw FASTQ files), even if they have little to no bioinformatics expertise. EDGE is a highly integrated and interactive web-based platform that is capable of running many of the standard analyses that biologists require for viral, bacterial/archaeal, and metagenomic samples. EDGE provides the following analytical workflows: pre-processing, assembly and annotation, reference-based analysis, taxonomy classification, phylogenetic analysis, and PCR analysis. EDGE provides an intuitive web-based interface for user input, allows users to visualize and interact with selected results (e.g. JBrowse genome browser), and generates a final detailed PDF report. Results in the form of tables, text files, graphic files, and PDFs can be downloaded. A user management system allows tracking of an individual’s EDGE runs, along with the ability to share, post publicly, delete, or archive their results.

While EDGE was intentionally designed to be as simple as possible for the user, there is still no single ‘tool’ or algorithm that fits all use-cases in the bioinformatics field. Our intent is to provide a detailed panoramic view of your sample from various analytical standpoints, but users are encouraged to have some knowledge of how each tool/algorithm workflow functions, and some insight into how the results should best be interpreted.

Bioinformatics overview

Inputs:

The input to the EDGE workflows begins with one or more illumina FASTQ files for a single sample. (There is currently limited capability of incorporating PacBio and Oxford Nanopore data into the Assembly module.) The user can also enter SRA/ENA accessions to allow processing of publically available datasets. Comparison among samples is not yet supported but development is underway to accommodate such a function for assembly and taxonomy profile comparisons.

Workflows:

Pre-Processing

Assessment of quality control is performed by FAQCS. The host removal step requires the input of one or more reference genomes as FASTA. Several common references are available for selection. Trimmed and host-screened FASTQ files are used for input to the other workflows.

Assembly and Annotation

We provide the IDBA, Spades, and MegaHit (in the development version) assembly tools to accommodate a range of sample types and data sizes. When the user selects to perform an assembly, all subsequent workflows can execute analysis with either the reads, the contigs, or both (default).

Reference-Based Analysis

For comparative reference-based analysis with reads and/or contigs, users must input one or more references (as FASTA or multi-FASTA if there are more than one replicon) and/or select from a drop-down list of RefSeq complete genomes. Results include lists of missing regions (gaps), inserted regions (with input contigs if assembly was performed), SNPs (and coding sequence changes), as well as genome coverage plots and interactive access via JBrowse.

Taxonomy Classification

For taxonomy classification with reads, multiple tools are used and the results are summarized in heat map and radar plots. Individual tool results are also presented with taxonomy dendograms and Krona plots. Contig classification occurs by assigning taxonomies to all possible portions of contigs. For each contig, the longest and best match (using BWA-MEM) is kept for any region within the contig and the region covered is assigned to the taxonomy of the hit. The next best match to a region of the contig not covered by prior hits is then assigned to that taxonomy. The contig results can be viewed by length of assembly coverage per taxa or by number of contigs per taxa.

Phylogenetic Analysis

For phylogenetic analysis, the user must select datasets from near neighbor isolates for which the user desires a phylogeny. A minimum of three additional datasets are required to draw a tree. At least one dataset must be an assembly or complete genome. RefSeq genomes (Bacteria, Archaea, Viruses) are available from a dropdown menu, SRA and FASTA entries are allowed, and previously built databases for some select groups of bacteria are provided. This workflow (see PhaME) is a whole genome SNP-based analysis that uses one reference assembly to which both reads and contigs are mapped. Because this analysis is based on read alignments and/or contig alignments to the reference genome(s), we strongly recommend only selecting genomes that can be adequately aligned at the nucleotide level (i.e. ~90% identity or better). The number of ‘core’ nucleotides able to be aligned among all genomes, and the number of SNPs within the core, are what determine the resolution of the phylogenetic tree. Output phylogenies are presented along with text files outlining the SNPs discovered.

Primer Analysis

For primer analysis, if the user would like to validate known PCR primers in silico, a FASTA file of primer sequences must be input. New primers can be generated from an assembly as well.

All commands and tool parameters are recorded in log files to make sure the results are repeatable and traceable. The main output is an integrated interactive web page that includes summaries of all the workflows run and features tables, graphical plots, and links to genome (if assembled, or of a selected reference) browsers and to access unprocessed results and log files. Most of these summaries, including plots and tables are included within a final PDF report.

Limitations

Pre-processing

For host removal/screening, not all genomes are available from a drop-down list, however

Assembly and Taxonomy Classification

EDGE has been primarily designed to analyze microbial (bacterial, archaeal, viral) isolates or (shotgun) metagenome samples. Due to the complexity and computational resources required for eukaryotic genome assembly, and the fact that the current taxonomy classification tools do not support eukaryotic classification, EDGE does not fully support eukaryotic samples. The combination of large NGS data files and complex metagenomes may also run into computational memory constraints.

Reference-based analysis

We recommend only aligning against (a limited number of) most closely related genome(s). If this is unknown, the Taxonomy Classification module is recommended as an alternative. If the user selects too many references, this may affect runtimes or require more computational resources than may be available on the user’s system.

Phylogenetic Analysis

Because this pipeline provides SNP-based trees derived from whole genome (and contig) alignments or read mapping, we recommend selecting genomes within the same species or at least within the same genus.

Computational Environment

EDGE source code, images, and webservers

EDGE was designed to be installed and implemented from within any institute that provides sequencing services or that produces or hosts NGS data. When installed locally, EDGE can access the raw FASTQ files from within the institute, thereby providing immediate access by the biologist for analysis. EDGE is available in a variety of packages to fit various institute needs. EDGE source code can be obtained via our GitHub page. To simplify installation, a VM in OVF or a Docker image can also be obtained. A demonstration version of EDGE is currently available at https://bioedge.lanl.gov with example data sets available to the public to view and/or re-run. This webserver has 24 cores, 512GB ram with Ubuntu 14.04.3 LTS, and also allows EDGE runs of SRA/ENA data. This webserver does not currently support upload of data (due in part to LANL security regulations), however local installations are meant to be fully functional.

Introduction

What is EDGE?

EDGE is a highly adaptable bioinformatics platform that allows laboratories to quickly analyze and interpret genomic sequence data. The bioinformatics platform allows users to address a wide range of use cases including assay validation and the characterization of novel biological threats, clinical samples, and complex environmental samples. EDGE is designed to:

  • Align to real world use cases
  • Make use of open source (free) software tools
  • Run analyses on small, relatively inexpensive hardware
  • Provide remote assistance from bioinformatics specialists
_images/useCases.png

Four common Use Cases guided initial EDGE Bioinformatic Software development.

Why create EDGE?

EDGE bioinformatics was developed to help biologists process Next Generation Sequencing data (in the form of raw FASTQ files), even if they have little to no bioinformatics expertise. EDGE is a highly integrated and interactive web-based platform that is capable of running many of the standard analyses that biologists require for viral, bacterial/archaeal, and metagenomic samples. EDGE provides the following analytical workflows: quality trimming and host removal, assembly and annotation, comparisons against known references, taxonomy classification of reads and contigs, whole genome SNP-based phylogenetic analysis, and PCR analysis. EDGE provides an intuitive web-based interface for user input, allows users to visualize and interact with selected results (e.g. JBrowse genome browser), and generates a final detailed PDF report. Results in the form of tables, text files, graphic files, and PDFs can be downloaded. A user management system allows tracking of an individual’s EDGE runs, along with the ability to share, post publicly, delete, or archive their results.

While the design of EDGE was intentionally done to be as simple as possible for the user, there is still no single ‘tool’ or algorithm that fits all use-cases in the bioinformatics field. Our intent is to provide a detailed panoramic view of your sample from various analytical standpoints, but users are encouraged to have some insight into how each tool or workflow functions, and how the results should best be interpreted.

System requirements

NOTE: The web-based online version of EDGE, found on https://bioedge.lanl.gov/edge_ui/ is run on our own internal servers and is our recommended mode of usage for EDGE. It does not require any particular hardware or software other than a web browser. This segment and the installation segment only apply if you want to run EDGE through Python or Apache 2, or through the CLI.

The current version of the EDGE pipeline has been extensively tested on a Linux Server with Ubuntu 14.04 and Centos 6.5 and 7.0 operating system and will work on 64bit Linux environments. Perl v5.8 or above is required. Python 2.7 is required. Due to the involvement of several memory/time consuming steps, it requires at least 16Gb memory and at least 8 computing CPUs. A higher computer spec is recommended: 128Gb memory and 16 computing CPUs.

Please ensure that your system has the essential software building packages installed properly before running the installing script.

The following are required installed by system administrator.

Note

If your system OS is neither Ubuntu 14.04 or Centos 6.5 or 7.0, it may have differnt packages/libraries name and the newer complier (gcc5) on newer OS (ex: Ubuntu 16.04) may fail on compling some of thirdparty bioinformatics tools. We would suggest to use EDGE VMware image or Docker container.

Ubuntu 14.04

https://design.ubuntu.com/wp-content/uploads/ubuntu-logo14.png
  1. Install build essential libraries and dependancies:

    sudo apt-get install build-essential
    sudo apt-get install libreadline-gplv2-dev
    sudo apt-get install libx11-dev
    sudo apt-get install libxt-dev libgsl0-dev
    sudo apt-get install libncurses5-dev
    sudo apt-get install gfortran
    sudo apt-get install inkscape
    sudo apt-get install libwww-perl libxml-libxml-perl libperlio-gzip-perl
    sudo apt-get install zlib1g-dev zip unzip libjson-perl
    sudo apt-get install libpng-dev
    sudo apt-get install cpanminus
    sudo apt-get install default-jre
    sudo apt-get install firefox
    sudo apt-get install wget curl csh
    
  2. Install python packages for Metaphlan (Taxonomy assignment software):

    sudo apt-get install python-numpy python-matplotlib python-scipy libpython2.7-stdlib
    sudo apt-get install python-pip python-pandas python-sympy python-nose
    
  3. Install BioPerl:

    sudo apt-get install bioperl
        or
    sudo cpan -i -f CJFIELDS/BioPerl-1.6.923.tar.gz
    
  4. Install packages for user management system:

    sudo apt-get install sendmail mysql-client mysql-server phpMyAdmin tomcat7
    

CentOS 6.7

https://scottlinux.com/wp-content/uploads/2011/07/centos6.png
  1. Install dependancies using yum:

    # add epel reporsitory
    sudo yum -y install epel-release
    su -c 'yum localinstall -y --nogpgcheck http://download1.rpmfusion.org/free/el/updates/6/i386/rpmfusion-free-release-6-1.noarch.rpm http://download1.rpmfusion.org/nonfree/el/updates/6/i386/rpmfusion-nonfree-release-6-1.noarch.rpm'
    sudo yum -y update
    
    sudo yum -y install\
     csh gcc gcc-c++ make curl binutils gd gsl-devel\
     libX11-devel readline-devel libXt-devel ncurses-devel inkscape\
     freetype freetype-devel zlib zlib-devel git\
     blas-devel atlas-devel lapack-devel libpng libpng-devel\
     expat expat-devel graphviz java-1.7.0-openjdk\
     perl-Archive-Zip perl-Archive-Tar perl-CGI perl-CGI-Session \
     perl-DBI perl-GD perl-JSON perl-Module-Build perl-CPAN-Meta-YAML\
     perl-XML-LibXML perl-XML-Parser perl-XML-SAX perl-XML-SAX-Writer\
     perl-XML-Simple perl-XML-Twig perl-XML-Writer perl-YAML\
     perl-Test-Most perl-PerlIO-gzip perl-SOAP-Lite perl-GraphViz
    
  2. Install perl cpanm:

    curl -L http://cpanmin.us | perl - App::cpanminus
    
  3. Install perl modules by cpanm:

    cpanm Graph Time::Piece Data::Dumper IO::Compress::Gzip Data::Stag IO::String
    cpanm Algorithm::Munkres Array::Compare Clone Convert::Binary::C XML::Parser::PerlSAX
    cpanm HTML::Template HTML::TableExtract List::MoreUtils PostScript::TextBlock
    cpanm SVG SVG::Graph Set::Scalar Sort::Naturally Spreadsheet::ParseExcel
    cpanm -f Bio::Perl
    
  4. Install dependent packages for Python:

EDGE requires several packages (NumPy, Matplotlib, SciPy, IPython, Pandas, SymPy and Nose) to work properly. These packages are available at PyPI (https://pypi.python.org/pypi) for downloading and installing respectively. Or you can install a Python distribution with dependent packages instead. We suggest users to use Anaconda Python distribution. You can download the installers and find more information at their website (https://store.continuum.io/cshop/anaconda/). The installation is interactive. Type in /opt/apps/anaconda when the script asks for the location to install python.:

bash Anaconda-2.x.x-Linux-x86.sh
ln -s /opt/apps/anaconda/bin/python /path/to/edge_v1.x/bin/

Create symlink anaconda python to edge/bin. So system will use your python over the system’s.

  1. Install packages for user management system:

    sudo yum -y install sendmail mysql mysql-server phpmyadmin tomcat
    

CentOS 7

https://upload.wikimedia.org/wikipedia/commons/thumb/b/bc/Centos_full.svg/200px-Centos_full.svg.png
  1. Install libraries and dependencies by yum:

    # add epel reporsitory
    sudo yum -y install epel-release
    
    sudo yum install -y libX11-devel readline-devel libXt-devel ncurses-devel inkscape\
        scipy expat expat-devel freetype freetype-devel zlib zlib-devel perl-App-cpanminus\
        perl-Test-Most python-pip blas-devel atlas-devel lapack-devel numpy numpy-f2py\
        libpng12 libpng12-devel perl-XML-Simple perl-JSON csh gcc gcc-c++ make binutils\
        gd gsl-devel git graphviz java-1.7.0-openjdk perl-Archive-Zip perl-CGI\
        perl-CGI-Session perl-CPAN-Meta-YAML perl-DBI perl-Data-Dumper perl-GD perl-IO-Compress\
        perl-Module-Build perl-XML-LibXML perl-XML-Parser perl-XML-SAX perl-XML-SAX-Writer\
        perl-XML-Twig perl-XML-Writer perl-YAML perl-PerlIO-gzip python-matplotlib python-six
    
  2. Update existing python and perl tools:

    sudo pip install --upgrade six scipy matplotlib
    sudo cpanm App::cpanoutdated
    sudo su -
    cpan-outdated -p | cpanm
    exit
    
  3. Install perl modules by cpanm:

    cpanm Graph Time::Piece Bio::Perl
    cpanm Algorithm::Munkres Archive::Tar Array::Compare Clone Convert::Binary::C
    cpanm HTML::Template HTML::TableExtract List::MoreUtils PostScript::TextBlock
    cpanm SOAP::Lite SVG SVG::Graph Set::Scalar Sort::Naturally Spreadsheet::ParseExcel
    cpanm CGI CGI::Simple GD Graph GraphViz XML::Parser::PerlSAX XML::SAX XML::SAX::Writer XML::Simple XML::Twig XML::Writer
    
  4. Install packages for user management system:

    sudo yum -y install sendmail mariadb-server mariadb phpMyAdmin tomcat
    
  5. Configure firewall for ssh, http, https, and smtp:

    sudo firewall-cmd --permanent --add-service=ssh
    sudo firewall-cmd --permanent --add-service=http
    sudo firewall-cmd --permanent --add-service=https
    sudo firewall-cmd --permanent --add-service=smtp
    

Note

You may need to turn the SELinux into Permissive mode.

sudo setenforce 0

Installation

EDGE Installation

Note

A base install is ~8GB for the code base and ~177GB for the databases.

  1. Please ensure that your system has the essential software building packages. installed properly before proceeding following installation.

  2. Download the codebase, databases and third party tools.

    ## Codebase is ~68Mb and contains all the scripts and HTML needed to make EDGE run
    wget -c https://edge-dl.lanl.gov/EDGE/1.1/edge_main_v1.1.1.tgz
    
    ## Third party tools is ~1.9Gb and contains the underlying programs needed to do the analysis
    wget -c https://edge-dl.lanl.gov/EDGE/1.1/edge_v1.1_thirdParty_softwares.tgz
    
    ## Pipeline database is ~7.9Gb and contains the other databases needed for EDGE
    wget -c https://edge-dl.lanl.gov/EDGE/1.1/edge_pipeline_v1.1.databases.tgz
    
    ## GOTTCHA database is ~14Gb and contains the custom databases for the GOTTCHA taxonomic identification pipeline
    wget -c https://edge-dl.lanl.gov/EDGE/1.1/GOTTCHA_db_for_edge_v1.1.tgz
    
    ## BWA index is ~41Gb and contains the databases for bwa taxonomic identification pipeline
    wget -c https://edge-dl.lanl.gov/EDGE/1.1/bwa_index1.1.tgz
    
    ## NCBI Genomes is ~8Gb and contain the full genomes for prokaryotes and some viruses
    wget -c https://edge-dl.lanl.gov/EDGE/1.1/NCBI_genomes_for_edge_v1.1.tar.gz
    

Warning

Be patient; the database files are huge.

  1. Unpack main archive:

    tar -xvzf edge_main_v1.1.1.tgz
    

Note

The main directory, edge_v1.1.1, will be created.

  1. Move the database and third party archives into main directory (edge_v1.1.1):

    mv edge_v1.1_thirdParty_softwares.tgz edge_v1.1.1/
    mv edge_pipeline_v1.1.databases.tgz edge_v1.1.1/
    mv GOTTCHA_db_for_edge_v1.1.tgz edge_v1.1.1/
    mv bwa_index1.1.tgz edge_v1.1.1/
    mv NCBI_genomes_for_edge_v1.1.tar.gz edge_v1.1.1/
    
  2. Change directory to main directory and unpack databases and third party tools archive:

    cd edge_v1.1.1
    
    # unpack third party tools
    tar -xvzf edge_v1.1_thirdParty_softwares.tgz
    
    # unpack databases
    tar -xvzf edge_pipeline_v1.1.databases.tgz
    tar -xvzf GOTTCHA_db_for_edge_v1.1.tgz
    tar -xzvf bwa_index1.1.tgz
    tar -xvzf NCBI_genomes_for_edge_v1.1.tar.gz
    

Note

To this point, you should see a database directory and a thirdParty directory in the main directory

  1. Installing pipeline:

    ./INSTALL.sh
    

It will install the following depended tools.

  • Assembly
    • idba
    • spades
  • Annotation
    • prokka
    • RATT
    • tRNAscan
    • barrnap
    • BLAST+
    • blastall
    • phageFinder
    • glimmer
    • aragorn
    • prodigal
    • tbl2asn
  • Alignment
    • hmmer
    • infernal
    • bowtie2
    • bwa
    • mummer
  • Taxonomy
    • kraken
    • metaphlan
    • kronatools
    • gottcha
  • Phylogeny
    • FastTree
    • RAxML
  • Utility
    • bedtools
    • R
    • GNU_parallel
    • tabix
    • JBrowse
    • primer3
    • samtools
    • sratoolkit
  • Perl_Modules
    • perl_parallel_forkmanager
    • perl_excel_writer
    • perl_archive_zip
    • perl_string_approx
    • perl_pdf_api2
    • perl_html_template
    • perl_html_parser
    • perl_JSON
    • perl_bio_phylo
    • perl_xml_twig
    • perl_cgi_session
  1. Restart the Terminal Session to allow $EDGE_HOME to be exported.

Note

After running INSTALL.sh successfully, the binaries and related scripts will be stored in the ./bin and ./scripts directory. It also writes EDGE_HOME environment variable into .bashrc or .bash_profile.

Testing the EDGE Installation

After installing the packages above, it is highly recommended to test the installation:

> cd $EDGE_HOME/testData
> ./runAllTest.sh
_images/testResult.png

There are 15 module/unit tests which took around 44 mins in our testing environments. (24 cores 2.60GHz, 512GB ram with Ubuntu 14.04.3 LTS ). You will see test output on the terminal indicating test successes and failures. Some tests may fail due to missing external applications/modules/packages or failed installation. These will be noted separately in the $EDGE_HOME/testData/runXXXXTest/TestOutput/error.log or log files in each modules. If these are related to features of EDGE that you are not using, this is acceptable. Otherwise, you’ll want to ensure that you have the EDGE installed correctly. If the output doesn’t indicate any failures, you are now ready to use EDGE through command line. To take advantage of the user friendly GUI, please follow the section below to configure the EDGE Web server.

Apache Web Server Configuration

  1. Install apache2

    For Ubuntu
    
    > sudo apt-get install apache2
    
    For CentOS
    
    > sudo yum -y install httpd
    
  2. Enable apache cgid, proxy, headers modules:

    For Ubuntu
    
    > sudo a2enmod cgid proxy proxy_http headers
    
  3. Modify/Check sample apache configuration file:

    Double check $EDGE_HOME/edge_ui/apache_conf/edge_apache.conf alias directories to match EDGE
    installation path at line 2,3,13,14,26,51.
    The default is configured as http://localhost/edge_ui/ or http://www.yourdomain.com/edge_ui/
    
  4. (Optional) If users are behind a corporate proxy for internet:

    Please add proxy info into $EDGE_HOME/edge_ui/apache_conf/edge_apache.conf or $EDGE_HOME/edge_ui/apache_conf/edge_httpd.conf
    
    # Add following proxy env
    SetEnv http_proxy http://yourproxy:port
    SetEnv https_proxy http://yourproxy:port
    SetEnv ftp_proxy http://yourproxy:port
    
  5. Copy modified edge_apache.conf to the apache or Insert content into httpd.conf

    For Ubuntu
    
    > cp $EDGE_HOME/edge_ui/apache_conf/edge_apache.conf /etc/apache2/conf-available/
    > ln -s /etc/apache2/conf-available/edge_apache.conf /etc/apache2/conf-enabled/
    
    For CentOS
    
    > cp $EDGE_HOME/edge_ui/apache_conf/edge_apache.conf /etc/httpd/conf.d/
    
  6. Modify permissions: modify permissions on installed directory to match apache user

    For Ubuntu 14, the user can be edited at /etc/apache2/envvars and the variable are APACHE_RUN_USER and APACHE_RUN_GROUP.
    
    For CentOS, the user can be edited at /etc/httpd/conf/httpd.conf and the variable are User and Group.
    
    > chown -R xxxxx $EDGE_HOME/edge_ui  $EDGE_HOME/edge_ui/JBrowse/data  #(xxxxx is the APACHE_RUN_USER value)
    
    > chgrp -R xxxxx $EDGE_HOME/edge_ui  $EDGE_HOME/edge_ui/JBrowse/data  #(xxxxx is the APACHE_RUN_GROUP value)
    
  7. Restart the apache2 to activate the new configuration

    For Ubuntu
    
    >sudo service apache2 restart
    
    For CentOS
    
    >sudo httpd -k restart
    

User Management system installation

  1. Create database: userManagement:

    > cd $EDGE_HOME/userManagement
    > mysql -p -u root
    mysql> create database userManagement;
    mysql> use userManagement;
    

Note

make sure mysql is running. If not, run “sudo service mysqld start”.;

for CentOS7: “sudo systemctl start mariadb.service && sudo systemctl enable mariadb.service”

  1. Load userManagement_schema.sql:

    mysql> source userManagement_schema.sql;
    
  2. Load userManagement_constrains.sql:

    mysql> source userManagement_constrains.sql;
    
  3. Create an user account

       username: yourDBUsername
       password: yourDBPassword
       (also modify the username/password in userManagementWS.xml file)
    and grant all privileges on database "userManagement" to user yourDBUsername
    
    mysql> CREATE USER 'yourDBUsername'@'localhost' IDENTIFIED BY 'yourDBPassword';
    
    mysql> GRANT ALL PRIVILEGES ON userManagement.* to 'yourDBUsername'@'localhost';
    
    mysql>exit;
    
  4. Configure tomcat:

    * Copy mysql-connector-java-5.1.34-bin.jar to /usr/share/tomcat/lib/
    
        For Ubuntu and CentOS6
        > cp mysql-connector-java-5.1.34-bin.jar /usr/share/tomcat7/lib/
        For CentOS7
        > cp mariadb-java-client-1.2.0.jar /usr/share/tomcat/lib/
    
    * Configure tomcat basic auth to secure /user/admin/register web service
      add lines below to /var/lib/tomcat7/conf/tomcat-users.xml of Ubuntu or /etc/tomcat/tomcat-users.xml of CentOS
    
        <role rolename="admin"/>
        <user username="yourAdminName" password="yourAdminPassword" roles="admin"/>
    
        (also modify the username and password in createAdminAccount.pl file)
    
    * Inactive timeout in /var/lib/tomcat7/conf/web.xml or /etc/tomcat/web.xml (default is 30mins)
    
        <!--  <session-config>
            <session-timeout>30</session-timeout>
        </session-config> -->
    
    * add the line below to tomcat /usr/share/tomcat7/bin/catalina.sh of Ubuntu or /etc/tomcat/tomcat.conf of CentOS to increase PermSize:
    
        JAVA_OPTS=" -Xms256M -Xmx1024M -XX:PermSize=256m -XX:MaxPermSize=512m"
    
    * Restart tomcat server
    
        for Ubuntu
        > sudo service tomcat7 restart
        for CentOS6
        > sudo service tomcat restart
        for CentOS7
        > sudo systemctl restart tomcat.service
    
    * Deploy userManagementWS to tomcat server
    
        for Ubuntu
        > cp userManagementWS.war /var/lib/tomcat7/webapps/
        > cp userManagementWS.xml /var/lib/tomcat7/conf/Catalina/localhost/
        for CentOS
        > cp userManagementWS.war /var/lib/tomcat/webapps/
        > cp userManagementWS.xml /etc/tomcat/Catalina/localhost/
    
        (for CentOS7. The userManagementWS.xml needs to modify the sql connector where driverClassName="org.mariadb.jdbc.Driver")
    
    * Deploy userManagement to tomcat server
    
        for Ubuntu
        > cp userManagement.war /var/lib/tomcat7/webapps
        for CentOS
        > cp userManagement.war /var/lib/tomcat/webapps
    
    * Change settings in /var/lib/tomcat7/webapps/userManagement/WEB-INF/classes/sys.properties of Ubuntu.
                        /var/lib/tomcat/webapps/userManagement/WEB-INF/classes/sys.properties of CentOS.
    
        host_url=http://www.yourdomain.com:8080/userManagement
        email_sender=admin@yourdomain.com
        email_host=mail.yourdomain.com
    

Note

tomcat files in /var/lib/tomcat7 & /usr/share/tomcat7 for Ubuntu
in /var/lib/tomcat & /usr/share/tomcat & /etc/tomcat for CentOS

The tomcat server will automatically decompress the userManagementWS.war and userManagement.war ;

  1. Setup admin user:

    * run script createAdminAccount.pl to add admin account with encrypted password to database
    
        > perl createAdminAccount.pl -e admin@my.com -p admin -fn <first name> -ln <last name>
    
  2. Configure the EDGE to use the user management system

    • edit $EDGE_HOME/edge_ui/cgi-bin/edge_config.tmpl where user_management=1

Note

If user management system is not in the same domain with edge. ex: http://www.someother.com/userManagement The parameter: edge_user_management_url=http://www.someother.com/userManagement

  1. Enable social (facebook,google,windows live, Linkedin) login function

    • edit $EDGE_HOME/edge_ui/cgi-bin/edge_config.tmpl where user_social_login=1
    • modify $EDGE_HOME/edge_ui/cgi-bin/edge_user_management.cgi at line 108/109 of the admin_email and password according to #6 above.
    • modify $EDGE_HOME/edge_ui/javascript/social.js, change apps id you created on each social media.

Note

You need to register your EDGE’s domain on each social media to get apps id. e.g.: A FACEBOOK app needs to be created and configured for the domain and website set up by EDGE. see https://developers.facebook.com/ and StackOverflow Q&A

Google+

Windows

LinkedIn

  1. Optional: configure sendmail to use SMTP to email out of local domain:

    * edit /etc/mail/sendmail.cf and edit this line:
    
        # "Smart" relay host (may be null)
        DS
    
    * and append the correct server right next to DS (no spaces);
    
        # "Smart" relay host (may be null)
        DSmail.yourdomain.com
    
    * Then, restart the sendmail service
    
        > sudo service sendmail restart
    

EDGE Docker image

EDGE has a lot of dependencies and can (but doesn’t have to) be very challenging to install. The EDGE docker gets around the difficulty of installation by providing a functioning EDGE full install on top of offical Ubuntu 14.04.3 LTS. You can find the image and usage at docker hub.

EDGE VMware/OVF Image

You can start using EDGE by launching a local instance of the EDGE VM. The image is built by VMware Fusion v8.0. The pre-built EDGE VM is provided in Open Virtualization Format (OVA/OVF) which is supported by major virtualization players, such as VMware / VirtualBox / Red Hat Enterprise Virtualization, etc. Unfortunately, this may not always work perfectly, as each VM technology seems to use slightly different OVA/OVF implementations that aren’t entirely compatible. For example, the auto-deploy feature and the path of auto-mount shared folders between host and guest which are used in the EDGE VMware image may not be compatible with other VM technologies (or may need advanced tweaks). Therefore, we highly recommended using VMware Workstation Player which is free for non-commercial, personal and home use. The EDGE databases are not included in the image. You will need to download and mount the databases, input and output directories after you launch the VM. Below are instructions to run EDGE VM on your local server:

  1. Install VMware Workstation player .
  2. Download VM image (EDGE_vm_RC1.ova) from LANL FTP site.
  3. Download the EDGE databases and follow instruction to unpack them.
  4. Configure your VM
  • Allocate at least 10GB memory to the VM
  • Share the database, input and output directory to the “database”, “EDGE_input” and “EDGE_output” directory in the VM guest OS. If you use VMware, the “Sharing settings” should look like:
_images/VMware_Sharing_settings.png
  1. Start EDGE VM.
  2. Access EDGE VM using host browser (http://<IP_OF_VM>/edge_ui/).

Note that the IP address will also be provided when the instance starts up.

_images/VMware_login.png
  1. Control EDGE VM with default credentials.

Graphic User Interface (GUI)

The User Interface was mainly implemented in JQuery Mobile, CSS, javascript and perl CGI. It is a HTML5-based user interface system designed to make responsive web sites and apps that are accessible on all smartphone, tablet and desktop devices.

See GUI page

User Login

A user management system has been implemented to provide a level of privacy/security for a user’s submitted projects. When this system is activated, any user can view projects that have been made public, but other projects can only be accessed by logging into the system using a registered local EDGE account or via an existing social media account (Facebook, Google+, Windows, or LinkedIn). The users can then run new jobs and view their own previously run projects or those that have been shared with them. Click on the upper-right user icon will pop up an user login window.

_images/login.jpg

Upload Files

For LANL security policy, the function is not implemented at https://bioedge.lanl.gov/edge_ui/.

EDGE supports input from NCBI Sequence Reads Archive (SRA) and select files from the EDGE server. To analyze users’ own data, EDGE allows user to upload fastq, fasta and genbank (which can be in gzip format) and text (txt). Max file size is ‘5gb’ and files will be kept for 7 days. Choose “Upload files” from the navigation bar on the left side of the screen. Add users files by clicking “Add Files” buttion or drag files to the upload feature window. Then, click “Start Upload” button to upload files to EDGE server.

_images/upload.jpg

Initiating an analysis job

Choose “Run EDGE” from the navigation bar on the left side of the screen.

_images/initiating.jpg

This will cause a section to appear called “Input Raw Reads.” Here, you may browse the EDGE Input Directory and select FASTQ files containing the reads to be analyzed. EDGE supports gzip compressed fastq files. At minimum, EDGE will accept two FASTQ files containing paired reads and/or one FASTQ file containing single reads as initial input. Alternatively, rather than providing files through the EDGE Input Directory, you may decide to use as input reads from the Sequence Read Archive (SRA). In this case, select the “yes” option next to “Input from NCBI Sequence Reads Archive” and a field will appear where you can type in an SRA accession number.

_images/input.jpg

In addition to the input read files, you have to specify a project name. The project name is restricted to only alphanumerical characters and underscores and requires a minimum of three characters. For example, a project name of “E. coli. Project” is not acceptable, but a project name of “E_coli_project” could be used instead. In the “Description” fields you may enter free text that describes your project. If you would like, you may use as input more reads files than the minimum of 2 paired read files or one file of single reads. To do so, click “additional options” to expose more fields, including two buttons for “Add Paired-end Input” and “Add Single-end Input”.

_images/input_additional.jpg

In the “additional options”, there are several more options, for output path, number of CPUs, and config file. In most cases, you can ignore these options, but they are described briefly below.

Output path

You may specify the output path if you would like your results to be output to a specific location. In most cases, you can leave this field blank and the results will be automatically written to a standard location, $EDGE_HOME/edge_ui/EDGE_output. In most cases, it is sufficient to leave these options to the default settings.

Number of CPUs

Additionally, you may specify the number of CPUs to be used. The default and minimum value is one-fourth of total number of server CPUs. You may adjust this value if you wish. Assuming your hardware has 64 CPUs, the default is 16 and the maximum you should choose is 62 CPUs. Otherwise, if the jobs currently in progress use the maximum number of CPUs, the new submitted job will be queued (and colored in grey. Color-coding see Checking the status of an analysis job). For instance, if you have only one job running, you may choose 62 CPUs. However, if you are planning to run 6 different jobs simultaneously, you should divide the computing resources (in this case, 10 CPUs per each job, totaling 60 CPUs for 6 jobs).

Config file

Below the “Use # of CPUs” field is a field where you may select a configuration file. A configuration file is automatically generated for each job when you click “Submit.” This field could be used if you wanted to restart a job that hadn’t finished for some reason (e.g. due to power interruption, etc.). This option ensures that your submission will be run exactly the same way as previously, with all the same options.

Batch project submission

The “Batch project submission” section is toggled off by default. Clicking on it will open it up and toggle off the “Input Sequence” section at the same time. When you have many samples in “EDGE Input Directory” and would like to run them with the same configuration, instead of submitting several times, you can compile a text file with project name, fastq inputs and optional project descriptions (upload or paste it) and submit through the “Batch project submission” section

_images/batchsubmit.jpg

Choosing processes/analyses

Once you have selected the input files and assigned a project name and description, you may either click “Submit” to submit an analysis job using the default parameters, or you may change various parameters prior to submitting the job. The default settings include quality filter and trimming, assembly, annotation, and community profiling. Therefore, if you choose to use default parameters, the analysis will provide an assessment of what organism(s) your sample is composed of, but will not include host removal, primer design, etc. Below the “Input Your Sample” section is a section called “Choose Processes / Analyses”. It is in this section that you may modify parameters if you would like to use settings other than the default settings for your analysis (discussed in detail below).

_images/modules.jpg

Pre-processing

Pre-processing is by default on, but can be turned off via the toggle switch on the right hand side. The default parameters should be sufficient for most cases. However, if your experiment involves specialized adapter sequences that need to be trimmed, you may do so in the Quality Trim and Filter subsection. There are two options for adapter trimming. You may either supply a FASTA file containing the adapter sequences to be trimmed, or you may specify N number of bases to be trimmed from either end of each read.

_images/qc.jpg

Note

Trim Quality Level can be used to trim reads from both ends with defined quality. “N” base cutoff can be used to filter reads which have more than this number of continuous base “N”. Low complexity is defined by the fraction of mono-/di-nucleotide sequence. Ref: FaQCs.

The host removal subsection allows you to subtract host-derived reads from your dataset, which can be useful for metagenomic (complex) samples such as clinical samples (blood, tissue), or environmental samples like insects. In order to enable host removal, within the “Host Removal” subsection of the “Choose Processes / Analyses” section, switch the toggle box to “On” and select either from the pre-build host list ( Human , Invertebrate Vectors of Human Pathogens , PhiX , RefSeq Bacteria and RefSeq Viruses .) or the appropriate host FASTA file for your experiment from the navigation field. The Similarity (%) can be varied if desired, but the default is 90 and we would not recommend using a value less than 90.

Assembly And Annotation

The Assembly option by default is turned on. It can be turned off via the toggle button. EDGE performs iterative kmers de novo assembly by IDBA-UD . It performs well on isolates as well as metagenomes but it may not work well on very large genomes. By default, it starts from kmer=31 and iterative step by adding 20 to maximum kmer=121. When the maximum k value is larger than the input average reads length, it will automatically adjust the maximum value to average reads length minus 1. User can set the minimum cutoff value on the final contigs. By default, it will filter out all contigs with size smaller than 200 bp.

_images/assembly.jpg

The Annotation module will be performed only if the assembly option is turned on and reads were successfully assembled. EDGE has the option of using Prokka or RATT to do genome annotation. For most cases, Prokka is the appropriate tool to use, however, if your input is a viral genome with attached reference annotation (GenBank file), RATT is the preferred method. If for some reason the assembly fails (ex: run out of Memory), EDGE will bypass any modules requiring a contigs file including the annotation analysis.

Reference-based Analysis

The reference-based analysis section allows you to map reads/contigs to the provided references, which can be useful for known isolated species such as cultured samples, to get the coverage information and validate the assembled contigs. In order to enable reference-based analysis, switch the toggle box to “On” and select either from the pre-build Reference list ( Ebola virus genomes , E.coli 55989 , E.coli O104H4 , E.coli O127H6 and E.coli K12 MG1655 .) or the appropriate FASTA/GenBank file for your experiment from the navigation field.

_images/analysis.jpg

Given a reference genome fasta file, EDGE will turn on the analysis of the reads/contigs mapping to reference and JBrowse reference track generation. If a GenBank file is provided, EDGE will also turn on variant analysis.

Taxonomy Classification

Taxonomic profiling is performed via the “Taxonomy Classification” feature. This is a useful feature not only for complex samples, but also for purified microbial samples (to detect contamination). In the “Community profiling” subsection in the “Choose Processes / Analyses section,” community profiling can be turned on or off via the toggle button.

_images/classification.jpg

There is an option to “Always use all reads” or not. If “Always use all reads” is not selected, then only those reads that do not map to the user-supplied reference will be shown in downstream analyses (i.e. the results will only include what is different from the reference). Additionally, the user can use different profiling tools with checkbox selection menu. EDGE uses multiple tools for taxonomy classification including GOTTCHA (bacterial & viral databases) , MetaPhlAn , Kraken and reads mapping to NCBI RefSeq using BWA .

Turning on the “Contig-Based Taxonomy Classification” section will initiate mapping contigs against NCBI databases for taxonomy and functional annotations.

Phylogenomic Analysis

EDGE supports 5 pre-computed pathogen databases ( E.coli, Yersinia, Francisella, Brucella, Bacillus) for SNP phylogeny analysis. You can also choose to build your own database by first selecting a build method (either FastTree or RAxML), then selecting a pathogen from the “Search Genomes” search function. You can also add FASTA files or SRA Accessions.

_images/phylogeny.jpg

PCR Primer Tools

EDGE includes PCR-related tools for use by those who want to use PCR data for their projects.

_images/pcr.jpg
  • Primer Validation

    The “Primer Validation” tool can be used to verify whether and where given primer sequences would align to the genome of the sequenced organism. Prior to initiating the analysis, primer sequences in FASTA format must be deposited in the folder on the desktop in the directory entitled “EDGE Input Directory.”

    In order to initiate primer validation, within the “Primer Validation” subsection switch the “Run Primer Validation” toggle button to “On”. Then, within the “Primer FASTA Sequences” navigation field, select your file containing the primer sequences of interest. Next, in the “Maximum Mismatch” field, choose the maximum number of mismatches you wish to allow per primer sequence. The available options are 0, 1, 2, 3, or 4.

  • Primer Design

    If you would like to design new primers that will differentiate a sequenced microorganism from all other bacteria and viruses in NCBI, you can do so using the “Primer Design” tool. To initiate primer design switch the “Run Primer Design” toggle button to “On”. There are default settings supplied for Melting Temperature, Primer Length, Tm Differential, and Number of Primer Pairs, but you can change these settings if desired.

Submission of a job

When you have selected the appropriate input files and desired analysis options, and you are ready to submit the analysis job, click on the “Submit” button at the bottom of the page. Immediately you will see indicators of successful job submission and job status below the submit button, in green. If there is something wrong with the input, it will stop the submission and show the message in red, highlighting the sections with issues.

_images/submission.jpg

Checking the status of an analysis job

Once an analysis job has been submitted, it will become visible in the left navigation bar. There is a grey, red, orange, green color-coding system that indicates job status as follow:

Status Not yet begun Error In progress (running) Completed
Color Grey Red Orange Green

While the job is in progress, clicking on the project in the left navigation bar will allow you to see which individual steps have been completed or are in progress, and results that have already been produced. Clicking the job progress widget at top right opens up a more concise view of progress.

_images/status.jpg _images/status2.jpg

Monitoring the Resource Usage

In the job project sidebar, you can see there is an “EDGE Server Usage” widget that dynamically monitors the server resource usage for %CPU, %MEMORY and %DISK space. If there is not enough available disk space, you may consider deleting or archiving the submitted job with the Action tool described below.

_images/resource.jpg

Management of Jobs

Below the resource monitor is the “Action” tool, used for managing jobs in progress or existing projects.

_images/action.jpg

The available actions are:

  • View live log A terminal-like screen showing all the command lines and progress log information. This is useful for troubleshooting or if you want to repeat certain functions through command line at edge server.
  • Force to rerun this project Rerun a project with the same inputs and configuration. No additional input needs.
  • Interrupt running project Immediately stop a running project.
  • Delete entire project Delete the entire output directory of the project.
  • Remove from project list Keep the output but remove project name from the project list
  • Empty project outputs Clean all the results but keep the config file. User can use this function to do a clean rerun.
  • Move to an archive directory For performance reasons, the output directory will be put in local storage. User can use this function to move projects from local storage to a slower but larger network storage, which are configured when the edge server is installed.
  • Share Project Allow guests and other users to view the project.
  • Make project Private Restrict access to viewing the project to only yourself.

Other Methods of Accessing EDGE

Internal Python Web Server

EDGE includes a simple web server for single-user applications or other testing. It is not robust enough for production usage, but it is simple enough that it can be run on practically any system.

To run gui, type:

$EDGE_HOME/start_edge_ui.sh

This will start a localhost and the GUI html page will be opened by your default browser.

Apache Web Server

The preferred installation of EDGE uses Apache 2 (See Apache Web Server Configuration), and serves the application as a proper system service. A sample httpd.conf (or apache2.conf, depending on your operating system) is provided in the root directory of your installation. If this configuration is used, EDGE will be available on any IP or hostname registered to the machine, on ports 80 and 8080.

You can access EDGE by opening either the desktop link (below), or your browser, and entering http://localhost:80 in the address bar.

Note

If the desktop environment is available, after installation, a “Start EDGE UI” icon should be on the desktop. Click on the green icon and choose “Run in Terminal.” Results should be the same as those obtained by the above method to start the GUI.

_images/edge_desktop_icon.png _images/start_ui_in_terminal.png

The URL address is 127.0.0.1:8080/index.html. It may not be that powerful,as it is hosted by Apache HTTP Server, but it works. With system administrator help, the Apache HTTP Server is the suggested method to host the gui interface.

Note

You may need to configure the edge_wwwroot and input and output in the edge_ui/edge_config.tmpl file while configuring the Apache HTTP Server and link to external drive or network drive if needed.

A Terminal window will display messages and errors as you run EDGE. Under normal operating conditions you can minimize this window. Should an error/problem arise, you may maximize this window to view the error.

_images/Terminal_log.png

Warning

IMPORTANT: Do not close this window!

The Browser window is the window in which you will interact with EDGE.

Command Line Interface (CLI)

The command line usage is as followings:

Usage: perl runPipeline.pl [options] -c config.txt -p 'reads1.fastq reads2.fastq' -o out_directory
Version 1.1
Input File:
        -u            Unpaired reads, Single end reads in fastq

        -p            Paired reads in two fastq files and separate by space in quote

        -c            Config File
Output:
        -o            Output directory.

Options:
        -ref          Reference genome file in fasta

        -primer       A pair of Primers sequences in strict fasta format

        -cpu          number of CPUs (default: 8)

        -version      print verison

A config file (example in the below section, the Graphic User Interface (GUI) will generate config automatically), reads Files in fastq format, and a output directory are required when run by command line. Based on the configuration file, if all modules are turned on, EDGE will run the following steps. Each step contains at least one command line scripts/programs.

  1. Data QC
  2. Host Removal QC
  3. De novo Assembling
  4. Reads Mapping To Contig
  5. Reads Mapping To Reference Genomes
  6. Taxonomy Classification on All Reads or unMapped to Reference Reads
  7. Map Contigs To Reference Genomes
  8. Variant Analysis
  9. Contigs Taxonomy Classification
  10. Contigs Annotation
  11. ProPhage detection
  12. PCR Assay Validation
  13. PCR Assay Adjudication
  14. Phylogenetic Analysis
  15. Generate JBrowse Tracks
  16. HTML report

Configuration File

The config file is a text file with the following information. If you are going to do host removal, you need to build host index for it and change the fasta file path in the config file.

[Count Fastq]
DoCountFastq=auto

[Quality Trim and Filter]
## boolean, 1=yes, 0=no
DoQC=1
##Targets quality level for trimming
q=5
##Trimmed sequence length will have at least minimum length
min_L=50
##Average quality cutoff
avg_q=0
##"N" base cutoff.  Trimmed read has more than this number of continuous base "N" will be discarded.
n=1
##Low complexity filter ratio, Maximum fraction of mono-/di-nucleotide sequence
lc=0.85
## Trim reads with adapters or contamination sequences
adapter=/PATH/adapter.fasta
## phiX filter, boolean, 1=yes, 0=no
phiX=0
## Cut # bp from 5 end before quality trimming/filtering
5end=0
## Cut # bp from 3 end before quality trimming/filtering
3end=0

[Host Removal]
## boolean, 1=yes, 0=no
DoHostRemoval=1
## Use more Host=  to remove multiple host reads
Host=/PATH/all_chromosome.fasta
similarity=90

[Assembly]
## boolean, 1=yes, 0=no
DoAssembly=1
##Bypass assembly and use pre-assembled contigs
assembledContigs=
minContigSize=200
## spades or idba_ud
assembler=idba_ud
idbaOptions="--pre_correction  --mink 31"
## for spades
singleCellMode=
pacbioFile=
nanoporeFile=

[Reads Mapping To Contigs]
# Reads mapping to contigs
DoReadsMappingContigs=auto

[Reads Mapping To Reference]
# Reads mapping to reference
DoReadsMappingReference=0
bowtieOptions=
# reference genbank or fasta file
reference=
MapUnmappedReads=0

[Reads Taxonomy Classification]
## boolean, 1=yes, 0=no
DoReadsTaxonomy=1
## If reference genome exists, only use unmapped reads to do Taxonomy Classification. Turn on AllReads=1 will use all reads instead.
AllReads=0
enabledTools=gottcha-genDB-b,gottcha-speDB-b,gottcha-strDB-b,gottcha-genDB-v,gottcha-speDB-v,gottcha-strDB-v,metaphlan,bwa,kraken_mini

[Contigs Mapping To Reference]
# Contig mapping to reference
DoContigMapping=auto
## identity cutoff
identity=85
MapUnmappedContigs=0

[Variant Analysis]
DoVariantAnalysis=auto

[Contigs Taxonomy Classification]
DoContigsTaxonomy=1

[Contigs Annotation]
## boolean, 1=yes, 0=no
DoAnnotation=1
# kingdom: Archaea Bacteria Mitochondria Viruses
kingdom=Bacteria
contig_size_cut_for_annotation=700
## support tools: Prokka or RATT
annotateProgram=Prokka
annotateSourceGBK=

[ProPhage Detection]
DoProPhageDetection=1

[Phylogenetic Analysis]
DoSNPtree=1
## Availabe choices are Ecoli, Yersinia, Francisella, Brucella, Bacillus
SNPdbName=Ecoli
## FastTree or RAxML
treeMaker=FastTree
## SRA accessions ByrRun, ByExp, BySample, ByStudy
SNP_SRA_ids=

[Primer Validation]
DoPrimerValidation=1
maxMismatch=1
primer=

[Primer Adjudication]
## boolean, 1=yes, 0=no
DoPrimerDesign=0
## desired primer tm
tm_opt=59
tm_min=57
tm_max=63
## desired primer length
len_opt=18
len_min=20
len_max=27
## reject primer having Tm < tm_diff difference with background Tm
tm_diff=5
## display # top results for each target
top=5

[Generate JBrowse Tracks]
DoJBrowse=1

[HTML Report]
DoHTMLReport=1

Test Run

EDGE provides an example data set which is an E. coli MiSeq dataset and has been subsampled to ~10x fold coverage reads.

In the EDGE home directory,

cd testData
sh runTest.sh
Snapshot from the terminal.

Snapshot from the terminal.

See Output

Descriptions of each module

Each module comes with default parameters and user can see the optional parameters by entering the program name with –h or -help flag without any other arguments.

  1. Data QC
  • Required step? No

  • Command example

    perl $EDGE_HOME/scripts/illumina_fastq_QC.pl  -p 'Ecoli_10x.1.fastq Ecoli_10x.2.fastq'  -q 5 -min_L 50 -avg_q 5 -n 0 -lc 0.85 –d QcReads -t 10
    
  • What it does

    • Quality control
    • Read filtering
    • Read trimming
  • Expected input

    • Paired-end/Single-end reads in FASTQ format
  • Expected output

    • QC.1.trimmed.fastq
    • QC.2.trimmed.fastq
    • QC.unpaired.trimmed.fastq
    • QC.stats.txt
    • QC_qc_report.pdf
  1. Host Removal QC
  • Required step? No

  • Command example

    perl $EDGE_HOME/scripts/host_reads_removal_by_mapping.pl  -p 'QC.1.trimmed.fastq QC.2.trimmed.fastq' -u QC.unpaired.trimmed.fastq -ref human_chromosomes.fasta -o QcReads -cpu 10
    
  • What it does

    • Read filtering
  • Expected input

    • Paired-end/Single-end reads in FASTQ format
  • Expected output

    • host_clean.1.fastq
    • host_clean.2.fastq
    • host_clean.mapping.log
    • host_clean.unpaired.fastq
    • host_clean.stats.txt
  1. IDBA Assembling
  • Required step? No

  • Command example

    fq2fa --merge host_clean.1.fastq  host_clean.2.fastq  pairedForAssembly.fasta
    idba_ud  --num_threads 10 -o AssemblyBasedAnalysis/idba --pre_correction pairedForAssembly.fasta
    
  • What it does

    • Iterative kmers de novo Assembly, it performs well on isolates as well as metagenomes. It may not work well on very large genomes.
  • Expected input

    • Paired-end/Single-end reads in FASTA format
  • Expected output

    • contig.fa
    • scaffold.fa (input paired end)
  1. Reads Mapping To Contig
  • Required step? No

  • Command example

    perl $EDGE_HOME/scripts/runReadsToContig.pl  -p 'host_clean.1.fastq host_clean.2.fastq' -d AssemblyBasedAnalysis/readsMappingToContig -pre readsToContigs  -ref AssemblyBasedAnalysis/contigs.fa
    
  • What it does

    • Mapping reads to assembled contigs
  • Expected input

    • Paired-end/Single-end reads in FASTQ format
    • Assembled Contigs in Fasta format
    • Output Directory
    • Output prefix
  • Expected output

    • readsToContigs.alnstats.txt
    • readsToContigs_coverage.table
    • readsToContigs_plots.pdf
    • readsToContigs.sort.bam
    • readsToContigs.sort.bam.bai
  1. Reads Mapping To Reference Genomes
  • Required step? No

  • Command example:

    perl $EDGE_HOME/scripts/runReadsToGenome.pl  -p 'host_clean.1.fastq host_clean.2.fastq'  -d ReadsBasedAnalysis -pre readsToRef -ref Reference.fna
    
  • What it does

    • Mapping reads to reference genomes
    • SNPs/Indels calling
  • Expected input

    • Paired-end/Single-end reads in FASTQ format
    • Reference genomes in Fasta format
    • Output Directory
    • Output prefix
  • Expected output

    • readsToRef.alnstats.txt
    • readsToRef_plots.pdf
    • readsToRef_refID.coverage
    • readsToRef_refID.gap.coords
    • readsToRef_refID.window_size_coverage
    • readsToRef.ref_windows_gc.txt
    • readsToRef.raw.bcf
    • readsToRef.sort.bam
    • readsToRef.sort.bam.bai
    • readsToRef.vcf
  1. Taxonomy Classification on All Reads or unMapped to Reference Reads
  • Required step? No

  • Command example:

    perl $EDGE_HOME/scripts/microbial_profiling/microbial_profiling_configure.pl $EDGE_HOME/scripts/microbial_profiling/microbial_profiling.settings.tmpl gottcha-speDB-b > microbial_profiling.settings.ini
    perl $EDGE_HOME/scripts/microbial_profiling/microbial_profiling.pl -o  Taxonomy -s microbial_profiling.settings.ini -c 10 UnmappedReads.fastq
    
  • What it does

    • Taxonomy Classification using multiple tools, including BWA mapping to NCBI Refseq, metaphlan, kraken, GOTTCHA.
    • Unify varies output format and generate reports
  • Expected input

    • Reads in FASTQ format
    • Configuration text file (generated by microbial_profiling_configure.pl)
  • Expected output

    • Summary EXCEL and text files.
    • Heatmaps tools comparison
    • Radarchart tools comparison
    • Krona and tree-style plots for each tool.
  1. Map Contigs To Reference Genomes
  • Required step? No

  • Command example:

    perl $EDGE_HOME/scripts/nucmer_genome_coverage.pl  -e 1 -i 85 –p contigsToRef Reference.fna contigs.fa
    
  • What it does

    • Mapping assembled contigs to reference genomes
    • SNPs/Indels calling
  • Expected input

    • Reference genome in Fasta Format
    • Assembled contigs in Fasta Format
    • Output prefix
  • Expected output

    • contigsToRef_avg_coverage.table
    • contigsToRef.delta
    • contigsToRef_query_unUsed.fasta
    • contigsToRef.snps
    • contigsToRef.coords
    • contigsToRef.log
    • contigsToRef_query_novel_region_coord.txt
    • contigsToRef_ref_zero_cov_coord.txt
  1. Variant Analysis
  • Required step? No

  • Command example:

    perl $EDGE_HOME/scripts/SNP_analysis.pl -genbank Reference.gbk -SNP contigsToRef.snps -format nucmer
    perl $EDGE_HOME/scripts/gap_analysis.pl -genbank Reference.gbk -gap  contigsToRef_ref_zero_cov_coord.txt
    
  • What it does

    • Analyze variants and gaps regions using annotation file.
  • Expected input

    • Reference in GenBank format
    • SNPs/INDELs/Gaps files from “Map Contigs To Reference Genomes“
  • Expected output

    • contigsToRef.SNPs_report.txt
    • contigsToRef.Indels_report.txt
    • GapVSReference.report.txt
  1. Contigs Taxonomy Classification
  • Required step? No

  • Command example:

    perl $EDGE_HOME/scripts/contig_classifier_by_bwa/contig_classifier_by_bwa.pl --db $EDGE_HOME/database/bwa_index/NCBI-Bacteria-Virus.fna --threads 10 --prefix OuputCT --input contigs.fa
    
  • What it does

    • Taxonomy Classification on contigs using BWA mapping to NCBI Refseq
  • Expected input

    • Contigs in Fasta format
    • NCBI Refseq genomes bwa index
    • Output prefix
  • Expected output

    • prefix.assembly_class.csv
    • prefix.assembly_class.top.csv
    • prefix.ctg_class.csv
    • prefix.ctg_class.LCA.csv
    • prefix.ctg_class.top.csv
    • prefix.unclassified.fasta
  1. Contig Annotation
  • Required step? No

  • Command example:

    prokka --force --prefix PROKKA --outdir Annotation contigs.fa
    
  • What it does

    • The rapid annotation of prokaryotic genomes.
  • Expected input

    • Assembled Contigs in Fasta format
    • Output Directory
    • Output prefix
  • Expected output

    • It produces GFF3, GBK and SQN files that are ready for editing in Sequin and ultimately submitted to Genbank/DDJB/ENA.
  1. ProPhage detection
  • Required step? No

  • Command example:

    perl $EDGE_HOME/scripts/phageFinder_prepare.pl -o Prophage –p Assembly Annotation/PROKKA.gff Annotation/PROKKA.fna
    $EDGE_HOME/thirdParty/phage_finder_v2.1/bin/phage_finder_v2.1.sh Assembly
    
  • What it does

    • Identify and classify prophages within prokaryotic genomes.
  • Expected input

    • Annotated Contigs GenBank file
    • Output Directory
    • Output prefix
  • Expected output

    • phageFinder_summary.txt
  1. PCR Assay Validation
  • Required step? No

  • Command example:

    perl $EDGE_HOME/scripts/pcrValidation/validate_primers.pl -ref contigs.fa -primer primers.fa -mismatch 1 -output AssayCheck
    
  • What it does

    • In silico PCR primer validation by sequence alignment.
  • Expected input

    • Assembled Contigs/Reference in Fasta format
    • Output Directory
    • Output prefix
  • Expected output

    • pcrContigValidation.log
    • pcrContigValidation.bam
  1. PCR Assay Adjudication
  • Required step? No

  • Command example:

    perl $EDGE_HOME/scripts/pcrAdjudication/pcrUniquePrimer.pl --input contigs.fa  --gff3 PCR.Adjudication.primers.gff3
    
  • What it does

    • Design unique primer pairs for input contigs.
  • Expected input

    • Assembled Contigs in Fasta format
    • Output gff3 file name
  • Expected output

    • PCR.Adjudication.primers.gff3
    • PCR.Adjudication.primers.txt
  1. Phylogenetic Analysis
  • Required step? No

  • Command example:

    perl $EDGE_HOME/scripts/prepare_SNP_phylogeny.pl -o output/SNP_Phylogeny/Ecoli -tree FastTree -db Ecoli -n output -cpu 10 -p QC.1.trimmed.fastq QC.2.trimmed.fastq -c contigs.fa -s QC.unpaired.trimmed.fastq
    perl $EDGE_HOME/scripts/SNPphy/runSNPphylogeny.pl output/SNP_Phylogeny/Ecoli/SNPphy.ctrl
    
  • What it does

    • Perform SNP identification against selected pre-built SNPdb or selected genomes
    • Build SNP based multiple sequence alignment for all and CDS regions
    • Generate Tree file in newick/PhyloXML format
  • Expected input

    • SNPdb path or genomesList
    • Fastq reads files
    • Contig files
  • Expected output

    • SNP based phylogentic multiple sequence alignment
    • SNP based phylogentic tree in newick/PhyloXML format.
    • SNP information table
  1. Generate JBrowse Tracks
  • Required step? No

  • Command example:

    perl $EDGE_HOME/scripts/edge2jbrowse_converter.pl --in-ref-fa Reference.fna --in-ref-gff3 Reference.gff --proj_outdir EDGE_project_dir
    
  • What it does

    • Convert several EDGE outputs into JBrowse tracks for visualization for contigs and reference, respectively.
  • Expected input

    • EDGE project output Directory
  • Expected output

    • EDGE post-processed files for JBrowse tracks in the JBrowse directory.
    • Tracks configuration files in the JBrowse directory.
  1. HTML Report
  • Required step? No

  • Command example:

    perl $EDGE_HOME/scripts/munger/outputMunger_w_temp.pl EDGE_project_dir
    
  • What it does

    • Generate statistical numbers and plots in an interactive html report page.
  • Expected input

    • EDGE project output Directory
  • Expected output

    • report.html

Other command-line utility scripts

  1. To extract certain taxa fasta from contig classification result:

    cd /home/edge_install/edge_ui/EDGE_output/41/AssemblyBasedAnalysis/Taxonomy
    perl /home/edge_install/scripts/contig_classifier_by_bwa/extract_fasta_by_taxa.pl -fasta ../contigs.fa -csv ProjectName.ctg_class.top.csv -taxa "Enterobacter cloacae” > Ecloacae.contigs.fa
    
  2. To extract unmapped/mapped reads fastq from the bam file:

    cd /home/edge_install/edge_ui/EDGE_output/41/AssemblyBasedAnalysis/readsMappingToContig
    # extract unmapped reads
    perl /home/edge_install/scripts/bam_to_fastq.pl -unmapped readsToContigs.sort.bam
    # extract mapped reads
    perl /home/edge_install/scripts/bam_to_fastq.pl -mapped readsToContigs.sort.bam
    
  3. To extract mapped reads fastq of a specific contig/reference from the bam file:

    cd /home/edge_install/edge_ui/EDGE_output/41/AssemblyBasedAnalysis/readsMappingToContig
    perl /home/edge_install/scripts/bam_to_fastq.pl -id ProjectName_00001 -mapped readsToContigs.sort.bam
    

Output

The output directory structure contains ten major sub-directories when all modules are turned on. In addition to the main directories, EDGE will generate a final report in portable document file format (pdf), process log and error log file in the project main directory.

  • AssayCheck
  • AssemblyBasedAnalysis
  • HostRemoval
  • HTML_Report
  • JBrowse
  • QcReads
  • ReadsBasedAnalysis
  • ReferenceBasedAnalysis
  • Reference
  • SNP_Phylogeny

In the graphic user interface, EDGE generates an interactive output webpage which includes summary statistics and taxonomic information, etc. The easiest way to interact with the results is through the web interface. If a project run finished through the command line, user can open the report html file in the HTML_report subdirectory off-line. When a project run is finished, user can click on the project id from the menu and it will generate the interactive html report on the fly. User can browse the data structure by clicking the project link and visualize the result by JBrowse links, download the pdf files, etc.

_images/output.png

Example Output

See http://lanl-bioinformatics.github.io/EDGE/example_output/report.html

Note

The example link is just an example of graphic output. The JBrowse and links are not accessible in the example links.

Databases

EDGE provided databases

MvirDB

A Microbial database of protein toxins, virulence factors and antibiotic resistance genes for bio-defense applications

NCBI Refseq

EDGE prebuilt blast db and bwa_index of NCBI RefSeq genomes.

see $EDGE_HOME/database/bwa_index/id_mapping.txt for all gi/accession to genome name lookup table.

Krona taxonomy

Update Krona taxonomy db

Download these files from ftp://ftp.ncbi.nih.gov/pub/taxonomy:

wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_nucl.dmp.gz
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_prot.dmp.gz
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz

Transfer the files to the taxonomy folder in the standalone KronaTools installation and run:

$EDGE_HOME/thirdParty/KronaTools-2.4/updateTaxonomy.sh --local.

Metaphlan database

MetaPhlAn relies on unique clade-specific marker genes identified from 3,000 reference genomes.

Human Genome

The bwa index is prebuilt in the EDGE. The human hs_ref_GRCh38 sequences from NCBI ftp site.

MiniKraken DB

Kraken is a system for assigning taxonomic labels to short DNA sequences, usually obtained through metagenomic studies. MiniKraken is a pre-built 4 GB database constructed from complete bacterial, archaeal, and viral genomes in RefSeq (as of Mar. 30, 2014).

GOTTCHA DB

A novel, annotation-independent and signature-based metagenomic taxonomic profiling tool. (manuscript in submission)

SNPdb

SNP database based on whole genome comparison. Current available db are Ecoli, Yersinia, Francisella, Brucella, Bacillus .

Invertebrate Vectors of Human Pathogens

The bwa index is prebuilt in the EDGE.

Version: 2014 July 24

Other optional database

Not in the EDGE but you can download.

Building bwa index

Here take human genome as example.

  1. Download the human hs_ref_GRCh38 sequences from NCBI ftp site.

Go to ftp://ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/Assembled_chromosomes/seq/ Or use a provided perl script in $EDGE_HOME/scripts/

perl $EDGE_HOME/scripts/download_human_refseq_genome.pl output_dir
  1. Gunzip the downloaded fasta file and concatenate them into one human genome multifasta file:

    gunzip hs_ref_GRCh38.*.fa.gz
    cat hs_ref_GRCh38.*.fa > human_ref_GRCh38.all.fasta
    
  2. Use the installed bwa to build the index:

    $EDGE_HOME/bin/bwa index human_ref_GRCh38.all.fasta
    
Now, you can configure the config file with “host=/path/human_ref_GRCh38.all.fasta” for host removal step.

SNP database genomes

SNP database was pre-built from the below genomes.

Ecoli Genomes

Name Description URL
Ecoli_042 Escherichia coli 042, complete genome http://www.ncbi.nlm.nih.gov/nuccore/387605479
Ecoli_11128 Escherichia coli O111:H- str. 11128, complete genome http://www.ncbi.nlm.nih.gov/nuccore/260866153
Ecoli_11368 Escherichia coli O26:H11 str. 11368 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/260853213
Ecoli_12009 Escherichia coli O103:H2 str. 12009, complete genome http://www.ncbi.nlm.nih.gov/nuccore/260842239
Ecoli_2009EL2050 Escherichia coli O104:H4 str. 2009EL-2050 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/410480139
Ecoli_2009EL2071 Escherichia coli O104:H4 str. 2009EL-2071 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/407466711
Ecoli_2011C3493 Escherichia coli O104:H4 str. 2011C-3493 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/407479587
Ecoli_536 Escherichia coli 536, complete genome http://www.ncbi.nlm.nih.gov/nuccore/110640213
Ecoli_55989 Escherichia coli 55989 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/218693476
Ecoli_ABU_83972 Escherichia coli ABU 83972 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/386637352
Ecoli_APEC_O1 Escherichia coli APEC O1 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/117622295
Ecoli_ATCC_8739 Escherichia coli ATCC 8739 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/170018061
Ecoli_BL21_DE3 Escherichia coli BL21(DE3) chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/387825439
Ecoli_BW2952 Escherichia coli BW2952 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/238899406
Ecoli_CB9615 Escherichia coli O55:H7 str. CB9615 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/291280824
Ecoli_CE10 Escherichia coli O7:K1 str. CE10 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/386622414
Ecoli_CFT073 Escherichia coli CFT073 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/26245917
Ecoli_DH1 Escherichia coli DH1, complete genome http://www.ncbi.nlm.nih.gov/nuccore/387619774
Ecoli_Di14 Escherichia coli str. ‘clone D i14’ chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/386632422
Ecoli_Di2 Escherichia coli str. ‘clone D i2’ chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/386627502
Ecoli_E2348_69 Escherichia coli O127:H6 str. E2348/69 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/215485161
Ecoli_E24377A Escherichia coli E24377A chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/157154711
Ecoli_EC4115 Escherichia coli O157:H7 str. EC4115 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/209395693
Ecoli_ED1a Escherichia coli ED1a chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/218687878
Ecoli_EDL933 Escherichia coli O157:H7 str. EDL933 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/16445223
Ecoli_ETEC_H10407 Escherichia coli ETEC H10407, complete genome http://www.ncbi.nlm.nih.gov/nuccore/387610477
Ecoli_HS Escherichia coli HS, complete genome http://www.ncbi.nlm.nih.gov/nuccore/157159467
Ecoli_IAI1 Escherichia coli IAI1 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/218552585
Ecoli_IAI39 Escherichia coli IAI39 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/218698419
Ecoli_IHE3034 Escherichia coli IHE3034 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/386597751
Ecoli_K12_DH10B Escherichia coli str. K-12 substr. DH10B chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/170079663
Ecoli_K12_MG1655 Escherichia coli str. K-12 substr. MG1655 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/49175990
Ecoli_K12_W3110 Escherichia coli str. K-12 substr. W3110, complete genome http://www.ncbi.nlm.nih.gov/nuccore/388476123
Ecoli_KO11FL Escherichia coli KO11FL chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/386698504
Ecoli_LF82 Escherichia coli LF82, complete genome http://www.ncbi.nlm.nih.gov/nuccore/222154829
Ecoli_NA114 Escherichia coli NA114 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/386617516
Ecoli_NRG_857C Escherichia coli O83:H1 str. NRG 857C chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/387615344
Ecoli_P12b Escherichia coli P12b chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/386703215
Ecoli_REL606 Escherichia coli B str. REL606 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/254160123
Ecoli_RM12579 Escherichia coli O55:H7 str. RM12579 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/387504934
Ecoli_S88 Escherichia coli S88 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/218556939
Ecoli_SE11 Escherichia coli O157:H7 str. Sakai chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/15829254
Ecoli_SE15 Escherichia coli SE11 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/209917191
Ecoli_SMS35 Escherichia coli SE15, complete genome http://www.ncbi.nlm.nih.gov/nuccore/387828053
Ecoli_Sakai Escherichia coli SMS-3-5 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/170679574
Ecoli_TW14359 Escherichia coli O157:H7 str. TW14359 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/254791136
Ecoli_UM146 Escherichia coli UM146 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/386602643
Ecoli_UMN026 Escherichia coli UMN026 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/218703261
Ecoli_UMNK88 Escherichia coli UMNK88 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/386612163
Ecoli_UTI89 Escherichia coli UTI89 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/91209055
Ecoli_W Escherichia coli W chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/386707734
Ecoli_Xuzhou21 Escherichia coli Xuzhou21 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/387880559
Sboydii_CDC_3083_94 Shigella boydii CDC 3083-94 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/187730020
Sboydii_Sb227 Shigella boydii Sb227 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/82542618
Sdysenteriae_Sd197 Shigella dysenteriae Sd197, complete genome http://www.ncbi.nlm.nih.gov/nuccore/82775382
Sflexneri_2002017 Shigella flexneri 2002017 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/384541581
Sflexneri_2a_2457T Shigella flexneri 2a str. 2457T, complete genome http://www.ncbi.nlm.nih.gov/nuccore/30061571
Sflexneri_2a_301 Shigella flexneri 2a str. 301 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/344915202
Sflexneri_5_8401 Shigella flexneri 5 str. 8401 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/110804074
Ssonnei_53G Shigella sonnei 53G, complete genome http://www.ncbi.nlm.nih.gov/nuccore/377520096
Ssonnei_Ss046 Shigella sonnei Ss046 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/74310614

Yersinia Genomes

Name Description URL
Ypestis_A1122 Yersinia pestis A1122 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/384137007
Ypestis_Angola Yersinia pestis Angola chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/162418099
Ypestis_Antiqua Yersinia pestis Antiqua chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/108805998
Ypestis_CO92 Yersinia pestis CO92 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/16120353
Ypestis_D106004 Yersinia pestis D106004 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/384120592
Ypestis_D182038 Yersinia pestis D182038 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/384124469
Ypestis_KIM_10 Yersinia pestis KIM 10 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/22123922
Ypestis_Medievalis_Harbin_35 Yersinia pestis biovar Medievalis str. Harbin 35 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/384412706
Ypestis_Microtus_91001 Yersinia pestis biovar Microtus str. 91001 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/45439865
Ypestis_Nepal516 Yersinia pestis Nepal516 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/108810166
Ypestis_Pestoides_F Yersinia pestis Pestoides F chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/145597324
Ypestis_Z176003 Yersinia pestis Z176003 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/294502110
Ypseudotuberculosis_IP_31758 Yersinia pseudotuberculosis IP 31758 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/153946813
Ypseudotuberculosis_IP_32953 Yersinia pseudotuberculosis IP 32953 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/51594359
Ypseudotuberculosis_PB1 Yersinia pseudotuberculosis PB1/+ chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/186893344
Ypseudotuberculosis_YPIII Yersinia pseudotuberculosis YPIII chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/170022262

Francisella Genomes

Name Description URL
Fnovicida_U112 Francisella novicida U112 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/118496615
Ftularensis_holarctica_F92 Francisella tularensis subsp. holarctica F92 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/423049750
Ftularensis_holarctica_FSC200 Francisella tularensis subsp. holarctica FSC200 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/422937995
Ftularensis_holarctica_FTNF00200 Francisella tularensis subsp. holarctica FTNF002-00 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/156501369
Ftularensis_holarctica_LVS Francisella tularensis subsp. holarctica LVS chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/89255449
Ftularensis_holarctica_OSU18 Francisella tularensis subsp. holarctica OSU18 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/115313981
Ftularensis_mediasiatica_FSC147 Francisella tularensis subsp. mediasiatica FSC147 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/187930913
Ftularensis_TIGB03 Francisella tularensis TIGB03 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/379716390
Ftularensis_tularensis_FSC198 Francisella tularensis subsp. tularensis FSC198 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/110669657
Ftularensis_tularensis_NE061598 Francisella tularensis subsp. tularensis NE061598 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/385793751
Ftularensis_tularensis_SCHU_S4 Francisella tularensis subsp. tularensis SCHU S4 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/255961454
Ftularensis_tularensis_TI0902 Francisella tularensis subsp. tularensis TI0902 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/379725073
Ftularensis_tularensis_WY963418 Francisella tularensis subsp. tularensis WY96-3418 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/134301169

Brucella Genomes

Name Description URL
Babortus_1_9941 Brucella abortus bv. 1 str. 9-941 http://www.ncbi.nlm.nih.gov/bioproject/58019
Babortus_A13334 Brucella abortus A13334 http://www.ncbi.nlm.nih.gov/bioproject/83615
Babortus_S19 Brucella abortus S19 http://www.ncbi.nlm.nih.gov/bioproject/58873
Bcanis_ATCC_23365 Brucella canis ATCC 23365 http://www.ncbi.nlm.nih.gov/bioproject/59009
Bcanis_HSK_A52141 Brucella canis HSK A52141 http://www.ncbi.nlm.nih.gov/bioproject/83613
Bceti_TE10759_12 Brucella ceti TE10759-12 http://www.ncbi.nlm.nih.gov/bioproject/229880
Bceti_TE28753_12 Brucella ceti TE28753-12 http://www.ncbi.nlm.nih.gov/bioproject/229879
Bmelitensis_1_16M Brucella melitensis bv. 1 str. 16M http://www.ncbi.nlm.nih.gov/bioproject/200008
Bmelitensis_Abortus_2308 Brucella melitensis biovar Abortus 2308 http://www.ncbi.nlm.nih.gov/bioproject/16203
Bmelitensis_ATCC_23457 Brucella melitensis ATCC 23457 http://www.ncbi.nlm.nih.gov/bioproject/59241
Bmelitensis_M28 Brucella melitensis M28 http://www.ncbi.nlm.nih.gov/bioproject/158857
Bmelitensis_M590 Brucella melitensis M5-90 http://www.ncbi.nlm.nih.gov/bioproject/158855
Bmelitensis_NI Brucella melitensis NI http://www.ncbi.nlm.nih.gov/bioproject/158853
Bmicroti_CCM_4915 Brucella microti CCM 4915 http://www.ncbi.nlm.nih.gov/bioproject/59319
Bovis_ATCC_25840 Brucella ovis ATCC 25840 http://www.ncbi.nlm.nih.gov/bioproject/58113
Bpinnipedialis_B2_94 Brucella pinnipedialis B2/94 http://www.ncbi.nlm.nih.gov/bioproject/71133
Bsuis_1330 Brucella suis 1330 http://www.ncbi.nlm.nih.gov/bioproject/159871
Bsuis_ATCC_23445 Brucella suis ATCC 23445 http://www.ncbi.nlm.nih.gov/bioproject/59015
Bsuis_VBI22 Brucella suis VBI22 http://www.ncbi.nlm.nih.gov/bioproject/83617

Bacillus Genomes

Name Description URL
Banthracis_A0248 Bacillus anthracis str. A0248, complete genome http://www.ncbi.nlm.nih.gov/nuccore/229599883
Banthracis_Ames Bacillus anthracis str. ‘Ames Ancestor’ chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/50196905
Banthracis_Ames_Ancestor Bacillus anthracis str. Ames chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/30260195
Banthracis_CDC_684 Bacillus anthracis str. CDC 684 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/227812678
Banthracis_H9401 Bacillus anthracis str. H9401 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/386733873
Banthracis_Sterne Bacillus anthracis str. Sterne chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/49183039
Bcereus_03BB102 Bacillus cereus 03BB102, complete genome http://www.ncbi.nlm.nih.gov/nuccore/225862057
Bcereus_AH187 Bacillus cereus AH187 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/217957581
Bcereus_AH820 Bacillus cereus AH820 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/218901206
Bcereus_anthracis_CI Bacillus cereus biovar anthracis str. CI chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/301051741
Bcereus_ATCC_10987 Bacillus cereus ATCC 10987 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/42779081
Bcereus_ATCC_14579 Bacillus cereus ATCC 14579, complete genome http://www.ncbi.nlm.nih.gov/nuccore/30018278
Bcereus_B4264 Bacillus cereus B4264 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/218230750
Bcereus_E33L Bacillus cereus E33L chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/52140164
Bcereus_F837_76 Bacillus cereus F837/76 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/376264031
Bcereus_G9842 Bacillus cereus G9842 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/218895141
Bcereus_NC7401 Bacillus cereus NC7401, complete genome http://www.ncbi.nlm.nih.gov/nuccore/375282101
Bcereus_Q1 Bacillus cereus Q1 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/222093774
Bthuringiensis_AlHakam Bacillus thuringiensis str. Al Hakam chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/118475778
Bthuringiensis_BMB171 Bacillus thuringiensis BMB171 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/296500838
Bthuringiensis_Bt407 Bacillus thuringiensis Bt407 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/409187965
Bthuringiensis_chinensis_CT43 Bacillus thuringiensis serovar chinensis CT-43 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/384184088
Bthuringiensis_finitimus_YBT020 Bacillus thuringiensis serovar finitimus YBT-020 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/384177910
Bthuringiensis_konkukian_9727 Bacillus thuringiensis serovar konkukian str. 97-27 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/49476684
Bthuringiensis_MC28 Bacillus thuringiensis MC28 chromosome, complete genome http://www.ncbi.nlm.nih.gov/nuccore/407703236

Ebola Reference Genomes

Accession Description URL
NC_014372 Tai Forest ebolavirus isolate Tai Forest virus H.sapiens-tc/CIV/1994/Pauleoula-CI, complete genome. http://www.ncbi.nlm.nih.gov/nuccore/NC_014372
FJ217162 Cote d’Ivoire ebolavirus, complete genome. http://www.ncbi.nlm.nih.gov/nuccore/FJ217162
FJ968794 Sudan ebolavirus strain Boniface, complete genome. http://www.ncbi.nlm.nih.gov/nuccore/FJ968794
NC_006432 Sudan ebolavirus isolate Sudan virus H.sapiens-tc/UGA/2000/Gulu-808892, complete genome. http://www.ncbi.nlm.nih.gov/nuccore/NC_006432
KJ660348 Zaire ebolavirus isolate H.sapiens-wt/GIN/2014/Gueckedou-C05, complete genome. http://www.ncbi.nlm.nih.gov/nuccore/KJ660348
KJ660347 Zaire ebolavirus isolate H.sapiens-wt/GIN/2014/Gueckedou-C07, complete genome. http://www.ncbi.nlm.nih.gov/nuccore/KJ660347
KJ660346 Zaire ebolavirus isolate H.sapiens-wt/GIN/2014/Kissidougou-C15, complete genome. http://www.ncbi.nlm.nih.gov/nuccore/KJ660346
JN638998 Sudan ebolavirus - Nakisamata, complete genome. http://www.ncbi.nlm.nih.gov/nuccore/JN638998
AY354458 Zaire ebolavirus strain Zaire 1995, complete genome. http://www.ncbi.nlm.nih.gov/nuccore/AY354458
AY729654 Sudan ebolavirus strain Gulu, complete genome. http://www.ncbi.nlm.nih.gov/nuccore/AY729654
EU338380 Sudan ebolavirus isolate EBOV-S-2004 from Sudan, complete genome. http://www.ncbi.nlm.nih.gov/nuccore/EU338380
KM655246 Zaire ebolavirus isolate H.sapiens-tc/COD/1976/Yambuku-Ecran, complete genome. http://www.ncbi.nlm.nih.gov/nuccore/KM655246
KC242801 Zaire ebolavirus isolate EBOV/H.sapiens-tc/COD/1976/deRoover, complete genome. http://www.ncbi.nlm.nih.gov/nuccore/KC242801
KC242800 Zaire ebolavirus isolate EBOV/H.sapiens-tc/GAB/2002/Ilembe, complete genome. http://www.ncbi.nlm.nih.gov/nuccore/KC242800
KC242799 Zaire ebolavirus isolate EBOV/H.sapiens-tc/COD/1995/13709 Kikwit, complete genome. http://www.ncbi.nlm.nih.gov/nuccore/KC242799
KC242798 Zaire ebolavirus isolate EBOV/H.sapiens-tc/GAB/1996/1Ikot, complete genome. http://www.ncbi.nlm.nih.gov/nuccore/KC242798
KC242797 Zaire ebolavirus isolate EBOV/H.sapiens-tc/GAB/1996/1Oba, complete genome. http://www.ncbi.nlm.nih.gov/nuccore/KC242797
KC242796 Zaire ebolavirus isolate EBOV/H.sapiens-tc/COD/1995/13625 Kikwit, complete genome. http://www.ncbi.nlm.nih.gov/nuccore/KC242796
KC242795 Zaire ebolavirus isolate EBOV/H.sapiens-tc/GAB/1996/1Mbie, complete genome. http://www.ncbi.nlm.nih.gov/nuccore/KC242795
KC242794 Zaire ebolavirus isolate EBOV/H.sapiens-tc/GAB/1996/2Nza, complete genome. http://www.ncbi.nlm.nih.gov/nuccore/KC242794

Third Party Tools

Annotation

Warning

tbl2asn must be compiled within the past year to function. We attempt to recompile every 6 months or so. Most recent compilation is 26 Feb 2015

Alignment

Taxonomy Classification

Phylogeny

Visualization and Graphic User Interface

Utility

FAQs and Troubleshooting

FAQs

  • Can I speed up the process?

    You may increase the number of CPUs to be used from the “additional options” of the input section. The default and minimum value is one-eighth of total number of server CPUs.

  • There is no enough disk space for storing projects data. How do I do?

    There is an archive project action which will move the whole project directory to the directory path configured in the $EDGE_HOME/sys.properties. We also recommend a symbolic link for the $EDGE_HOME/edge_ui/EDGE_input directory which points to the location where the user’s (or sequencing center’s) raw data are stored, obviating unnecessary data transfer via web protocol and saving local storage.

  • How to decide various QC parameters?

    The default parameters should be sufficient for most cases. However, if you have very depth coverage of the sequencing data, you may increase the trim quality level and average quality cutoff to only use high quality data.

  • How to set K-mer size for IDBA_UD assembly?

    By default, it starts from kmer=31 and iterative step by adding 20 to maximum kmer=121. Larger K-mers would have higher rate of uniqueness in the genome and would make the graph simpler, but it requires deep sequencing depth and longer read length to guarantee the overlap at any genomic location and it is much more sensitive to sequencing errors and heterozygosity. Professor Titus Brown has a good blog on general k-mer size discussion.

  • How many reference genomes for Reference-Based Analysis and Phylogenetic Analysis can be used from the EDGE GUI?

    The default maximum is 20 and there is a minimum 3 genomes criteria for the Phylogenetic Analysis. But it can be configured when installing EDGE.

Troubleshooting

  • In the GUI, if you are trying to enter information into a specific field and it is grayed out or won’t let you, try refreshing the page by clicking the icon in the right top of the browser window.
  • Process.log and error.log files may help on the troubleshooting.

Coverage Issues

  • Average Fold Coverage reported in the HTML output and by the output tables generated in {output directory}/AssemblyBasedAnalysis/ReadsMappingToContigs/ are calculated with mpileup using the default options for metagenomes. These settings discount reads that are unpaired within a contig or with an insert size out of the expected bounds. This will result in an underreporting of the average fold coverage based on the generated BAM file, but one that the team feels is more accurate given the intended use of this environment.

Data Migration

  • The preferred method of transferring data to the EDGE appliance is via SFTP. Using an SFTP client such as FileZilla, connect to port 22 using your system’s username and password.
  • In the case of very large transfers, you may wish to use a USB hard drive or thumb drive.
  • If the data is being transferred from another LINUX machine, the server will recognize partitions that use the FAT, ext2, ext3, or ext4 filesystems.
  • If the data is being transferred from a Windows machine, the partition may use the NTFS filesystem. If this is the case, the drive will not be recognized until you follow these instructions:
    • Open the command line interface by clicking the Applications menu in the top left corner (or use SSH to connect to the system).
    • Enter the command: ‘’sudo yum install ntfs-3g ntfs-3g-devel -y’‘
    • Enter your password if required.
  • After a reboot, you should be able to connect your Windows hard drive to the system, and it will mount like a normal disk.

Discussions / Bugs Reporting

  • We have created a mailing list for EDGE users. If you would like to recieve notifications about the updates and join the discussion, please join the mailing list by becoming the member of edge-users groups.

  • We appreciate any feedback or concerns you may have about EDGE. If you encounter any bugs, you can report them to our GitHub issue tracker.

  • Any other questions? You are welcome to Contact Us

Contact Us

Questions? Concerns? Please feel free to email our google group at edge-users@googlegroups.com or contact a dev team member listed below.

Name Email
Patrick Chain pchain@lanl.gov
Chien-Chi Lo chienchi@lanl.gov
Paul Li po-e@lanl.gov
Karen Davenport kwdavenport@lanl.gov
Joe Anderson joseph.j.anderson2.civ@mail.mil
Kim Bishop-Lilly kimberly.a.bishop-lilly.ctr@mail.mil

Citation

Enabling the democratization of the genomics revolution with a fully integrated web-based bioinformatics platform

Po-E Li; Chien-Chi Lo; Joseph J. Anderson; Karen W. Davenport; Kimberly A. Bishop-Lilly; Yan Xu; Sanaa Ahmed; Shihai Feng; Vishwesh P. Mokashi; Patrick S.G. Chain

Nucleic Acids Research 2016;

doi: 10.1093/nar/gkw1027