itero: guided contig assembly for target enrichment data¶
Release v1.1. (Changelog)
Author: | Brant C. Faircloth |
---|---|
Date: | 10 October 2018 18:46 UTC (+0000) |
Copyright: | This documentation is available under a Creative Commons (CC-BY) license. |
itero is a “guided-assembly” workflow for target enrichment data that uses bwa, samtools, bedtools, and spades to provided high-quality assemblies of raw reads from Illumina instruments.
Contributions¶
itero is open-source (see License) and we welcome contributions from anyone who is interested. Please make a pull request on github. The issue tracker for itero is also available on github.
Issues¶
If you have an issue, please ensure that you are experiencing this issue on a supported OS (see Installation) using the conda installation of itero. If possible, please submit a test case demonstrating the issue and indicate which platform, git checkout, and phyluce version you are using.
Guide¶
Purpose¶
itero is a program what wraps a workflow for guided- or reference-based assembly of target enrichment data. This approach to assembly is also “iterative”, meaning that the assembly proceeds through several, iterations (hence “itero”). I wrote itero for a variety of reasons:
- “traditional” DNA assembly programs performed poorly with target enrichment data (from UCE loci)
- existing DNA assembly approaches had relatively high assembly error
- existing guided assembly programs were hard to install and run
- some existing guided assembly programs were slow
itero attempts to fix some of these problems. At its heart, itero uses an input file of “seeds”, against which it will try to assemble raw-read data from Illumina instruments. Alignment of reads-to-seeds uses bwa, the BAM file is split with samtools and bedtools, and locus-specific reads are then assembled using spades (with error correction turned on during the final round). Then, the entire process repeats itself.
To increase assembly speed, itero takes advantage of multiple cores (on single nodes) using python multiprocessing and MPI (on HPC systems) using the excellent schwimmbad library.
Who wrote this?¶
This documentation was written primarily by Brant Faircloth (http://faircloth-lab.org). Brant is also responsible for the development of most of the itero code. Bugs within the code are usually his.
How do I report bugs?¶
To report a bug, please post an issue to https://github.com/faircloth-lab/itero/issues. Please also ensure that you are using one of the “supported” platforms:
- Apple OSX 10.9.x
- CentOS 7.x
and that you have installed itero and dependencies using conda, as described in the Installation section.
Installation¶
itero uses a number of Python tools that allow it to assemble raw reads into contigs. itero also wraps a number of third-party programs. These include:
3rd-party programs¶
To ensure that these dependencies are easy to install, we have created a conda package for itero that is available as part of bioconda. This is the easiest way to get itero up and running on your system. itero can also be run outside of conda, and we include some installation suggestions for these types of systems, below. However, because many HPC systems are configured differently, we cannot provide extensive support for itero on HPC platforms.
Note
We build and test the binaries available through conda using 64-bit operating systems that include:
- Apple OSX 10.9.x
- CentOS 7.x
Why conda?¶
It may seem odd to impose a particular distribution on users, and we largely agree. However, conda makes it very easy for us to distribute both Python and non-Python packages, setup identical environments across very heterogenous platforms (linux, osx), make sure all the $PATHs are correct, and have things run largely as expected. Using conda has several other benefits, including environment separation similar to virtualenv.
In short, using conda gets us as close to a “one-click” install that we will probably ever get.
Install Process (using conda/bioconda)¶
Attention
We do not support itero on Windows.
Note
We build and test the binaries available through using 64-bit operating systems that include:
- Apple OSX 10.9.x
- CentOS 7.x
The installation process is a 3-step process. You need to:
Installing itero using conda will install all of the required binaries, libraries, and Python dependencies.
Install Anaconda or miniconda¶
You first need to install anaconda or miniconda. Which one you choose is up to you, your needs, how much disk space you have, and if you are on a fast/slow connection.
Attention
You can easily install anaconda or miniconda in your $HOME, although you should be aware that this setup can cause problems in some HPC setups.
Tip
Do I want anaconda or miniconda?
The major difference between the two python distributions is that anaconda comes with many, many packages pre-installed, while miniconda comes with almost zero packages pre-installed. As such, the beginning anaconda distribution is roughly 500 MB in size while the beginning miniconda distribution is 15-30 MB in size.
anaconda¶
Follow the instructions here for your platform: http://docs.continuum.io/anaconda/install.html
miniconda¶
Find the correct miniconda-x.x.x file for your platform from http://repo.continuum.io/miniconda/ and download that file. Be sure you do not get one of the packages that has a name starting with miniconda3-. When that has completed, run one of the following:
bash Miniconda-x.x.x-Linux-x86_64.sh [linux]
bash Miniconda-x.x.x-MacOSX-x86_64.sh [osx]
Note
Once you have installed Miniconda, we will refer to it as anaconda throughout the remainder of this documentation.
Checking your $PATH¶
Regardless of whether you install anaconda or miniconda, you need to check that you’ve installed the package correctly. To ensure that the correct location for anaconda or miniconda are added to your $PATH (this occurs automatically on the $BASH shell), run the following:
$ python -V
The output should look similar to (x will be replaced by a version):
Python 2.7.x :: Anaconda x.x.x (x86_64)
Notice that the output shows we’re using the Anaconda x.x.x version of Python. If you do not see the expected output (or something similar), then you likely need to edit your $PATH variable to add anaconda or miniconda.
The easiest way to edit your path, if needed is to open ~/.bashrc
with a
text editor (if you are using ZSH, this will be ~/.zshrc
) and add, as the
last line:
export PATH=$HOME/path/to/conda/bin:$PATH
where $HOME/path/to/conda/bin
is the location of anaconda/miniconda on your
system (usually $HOME/anaconda/bin
or $HOME/miniconda/bin
).
Configure Bioconda¶
Once you have installed anaconda (or miniconda), you need to configure conda to use the bioconda channel. More information on this process can be found on the bioconda website, but the gist of the process is that you need to run:
conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda
Install Process (Alternative / HPC)¶
On some systems (particularly HPC systems), conda can cause problems. You can itero the “old” way by downloading the package tarball (https://github.com/faircloth-lab/itero/releases) and running:
python setup.py install
in the main directory. This should install all of the Python dependencies, but you still need to install and configure the 3rd-party dependencies.
Attention
You will need to install 3rd-party dependencies on your own if you are using the python setup.py install method of installing itero
You can build and install these dependencies where you like. To configure itero to use the dependencies you have build and installed, you need to create a $HOME/.itero.conf
that gives the paths to each program and looks like:
[executables]
bedtools:/path/to/bin/bedtools
bwa:/path/to/bin/bwa
gawk:/path/to/bin/gawk
samtools:/path/to/bin/samtools
spades:/path/to/bin/spades.py
Running itero¶
itero has both a local mode and an MPI mode. The local mode is for execution on a single node, while the MPI mode executes individual locus assemblies in parallel using an MPI-enabled HPC cluster. To run the program, you must first create a configuration file denoting the samples you wish you assemble. That file has the following format:
[reference]
/path/to/the/locus/seeds.fasta
[individuals]
taxon-one:/path/to/fastq/R1/and/R2/files/for/taxon/1/
taxon-two:/path/to/fastq/R1/and/R2/files/for/taxon/2/
taxon-three:/path/to/fastq/R1/and/R2/files/for/taxon/3/
itero on a single node¶
You then run the local version using a command similar to:
itero assemble local --config ndna-test.conf
--output local
--local-cores 16
--iterations 6
This will run itero on a single node and will first use 16 cores to perform bwa alignments. The code will then distribute locus-specific assemblies across all cores on the node (1 assembly per core; 16 in parallel).
itero across MPI nodes¶
You run the MPI version using a command similar to:
mpirun -hostfile hostfile -n 96 itero assemble mpi --config ndna-test.conf \
--output mpi \
--local-cores 16 \
--iterations 6
If each of your nodes has 16 cores, this will first use 16 cores for the needed bwa alignments of reads to seeds. The code will then distribute locus-specific assemblies across all 96 cores in your cluster (1 assembly per core; 96 in parallel).
Project info¶
License¶
Documentation¶
The documentation for itero is available under a CC-BY (2.0) license. This license gives you permission to copy, distribute, and trasmit the work as well as to adapt the work or use this work for commercial purposes, under the condition that you must attribute the work to the author(s).
If you use this documentation or the itero software for your own research, please cite both the software and (Faircloth et al. 2012). See the Citing section for more detail.
Software¶
Copyright (c) 2018, Brant C. Faircloth All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
- Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
- Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
- Neither the name of the University of California, Los Angeles nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
Changelog¶
v1.0.x (April 2018)¶
- initial version with MPI and multiprocessing capability
v1.1.0 (May 2018)¶
- fix error in contig checking code that could cause MPI operations to hang
- refactor BAM splitting code for hopefully faster operation
- add RAM limits on spades
- add configuration parameters to iter.conf for spades
- create unique log file for each run
v1.1.1 (June 2018)¶
- fix an error where too many fastq files would cause MPI to hang
Funding¶
Primary Sources¶
The National Science Foundation (NSF) has supported a large portion of our work. The specific programs and proposal identifiers are below:
- NSF DEB-1655624
- NSF IOS-1754417
\ Sort by:\ best rated\ newest\ oldest\
\\
Add a comment\ (markup):
\``code``
, \ code blocks:::
and an indented block after blank line