MGLEX documentation

Welcome to the MGLEX documentation! MGLEX (MetaGenome Likelihood Extractor) is a probablistic model implementation in Python 3 to extract genomes from metagenome assemblies using various features.

Note: This documentation is a stub, it will be extended with upcoming versions of MGLEX.

Contents

Installing MGLEX

Dependencies

MGLEX is a Python 3 package, it does not run with Python 2 versions. It depends on

  • NumPy (for data types and math operations)
  • SciPy (for few math functions)
  • docopt (for command line parsing)

Installation

We show how to install MLGEX under Debian and Ubuntu, but other platforms are similar.

You can simply install the requirements as system packages.

sudo apt install python3 python3-numpy python3-scipy

We recommend to create a Python virtual installation enviroment for MGLEX. In order to do so, install the venv package for your Python version (e.g. the Debian package python3.4-venv), if not included (or use virtualenv). The following command will make use of the installed system packages.

python3 -m venv --system-site-packages my-mglex-environment
source my-mglex-environment/bin/activate

MGLEX is deposited on the Python Package Index and we recommend to install it via pip.

python3 -m pip install mglex

Quickstart guide

This description is supposed to give a rough overview of how MGLEX handles data and how the operations are called.

Feature files

For each submodel in MGLEX, a feature file must be provided. These are simple text files where each line corresponds to a contig. Thus, all input files must have the same length and ordering. Features are the data which represent the contigs. They must be derived from the nucleotide sequences, which are not directly needed when working with MGLEX.

Contig assignment matrices

Contig assignments to genomes are both the input to the training step and the output of the classification step. These assignments are represented as a weight or probability matrix in TAB-separated format. Each line corresponds to a contig, as for the feature files, and each column represents a genome or genome bin. All values in this matrix represent weights between zero and one (e.g. likelihoods) but are written and parsed using the negative natural logarithm. Using weights, contigs can be partially assigned to multiple bins, both for model training and in the output. We also refer to such a matrix as a responsibility matrix.

Data handling

Most data files with MGLEX, in particular the assignment matrices, compress very well. Therefore, we recommend to compress and decompress these files on-the-fly using, for instance, gzip. This will reduce the IO and storage requirements.

MGLEX classification

The typical workflow when using MLGEX as a classifier is to

  1. get all features in the right text format
  2. build an assignment matrix for training
  3. train and save a model to a file
  4. classify the same of different contigs using this model

MGLEX command line interface

After installation, you can call the command line interface

> mglex-cli --help

This is the mglex command line interface which executes the sub-commands.

Usage:
 mglex-cli [--version] [--help] <command> [<args>...]

 -h, --help         Show this screen
 -v, --version      Show version

 Here are the commands to run:
  train             Train a model (set maximum-likelihood parameters)
  classify          Calculate likelihood of data under a model
  buildmatrix       Construct a responsibility matrix for grouped data
  evaluate          Evaluate classifications using a reference (true) responsibility matrix
  significance      Give a null model log-likelihood distribution and calculate p-values for unseen data
  bincompare        Compare bins by likelihood values

 See 'mglex-cli <command> --help' for more information on a specific command.

Formatting the data

Please refer to the description in the data section to format the feature files. We assume that you have the the following files in the correct format. Usually, these will be

  • a file with contig names, one per line: “contigs.seqname”
  • a file with contig lengths, one per line: “contigs.seqlen”
  • one file for each submodels, one contig per line

It is important, that the n:sup:th line in all file corresponds to the same contig. If features are missing, then this is reflected by an empty line.

Building a responsibility matrix

For each genome we like to model, we require training data. The canonical way to tell the training command which contig to use for which genome, is by passing a responsibility matrix. Doing so, in theory a contig could be used for more than one genome in the training step, e.g. when modeling similar strains. However, this matrix will usually assign each contig to either zero or just one genome. MGLEX provides a simple command to construct such a responsibility matrix from a text file which lists the contigs which are believed to belong to a single genome. For this text file, here called seeds, you must list the contig names for each genome per line, separated by the space symbol. Thus, when you want to model 20 genomes, this seeds file must contain 20 lines. In the following command, we will give MGLEX the contig names file (see above) via standard input and let it construct the responsibility matrix from the seeds file. Because the matrix is very sparse, we will compress it on the fly.

> mglex-cli buildmatrix --seeds contigs.seeds < contigs.seqname | gzip > contigs.training.mat.gz

Training a model

We now use the features and the responsibility matrix to create a model. We also need to provide the contig lengths so MGLEX can weight the contigs according to their size. We decompress the responsibility file and pass it via the standard input. Suppose, we want to construct a model based on k-mer counts in file contigs.kmc.

> zcat contigs.training.mat.gz | mglex-cli train --weight contigs.seqlen --composition contigs.kmc --outmodel contigs.model

Depending on the model type and data, this might take a while because the features need first to be parsed and then the model parameters must be calculated.

Classifying contigs

Now, suppose we we have seeded the model with some contigs but want to classify all the other, we can call the classify command to (re-)classify the models. Although the seeding contigs were used to construct the model, they can still end up being assigned to another genome than the original one. The output is again a responsibility matrix, but this time it will be less sparse. It contains a normalized likelihood for each contig and each genome. As this file can be large, we will again compress it using gzip.

> mglex-cli classify --model contigs.model --composition contigs.kmc | gzip > contigs.classify.mat.gz

The output matrix can be inspected using zless, for instance.

> zless -S contigs.classify.mat.gz

If you want to find the most likely genome for a contig, simply find the lowest score in the corresponding row (because this is the negative log-likelihood).

Advanced usage

This quickstart guide shows the simplest possibility to use MGLEX. There are more commands and some of them are described in the paper. For instance, to get meaningful soft assignments, it is best to re-normalize the output by fixing the beta parameter.

MGLEX feature files

Each submodel in MGLEX defines its own data. The conventions are to use human-readable text files with one line per contig and spaces as the major delimiter with feature files. These files are usually easy to produce other programs and with the help of command line tools like awk, sed, cut, tr etc.

Read coverage

For each line, the first number sets the average read coverage in the first sample, the second in the second sample etc. These numbers are separated by spaces, so each line should have the same number of entries and a zero where no read could be mapped. Typically, read coverage is calculated by aligning reads to contigs using a read mapper like BWA or Bowtie2. Then, we can parse the resulting BAM files, for instance using BEDtools, to extract the number of mapping reads for each position. Here is an example using Bowtie2, samtools and BEDtools.

First, map the reads (do this for each sample and use different output files).

> bowtie2-build contigs.fna contigs.bowtie2
> bowtie2 -x contigs.bowtie2 -1 forward.fq.gz -2 reverse.fq.gz |
  samtools view -@ 5 -b - < input.sam | samtools sort -@ 5 - out

Then get the number of read for each contig position.

> genomeCoverageBed -ibam out.sorted.bam -g contigs.seqlen -d -split |
  awk 'BEGIN{IFS=OFS=FS="\t"}
  {if($1 == last){ s+=$3; c+=1;}
   else{if(s){print last, s/c; s=$3}; c=1; last=$1}}
   END{print last, s/c}' > out.twocol.cov

The awk code here makes sure that positions with zero reads are reported. Then, you must merge still fill in missing contigs with zero counts, which were not reported here, and make sure that this is done in the right order. Finally, you can merge the samples using paste.

The read coverage feature file works for both, the absolute and the relative count submodels.

Nucleotide composition

Again, each line lists the features counts (k-mers) separated by spaces. Each line should have the same numer of entries. We can use any k-mer counting program. Here is an example for fasta2kmerS using 5-mers.

> zcat contigs.fna.gz |
  fasta2kmersS -i <(cat) -f >(cat) -j 5 -k 5 -s 0 -h 0 -n 0 |
  tr '\t' ' ' > contigs.kmc

Taxonomic annotation

The tree-shaped annotation can be generated from any taxonomic assignment program. On each line, a taxonomic paths are specified with corresponding weight, for instance an alignment score, the number of matching positions etc. The model will cope with incorrect annotation, so it is beneficial to generate annotation up the the highest-scoring reference taxon up to the species level. A path consist of strings separated by dots and followed by a colon and the numeric weight, for example 1.2.3:100 or Bacteria.Proteobacteria:5. It makes sense to shorten the strings to something numerical in the feature file to save same space on the disk and in the memory which running MGLEX, although this is not required.

Contact

Feel free to contact the author when you have troubles using this package.

Indices and tables