https://anaconda.org/bioconda/metagenome-atlas/badges/version.svg https://img.shields.io/conda/dn/bioconda/metagenome-atlas.svg?label=Bioconda https://img.shields.io/twitter/follow/SilasKieser.svg?style=social&label=Follow

Metagenome-Atlas

Metagenome-atlas logo

Metagenome-Atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, to Annotation.

You can start using atlas with three commands:

mamba install -c bioconda -c conda-forge metagenome-atlas={latest_version}
atlas init --db-dir databases path/to/fastq/files
atlas run

where {latest_version} should be replaced by

https://anaconda.org/bioconda/metagenome-atlas/badges/version.svg

Publication

ATLAS: a Snakemake workflow for assembly, annotation, and genomic binning of metagenome sequence data. Kieser, S., Brown, J., Zdobnov, E. M., Trajkovski, M. & McCue, L. A. BMC Bioinformatics 21, 257 (2020). doi: 10.1186/s12859-020-03585-4

Getting Started

Setup

Conda package manager

Atlas has one dependency: conda. All databases and other dependencies are installed on the fly. Atlas is based on snakemake which allows to run steps of the workflow in parallel on a cluster.

If you want to try atlas and have a linux computer (OSX may also work), you can use our example data for testing.

For real metagenomic data atlas should be run on a _linux_ sytem, with enough memory (min ~50GB but assembly usually requires 250GB).

You need to install anaconda or miniconda. If you haven’t done it already you need to configure conda with the bioconda-channel and the conda-forge channel. This are sources for packages beyond the default one. Setting strict channel priority can prevent quite some annoyances.

The order is important by the way.

Install mamba

Conda can be a bit slow because there are so many packages. A good way around this is to use mamba (another snake).:

conda install mamba

From now on you can replace conda install with mamba install and see how much faster this snake is.

Install metagenome-atlas

We recommend you to install metagenome-atlas into a conda environment e.g. named atlasenv We also recommend to specify the latest version of metagenome-atlas.

mamba create -y -n atlasenv metagenome-atlas={latest_version}
source activate atlasenv

where {latest_version} should be replaced by

https://anaconda.org/bioconda/metagenome-atlas/badges/version.svg
Install metagenome-atlas from GitHub

Alternatively you can install metagenome Atlas directly form GitHub. This allows you to access versions that are not yet in the conda release, e.g. versions that are still in development.

git clone https://github.com/metagenome-atlas/atlas.git
cd atlas

# optional change to different branch
# git checkout branchname

# create dependencies for atlas
mamba env create -n atlas-dev --file atlasenv.yml
conda activate atlas-dev

# install atlas version. Changes in this files are directly available in the atlas dev version
pip install --editable .
cd ..

Example Data

If you want to test atlas on a small example data here is a two sample, three genome minimal metagenome dataset, to test atlas. Even when atlas will run faster on the test data, it will anyway download all the databases and requirements, for the a complete run, which can take a certain amount of time and especially disk space (>100Gb).

The database dir of the test run should be the same as for the later atlas executions.

The example data can be downloaded as following

wget https://zenodo.org/record/3992790/files/test_reads.tar.gz
tar -xzf test_reads.tar.gz

Usage

Start a new project

Let’s apply atlas on your data or on our example data:

atlas init --db-dir databases path/to/fastq_files

This command parses the folder for fastq files (extension .fastq(.gz) or .fq(.gz) , gzipped or not). fastq files can be arranged in subfolders, in which case the subfolder name will be used as a sample name. If you have paired-end reads the files are usually distinguishable by _R1/_R2 or simple _1/_2 in the file names. Atlas searches for these patterns and lists the paired-end files for each sample.

The command creates a samples.tsv and a config.yaml in the working directory.

Have a look at them with a normal text editor and check if the samples names are inferred correctly. The sample names are used for the naming of contigs, genes, and genomes. Therefore, the sample names should consist only form digits and letters and start with a letter (Even though one - is allowed). Atlas tries to simplify the file name to obtain unique sample names, if it doesn’t succeed it simply puts S1, S2, … as sample names.

See the example sample table

The BinGroup parameter is used during the genomic binning. In short: If you have between 5 and 150 samples the default (puting everithing in one group) is fine. If you have less than 5 samples, put every sample in an individual BinGroup and use metabat as final binner. If you have more samples see the cobinning section for more details.

Note

If you want to use long reads for a hybrid assembly, you can also specify them in the sample table.

You should also check the config.yaml file, especially:

  • You may want to add ad host genomes to be removed.
  • You may want to change the resources configuration, depending on the system you run atlas on.

Details about the parameters can be found in the section Configuration

Keep in mind that all databases are installed in the directory specified with --db-dir so choose it wisely.

Usage: atlas init [OPTIONS] PATH_TO_FASTQ

  Write the file CONFIG and complete the sample names and paths for all
  FASTQ files in PATH.

  PATH is traversed recursively and adds any file with '.fastq' or '.fq' in
  the file name with the file name minus extension as the sample ID.

Options:
  -d, --db-dir PATH               location to store databases (need ~50GB)
                                  [default: /Users/silas/Documents/GitHub/atla
                                  s/databases]

  -w, --working-dir PATH          location to run atlas
  --assembler [megahit|spades]    assembler  [default: spades]
  --data-type [metagenome|metatranscriptome]
                                  sample data type  [default: metagenome]
  --interleaved-fastq             fastq files are paired-end in one files
                                  (interleaved)

  --threads INTEGER               number of threads to use per multi-threaded
                                  job

  --skip-qc                       Skip QC, if reads are already pre-processed
  -h, --help                      Show this message and exit.
Start a new project with public data

Since v2.9 atlas has possibility to start a new project from public data stored in the short read archive (SRA).

You can run atlas init-public <SRA_IDs> and specify any ids, like bioprojects, or other SRA ids.

Atlas does the folowing steps:

  1. Search SRA for the corresponding sequences (Runs) and save them in the file SRA/RunInfo_original.tsv. For example if you specify a Bioproject, it fetches the information for all runs of this project.
  2. Atlas filters the runs to contain only valid metagenome sequences. E.g. exclude singleton reads, 16S. The output will be saved in RunInfo.tsv
  3. Sometimes the same Sample is sequenced on different laines, which will result into multipe runs from the same sample. Atlas will merge runs from the same biosample.
  4. Prepare a sample table and a config.yaml similar to the atlas init command.

If you are not happy with the filtering atlas performs, you can go back to the SRA/RunInfo_original.tsv and create a new RunInfo.tsv. If you then rerun atlas init-public continue it will continue from your modified RunInfo and do step 3. & 4. above.

Limitations: For now atlas, cannot handle a mixture of paired and single end reads, so we focus primarily on the paired end. If you have longreads for your project, you would need to specify them yourself in the sample.tsv.

During the run, the reads are downloaded from SRA in the likely most efficient way using prefetch and parallel, fastq.gz generation. The download step has checkpoints, so if the pipline gets interupted, you can restart where you left off. Using the comand line arguments --restart-times 3 and --keep-going You can even ask atlas to do multiple restarts befor stoping.

The downloaded reads, are directly processed. If you however want only to doenload the reads you can use.:

atlas run None download_sra
Example: Downloading reads from the human microbiome project2
atlas init-public --working-dir HMP2 PRJNA398089

Gives the output:

[Atlas] INFO: Downloading runinfo from SRA
[Atlas] INFO: Start with 2979 runs from 2979 samples
[Atlas] INFO: Runs have the folowing values for LibrarySource: METAGENOMIC, METATRANSCRIPTOMIC
        Select only runs LibrarySource == METAGENOMIC, Filtered out 762 runs
[Atlas] INFO: Runs have the folowing values for LibrarySelection: PCR, RT-PCR, RANDOM
        Select only runs LibrarySelection == RANDOM, Filtered out 879 runs
[Atlas] INFO: Selected 1338 runs from 1338 samples
[Atlas] INFO: Write filtered runinfo to HMP2/RunInfo.tsv
[Atlas] INFO: Prepared sample table with 1338 samples
[Atlas] INFO: Configuration file written to HMP2/config.yaml
        You may want to edit it using any text editor.
Run atlas
atlas run genomes

atlas run need to know the working directory with a samples.tsv inside it.

Take note of the --dryrun parameter, see the section Useful command line options for other handy snakemake arguments.

We recommend to use atlas on a Cluster execution system, which can be set up in a view more commands.

Usage: atlas run [OPTIONS] [qc|assembly|binning|genomes|genecatalog|None|all]
                 [SNAKEMAKE_ARGS]...

  Runs the ATLAS pipline

  By default all steps are executed but a sub-workflow can be specified.
  Needs a config-file and expects to find a sample table in the working-
  directory. Both can be generated with 'atlas init'

  Most snakemake arguments can be appended to the command for more info see
  'snakemake --help'

  For more details, see: https://metagenome-atlas.readthedocs.io

Options:
  -w, --working-dir PATH  location to run atlas.
  -c, --config-file PATH  config-file generated with 'atlas init'
  -j, --jobs INTEGER      use at most this many jobs in parallel (see cluster
                          submission for mor details).

  --profile TEXT          snakemake profile e.g. for cluster execution.
  -n, --dryrun            Test execution.  [default: False]
  -h, --help              Show this message and exit.

Execue Atlas

Cluster execution

Automatic submitting to cluster systems

Thanks to the underlying snakemake Atlas can submit parts of the pipeline automatically to a cluster system and define the appropriate resources. If one job has finished it launches the next one. This allows you use the full capacity of your cluster system. You even need to pay attention not to spam the other users of the cluster.

Thanks to the underlying snakemake system, atlas can submit parts of the pipeline to clusters and cloud systems. Instead of running all steps of the pipeline in one cluster job, atlas can automatically submit each step to your cluster system, specifying the necessary threads, memory, and runtime, based on the values in the config file. Atlas periodically checks the status of each cluster job and can re-run failed jobs or continue with other jobs.

See atlas scheduling jobs on a cluster in action https://asciinema.org/a/337467.

If you have a common cluster system (Slurm, LSF, PBS …) we have an easy set up (see below). Otherwise, if you have a different cluster system, file a GitHub issue (feature request) so we can help you bring the magic of atlas to your cluster system. For more information about cluster- and cloud submission, have a look at the snakemake cluster docs.

Set up of cluster execution

You need cookiecutter to be installed, which comes with atlas

Then run:

cookiecutter --output-dir ~/.config/snakemake https://github.com/metagenome-atlas/clusterprofile.git

This opens a interactive shell dialog and ask you for the name of the profile and your cluster system. We recommend you keep the default name cluster. The profile was tested on slurm, lsf and pbs.

The resources (threads, memory and time) are defined in the atlas config file (hours and GB).

Specify queues and accounts

If you have different queues/partitions on your cluster system you should tell atlas about them so it can automatically choose the best queue. Adapt the template for the queues.tsv:

cp ~/.config/snakemake/cluster/queues.tsv.example ~/.config/snakemake/cluster/queues.tsv

Now enter the information about the queues/partitions on your particular system.

If you need to specify accounts or other options for one or all rules you can do this for all rules or for specific rules in the ~/.config/snakemake/cluster/cluster_config.yaml. In addition, using this file you can overwrite the resources defined in the config file.

Example for cluster_config.yaml with queues defined:

__default__:
# default parameter for all rules
  account: project_1345
  nodes: 1

Now, you can run atlas on a cluster with:

atlas run <options> --profile cluster

As the whole pipeline can take several days, I usually run atlas itself on a cluster in a long running queue.

If a job fails, you will find the “external jobid” in the error message. You can investigate the job via this ID.

The atlas argument --jobs now becomes the number of jobs simultaneously submitted to the cluster system. You can set this as high as 99 if your colleagues don’t mind you over-using the cluster system.

Single machine execution

If you dont want to use the automatic scheduling you can use atlas on a single machine (local execution) with a lot of memory and threads ideally. In this case I recommend you the following options. The same applies if you submit a single job to a cluster running atlas.

Atlas detects how many CPUs and how much memory is available on your system and it will schedule as many jobs in paralell as possible. If you have less resources available than specified in the config file, the jobs are downscaled.

By default atlas will use all cpus and 95% of all the available memory. If you are not happy with that, or you need to specify an exact ammount of memory/ cpus you can use the comand line arguments --jobs and --max-mem to do so.

Cloud execution

Atlas, like any other snakemake pipeline can also easily be submitted to cloud systems. I suggest looking at the snakemake doc. Keep in mind any snakemake command line argument can just be appended to the atlas command.

Useful command line options

Atlas builds on snakemake. We designed the command line interface in a way that additional snakemake arguments can be added to an atlas run call.

For instance the --profile used for cluster execution. Other handy snakemake command line arguments include:

--keep-going, which allows atlas in the case of a failed job to continue with independent steps.

--report, which allows atlas to generate a user-friendly run report (e.g., by specifying --report report.html). This report includes the steps used in the analysis workflow and the versions of software tools used at each step. See discussions #523 and #514.

For a full list of snakemake arguments see the snakemake doc.

Expected output

Atlas is a workflow for assembly and binning of metagenomic reads

There are two main workflows implemented in atlas. A. Genomes and B. Genecatalog. The first aims in producing metagenome assembled genomes (MAGs) where as the later produces a gene catalog. The steps of Quality control and and

Note

Have a look at the example output at https://github.com/metagenome-atlas/Tutorial/Example .

Quality control

atlas run qc
# or
atlas run genomes
# or
atlas run genecatalog

Runs quality control of single or paired end reads and summarizes the main QC stats in reports/QC_report.html.

Per sample it generates:

  • {sample}/sequence_quality_control/{sample}_QC_{fraction}.fastq.gz
  • Various quality stats in sample}/sequence_quality_control/read_stats
Fractions:

When the input was paired end, we will put out three the reads in three fractions R1,R2 and se The se are the paired end reads which lost their mate during the filtering.

The se reads are no longer used as they usually represent an insignificant number of reads.

Assembly

atlas run assembly
#or
atlas run genomes
# or
atlas run genecatalog

Besides the reports/assembly_report.html this rule outputs the following files per sample:

  • {sample}/{sample}_contigs.fasta
  • {sample}/sequence_alignment/{sample}.bam
  • {sample}/assembly/contig_stats/final_contig_stats.txt

Binning

atlas run binning
#or
atlas run genomes

When you use different binners (e.g. metabat, maxbin) and a bin-reconciliator (e.g. DAS Tool), then Atlas will produce for each binner and sample:

  • {sample}/binning/{binner}/cluster_attribution.tsv

which shows the attribution of contigs to bins. For the final_binner it produces the

  • reports/bin_report_{binner}.html

See an example as a summary of the quality of all bins.

See also

In version 2.8 the new binners vamb and SemiBin were added. First experience show that they outperform the default binner (metabat, maxbin + DASTool). They use a new approach of co-binning which uses the co-abundance from different samples. For more information see the detailed explanation here on page 14

Note

Keep also in mind that maxbin, DASTool, and SemiBin are biased for prokaryotes. If you want to try to bin (small) Eukaryotes use metabat or vamb. More information about Eukaryotes see the discussion here.

Genomes

atlas run genomes

Binning can predict several times the same genome from different samples. To remove this reduncancy we use DeRep to filter and de-replicate the genomes. By default the threshold is set to 97.5%, which corresponds somewhat to the sub-species level. The best quality genome for each cluster is choosen as the representative for each cluster. The represenative MAG are then renamed and used for annotation and quantification.

The fasta sequence of the dereplicated and renamed genomes can be found in genomes/genomes and their quality estimation are in genomes/checkm/completeness.tsv.

Quantification

The quantification of the genomes can be found in:

  • genomes/counts/median_coverage_genomes.tsv
  • genomes/counts/raw_counts_genomes.tsv

See also

See in Atlas example how to analyze these abundances.

Annotations

The annotation can be turned of and on in the config file:

annotations:
  - genes
  - gtdb_tree
  - gtdb_taxonomy
  - kegg_modules
  - dram

The genes option produces predicted genes and translated protein sequences which are stored in genomes/annotations/genes.

Taxonomic adnnotation

A taxonomy for the genomes is proposed by the Genome Taxonomy database (GTDB). The results can be found in genomes/taxonomy. The genomes are placed in a phylogenetic tree separately for bacteria and archaea using the GTDB markers.

In addition a tree for bacteria and archaea can be generated based on the checkm markers. All trees are properly rooted using the midpoint. The files can be found in genomes/tree

Functional annotation

Sicne version 2.8, We use DRAM to annotate the genomes with Functional annotations, e.g. KEGG and CAZy as well as to infere pathways, or more specifically Kegg modules.

The Functional annotations for each genome can be found in genomes/annotations/dram/

and are contain the following files:

  • kegg_modules.tsv Table of all Kegg modules
  • annotations.tsv Table of all annotations
  • distil/metabolism_summary.xlsx Excel of the summary of all annotations

The tool alos produces a nice report in distil/product.html.

Gene Catalog

atlas run all
# or
atlas run genecatalog

The gene catalog takes all genes predicted from the contigs and clusters them according to the configuration. It quantifies them by simply mapping reads to the genes (cds sequences) and annotates them using EggNOG mapper.

This rule produces the following output file for the whole dataset.

  • Genecatalog/gene_catalog.fna
  • Genecatalog/gene_catalog.faa
  • Genecatalog/annotations/eggNog.tsv.gz
  • Genecatalog/counts/
Since version 2.15 the output of the quantification are stored in 2 hdf-files`in the folder Genecatalog/counts/:
  • median_coverage.h5
  • Nmapped_reads.h5.fna
Together with the statistics per gene and per sample.
  • gene_coverage_stats.parquet
  • sample_coverage_stats.tsv

The hdf only contains a matrix of abundances or counts under the name data. The sample names are stored as attributes. The gene names (e.g. Gene00001) are simply the row number.

You can open the hdf file in R or python as following:

import h5py

filename = "path/to/atlas_dir/Genecatalog/counts/median_coverage_genomes.h5"

with h5py.File(filename, 'r') as hdf_file:

    data_matrix = hdf_file['data'][:]
    sample_names = hdf_file['data'].attrs['sample_names'].astype(str)
library(rhdf5)


filename = "path/to/atlas_dir/Genecatalog/counts/median_coverage_genomes.h5"

data <- h5read(filename, "data")

attributes= h5readAttributes(filename, "data")

colnames(data) <- attributes$sample_names

You don’t need to load the full data. You could only select a subset of genes, e.g. the genes with annotations, or genes that are not singletons. To find out which gene is a singleton or not you can use the file gene_coverage_stats.parquet

library(rhdf5)
library(dplyr)
library(tibble)

# read only subset of data
indexes_of_genes_to_load = c(2,5,100,150) # e.g. genes with annotations
abundance_file <- file.path(atlas_dir,"Genecatalog/counts/median_coverage.h5")


# get dimension of data

h5overview=h5ls(abundance_file)
dim= h5overview[1,"dim"] %>% stringr::str_split(" x ",simplify=T) %>% as.numeric
cat("Load ",length(indexes_of_genes_to_load), " out of ", dim[1] , " genes\n")


data <- h5read(file = abundance_file, name = "data",
                index = list(indexes_of_genes_to_load, NULL))

# add sample names
attributes= h5readAttributes(abundance_file, "data")
colnames(data) <- attributes$sample_names


# add gene names (e.g. Gene00001) as rownames
gene_names = paste0("Gene", formatC(format="d",indexes_of_genes_to_load,flag="0",width=ceiling(log10(max(dim[1])))))
rownames(data) <- gene_names


data[1:5,1:5]

If you do this you can use the information in the file Genecatalog/counts/sample_coverage_stats.tsv to normalize the counts.

Here is the R code to calculate the gene copies per million (analogous to transcript per million) for the subset of genes.

# Load gene stats per sample
gene_stats_file = file.path(atlas_dir,"Genecatalog/counts/sample_coverage_stats.tsv")

gene_stats <- read.table(gene_stats_file,sep='\t',header=T,row.names=1)

gene_stats <- t(gene_stats) # might be transposed, sample names should be index

head(gene_stats)

# calculate copies per million
total_covarage <- gene_stats[colnames(data)  ,"Sum_coverage"]

# gives wrong results
#gene_gcpm<- data / total_covarage *1e6

gene_gcpm<- data %*% diag(1/total_covarage) *1e6
colnames(gene_gcpm) <- colnames(data)

gene_gcpm[1:5,1:5]

See also

See in Atlas Tutorial

Before version 2.15 the output of the counts were stored in a parquet file. The parquet file can be opended easily with pandas.read_parquet or arrow::read_parquet`. However you need to load the full data into memory.

parquet_file <- file.path(atlas_dir,"Genecatalog/counts/median_coverage.parquet")
gene_abundances<- arrow::read_parquet(parquet_file)

# transform tibble to a matrix
gene_matrix= as.matrix(gene_abundances[,-1])
rownames(gene_matrix) <- gene_abundances$GeneNr


#calculate copies per million
gene_gcpm= gene_matrix/ colSums(gene_matrix) *1e6


gene_gcpm[1:5,1:5]

All

The option of atlas run all runs both Genecatalog and Genome workflows and creates mapping tables between Genecatalog and Genomes. However, in future the two workflows are expected to diverge more and more to fulfill their aim better.

If you want to run both workflows together you can do this by:

atlas run genomes genecatalog

If you are interested in mapping the genes to the genomes see the discussion at https://github.com/metagenome-atlas/atlas/issues/413

_configuration:

Configure Atlas

_contaminants:

Remove reads from Host

One of the most important steps in the Quality control is to remove host genome. You can add any number of genomes to be removed.

We recommend you to use genomes where repetitive sequences are masked. See here for more details human genome.

Co-abundance Binning

While binning each sample individually is faster, using co-abundance for binning is recommended. Quantifying the coverage of contigs across multiple samples provides valuable insights about contig co-variation.

There are two primary strategies for co-abundance binning:

  1. Cross mapping: Map the reads from multiple samples to each sample’s contigs.
  2. Co-binning: Concatenate contigs from multiple samples and map all the reads to these combined contigs.

final_binner: metabat2 is used for cross-mapping, while vamb or SemiBin is used for co-binning.

The samples to be binned together are specified using the BinGroup in the sample.tsv file. The size of the BinGroup should be selected based on the binner and the co-binning strategy in use.

Cross mapping complexity scales quadratically with the size of the BinGroup since each sample’s reads are mapped to each other. This might yield better results for complex metagenomes, although no definitive benchmark is known. On the other hand, co-binning is more efficient, as it maps a sample’s reads only once to a potentially large assembly.

Default Behavior

Starting with version 2.18, Atlas places every sample in a single BinGroup and defaults to vamb as the binner unless there are very few samples. For fewer than 8 samples, metabat is the default binner.

Note

This represents a departure from previous versions, where each sample had its own BinGroup. Running vamb in those versions would consider all samples, regardless of their BinGroup. This change might cause errors if using a sample.tsv file from an older Atlas version. Typically, you can resolve this by assigning a unique BinGroup to each sample.

The mapping threshold has been adjusted to 95% identity (single sample binning is 97%) to allow reads from different strains — but not other species — to map to contigs from a different sample.

If you’re co-binning more than 150-200 samples or cross-mapping more than 50 samples, Atlas will issue a warning regarding excessive samples in a BinGroup. Although VAMB’s official publication suggests it can handle up to 1000 samples, this demands substantial resources.

Therefore, splitting your samples into multiple BinGroups is recommended. Ideally, related samples, or those where the same species are anticipated, should belong to the same BinGroup.

Single-sample Binning

To employ single-sample binning, simply assign each sample to its own BinGroup and select metabat or DASTool as the final_binner.

Although it’s not recommended, it’s feasible to use DASTool and feed it inputs from metabat and other co-abundance-based binners.

Add the following lines to your config.yaml:

final_binner: DASTool

binner:
  - metabat
  - maxbin
  - vamb

Long reads

Limitation: Hybrid assembly of long and short reads is supported with spades and metaSpades. However metaSpades needs a paired-end short-read library.

The path of the (preprocessed) long reads should be added manually to the the sample table under a new column heading ‘longreads’.

In addition the type of the long reads should be defined in the config file: longread_type one of [“pacbio”, “nanopore”, “sanger”, “trusted-contigs”, “untrusted-contigs”]

Example config file

include:: ../../workflow/../config/template_config.yaml
code:

Detailed configuration

toctree::
maxdepth:1

../advanced/qc ../advanced/assembly

# Change log

## 2.17

### Skani The tool Skani claims to be better and faster than the combination of mash + FastANI as used by dRep I implemented the skin for species clustering. We now do the species clustering in the atlas run binning step. So you get information about the number of dereplicated species in the binning report. This allows you to run different binners before choosing the one to use for the genome annotation. Also, the file storage was improved all important files are in Binning/{binner}/

My custom species clustering does the following steps:

  1. Pre-cluster genomes with single-linkage at 92.5 ANI.
  2. Re-calibrate checkm2 results.
  • If a minority of genomes from a pre-cluster use a different translation table they are removed
  • If some genomes of a pre-cluster don’t use the specialed completeness model we re-calibrate completeness to the minimum value.

This ensures that not a bad genome evaluated on the general model is preferred over a better genome evaluated on the specific model.

See also https://silask.github.io/post/better_genomes/ Section 2. - Drop genomes that don’t correspond to the filter criteria after re-calibration 3. Cluster genomes with ANI threshold default 95% 4. Select the best genome as representative based on the Quality score Completeness - 5x Contamination

### New Contributors * @jotech made their first contribution in https://github.com/metagenome-atlas/atlas/pull/667

## 2.16

  • gtdb08

## 2.15

## 2.14

Thank you @trickovicmatija for your help.

Full Changelog: https://github.com/metagenome-atlas/atlas/compare/v2.13.1…v2.14.0 ## 2.13

The filter function is defined in the config file: ` genome_filter_criteria: "(Completeness-5*Contamination >50 ) & (Length_scaffolds >=50000) & (Ambigious_bases <1e6) & (N50 > 5*1e3) & (N_scaffolds < 1e3)" ` The genome filtering is similar as other publications in the field, e.g. GTDB. What is maybe a bit different is that genomes with completeness around 50% and contamination around 10% are excluded where as using the default parameters dRep would include those.

We saw better performances using drep. This scales also now to ~1K samples * Use new Dram version 1.4 by in https://github.com/metagenome-atlas/atlas/pull/564

Full Changelog: https://github.com/metagenome-atlas/atlas/compare/v2.12.0…v2.13.0

## 2.12

  • GTDB-tk requires rule extract_gtdb to run first by @Waschina in https://github.com/metagenome-atlas/atlas/pull/551
  • use Galah instead of Drep
  • use bbsplit for mapping to genomes (maybe move to minimap in future)
  • faster gene catalogs quantification using minimap.
  • Compatible with snakemake v7.15

### New Contributors * @Waschina made their first contribution in https://github.com/metagenome-atlas/atlas/pull/551

Full Changelog: https://github.com/metagenome-atlas/atlas/compare/v2.11.1…v2.12.0

## 2.11 * Make atlas handle large gene catalogs using parquet and pyfastx (Fix #515)

parquet files can be opened in python with ``` import pandas as pd coverage = pd.read_parquet(“working_dir/Genecatalog/counts/median_coverage.parquet”) coverage.set_index(“GeneNr”, inplace=True)

```

and in R it should be something like:

``` arrow::read_parquet(“working_dir/Genecatalog/counts/median_coverage.parquet”)

```

Full Changelog: https://github.com/metagenome-atlas/atlas/compare/v2.10.0…v2.11.0

## [2.10](https://github.com/metagenome-atlas/atlas/compare/v2.9.1…v2.10.0)

### Features * GTDB version 207 * Low memory taxonomic annotation

## [2.9](https://github.com/metagenome-atlas/atlas/compare/v2.8.2…v2.9.0)

### Features * ✨ Start an atlas project from public data in SRA [Docs](https://metagenome-atlas.readthedocs.io/en/latest/usage/getting_started.html#start-a-new-project-with-public-data) * Make atlas ready for python 3.10 https://github.com/metagenome-atlas/atlas/pull/498 * Add strain profiling using inStrain You can run atlas run genomes strains

### New Contributors * @alienzj made their first contribution to fix config when run DRAM annotate in https://github.com/metagenome-atlas/atlas/pull/495

## 2.8 This is a major update of metagenome-atlas. It was developed for the [3-day course in Finnland](https://silask.github.io/talk/3-day-course-on-metagenome-atlas/), that’s also why it has a finish release name.

### New binners It integrates bleeding-edge binners Vamb and SemiBin that use Co-binning based on co-abundance. Thank you @yanhui09 and @psj1997 for helping with this. The first results show better results using these binners over the default.

[See more](https://metagenome-atlas.readthedocs.io/en/v2.8.0/usage/output.html#binning)

### Pathway annotations The command atlas run genomes produces genome-level functional annotation and Kegg pathways respective modules. It uses DRAM from @shafferm with a hack to produce all available Kegg modules.

[See more](https://metagenome-atlas.readthedocs.io/en/v2.8.0/usage/output.html#annotations)

### Genecatalog The command atlas run genecatalog now produces directly the abundance of the different genes. See more in #276

> In future this part of the pipeline will include protein assembly to better tackle complicated metagenomes.

### Minor updates

#### Reports are back See for example the [QC report](https://metagenome-atlas.readthedocs.io/en/v2.8.0/_static/QC_report.html)

#### Update of all underlying tools All tools use in atlas are now up to date. From assebler to GTDB. The one exception is, BBmap which contains a [bug](https://sourceforge.net/p/bbmap/tickets/48/) and ignores the minidenty parameter.

#### Atlas init Atlas init correctly parses fastq files even if they are in subfolders and if paired-ends are named simply Sample_1/Sample_2. @Sofie8 will be happy about this. Atlas log uses nice colors.

#### Default clustering of Subspecies

The default ANI threshold for genome-dereplication was set to 97.5% to include more sub-species diversity.

[See more](https://metagenome-atlas.readthedocs.io/en/v2.8.0/usage/output.html#genomes)