Welcome to DaisySuite’s documentation!

DaisySuite is a mapping-based workflow for analyzing horizontal gene transfer (HGT) events in bacterial data. The Next Generation Sequencing (NGS) input is processed in two major steps.
First, DaisyGPS identifies possible acceptor and donor candidates. Second, Daisy determines exact HGT-regions for the acceptor/donor pairs.

Getting started

You can either install DaisySuite with Conda or git

Tools used

Contact

Issue Tracker
Enrico Seiler
Kathrin Trappe

Install with Conda

Installing DaisySuite

We recommend the usage of Conda to install DaisySuite.
If you prefer to not use Conda, you can find instructions in the Install with git chapter.

To install Conda under Linux 64bit, run

wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
chmod +x Miniconda3-latest-Linux-x86_64.sh
./Miniconda3-latest-Linux-x86_64.sh

and follow the instructions.
Please visit the Miniconda homepage for further information.

Next, add the bioconda channel to your conda installation.

conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda

Once Conda is available, DaisySuite can be easily installed via

conda install daisysuite

Note that older or newer versions of already on your system existing tools, e.g. bwa, will be installed according to the specified requirements for DaisySuite in the Conda package.
To install DaisySuite in a new environment and thereby not altering any of your existing installations, use

conda create -n daisysuite_env daisysuite

where -n daisysuite_env specifies the environment name and can be chosen freely.
Use source activate daisysuite_env to activate your new environment and source deactivate to exit the environment.

Additional dependency for Laser

Laser uses bwa for structural variation analysis and requires the additional bwa perl script xa2multi.pl that is usually not installed with bwa, but is available in the bwa github.
xa2multi.pl needs to be present in the $PATH variable in order to use Laser, for example:

mkdir ~/bin
cd ~/bin
wget https://raw.githubusercontent.com/lh3/bwa/master/xa2multi.pl
export PATH=~/bin:$PATH

Setting DaisySuite up

You can automatically download and create all required data by running DaisySuite_setup <dir>. This will put the NCBI database and corresponding indices into the directory <dir>.
Alternatively, the requirements are explained in the Database requirements section. Yara needs up to 1 TB of temporary disk space to create its index. Currently, the setup script will fail if one of the steps fails. In this case, consider running the individual steps one by one (see the Database creation section).

Install with git

Installing DaisySuite

If you prefer to use Conda, you can find instructions in the Install with Conda chapter.

To use DaisySuite, the following dependencies need to be satisfied and globally available:

To install DaisySuite, run

git clone https://gitlab.com/eseiler/DaisySuite.git
cd DaisySuite
chmod +x DaisySuite*

For easy access, you might want to add the DaisySuite directory to your PATH variable, e.g.

export PATH=~/DaisySuite/:$PATH

Additional dependency for Laser

Laser uses bwa for structural variation analysis and requires the additional bwa perl script xa2multi.pl that is usually not installed with bwa, but is available in the bwa github.
xa2multi.pl needs to be present in the $PATH variable in order to use Laser, for example:

mkdir ~/bin
cd ~/bin
wget https://raw.githubusercontent.com/lh3/bwa/master/xa2multi.pl
export PATH=~/bin:$PATH

Setting DaisySuite up

You can automatically download and create all required data by running DaisySuite_setup <dir>. This will put the NCBI database and corresponding indices into the directory <dir>.
Alternatively, the requirements are explained in the Database requirements section.

Database requirements

You can automatically download and create all required data by running DaisySuite_setup <dir>. This will put the NCBI database and corresponding indices into the directory <dir>.
Alternatively, the individual requirements are explained in this section.

DaisySuite requires following data:

  • NCBI RefSeq
  • bwa index of the NCBI RefSeq
  • yara index of the NCBI RefSeq
  • MicrobeGPS taxonomy files

We will provide example code snippets to solving each one of the requirements.

NCBI RefSeq

First, download all NCBI entries that are tagged as “complete genomes” into a directory of your choice, e.g.

### Download NCBI database
mkdir ncbi
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/assembly_summary.txt
wget --directory-prefix ncbi -q ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/367/745/GCF_000367745.1_ASM36774v1/GCF_000367745.1_ASM36774v1_genomic.fna.gz
cat assembly_summary.txt | \
awk '{FS="\t"} !/^#/ $12 ~ "Complete Genome" {print $20}' | \
sed -r 's|(ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/[0-9]*/[0-9]*/[0-9]*/)(GCF_.+)|\1\2/\2_genomic.fna.gz|' | \
xargs -n 1 -P 12 wget --directory-prefix ncbi -q
rm assembly_summary.txt

Note that we download an additionial file in line 4. It contains the acceptor reference of one of our datasets and is wrongly labeled as not complete, hence the manual download.

Preprocessing NCBI

Next, we want to

  • extract the downloaded references
  • split references containing multiple sequences
  • remove plasmid references
  • rename the fasta files to <accession.version>.fasta
  • get a list of all accessions for future use
  • merge the fasta files (while keeping the originals)
  • gzip all fasta files

First, we need to extract the references.

### Extract NCBI database
gzip -d /ncbi/*_genomic.fna.gz

The pipeline requires the fasta files to only contain a single reference. This can be done, for example, with the python script split_fasta.py in the data/scripts/src directory of your DaisySuite installation.

### Split fasta files
for f in ncbi/*.fna;
do
    src/split_fasta.py $f
done
#!/usr/bin/python3
import os

def split_multifasta(fname):
    count = 0
    g = None
    with open(fname, 'r') as f:
        comp = fname.split('.')
        for line in f:
            if '>' in line:
                if g is not None:
                    g.close()
                g = open('{}_split{}.{}'.format('.'.join(comp[:-1]), count, comp[-1]), 'w')
                count += 1
                g.write(line)
            else:
                g.write(line)
    os.remove(fname)

if __name__ == '__main__':
    import sys
    import os
    split_multifasta(os.path.realpath(sys.argv[1]))

The DaisyGPS part of DaisySuite currently can not detect plasmids as potential HGT donors. The plasmids need to be excluded from the database and the remaining references should be named <accession.version>.fasta

### Rename fasta files and exclude plasmids
for f in ../ncbi/*.fna;
do
    if [[ $(head -n 1 $f) != *"Plasmid"* ]] && [[ $(head -n 1 $f) != *"plasmid"* ]]
    then
        accession=$(sed 's/>\([A-Za-z0-9\._]*\) .*/\1/' <(head -n 1 $f))
        new=$(dirname $f)/$accession.fasta
               mv $f $new
    else
        rm -f $f
    fi
done

For the creation of a taxonomic file for MicrobeGPS we need a list of all references included in our database:

### Get accession list
for i in ../ncbi/*.fasta*;
do
    n=$(sed 's/\(.*\)\.fasta.*/\1/' <(basename $i))
    cat <(echo $n) >> acc.txt
done

Most mapper indexer accept only one fasta file containing all references, so we merge our sequences. We also compress the resulting fasta to save space.

### Merge fasta files
cat ncbi/*.fasta > ncbi/reference.fasta

### Gzip fasta files
gzip ncbi/*.fasta

Taxonomy

MicrobeGPS is used in DaisyGPS and leverages taxonomic information for the profiling of the sample. Therefore, several taxonomic files are needed:

  • NCBI’s nodes.dmp
  • NCBI’s names.dmp
  • a catalog (bact.catalog) containing accession.taxid, taxid and name of each reference present in the NCBI database

The nodes.dmp, names.dmp and bact.catalog must be saved in the data/microbeGPS/data folder of your DaisySuite installation.

### Get names.dmp and nodes.dmp
wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
tar xfz taxdump.tar.gz names.dmp nodes.dmp
mv names.dmp data/microbeGPS/data/names.dmp
mv nodes.dmp data/microbeGPS/data/nodes.dmp
rm taxdump.tar.gz

To create the catalog file, you can use the createMinimalMGPSCatalog.py in the data/scripts/src directory of your DaisySuite installation.
It requires the NCBI nucl_gb.accession2taxid file, names.dmp and the list of all accessions in the NCBI database generated in the preprocessing step.

### Get catalog
wget -T 1800 -qO- ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz | gzip -dc > nucl_gb.accession2taxid
data/src/createMinimalMGPSCatalog.py nucl_gb.accession2taxid data/microbeGPS/data/names.dmp data/microbeGPS/data/bact.catalog -ref acc.txt
rm nucl_gb.accession2taxid

The accession list is now no longer needed and can be deleted.

### Remove accession list
rm acc.txt

Indices

In a last step, we need to generate the yara or bwa index (one is sufficient to run DaisySuite). During development we found Yara to be more sensitive regarding the mapping, but it needs up to 1 TB of temporary disk space to build its index.

### Create bwa index
bwa index ncbi/reference.fasta.gz -p bwa/bwa_index

### Create yara index
yara_indexer ncbi/reference.fasta.gz -o yara/yara_index

After this, we can delete the merged reference file.

### Remove merged fasta file
rm ../ncbi/reference.fasta.gz

DaisySuite configuration

The configuration of DaisySuite is done by a yaml file. You can generate a template containing all options with explanation by using

DaisySuite_template .

to create a template in the current working directory. Additionally, the options are explained on this page.

General configuration

There are three parameters that determine which steps of the pipeline are run (read simulation, DaisyGPS, Daisy). Possible values are true and false. To run, for example, DaisyGPS and Daisy on an existing read dataset, use:

sim: false
daisygps: true
daisy: true

Also, an output directory must be defined via the outputdir variable:

outputdir: 'path/to/output/directory'

If sim is set to true this directory can be empty, otherwise the reads are expected to be in the read directory of the output directory. The reads must be named <sample>.1.fq and <sample>.2.fq, see also Using DaisySuite.

Simulation configuration

The simulation needs the NCBI folder created by the setup (see Using DaisySuite), e.g.:

ncbidir: 'path/to/DaisySuite/data/ncbi'

You can futher specify how many simulations are to be run, the coverage and the read length:

simulates: 1
coverage: 100
readlength: 150

This will create one simulated read data set where the reference is covered about 100 times by reads with an average length of 150.

You can also choose to not incorporate any HGT event, e.g. to test for specificity:

negative: true

Furthermore, there are various parameters to configure the read simulation done by mason, e.g. SNP rate:

# Acceptor: Mason variator SNP rate
a_snp: 0.01
# Acceptor: Mason variator small indel rate
a_sir: 0.001
# Acceptor: Mason variator sv indel rate
a_svir: 0.00001
# Donor: Mason variator SNP rate
d_snp: 0.001
# Donor: Mason variator small indel rate
d_sir: 0.001
# Donor: Mason variator max small indel size
d_msis: 4

DaisyGPS configuration

You can select a mapper (bwa or yara) and the number of threads used by the mapper:

bwa: false
threads: 20

Those settings are shared with Daisy.

According to the choice of your mapper, you need to specify the location of the index. The indices are generated during the setup (see Using DaisySuite):

yara_index: 'path/to/yara_index'
bwa_index: 'path/to/bwa_index'

Note that you can specify both indices, the pipeline will pick the index according to your chosen mapper.

You also need to specify the NCBI directory as described in the Simulation section.

You can select taxons per sample that should not be included in the candidate selection. You can either blacklist single taxons, a whole species or all children of a taxon:

taxon_blacklist:
  - sample1:
    - 672612
  - sample2:
    - 726312

species_blacklist:
  - sample2:
    - 1270

parent_blacklist:
  - sample2:
    - 3173

You can also opt to only report candidates that are classified more specific than species level, i.e. if a taxon has the same taxonomic ID as the corresponding species taxonomic ID, the taxon is not reported. If the resulting candidates list would be empty with this filter activated, the filter is ignored.

filter_species: true

Furthermore, you can set how many acceptors, donors and acceptor-like donors are reported:

number_acceptors: 2
number_donors: 3
number_accdons: 2

Daisy configuration

The mapper and threads choices are shared with DaisyGPS.

You can either use a sensitive mode (using Stellar and Gustaf) or less sensitive mode (using laser).

sensitivemode: true

In case you are using the sensitive mode, the reads are mapped and filtered for not properly aligned pairs. There is an option to only filter for reads where at least one mate is unmapped, if the filtering for not properly aligned pairs should yield more than 750000 entires. This will decrease the runtime at the cost of decreased sensitivity.

samflag_filter: true

Another option for the sensitive mode is to define the needed number of reads that support a breakpoint for Gustaf. The default is 2 and increasing it will lead to a more strict search for breakpoints. This option is set per sample.

gustaf_st:
  - sample1:
    - 4

Daisy can also search against a phage database. An empty entry will result in no search against the phage DB. If you wish to search against a database, inlclude the full path to the fasta file.

phage: ''

Furthermore, you can set parameters regarding the HGT detection, i.e. the minimum/maximum HGT size and the needed support by sampling.

hgt_min: 100
hgt_max: 55000
hgt_sens: 90

Command line help

DaisySuite -h provides a short command overview

DaisySuite Pipeline (powered by Snakemake)

Usage: DaisySuite --configfile FILE [Snakemake options]

 Useful Snakemake parameters:
   -j, --cores            number of cores
   -k, --keep-going       go on with independent jobs if a job fails
   -n, --dryrun           do not execute anything
   -p, --printshellcmds   print out the shell commands that will be executed

 Full list of parameters:
   --help                 show Snakemake help (or snakemake -h)

Using DaisySuite

Preparation

If you want to simulate reads and then analyze them, you only need to specify the outputdir and your database directories in the config.
If you already have (gzipped) fastq reads, these files need to be present within the `reads` folder in the outputdir.
Furthermore, the reads need to follow a specific naming scheme:
{sample}.1.fq for single end reads
{sample}.1.fq and {sample}.2.fq for paired end reads
{sample} is an arbitrary name for the read data set
The reads can be either in plain fastq format or gzipped (fq.gz).
Please note that Laser cannot be used with single end reads.

The read names (the read names within the fastq file) can not contain a slash (‘/’) symbol.
For example, simulated.1/1 will not work. (remove, e.g. by using `sed -i 's/\(@simulated\.[0-9]*\)\/[0-9]*/\1/' <fq-file>`)

Running with known acceptors and donors

If you have a read dataset and know possible acceptor and donor organisms, you can skip the DaisyGPS and directly use Daisy to detect HGT events.
To do this, you need to set `sim` and `daisygps` in the config to `false` and `daisy` to `true`.
Now create the following directory structure: .
├──  outputdir
│    ├──  reads
│        └──  {sample}.1.fq.gz                      read file 1
│        └──  {sample}.2.fq.gz                      read file 2
│    ├──  candidates
│        └──  {sample}_acceptors.fa             Sequences for all acceptor candidates
│        └──  {sample}_acceptors.txt            Accession.Version for all acceptor candidates
│        └──  {sample}_donors.fa                 Sequences for all donor candidates
│        └──  {sample}_donors.txt               Accession.Version for all donor candidates

The `reads` folder contains your read dataset (paired or single end) and is prefixed with `sample`.
`candidates/{sample}_acceptors.fa` is the concatenation of all the sequences of the acceptor candidates (a (multi-)FASTA file).
`candidates/{sample}_acceptors.txt` lists all the accesions that appear in `candidates/{sample}_acceptors.fa` (one accession per line).
Analogously, you have to provide `{sample}_donors.fa` and `{sample}_donors.txt`.

Example directory

.
├──  outputdir
│        ├──  reads
│                  └──  {sample1}.1.fq.gz
│                  └──  {sample1}.2.fq.gz
│                  └──  {sample2}.1.fq.gz
│                  └──  {sample2}.2.fq.gz
│                  └──  {sample3}.1.fq.gz

All samples inside the outputdir will be run.

Config file

All necessary parameters are set in a configfile. A template file can be generated by running DaisySuite_template .
For further information, visit the Configuration section.

Running a job

This pipeline uses Snakemake and therefore supports all parameters implemented by Snakemake.

DaisySuite --configfile config/example.yaml

See DaisySuite -h or the Command line help section for additional options.

Output

The results of DaisyGPS are in the candidates directory.
The results of Daisy are in the hgt_eval directory.

.
├──  outputdir
│        └──  log-{timestamp}.txt       logfile containing information about the run
│    ├──  reads
│        └──  {sample1}.1.fq.gz                      read file 1
│        └──  {sample1}.2.fq.gz                      read file 2
│    ├──  candidates
│        └──  {sample1}_acceptors.fa             Sequences for all acceptor candidates
│        └──  {sample1}_acceptors.txt            Accession.Version for all acceptor candidates
│        └──  {sample1}_candidates.tsv          All reported candidates including several metrics
│        └──  {sample1}_complete.tsv            All mapped references including several metrics
│        └──  {sample1}_donors.fa                 Sequences for all donor candidates
│        └──  {sample1}_donors.txt               Accession.Version for all donor candidates
│    ├──  hgt_eval
│        └──  {sample1}.fasta                        Sequences of filtered horizontally transferred regions
│        └──  {sample1}.tsv                           All hgt regions
│        └──  {sample1}.vcf                           All filtered hgt regions

Additonally, there are many intermediate directories, which can be kept with `--nt`.

Directory Description
bwa/yara [Daisy] contains index for acceptor/donor candidates
candidate_mapping [Daisy] contains mapping of candidates against ncbi
fasta_npa [Daisy] contains not properly aligned reads in fasta format
fastq_npa [Daisy] contains not properly aligned reads in fastq format
gustaf [Daisy] contains results of gustaf
hgt_candidates [Daisy] contains hgt candidates generated in the hgt evaluation step
hgt_eval_pair [Daisy] single result files, merged files are found in hgt_eval
joined [Daisy] contains joined sequences of acceptor and donor candidates
mapping [DaisyGPS] contains mapping of reads against ncbi
mgps [DaisyGPS] contains results of MicrobeGPS
npa_joined [Daisy] contains merged fasta_npa files
process [DaisyGPS] contains intermediate results of candidate detection
sort_mapping [Daisy] contains sorted alignment file from candidate_mapping
sort_npa [Daisy] contains sorted fasta file from npa_joined
stellar [Daisy] contains results of stellar

Explanation

The objective of the DaisyGPS workflow is to identify possible acceptor (organism that receives genetic information) and donor (organism that the information is transferred from) candidates given reads of a potential HGT organism. The HGT organism’s genome consists mainly of the acceptor genome. When the reads of the HGT organism are mapped against the acceptor reference, most reads should map properly. Therefore a high and continuous mapping coverage pattern of the acceptor genome can be expected.

In contrast to that, only a small part of the donor genome is present within the HGT organism’s genome, hence only a small fraction of the reads should map against the donor reference and then only within a zoned part (i.e. the part that has been transferred). This results in a discontinuous mapping coverage pattern where only a small part of the reference shows a high mapping coverage.

Given only the reads of the HGT organism, the acceptor and donor candidate identification problem is similar to what is done in metagenomic profiling. We hence use the metagenomic profiling tool MicrobeGPS to gather a profile of our given HGT organism and mapping coverage metrics. MicrobeGPS fits our requirements as it uses metrics that can be adapted to represent acceptor and donor attributes. The candidates are ranked by this score and a list with putative acceptor and donor candidates is generated. These acceptor and donor candidates can then be further analysed with tools like Daisy.

MicrobeGPS

MicrobeGPS is a metagenomic profiling tool. Given a set of references and sequencing reads from a sample it uses references to describe the potential (new) organisms in the sample. That is, it uses available references to model the composition of the organisms present in the sample instead of directly assigning reference organisms to characterize the sample. To do this, MicrobeGPS developed metrics to compare the genomes in the sample with the reference genomes. In doing so, MicrobeGPS computes a genome coverage and a metric that corresponds to whether a mapping pattern is continuous or not. Therefore, it is possible to use MicrobeGPS to identify not only acceptor but also the donor by leveraging those metrics. We generated the alignment file required for MicrobeGPS by mapping the HGT organism’s reads against the NCBI RefSeq using bwa and yara.

To to describe the similarity between an organism found in the sample and the corresponding reference genome based on read evidence, MicrobeGPS facilitates a genomic distance measurement, namely the Genome Dataset Validity (GDV, validity for short). It describes the fraction of the reference genome for which there is read evidence. A taxonomic position then can be estimated by using the GDV as a measurement for the distance between an organism found in the sample and the closest reference genome. To account for possible biases arising from the use of short NGS reads, the read mapping is modelled as a mixture distribution:

\(f(x|\overline{\alpha},\lambda) = \alpha_{1}\cdot z(x)+\alpha_{2}\cdot\ P(x|\lambda)+\alpha_{3}\cdot T_{\alpha}(x)\)

where \(z(x)\) (zero distribution) and \(P(x)\) (Poisson distribution) form a zero-inflated Poisson distribution that allows to account for zero coverage regions. The tail distribution accounts for differences between reads and reference genome. \(\alpha_{1}\), \(\alpha_{2}\) and \(\alpha_{3}\) are the mixture coefficients of the different distributions, respectively. Since \(z(x)\) represents the zero coverage regions, the validity can be easily calculated by \(1-\alpha_{1}\)

Secondly, another key point of interest is to determine how evenly the reads are distributed. In MicrobeGPS, a Kolmogorov-Smirnov (KS)-Test is used to compare the actual read distribution with an theoretical read distribution that is generated by placing the reads uniformly across the genome, i.e. after having seen \(x\%\) of the genome also \(x\%\) of the reads should have been seen. The heterogeneity (note: called homogeneity in the original work) is then the KS-Statistic:

\(D_{n}=\sup_{x}\mid F_{n}(x) - F(x)\mid\)

\(D_{n}\) is the KS statistic, \(F_{n}(x)\) is the empirical distribution function (what we expect), \(F(x)\) the given cumulative distribution function (what we have)

\(F_{n}(x) = \frac{1}{n}\sum_{i=1}^nI_{[-\infty,x]}(X_{i}) I_{[-\infty,x]}(X_{i})=\begin{cases}1 & X_{i} < x 0 0 & else\end{cases}\)

Validity and Heterogeneity describe how much of the reference is covered and how evenly the reads are distributed, respectively. Therefore, they can be used to classify acceptor and donor candidates for an HGT event.

Acceptor and donor classification

For an acceptor, the mapping coverage is relatively high and homogeneous. Hence, we expect a high validity score and a low heterogeneity score. In contrast to that, we expect a low validity and high heterogeneity for donors. These assumptions can be used to define a property score that reflects how much the attributes of a given organism matches the properties of an acceptor or donor:

\(property = validity - heterogeneity\)
with \(property \in (-1,1]\)

Therefore, the value for a completely covered acceptor with uniform read distribution would approach \(+1\). Likewise, the value for a donor that is only covered in a small region would approach \(-1\). We tested weighted scores that include additional information such as the number of unique reads or k-mer counts (taken, e.g., directly from Kraken), but they did not improve the classification of suitable candidates. For both mentioned criteria, a high count did not correlate with the candidate being an acceptor or donor (data not shown). There is, however, a high evidence by sheer read numbers for acceptors:

\(property score = \#mapped\_reads/\#total\_reads * property\)

Where \(\#mapped\_reads/\#total\_reads\) is the fraction of all mapped reads that mapped to the specific acceptor candidate.

For the donor, however, the size of the transferred region is not known in advance, hence, we do not expect a specific read number evidence and therefore omit the weighting, which yields \(property\) as metric. We select the acceptors with the highest property scores (approximating \(1\)) and the donors with the lowest property (approximating \(-1\)) as candidates.

In case acceptor and donor are very similar, the donor might not express the attributes we are looking for. In particular, the donor might have a significant read number evidence arising from acceptor reads also mapping to the donor.

As a result, the property score will likely indicate an acceptor instead of a donor (approximating \(+0\)), while still being relatively low compared to the property of other acceptors.

To account for this, we also classify candidates with a \(property score > 0\) as possible acceptor-like donors. The entries with the lowest property among them are selected as acceptor-like donors.

For more information and publications, see Citations

DaisySuite example

To run DaisySuite on an example dataset, first copy the example into a directory of your choice by running

DaisySuite_example .

to copy the folder example into the current working directory.

Next, you need to edit the following parameters in the example/example.yaml:

  • outputdir (full path to example/output/)
  • ncbidir
  • bwa (in case you are using bwa)
  • yara_index or bwa_index

Finally, you can run DaisySuite:

DaisySuite --configfile example/example.yaml

You can also use multiple threads by adding --cores <thread_number>, e.g. --cores 10, to the command.

DaisyGPS results

You will find the acceptor Escherichia coli str. K-12 substr. DH10B [NC_010473.1] and the donor Helicobacter pylori [NZ_AP014710.1] in the example/output/candidates/sim1HP_candidates.tsv file.

DaisyGPS Results
Type Name Accession.Version TaxID Parent TaxID Species TaxID Abundance Num. Reads Unique Reads Coverage Validity Homogeneity Mapping Error Property Score Property
Acceptor Escherichia coli str. K-12 substr. DH10B NC_010473.1 316385 83333 562 0.946692320209823 197800 136 36.8658152545 0.25427675964 0.08146934189377475 0.021496966633 0.0030079101341754125 0.17280741774622527
Acceptor Escherichia coli K-12 NZ_CP010445.1 83333 562 562 0.8952416506332022 187050 0 35.7282033189 0.237248626581 0.07496815501320375 0.0214397576406 0.0026711615990942022 0.16228047156779624
Donor [Haemophilus] ducreyi NZ_CP015434.1 730 724 730 0.0015411270328997118 322 0 24.7946611905 0.00109715387698 0.9254876430438622 0.0265838509317 -2.619313788987036e-05 -0.9243904891668824
Donor Salmonella enterica subsp. enterica serovar Anatum str. USDA-ARS-USMARC-1676 NZ_CP014620.1 1454587 58712 28901 0.0006030497085259743 126 0 0.0645783951687 0.00108528243123 0.9193443687114572 0.013492063492100002 -1.0181504759172115e-05 -0.9182590862802272
Donor Klebsiella oxytoca KONIH1 NZ_CP008788.1 1333852 571 571 0.008571920856904919 1791 0 41.2926529358 0.00105750960227 0.7946631781949077 0.0263539921831 -0.00012507673507004732 -0.7936056685926377
Donor Helicobacter pylori NZ_AP014710.1 210 209 210 0.043812039935291806 9154 9091 47.3515414856 0.0177922034096 0.7999996890279752 0.00920544752749 -0.0006300993983310352 -0.7822074856183753
Acceptor-like Donor Escherichia coli NZ_CP016182.1 562 561 562 0.35694799414180284 74580 0 6.99537943225 0.0939148463399 0.0879420602326042 0.0211231786896 3.919904897022354e-05 0.00597278610729579

Daisy results

Furthermore, you will find the base pair positions of the transfer in example/output/hgt_eval/sim1HP.vcf. Bases 1322000 to 1350000 of the donor have been inserted at base 1120262 of the acceptor. This is indicated by two breakpoints in the vcf, one representing the beginning of the insert (acceptor 1120261, donor 1322000) and one representing the end of the insert (acceptor 1120263, donor 1350000).

The example/output/hgt_eval/sim1HP.tsv also provides a more intuitive representation of putative transferred regions, but please note that those candidates have not been filtered by the sampling values.

Daisy TSV header
#AN: Acceptor name
#DN: Donor name
#AS: Acceptor start position
#AE: Acceptor end position
#DS: Donor start position
#DE: Donor end position
#MC: Mean coverage in region
#Split: Total number split-reads per region (including duplicates!)
#PS-S: Pairs spanning HGT boundaries
#PS-W: Pairs within HGT boundaries
#Phage: PS-S and PS-W reads mapping to phage database
#BS:MC/PS-S/PS-W: Percent of bootstrapped random regions with MC/PS-S/PS-W smaller than candidate
Daisy Results TSV
AN DN AS AE MC BS:MC DS DE MC Split PS-S PS-W Phage BS:MC BS:PS-S BS:PS-W
NZ_CP010445.1 NZ_AP014710.1 1880235 1880237 44.00 7 1322002 1350000 94.62 152 182 8712 0.0000 100 100 100
NZ_CP010445.1 NZ_CP015434.1 3904873 3904886 40.54 3 114928 126957 30.41 871 156 884 0.0000 100 100 100
NZ_CP010445.1 NZ_CP015434.1 3904873 3916007 97.63 20 125626 126957 129.68 1571 108 258 0.0000 100 100 100
NZ_CP010445.1 NZ_CP015434.1 3904885 3916007 97.69 18 114927 125626 18.06 253 43 279 0.0000 100 99 100
NC_010473.1 NZ_AP014710.1 1120261 1120263 43.00 3 1322002 1350000 94.62 154 182 8712 0.0000 100 100 100
Daisy VCF header
##fileformat=VCFv4.2
##source=DAISY
##INFO=<ID=EVENT,Number=1,Type=String,Description="Event identifier for breakends.">
##contig=<ID=NC_010473.1>
##contig=<ID=NZ_CP010445.1>
##contig=<ID=NZ_CP015434.1>
##contig=<ID=NZ_CP014620.1>
##contig=<ID=NZ_CP008788.1>
##contig=<ID=NZ_AP014710.1>
##contig=<ID=NZ_CP016182.1>
Daisy Results VCF
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
NZ_CP010445.1 1880235 BND_1_1 A A[NZ_AP014710.1:1322002[ PASS SVTYPE=BND;EVENT=HGT1 . 1
NZ_CP010445.1 1880237 BND_1_2 G ]NZ_AP014710.1:1350000]G PASS SVTYPE=BND;EVENT=HGT1 . 1
NZ_CP010445.1 3904873 BND_1_1 T T[NZ_CP015434.1:114928[ PASS SVTYPE=BND;EVENT=HGT1 . 1
NZ_CP010445.1 3904886 BND_1_2 C ]NZ_CP015434.1:126957]C PASS SVTYPE=BND;EVENT=HGT1 . 1
NC_010473.1 1120261 BND_1_1 A A[NZ_AP014710.1:1322002[ PASS SVTYPE=BND;EVENT=HGT1 . 1
NC_010473.1 1120263 BND_1_2 G ]NZ_AP014710.1:1350000]G PASS SVTYPE=BND;EVENT=HGT1 . 1