BiasAway Documentation¶
Welcome to BiasAway - an open-source command-line tool and web-server that provide four approaches to generate nucleotide composition-matched DNA sequences.
Introduction¶
The BiasAway software tool is introduced to generate nucleotide composition-matched DNA sequences. It is available as open source code from bitbucket.
The tool provides users with four approaches to generate synthetic or genomic background sequences matching mono- or k-mer composition of user-provided foreground sequences:
- synthetic k-mer shuffled sequences
- synthetic k-mer shuffled sequences in a sliding window
- genomic mononucleotide distribution matched sequences
- genomic mononucleotide distribution within a sliding window matched sequences
The 1st approach shuffles each user-provided sequences independently by preserving the k-mer composition of the input sequences. The 2nd approach applies the same method as the 1st approach but within a sliding window along the user-provided sequences. For the 3rd and 4th approaches, the background sequences are selected from a pool of provided genomic sequences to match the distribution of mononucleotide for each target sequence. The 4th approach considers the mean and standard deviation of %GC computed within the sliding window along the user-provided sequences to match as closely as possible the distribution for each user-provided sequence.
The approaches based on a sliding window were considered because due to evolutionary changes such as insertion of repetitive sequences, local rearrangements, or biochemical missteps, the target sequences may have sub-regions of distinct nucleotide composition.
Installation¶
BiasAway is available on PyPi,
through Bioconda,
and the source code is available on bitbucket. BiasAway takes care of the
installation of all the required python modules. If you already have a working
installation of python, the easiest way to install the required python modules
is by installing biasaway using pip
.
If you are setting up Python for the first time, we recommend to install it using the Conda or Miniconda Python distribution. This comes with several helpful scientific and data processing libraries available for platforms including Windows, Mac OSX, and Linux.
You can use one of the following ways to install BiasAway.
Quick installation¶
Prerequisites¶
BiasAway requires the following Python modules:
- biopython: https://biopython.org
- numpy: https://numpy.org
- matplotlib: https://matplotlib.org/
- seaborn: https://seaborn.pydata.org/
Install biopython, numpy, matplotlib, and seaborn¶
BiasAway uses biopython, numpy, matplotlib, and seaborn you can install them using pip or conda.
Note
If you install using pip
or bioconda
prerequisites will be installed.
Install BiasAway using conda¶
BiasAway is available on Bioconda for installation via conda
.
conda install -c bioconda biasaway
Install BiasAway using pip¶
BiasAway is available on PyPi for installation via pip
.
pip install biasaway
Install BiasAway from source¶
You can install the development version by using git
from our bitbucket
repository at https://bitbucket.org/CBGR/biasaway.
Install development version from Bitbucket¶
If you have git installed, use this:
git clone https://bitbucket.org/CBGR/biasaway.git
cd biasaway
python setup.py sdist install
How to use BiasAway¶
Once you have installed BiasAway, you can type:
biasaway --help
It will print the main help, which lists the six subcommands/modules: k
, w
, g
, and c
.
usage: biasaway <subcommand> [options]
positional arguments <subcommand>: {k,w,g,c}
List of subcommands
k k-mer shuffling
w k-mer shuffling within a sliding window
g mononucleotide distribution matched
c mononucleotide distribution within a sliding window matched
optional arguments:
-h, --help show this help message and exit
-v, --version show program's version number and exit
To view the help for the individual subcommands, please type:
Note
Please check BiasAway modules
to see a detailed summary of available options.
To view k
module help, type
biasaway k --help
To view w
module help, type
biasaway w --help
To view g
module help, type
biasaway g --help
To view c
module help, type
biasaway c --help
BiasAway modules¶
The BiasAway software tool is introduced to generate nucleotide composition-matched DNA sequences. It is available as open source code from bitbucket.
The tool provides users with four approaches to generate synthetic or genomic background sequences matching mono- and k-mer composition of user-provided foreground sequences:
Note
BiasAway can generate distribution plots for QC. Plots provide information about distribution of %GC, dinucleotides, and lengths for the input sequences and generated sequences. Moreover, BiasAway provides the following QC metrics for comparing these distributions whenever possible: mean absolute error and goodness of fit computed as Pearson’s chi-squared statistic, log-likelihood ratio test (G-test), and the Cressie-Read power divergence.
Note
BiasAway also comes with a Web App available at http://biasaway.uio.no.
K-mer shuffling¶
Each user-provided sequence will be shuffled to keep its k-mer composition. This module can be used for any k, for instance use -k 1 for conserving the mononucleotide composition of the input sequences.
Usage:
biasaway k [options]
Note
Please scroll down to see a detailed summary of available options.
Help:
biasaway k --help
Example:
biasaway k -f path/to/FASTA/file/my_fasta_file.fa
It will output the generated sequences on stdout, keeping the dinucleotide composition of the input sequence by default (k-mer with k=2 is the default). If you wish to save the sequences in a specific file, you can type:
biasaway d -f path/to/FASTA/file/my_fasta_file.fa > path/to/output/FASTA/file/my_fasta_output.fa
Summary of options
Option | Description |
---|---|
-h, –help | To show the help message and exit |
-f, –foreground | Foreground file in fasta format. |
-k, –kmer | K-mer to be used for shuffling (default: 2 for dinucleotide shuffling) |
-n, –nfold | How many background sequences per each foreground sequence will be generated (default: 1 ) |
-e, –seed | Seed number to initialize the random number generator for reproducibility (default: integer from the current time ) |
-p, –plot-filename | Base filename for all the plots and related statistics looking at %GC, dinucleotide, and lengths distributions (``default: not activated so no plot and statistics produced) |
K-mer shuffling within a sliding window¶
For each user-provided sequence, a window will slide along to shuffle the nucleotides within the window, keeping the local k-mer composition. As such, the generated sequences will preserve the local k-mer composition of the input sequences along them.
Usage:
biasaway w [options]
Note
Please scroll down to see a detailed summary of available options.
Help:
biasaway w --help
Example:
biasaway w -f path/to/FASTA/file/my_fasta_file.fa
It will output the generated sequences on stdout, keeping the local dinucleotide composition of the input sequences (k=2 for dinucleotide shuffling is used as default). If you wish to save the sequences in a specific file, you can type:
biasaway w -f path/to/FASTA/file/my_fasta_file.fa > path/to/output/FASTA/file/my_fasta_output.fa
Summary of options
Option | Description |
---|---|
-h, –help | To show the help message and exit |
-f, –foreground | Foreground file in fasta format. |
-k, –kmer | K-mer to be used for shuffling (default: 2 for dinucleotide shuffling) |
-n, –nfold | How many background sequences per each foreground sequence will be generated (default: 1 ) |
-w, –winlen | Window length (default: 100 ) |
-s, –step | Sliding step (default: 50 ) |
-e, –seed | Seed number to initialize the random number generator for reproducibility (default: integer from the current time ) |
-p, –plot-filename | Base filename for all the plots and related statistics looking at %GC, dinucleotide, and lengths distributions (``default: not activated so no plot and statistics produced) |
Genomic mononucleotide distribution matched¶
Given a set of available background sequences (pre-computed or provided by the user), each user-provided foreground sequence will be matched to a background sequence having the same mononucleotide composition.
The first time you run this module, you need to provide a set of potential background sequences using the –background argument. The –bgdirectory argument is necessary and will contain the decomposition of the background sequences in dedicated files per %GC content.
If you already have such a pre-computed background directory, you can only use the –bgdirectory argument to speed-up the process.
Usage:
biasaway g [options]
Note
Please scroll down to see a detailed summary of available options.
Help:
biasaway g --help
Example:
biasaway g -f path/to/FASTA/file/my_fasta_file.fa -b path/to/background.fa -r path/to/bgdirectory
It will output the generated sequences on stdout. If you wish to save the sequences in a specific file, you can type:
biasaway g -f path/to/FASTA/file/my_fasta_file.fa -b path/to/background.fa -r path/to/bgdirectory > path/to/output/FASTA/file/my_fasta_output.fa
Summary of options
Option | Description |
---|---|
-h, –help | To show the help message and exit |
-f, –foreground | Foreground file in fasta format. |
-n, –nfold | How many background sequences per each foreground sequence will be generated (default: 1 ) |
-r, –bgdirectory | Background directory (must be empty if –background is used). See documentation for details. |
-b, –background | Background file in fasta format. Not necessary if a background directory has already been computed previously. |
-l, –length | Try to match the length as closely as possible (not set by default ) |
-e, –seed | Seed number to initialize the random number generator for reproducibility (default: integer from the current time ) |
-p, –plot-filename | Base filename for all the plots and related statistics looking at %GC, dinucleotide, and lengths distributions (``default: not activated so no plot and statistics produced) |
Genomic mononucleotide distribution within a sliding window matched¶
Given a set of available background sequences (pre-computed or provided by the user), each user-provided foreground sequence will be matched to a background sequence having a close mononucleotide local composition. Specifically, distribution of %GC composition in a sliding window are computed for foreground and background sequences; a foreground sequence with a mean m_f and standard deviation sdev_f of %GC in the sliding window is matched to a background sequence if its mean %GC m_b is such that: .. math:
m_f - N * sdev_f <= m_b <= m_f + N * sdev_f
with N equals to 2.6 by default.
The first time you run this module, you need to provide a set of potential background sequences using the –background argument. The –bgdirectory argument is necessary and will contain the decomposition of the background sequences in dedicated files per %GC content.
If you already have such a pre-computed background directory, you can only use the –bgdirectory argument to speed-up the process.
Usage:
biasaway c [options]
Note
Please scroll down to see a detailed summary of available options.
Help:
biasaway c --help
Example:
biasaway c -f path/to/FASTA/file/my_fasta_file.fa -b path/to/background.fa -r path/to/bgdirectory
It will output the generated sequences on stdout. If you wish to save the sequences in a specific file, you can type:
biasaway c -f path/to/FASTA/file/my_fasta_file.fa -b path/to/background.fa -r path/to/bgdirectory > path/to/output/FASTA/file/my_fasta_output.fa
Summary of options
Option | Description |
---|---|
-h, –help | To show the help message and exit |
-f, –foreground | Foreground file in fasta format. |
-n, –nfold | How many background sequences per each foreground sequence will be generated (default: 1 ) |
-r, –bgdirectory | Background directory (must be empty if –background is used). See documentation for details. |
-b, –background | Background file in fasta format. Not necessary if a background directory has already been computed previously. |
-l, –length | Try to match the length as closely as possible (not set by default ) |
-w, –winlen | Window length (default: 100 ) |
-s, –step | Sliding step (default: 50 ) |
-d, –deviation | Deviation from the mean (default: 2.6 for a threshold of mean + 2.6 * stdev ) |
-e, –seed | Seed number to initialize the random number generator for reproducibility (default: integer from the current time ) |
-p, –plot-filename | Base filename for all the plots and related statistics looking at %GC, dinucleotide, and lengths distributions (``default: not activated so no plot and statistics produced) |
BiasAway web-server¶
Introduction¶
The BiasAway web-server provides an interactive and easy to use interface for users to upload FASTA files and to generate background sequences. It comes with precomputed genomic partitions of 100, 250, 500, 750, and 1000 bp bins for the genome of nine species (Arabidopsis thaliana; Caenorhabditis elegans; Danio rerio; Drosophila melanogaster; Homo sapiens; Mus musculus; Rattus norvegicus; Saccharomyces cerevisiae; and Schizosaccharomyces pombe). These background sequences are provided through Zenodo at 10.5281/zenodo.3923866. These background sequences were generated using the script at https://bitbucket.org/CBGR/biasaway_background_construction, which can be used by users to generate their own background sequences. The result page provides information about mononucleotide, dinucleotide, and length distributions for the provided and generated sequences for comparison.
BiasAway has four modules:

Note
The BiasAway web-application automatically generate distribution plots for QC. Plots provide information about distribution of %GC, dinucleotides, and lengths for the input sequences and generated sequences. Moreover, BiasAway provides the following QC metrics for comparing these distributions whenever possible: mean absolute error and goodness of fit computed as Pearson’s chi-squared statistic, log-likelihood ratio test (G-test), and the Cressie-Read power divergence.
Below are screenshots for individual modules.
K-mer shuffling¶
This module should be run when the user aims at preserving the global k-mer nucleotide frequencies of input sequences.

K-mer shuffling within a sliding window¶
This module should be run when the user aims at preserving the local k-mer nucleotide frequencies of input sequences.

Genomic mononucleotide distribution matched¶
This module should be run when the user aims at selecting genuine genomic background sequences from a pool of provided genomic sequences to match the distribution of mononucleotide for each target sequence.

Genomic mononucleotide distribution within a sliding window matched¶
This module should be run when the user aims at selecting genuine genomic background sequences from a pool of provided genomic sequences to match the local distribution of mononucleotide for each target sequence.

Example result page and QC plots¶
BiasAway provides quality control (QC) plots and metrics to assess the similarity of the mono- and di-nucleotide, and length distributions for the foreground and background sequences. Specifically, four plots are provided to visualize how similar the foreground and background sequences are when considering (2) their distributions of %GC content using density plots, (2) their dinucleotide contents considering all IUPAC nucleotides using a heatmap, (3) their dinucleotide contents considering adenine, cytosine, guanine, and thymine nucleotides using a heatmap, and (4) their distributions of lengths.

Generation of background repositories¶
Modules g and c of BiasAway require the generation of a background repository for the genome of interest. This can be created with the script located at our BitBucket repository.
Our BiasAway Web-Server contains precomputed background repositories for 9 species. The genome fasta files used to create these can be found below:
- Homo sapiens: GRCh38/hg38
- Mus musculus: mm10
- Rattus norvegicus: Rnor 6.0
- Arabidopsis thaliana: TAIR10
- Danio rerio: GRCz11
- Drosophila melanogaster: dm6
- Caenorhabditis elegans: WBcel235
- Saccharomyces cerevisiae
- Schizosaccharomyces pombe: ASM294v2
Please note that some genome fasta files are separated by chromosomes in their original repositories. In that case, please make sure to concatenate all chromosome fasta files in one single genome fasta file.
We also provide a collection of precomputed background repositories for the nine organisms mentioned above using k-mers of size 100, 250, 500, 750 and 1000 base pairs. They can be found as individual compressed files in our Zenodo repository
Support¶
If you have questions, or found any bug in the program, please write to us at anthony.mathelier[at]ncmm.uio.no
and azizk[at]stanford.edu
.
You can also report the issues to our bitbucket repo
Citation¶
If you used BiasAway, please cite:
- A. Khan, R. Riudavets Puig, P. Boddie, and A. Mathelier. BiasAway: command-line and web server to generate nucleotide composition-matched DNA background sequences, 2020.
- R. Worsley-Hunt et al. Improving analysis of transcription factor binding sites within ChIP-Seq data based on topological motif enrichment, BMC Genomics 2014; 10.1186/1471-2164-15-472