Gene Regulation¶
This git repository holds shared code for the analysis of Next Generation Sequencing data related to gene regulation: ChIP-seq, RNA-seq, and related technologies.
It was developed as part of the France Genomique Workpackage 2.6: Gene expression regulation.
One of the key goals is to ensure portability and re-usability of the code.
We have chosen to use the Snakemake workflow management system[1] in order to build reproducible and flexible NGS analysis pipelines.
Contact¶
- Claire Rioualen claire.rioualen@inserm.fr
- Jacques van Helden Jacques.van-helden@univ-amu.fr
References¶
- Köster, Johannes and Rahmann, Sven. “Snakemake - A scalable bioinformatics workflow engine”. Bioinformatics 2012.
User guide and reference¶
Getting started¶
The commands and tutorials below are designed for a Unix environment, and were tested under the OS Linux Mint Debian Edition (LMDE).
They assume that the gene-regulation library is downloaded or cloned into your home directory. You can specify any other preferred directory.
Download Gene-regulation¶
cd ${HOME}
wget --no-clobber https://github.com/rioualen/gene-regulation/archive/4.0.tar.gz
tar xvzf 4.0.tar.gz
GENE_REG_PATH=${HOME}/gene-regulation-4.0
Clone Gene-regulation¶
cd ${HOME}
git clone https://github.com/rioualen/gene-regulation.git
Run Gene-regulation¶
Please refer yourself to the Tutorials section.
Gene-regulation library¶
The library contains a variety of files, including scripts and configuration files.
You can find a description of these hereafter.
Snakemake files (snakefiles)¶
Snakefiles are based on the scripting language Python 3, and use a specific syntax.
For organization purpose, we have distinguished two types of Snakefiles:
- Rules are typically “bricks” to build workflows with. Each rule corresponds to a specific operation.
- Workflows are combinations of rules that serve a specific purpose: quality check of sequencing data, ChIP-seq peaks analysis...
Rules (.rules)¶
todo
Python scripts (.py)¶
todo
R scripts¶
todo
Configuration files (yaml)¶
todo
R markdown files (.Rmd)¶
todo
Dependencies¶
These manuals aim at helping you install programs and dependencies used in the Gene-regulation library.
Some of them are mandatory, and some are optional, depending on the Snakemake workflows you need to run.
They were tested under Ubuntu 14.04.
Manual installation¶
This manual is organized in sections, so you can cherry-pick the programs you want to manually install. For “all inclusive” solutions, please refer yourself to the following sections.
General¶
Generic tools¶
Rsync is an open source utility that provides fast incremental file transfer.
sudo apt-get install rsync
Git rsync is a version control system (VCS) for tracking changes in computer files and coordinating work on those files among multiple people.
sudo apt-get install git
Optional:
- Create an account on GitHub.
- Add your ssh public key to your GitHub account settings (account > settings > SSH keys > add SSH key).
less ~/.ssh/id_rsa.pub
Unix package required by several tools, including Sickle and Bamtools.
sudo apt-get install libz-dev
Java is required by several tools using GUIs, such as FastQC or IGV.
It seems java 9 causes issues with IGV, so we chose to use java 8 here.
echo debconf shared/accepted-oracle-license-v1-1 select true | sudo debconf-set-selections
echo debconf shared/accepted-oracle-license-v1-1 seen true | sudo debconf-set-selections
sudo add-apt-repository -y ppa:webupd8team/java
sudo apt-get update
sudo apt-get -y install oracle-java8-installer
Check installation:
java -version
Create bin/ and app_sources/ (optional)¶
While some programs will be installed completely automatically, others will not. Here we create a directory that will be used for manual installations.
mkdir $HOME/bin
mkdir $HOME/app_sources
You might then have to edit your $PATH
manually (see next section).
Edit $PATH
¶
In order to use manually installed programs and make them executable,
you may have to update your $PATH
environment variable. You can do
so by editing the ~/.profile
file.
nano ~/.profile
Fetch this paragraph and add the path to manually installed executables:
# set PATH so it includes user's private bin if it exists
if [ -d "$HOME/bin" ] ; then
PATH="$HOME/bin:$PATH"
fi
Execute the file to validate the change.
source ~/.profile
Python¶
Snakemake requires to have Python 3.3+ installed. You can check this by issuing the following commands in a terminal:
python --version
python3 --version
If you don’t have python 3 you should install it.
sudo apt-get install python3
Install python package managers and devel libraries.
apt-get install python-dev
apt-get install python3.4-dev
sudo apt-get install python-pip
sudo apt-get install python3-pip
Pandas library¶
Python Data Analysis Library is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
This library is used in order to read tab-delimited files used in the workflows
(see files samples.tab
and design.tab
).
pip3 install pandas
R¶
You can fetch a CRAN mirror here.
sudo sh -c "echo 'deb <your mirror> trusty/' >> /etc/apt/sources.list" ## Repository for Ubuntu 14.04 Trusty Tahr
#sudo sh -c "echo 'deb http://ftp.igh.cnrs.fr/pub/CRAN/ trusty/' >> /etc/apt/sources.list" ## Mirror in Montpellier, France
sudo apt-get -y update
sudo apt-get -y install r-base r-base-dev libcurl4-openssl-dev libxml2-dev
echo "r <- getOption('repos'); r['CRAN'] <- 'http://cran.us.r-project.org'; options(repos = r);" >> ~/.Rprofile
Check installation:
R --version
Snakemake¶
“Snakemake is a workflow engine that provides a readable Python-based workflow definition language and a powerful execution environment that scales from single-core workstations to compute clusters without modifying the workflow. It is the first system to support the use of automatically inferred multiple named wildcards (or variables) in input and output filenames.”
(Köster and Rahman, 2012)
- Documentation
- FAQ
- Forum
- See also Snakemake section for tutorials.
NB: Python 3 and pip3 are required (see this section).
pip3 install snakemake
You can check that snakemake works properly with this basic script.
"""Snakefile to test basic functions of snakemake.
"""
rule all:
input: expand("bye.txt")
rule hello:
"""Write HELLO in a text file named hello.txt.
"""
output: "hello.txt"
message: "Generating {output} file."
shell: "echo HELLO > {output}"
rule bye:
"""Write BYE in a text file named bye.txt.
"""
input: "hello.txt"
output: "bye.txt"
message: "Generating {output} file."
shell: "echo BYE > {output}"
touch $HOME/hello.py
nano $HOME/hello.py ## copy/paste script above and save
Execute the workflow; two files should be created: hello.txt
and bye.txt
.
cd ; snakemake -s hello.py
In case you need to upgrade snakemake:
pip3 install snakemake --upgrade
If you want to use Snakemake reports function (optional):
pip3 install docutils
Quality control¶
FastQC¶
FastQC aims to provide a simple way to do some quality control checks on raw sequence data coming from high throughput sequencing pipelines. It provides a modular set of analyses which you can use to give a quick impression of whether your data has any problems of which you should be aware before doing any further analysis.
The main functions of FastQC are:
- Import of data from BAM, SAM or FastQ files (any variant)
- Providing a quick overview to tell you in which areas there may be problems
- Summary graphs and tables to quickly assess your data
- Export of results to an HTML based permanent report
- Offline operation to allow automated generation of reports without running the interactive application
Links:
FastQC is available from linux repositories:
sudo apt-get install fastqc
However, since it’s an older version, it can cause problems of dependencies.
We recommend installing it manually:
cd $HOME/app_sources
wget --no-clobber http://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.5.zip
unzip -o fastqc_v0.11.5.zip
chmod +x FastQC/fastqc
ln -s -f $HOME/app_sources/FastQC/fastqc $HOME/bin/fastqc
NB: FastQC requires to have Java installed (even for commandline use). See dedicated section to install it.
Check installation:
fastqc --version
MultiQC¶
MultiQC searches a given directory for analysis logs and compiles a HTML report. It’s a general use tool, perfect for summarising the output from numerous bioinformatics tools.
sudo pip install multiqc
NB: a bug can appear depending on versions:
Command python setup.py egg_info failed with error code 1 in /tmp/pip_build_root/matplotlib Storing debug log for failure in /home/gr/.pip/pip.log
If so, it can be avoided by installing ubuntu dependencies, then reinstalling multiqc:
sudo apt-get install libfreetype6-dev python-matplotlib
sudo pip install multiqc
Check installation:
multiqc --version
Trimming¶
The quality of the reads generated by high-throughput sequencing technologies tends to decrease at their ends. Trimming consists in cutting out theses ends, and thus better the quality of reads before the mapping.
Sickle¶
Sickle is a trimming tool which better the quality of NGS reads.
Sickle uses sliding windows computing sequencing quality along the reads. When the quality falls below a chose q-value threshold, the reads is cut. If the size of the remaining read is too short, it is completely removed. Sickle takes into account three different types of read quality: Illumina, Solexa, Sanger.
- Pre-requisite: install
zlib
(link to section). - Clone the git repository into your bin (link to section) and run
make
.
cd $HOME/app_sources
git clone https://github.com/najoshi/sickle.git
cd sickle
make
cp sickle $HOME/bin
Check installation:
sickle --version
Cutadapt¶
Cutadapt finds and removes adapter sequences, primers, poly-A tails and other types of unwanted sequence from your high-throughput sequencing reads.
pip install --user --upgrade cutadapt
mv /root/.local/bin/cutadapt $HOME/bin
Check installation:
cutadapt --version
TrimGalore¶
In our workflows we use TrimGalore, a wrapper around Cutadapt and FastQC. It should be installed if you want to run cutadapt.
cutadapt --version # Check that cutadapt is installed
fastqc -v # Check that FastQC is installed
cd $HOME/app_sources
curl -fsSL https://github.com/FelixKrueger/TrimGalore/archive/0.4.3.tar.gz -o trim_galore.tar.gz
tar xvzf trim_galore.tar.gz
mv TrimGalore-0.4.3/trim_galore $HOME/bin
Check installation:
trim_galore --version
BBDuk¶
cd $HOME/app_sources
wget https://sourceforge.net/projects/bbmap/files/BBMap_37.31.tar.gz
tar xvzf BBMap_37.31.tar.gz
cp `find bbmap/ -maxdepth 1 -executable -type f` $HOME/bin
Alignment/mapping¶
The point of mapping is to replace the reads obtained from the sequencing step onto a reference genome. When the read is long enough, it can be mapped on the genome with a pretty good confidence, by tolerating a certain amount of so-called mismatches. However, genomes can contain repeated regions that are harder to deal with.
We call “sequencing depth” the average number of reads mapped at each position of the genome. The bigger the sequencing depth, the better the quality of the alignment, and the better the downstream analyses.
BWA¶
BWA is a software package for mapping low-divergent sequences against a large reference genome, such as the human genome. It is designed for short reads alignment.
Li H. and Durbin R. (2009). Fast and accurate short read alignment with Burrows-Wheeler Transform. Bioinformatics, 25:1754-60.
sudo apt-get install bwa
Check installation:
bwa
Bowtie¶
Bowtie performs ungapped alignment, and is therefore not suitable for certain types of data, like RNA-seq data.
cd $HOME/app_sources
wget --no-clobber http://downloads.sourceforge.net/project/bowtie-bio/bowtie/1.1.1/bowtie-1.1.1-linux-x86_64.zip
unzip bowtie-1.1.1-linux-x86_64.zip
cp `find bowtie-1.1.1/ -maxdepth 1 -executable -type f` $HOME/bin
Check installation:
bowtie --help
Bowtie2¶
“Bowtie 2 is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. It is particularly good at aligning reads of about 50 up to 100s or 1,000s of characters to relatively long (e.g. mammalian) genomes. Bowtie 2 indexes the genome with an FM Index (based on the Burrows-Wheeler Transform or BWT) to keep its memory footprint small: for the human genome, its memory footprint is typically around 3.2 gigabytes of RAM. Bowtie 2 supports gapped, local, and paired-end alignment modes. Multiple processors can be used simultaneously to achieve greater alignment speed. Bowtie 2 outputs alignments in SAM format, enabling interoperation with a large number of other tools (e.g. SAMtools, GATK) that use SAM. Bowtie 2 is distributed under the GPLv3 license, and it runs on the command line under Windows, Mac OS X and Linux.”
Reference:
Langmead B, Trapnell C, Pop M, L Salzberg S. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology 200910:R25. DOI: 10.1186/gb-2009-10-3-r25
cd $HOME/app_sources
wget http://sourceforge.net/projects/bowtie-bio/files/bowtie2/2.2.6/bowtie2-2.2.6-linux-x86_64.zip
unzip bowtie2-2.2.6-linux-x86_64.zip
cp `find bowtie2-2.2.6/ -maxdepth 1 -executable -type f` $HOME/bin
Check installation:
bowtie2 --version
Subread-align¶
The Subread package comprises a suite of software programs for processing next-gen sequencing read data including:
Subread: a general-purpose read aligner which can align both genomic DNA-seq and RNA-seq reads. It can also be used to discover genomic mutations including short indels and structural variants. Subjunc: a read aligner developed for aligning RNA-seq reads and for the detection of exon-exon junctions. Gene fusion events can be detected as well. featureCounts: a software program developed for counting reads to genomic features such as genes, exons, promoters and genomic bins. exactSNP: a SNP caller that discovers SNPs by testing signals against local background noises
Reference:
Liao Y, Smyth GK and Shi W. The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote. Nucleic Acids Research, 41(10):e108, 2013
cd $HOME/app_sources
wget -nc https://sourceforge.net/projects/subread/files/subread-1.5.2/subread-1.5.2-source.tar.gz
tar zxvf subread-1.5.2-source.tar.gz
cd subread-1.5.2-source/src
make -f Makefile.Linux
cd ../bin
cp `find * -executable -type f` $HOME/bin
Check installation:
subread-align --version
Tophat¶
TopHat is a fast splice junction mapper for RNA-Seq reads. It aligns RNA-Seq reads to mammalian-sized genomes using the ultra high-throughput short read aligner Bowtie, and then analyzes the mapping results to identify splice junctions between exons.
cd $HOME/app_sources
wget --no-clobber https://ccb.jhu.edu/software/tophat/downloads/tophat-2.0.14.Linux_x86_64.tar.gz
tar xvfz tophat-2.0.14.Linux_x86_64.tar.gz
cd tophat-2.0.14.Linux_x86_64
rm -Rf AUTHORS LICENSE README intervaltree/ sortedcontainers/
mv ./* $HOME/bin
cd ..; rm -Rf tophat-2.0.14.Linux_x86_64*
Check installation:
tophat --version
Peak-calling¶
The following tools can be used to perform ChIP-seq peak-calling.
Homer¶
Required in order to run the tutorials.
mkdir $HOME/app_sources/homer
cd $HOME/app_sources/homer
wget "http://homer.salk.edu/homer/configureHomer.pl"
perl configureHomer.pl -install homer
cp `find $HOME/app_sources/homer/bin -maxdepth 1 -executable -type f` $HOME/bin
The basic Homer installation does not contain any sequence data. To download sequences for use with HOMER, use the configureHomer.pl script. To get a list of available packages:
perl $HOME/bin/HOMER/configureHomer.pl -list
To install packages, simply use the -install option and the name(s) of the package(s).
However, Homer can also work with custom genomes in FASTA format and gene annotations in GTF format. Thus the Gene-regulation workflows don’t require to install any genome.
Check installation:
findMotifs.pl
MACS 1.4¶
Required in order to run the demo workflow “ChIP-seq” on dataset GSE20870 (in the tutorials section).
cd $HOME/app_sources
wget "https://github.com/downloads/taoliu/MACS/MACS-1.4.2.tar.gz"
tar -xvzf MACS-1.4.2.tar.gz
cd MACS-1.4.2
sudo python setup.py install
Check installaiton:
macs14 --version
MACS 2¶
Required in order to run the tutorials.
sudo apt-get install python-numpy
sudo pip install MACS2
Check installation:
macs2 --version
bPeaks¶
Peak-caller developped specifically for yeast, can be useful in order to process small genomes only.
It is currently not used in demo workflows, and is therefore not mandatory to run the tutorials.
Available as an R package.
install.packages("bPeaks")
library(bPeaks)
SPP¶
This peak-caller is used in the ChIP-seq study case GSE20870.
- Method 1: git
See github page.
Commands in R:
require(devtools)
devtools::install_github('hms-dbmi/spp', build_vignettes = FALSE)
- Method 2: Bioconductor [deprecated]
source("http://bioconductor.org/biocLite.R")
biocLite("spp")
install.packages("caTools")
install.packages("spp")
- Method 3: commandline [deprecated]
apt-get install libboost-all-dev
cd $HOME/app_sources
wget -nc http://compbio.med.harvard.edu/Supplements/ChIP-seq/spp_1.11.tar.gz
sudo R CMD INSTALL spp_1.11.tar.gz
- Method 4: the ultimate protocol for Ubuntu 14.04
In unix shell:
# unix libraries
sudo apt-get update
sudo apt-get -y install r-base
sudo apt-get -y install libboost-dev zlibc zlib1g-dev
In R shell:
# Rsamtools
source("http://bioconductor.org/biocLite.R")
biocLite("Rsamtools")
In unix shell:
# spp
cd $HOME/app_sources
wget http://compbio.med.harvard.edu/Supplements/ChIP-seq/spp_1.11.tar.gz
sudo R CMD INSTALL spp_1.11.tar.gz
Check installation in R:
library(spp)
A few links:
Mosaics¶
This peak-caller is used in the ChIP-seq study case GSE20870.
Installation in R from Bioconductor:
source("https://bioconductor.org/biocLite.R")
biocLite("mosaics")
SWEMBL¶
This peak-caller is used in the ChIP-seq study case GSE20870.
git clone https://github.com/stevenwilder/SWEMBL.git
cd SWEMBL
make
cp SWEMBL $(BIN_DIR)
Deprecated method
cd $HOME/app_sources
wget "http://www.ebi.ac.uk/~swilder/SWEMBL/SWEMBL.3.3.1.tar.bz2" && \
bunzip2 -f SWEMBL.3.3.1.tar.bz2 && \
tar xvf SWEMBL.3.3.1.tar && \
rm SWEMBL.3.3.1.tar && \
chown -R ubuntu-user SWEMBL.3.3.1 && \
cd SWEMBL.3.3.1 && \
make
# This method require a manual fix of C flags in makefile
# gcc main.c IO.c calc.c stack.c summit.c refcalc.c wiggle.c overlap.c -o SWEMBL -lz -lm
Motif discovery, motif analysis¶
These software can be useful for the analysis of ChIP-seq peaks.
Regulatory Sequence Analysis Tools (RSAT)¶
see dedicated section
to translate
Suite logicielle spécialisée pour l’analyse de motifs cis-régulateurs, développée par les équipes de Morgane Thomas-Chollier (ENS, Paris) et Jacques van Helden (TAGC, Marseille). Inclut des outils spécifiques pour l’analyse de données de ChIP-seq.
RNA-seq¶
featureCounts¶
Liao Y, Smyth GK and Shi W. featureCounts: an efficient general-purpose program for assigning sequence reads to genomic features. Bioinformatics, 30(7):923-30, 2014
Miscellaneous¶
SRA toolkit¶
This toolkit includes a number of programs, allowing the conversion of
*.sra
files. fastq-dump
translates *.sra
files to
*.fastq
files.
You can download last version here, or issue the following commands:
cd $HOME/app_sources
wget -nc http://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/2.5.2/sratoolkit.2.5.2-ubuntu64.tar.gz
tar xzf sratoolkit.2.5.2-ubuntu64.tar.gz
cp `find sratoolkit.2.5.2-ubuntu64/bin -maxdepth 1 -executable -type l` $HOME/bin
You can also install SRA toolkit simply by issuing this command, but likely it won’t be the most recent release:
sudo apt-get install sra-toolkit
fastq-dump --version
fastq-dump : 2.1.7
Samtools¶
SAM (Sequence Alignment/Map) format is a generic format for storing large nucleotide sequence alignments.
SAMtools provides several tools to process such files.
cd $HOME/app_sources
wget --no-clobber http://sourceforge.net/projects/samtools/files/samtools/1.3/samtools-1.3.tar.bz2
bunzip2 -f samtools-1.3.tar.bz2
tar xvf samtools-1.3.tar
cd samtools-1.3
make
sudo make install
Bedtools¶
The bedtools utilities are a swiss-army knife of tools for a wide-range of genomics analysis tasks. For example, bedtools allows one to intersect, merge, count, complement, and shuffle genomic intervals from multiple files in widely-used genomic file formats such as BAM, BED, GFF/GTF, VCF.
sudo apt-get install bedtools
or get the latest version:
cd $HOME/app_sources
wget --no-clobber https://github.com/arq5x/bedtools2/releases/download/v2.24.0/bedtools-2.24.0.tar.gz
tar xvfz bedtools-2.24.0.tar.gz
cd bedtools2
make
sudo make install
Bedops¶
cd $HOME/app_sources
wget -nc https://github.com/bedops/bedops/releases/download/v2.4.19/bedops_linux_x86_64-v2.4.19.tar.bz2
tar jxvf bedops_linux_x86_64-v2.4.19.tar.bz2
mkdir bedops
mv bin bedops
cp bedops/bin/* $HOME/bin
Deeptools¶
cd $HOME/app_sources
git clone https://github.com/fidelram/deepTools
cd deepTools
python setup.py install
Picard tools¶
todo
Other¶
- MICSA: peak-calling & motifs discovery (publication).
- ChIPMunk: deep and wide digging for binding motifs in ChIP-Seq data (publication).
- HMCan: a method for detecting chromatin modifications in cancer samples using ChIP-seq data (publication).
- seqMINER
- Crunch project
- CSDeconv
- ...
Makefile¶
The Gene-regulation library comprises a makefile that can install most of the dependencies described in the previous section. It is recommended when you’re setting up a virtual environments, as described in these tutorials.
If you want to run the workflows on your personal computer or on a server, you should follow the manual installation, or contact a sysadmin.
The makefile currently allows running the following workflows:
- import_from_sra.wf
- quality_control.wf
- ChIP-seq.wf
It is not yet handling al the RNA-seq dependencies.
# it is assumed that you have defined a global variable with the path to the Gene-regulation library
cd ${GENE_REG_PATH}
make -f scripts/makefiles/install_tools_and_libs.mk all
source ~/.bashrc
Conda¶
A number of dependencies of Gene-regulation can be installed through a Conda environment. This list is not exhaustive.
conda install -c bioconda sickle=0.5 conda install -c bioconda bowtie=1.2.0 conda install -c bioconda bowtie2=2.3.0 conda install -c bioconda subread=1.5.0.post3 conda install -c bioconda tophat=2.1.1 conda install -c bioconda bwa=0.7.15 conda install -c bioconda fastqc=0.11.5 conda install -c bioconda macs2=2.1.1.20160309 conda install -c bioconda homer=4.8.3 conda install -c bioconda bedtools=2.26.0 conda install -c bioconda samtools=1.3.1 conda install -c bioconda bamtools=2.4.0
Tutorials¶
Demo: ChIP-seq analysis¶
Case: ChIP-seq study of Tbf1 in S. cerevisiae¶
Reference
Preti M, Ribeyre C, Pascali C, Bosio MC et al. The telomere-binding protein Tbf1 demarcates snoRNA gene promoters in Saccharomyces cerevisiae. Mol Cell 2010 May 28;38(4):614-20. PMID: 20513435
Abstract
Small nucleolar RNAs (snoRNAs) play a key role in ribosomal RNA biogenesis, yet factors controlling their expression are unknown. We found that the majority of Saccharomyces snoRNA promoters display an aRCCCTaa sequence motif at the upstream border of a TATA-containing nucleosome-free region. Genome-wide ChIP-seq analysis showed that these motifs are bound by Tbf1, a telomere-binding protein known to recognize mammalian-like T(2)AG(3) repeats at subtelomeric regions. Tbf1 has over 100 additional promoter targets, including several other genes involved in ribosome biogenesis and the TBF1 gene itself. Tbf1 is required for full snoRNA expression, yet it does not influence nucleosome positioning at snoRNA promoters. In contrast, Tbf1 contributes to nucleosome exclusion at non-snoRNA promoters, where it selectively colocalizes with the Tbf1-interacting zinc-finger proteins Vid22 and Ygr071c. Our data show that, besides the ribosomal protein gene regulator Rap1, a second telomere-binding protein also functions as a transcriptional regulator linked to yeast ribosome biogenesis.
Access link
- GEO series: GSE20870
Setup analysis environment¶
Here, we create a directory that will contain the raw data, the Gene-regulation library, the reference genome data and the results of the workflow(s) used.
We are going to use a global variable: $ANALYSIS_DIR.
ANALYSIS_DIR=$HOME/ChIP-seq_GSE20870
mkdir -p ${ANALYSIS_DIR}
cd ${ANALYSIS_DIR}
Download Gene-regulation¶
We are going to download the Gene-regulation library in the analysis directory. Another possibility would be to download Gene-regulation in a fixed place, and create a symlink to the analysis directory.
wget --no-clobber https://github.com/rioualen/gene-regulation/archive/4.0.tar.gz
tar xvzf 4.0.tar.gz
ln -s gene-regulation-4.0 gene-regulation
Download reference genome & annotations¶
Here, we are going to download the geneome sequence and annotation files in the analysis directory. It is also possible to define a fixed location to store genomes and then create a symlink to it.
It can be useful to store all the genomes in one place, in order to avoid duplication of big files. Also, most mapping algorithms need to index the genome before proceeding with the alignment. This index needs only be done once, but it takes time and storage space, so it’s better to avoid duplicating it.
mkdir ${ANALYSIS_DIR}/genome
wget -nc ftp://ftp.ensemblgenomes.org/pub/fungi/release-30/fasta/saccharomyces_cerevisiae/dna/Saccharomyces_cerevisiae.R64-1-1.30.dna.genome.fa.gz -P ${ANALYSIS_DIR}/genome
wget -nc ftp://ftp.ensemblgenomes.org/pub/fungi/release-30/gff3/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.30.gff3.gz -P ${ANALYSIS_DIR}/genome
wget -nc ftp://ftp.ensemblgenomes.org/pub/fungi/release-30/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.30.gtf.gz -P ${ANALYSIS_DIR}/genome
gunzip ${ANALYSIS_DIR}/genome/*.gz
Download raw data¶
mkdir -p ${ANALYSIS_DIR}/data/GSM521934 ${ANALYSIS_DIR}/data/GSM521935
wget --no-clobber ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX%2FSRX021%2FSRX021358/SRR051929/SRR051929.sra -P ${ANALYSIS_DIR}/data/GSM521934
wget --no-clobber ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX%2FSRX021%2FSRX021359/SRR051930/SRR051930.sra -P ${ANALYSIS_DIR}/data/GSM521935
Workflow ‘import_from_sra’¶
The purpose of this workflow is to convert .sra files to .fastq files. The .sra format (Short Read Archive) is used by the GEO database, but for downstream analyses we need to dispose of fastq-formatted files. You can check out the glossary to find out more about file formats.
Workflow execution¶
If you have followed the previous steps, you have a file organization that looks like this:

You should then be able to run the following command:
cd ${ANALYSIS_DIR}
snakemake -s gene-regulation/scripts/snakefiles/workflows/import_from_sra.wf -p --configfile gene-regulation/examples/ChIP-seq_SE_GSE20870/config.yml

Workflow ‘quality_control’¶
This workflow can be run after the workflow ‘import_from_sra’, or directly on properly-organized fastq files (see this section if you dispose of your own data).
The purpose of this workflow is to perform quality check with `FastQC https://www.bioinformatics.babraham.ac.uk/projects/fastqc/`_.
Optionally, trimming can be performed using the tools Sickle. or Cutadapt.
Workflow execution¶
cd ${ANALYSIS_DIR}
snakemake -s gene-regulation/scripts/snakefiles/workflows/quality_control.wf -p --configfile gene-regulation/examples/ChIP-seq_SE_GSE20870/config.yml

Workflow ‘ChIP-seq’¶
- This workflows performs:
- mapping with various algorithms
- genome coverage in different formats (check out our `glossary
- <http://gene-regulation.readthedocs.io/en/latest/wiki.html#glossary>`_)
- peak-calling with various algorithms
- motifs search using the RSAT suite
You must have run at least the workflow “import_from_sra”, and optionally the workflow “quality_control”.
Workflow execution¶
cd ${ANALYSIS_DIR}
snakemake -s gene-regulation/scripts/snakefiles/workflows/ChIP-seq.wf -p --configfile gene-regulation/examples/ChIP-seq_SE_GSE20870/config.yml

Demo: ChIP-seq and RNA-seq integration¶
Case: Genomic analysis of the scc2-4 mutant in budding yeast¶
Reference
Genomic analysis of the scc2-4 mutant in budding yeast
Musinu Zakari
GEO series
Setup workdir¶
ANALYSIS_DIR=$HOME/GSE55358_Integrated_analysis
mkdir ${ANALYSIS_DIR}
cd ${ANALYSIS_DIR}
Download the Gene-regulation library¶
wget --no-clobber https://github.com/rioualen/gene-regulation/archive/4.0.tar.gz
tar xvzf 4.0.tar.gz
ln -s gene-regulation-4.0 gene-regulation
Download reference genome & annotations¶
wget -nc ftp://ftp.ensemblgenomes.org/pub/fungi/release-30/fasta/saccharomyces_cerevisiae/dna/Saccharomyces_cerevisiae.R64-1-1.30.dna.genome.fa.gz -P ${ANALYSIS_DIR}/genome
wget -nc ftp://ftp.ensemblgenomes.org/pub/fungi/release-30/gff3/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.30.gff3.gz -P ${ANALYSIS_DIR}/genome
wget -nc ftp://ftp.ensemblgenomes.org/pub/fungi/release-30/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.30.gtf.gz -P ${ANALYSIS_DIR}/genome
gunzip ${ANALYSIS_DIR}/genome/*.gz
Workflow ‘ChIP-seq’¶
Download ChIP-seq data¶
wget --no-clobber ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX476/SRX476133/SRR1176905/SRR1176905.sra -P ${ANALYSIS_DIR}/ChIP-seq_GSE55357/data/GSM1334674
wget --no-clobber ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX476/SRX476135/SRR1176907/SRR1176907.sra -P ${ANALYSIS_DIR}/ChIP-seq_GSE55357/data/GSM1334676
wget --no-clobber ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX476/SRX476136/SRR1176908/SRR1176908.sra -P ${ANALYSIS_DIR}/ChIP-seq_GSE55357/data/GSM1334679
wget --no-clobber ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX476/SRX476138/SRR1176910/SRR1176910.sra -P ${ANALYSIS_DIR}/ChIP-seq_GSE55357/data/GSM1334677
Workflow execution¶
Your directory should now look like this:


And you should be able to execute it like this:
cd ${ANALYSIS_DIR}
snakemake -s gene-regulation/scripts/snakefiles/workflows/import_from_sra.wf -p --configfile gene-regulation/examples/ChIP-seq_GSE55357/config.yml
snakemake -s gene-regulation/scripts/snakefiles/workflows/quality_control.wf -p --configfile gene-regulation/examples/ChIP-seq_GSE55357/config.yml
snakemake -s gene-regulation/scripts/snakefiles/workflows/ChIP-seq.wf -p --configfile gene-regulation/examples/ChIP-seq_GSE55357/config.yml
Workflow ‘RNA-seq’ for differential expression analysis¶
Download RNA-seq data¶
wget --no-clobber ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX476/SRX476122/SRR1176894/SRR1176894.sra -P ${ANALYSIS_DIR}/RNA-seq_GSE55316/data/GSM1334027
wget --no-clobber ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX476/SRX476124/SRR1176896/SRR1176896.sra -P ${ANALYSIS_DIR}/RNA-seq_GSE55316/data/GSM1334029
wget --no-clobber ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX476/SRX476128/SRR1176900/SRR1176900.sra -P ${ANALYSIS_DIR}/RNA-seq_GSE55316/data/GSM1334033
wget --no-clobber ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX476/SRX476129/SRR1176901/SRR1176901.sra -P ${ANALYSIS_DIR}/RNA-seq_GSE55316/data/GSM1334034
Workflow execution¶
Your directory should now look like this:

And you should be able to execute it like this:
cd ${ANALYSIS_DIR}
snakemake -s gene-regulation/scripts/snakefiles/workflows/import_from_sra.wf -p --configfile gene-regulation/examples/RNA-seq_GSE55316/config.yml
snakemake -s gene-regulation/scripts/snakefiles/workflows/quality_control.wf -p --configfile gene-regulation/examples/RNA-seq_GSE55316/config.yml
snakemake -s gene-regulation/scripts/snakefiles/workflows/RNA-seq.wf -p --configfile gene-regulation/examples/RNA-seq_GSE55316/config.yml
Workflow ‘integration_ChIP_RNA’¶
coming soon
Running Gene-regulation workflows on your own data¶
Gene-regulation library & genome data¶
Hereafter is a suggestion for the organization of your files.
ANALYSIS_DIR=$HOME/my_analysis
mkdir -p ${ANALYSIS_DIR}
cd ${ANALYSIS_DIR}
# Download the Gene-regulation library
wget --no-clobber https://github.com/rioualen/gene-regulation/archive/4.0.tar.gz
tar xvzf 4.0.tar.gz
ln -s gene-regulation-4.0 gene-regulation
# Download genome data
wget -nc <URL_to_my_genome.fa.gz> -P ${ANALYSIS_DIR}/genome
wget -nc <URL_to_my_genome.gff3.gz> -P ${ANALYSIS_DIR}/genome
wget -nc <URL_to_my_genome.gtf.gz> -P ${ANALYSIS_DIR}/genome
gunzip ${ANALYSIS_DIR}/genome/*.gz
Your directory should look like this:

Fastq files organization¶
This tutorial assumes you dispose of your own fastq files. We recommend that your organize your samples in separate folders, and name both fastq files and their parent directories accordingly.

If you have paired-ends samples, they should be in the same directory and distinguished using a suffix of any sort.

Metadata¶
Running the workflows provided by the Gene-regulation library requires the use of three configuration files.
samples.tab¶
This file should contain, at least, one column named “ID”, that should contain sample names matching those defined in the previous section. In the case of an RNA-seq analysis, it should also contain a column “Condition”, which will define groups of comparison (see design file in the section below).
All the samples will be processed in the same manner. You can prevent certain samples from being processed by commenting the corresponding lines with a ”;” at the beginning of the line.
RNA-seq sample groups should contain at least 2 samples.
You can add any other relevant information related to samples in other tab-separated columns.


design.tab¶
The purpose of this file is to determine which samples should be processed together. In a ChIP-seq analysis, it will be used to define which ChIP samples should be compared with which inputs. In an RNA-seq experiment, it defines the conditions to be compared against each other.
Column names should be respected.


config.yml¶
You can find examples of configuration files in the examples section of the gene-regulation directory.
Directories should be defined relative to the working directory defined in the beginning: genome, gene-regulation, fastq, etc. Same goes for configuration files.
Genome filenames should be mentionned as they appear in the defined genome directory.
Genome size should be filled in, as well as the sequencing type: “se” for single-end data, and “pe” for paired-ends data. In the case of paired-ends data, suffixes (parameter “strands”) should be mentioned and should match the filenames (minus the “_”).
The minimum of configuration should look like this:

All the parameters related to the tools used are optional, and the default parameters of each program will be used when they’re not set in the configfile.

Running a workflow¶
If your directory now looks like this, you should be ready to run a worflow!

You can verify it by doing dry runs:
cd ${ANALYSIS_DIR}
# Run the quality check
snakemake -s gene-regulation/scripts/snakefiles/workflows/quality_control.wf --config-file metadata/config.yml -p -n
# Run the ChIP-seq workflow
snakemake -s gene-regulation/scripts/snakefiles/workflows/ChIP-seq.wf --config-file metadata/config.yml -p -n
# Run the RNA-seq workflow
snakemake -s gene-regulation/scripts/snakefiles/workflows/RNA-seq.wf --config-file metadata/config.yml -p -n
Just remove the -n option to actually run them.
Virtual environments¶
Tutorials on how to run Gene-regulation workflows in virtual environments or virtual machines (VM).
These protocols were developed on a Unix computer, with the OS LMDE and a 64-bit architecture. The virtual machines are developed with the Ubuntu 14.04 OS.
Last update: 17/05/09
In computing, a virtual machine (VM) is an emulation of a computer system. Virtual machines are based on computer architectures and provide functionality of a physical computer. Their implementations may involve specialized hardware, software, or a combination. Source: Wikipedia
Diagram of a physical machine vs a virtual machine:

IFB cloud¶
IFB cloud utilities¶

The French Bioinformatics Institute (IFB) cloud provides users with a number of bioinformatics facilities, under the form of ready-to-use appliances. A cloud appliance is a template or a virtual machine (VM) built with a bundle of scientific or utility software components that are already configured. Several appliances are dedicated to special fields of bioinformatics, such as proteomics, genomics... Some of them come with an HTML interface, such as Galaxy or RSAT.
The cloud also provides “basic” Ubuntu or CentOS appliances. Provided you hold a developper account, it allows you to instantiate a virtual machine, setup your own tools, and register it as a new appliance to be used again later on and even shared with other cloud users.
The official website is still under development. However, here are a few useful links:
The first parts of this tutorial will explain you how to use the IFB cloud for general purposes.
For a specific use of the Gene-regulation appliance, you should refer yourself to this section.
User account creation & configuration¶
- Using the IFB cloud facilities requires to have a user account. Register here.
- Once your account has been validated, you can login.
- In order to be able to access your instances through SSH, you should register your SSH public key in your account settings, through the dashboard.

Virtual disk creation¶
Appliances usually have a limited amount of disk space (up to 10 or 20Go). If the instance to be run necessitates disk space, you have to create a virtual disk (vDisk) prior to launching it. Depending on the type of account that you have, you’ll have a certain amount of storage space available. This space can be divided into as many vDisks as you want.
When instantiating an appliance, you can chose to attach one of these vDisks to the virtual machine.
- Click New vDisk button.
- Enter a size (whole number equating to the amount of Go needed).
- Name it.


Creation of an instance¶
Creating an instance consists in instanciating an existing appliance. It creates a virtual environment that has the same contents as the appliance chosen, and that is accessible through ssh from your local host.
- Click New Instance button.
- Choose an appliance in the drop-down menu. You may use the filter menu in order to look for a specific tool.
- Name your virtual machine.
- Choose the amount of CPU and RAM to grant the VM.
- Attach the vDisk.
- Click Run.
change screencap

- After a few seconds, you may refresh the page until the newly created instance shows up on the dashboard. Clicking on the ssh mention in the Access column will give you the commands to access your virtual machine.

- If the appliance has an HTTP interface, a link will also be provided in the Access column.
- Connect to your VM by commandline.
# Replace XXX by the IP of your instance
ssh -A -p 22 root@192.54.201.XXX
Creation of an appliance¶
Creating your own appliance can be as simple as instantiating an existing one. You just need to chose a “base” to build it on.
- Click New Instance button.
- Choose the appliance Ubuntu 14.04 IFB (16-12).
- Name your instance.
- Check Create appliance.
- Choose the amount of CPU and RAM to grant the VM.
- Attach the vDisk.
- Click Run.

- Refresh the page. Your appliance should appear in orange because of the creation mode you selected. You can now click on the ssh column to see the ssh parameters. It should look like this:

Connect to your VM by commandline.
# Replace XXX by the IP of your instance ssh -A -p 22 root@192.54.201.XXX
Once you’re connected to your appliance, you can install all the programs that you want. You can check out this section for a manual on how to install NGS tools. Beware that the amount of disk space of the appliance itself is limited!
Later on, you can ask an admin from the IFB cloud to register the appliance, in order for it to become available to other cloud users.
First connection to the instance¶
Data management¶
By default, if a vDisk has been attached to the VM, it is mounted under
/root/mydisk
.
You can transfer data from your local computer to the VM using commands provided under Access > ssh:
# Replace XXX by the IP of your instance
scp -P 22 ${localfile} root@192.54.201.XXX:
sftp -oPort=22 root@192.54.201.XXX
Another way is to use rsync:
# Replace XXX by the IP of your instance
rsync -ruptvl ${localfile} root@192.54.201.XXX:/root/mydisk/
Software installation¶
Once you’re connected to the VM through ssh
, you can install any
program just the way you would do it locally (see tutorials in this
directory
for instance).
Configuration¶
nano ~/.bashrc
Fetch following paragraph and uncomment command force-color
.
# uncomment for a colored prompt, if the terminal has the capability; turned
# off by default to not distract the user: the focus in a terminal window
# should be on the output of commands, not on the prompt
force_color_prompt=yes
source ~/.bashrc
Using the Gene-regulation appliance¶
Requirements¶
User account creation & configuration
- Using the IFB cloud facilities requires to have a user account. Register here.
- Once your account has been validated, you can login.
- In order to be able to access your instances through SSH, you should register your SSH public key in your account settings, through the dashboard.
Virtual disk creation¶
Appliances usually have a limited amount of disk space (up to 10 or 20Go). If the instance to be run necessitates disk space, you have to create a virtual disk (vDisk) prior to launching it.
Check out this section for details.
- Click New vDisk button.
- Enter a size (whole number equating to the amount of Go needed).
- Name it (e.g.
GSE20870-10Gb
, the ID of the Gene Expression Omnibus series that will be stored on the virtual drive).

Creation of an instance¶
- Click New Instance button.
- Choose appliance Gene regulation 4.0 in the drop-down menu.
- Name your VM.
- Choose the amount of CPU and RAM to grant the VM.
- Attach the vDisk.
- Click Run.
- After a few seconds, you may refresh the page until the newly created instance shows up on the dashboard. Clicking on the ssh mention in the Access column will give you the commands to access your virtual machine.

Connection to the device¶
Open a terminal on your host computer and type in:
# Replace XXX by the IP of your instance
ssh -A -p 22 root@192.54.201.XXX
Download source data¶
On the IFB cloud VM, the vDisk is automatically attached and mounted by
default under /root/mydisk
, or ~/mydisk
.
Here we create a folder to store the source data files and the files that will results from the execution of our workflow.
We also create a link to the gene-regulation library.
ANALYSIS_DIR=${HOME}/mydisk/ChIP-seq_SE_GSE20870
mkdir -p ${ANALYSIS_DIR}
cd ${ANALYSIS_DIR}
ln -s ${HOME}/gene-regulation-4.0 gene-regulation
The following commands will download the raw files from the GEO database, and create the folders to organize them properly.
wget --no-clobber ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX%2FSRX021%2FSRX021358/SRR051929/SRR051929.sra -P ${ANALYSIS_DIR}/data/GSM521934
wget --no-clobber ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX%2FSRX021%2FSRX021359/SRR051930/SRR051930.sra -P ${ANALYSIS_DIR}/data/GSM521935
The following commands will download the required genome files in a specific directory:
- fasta file of the reference genome
- gff3 annotation file
- gtf annotation file
wget -nc ftp://ftp.ensemblgenomes.org/pub/fungi/release-30/fasta/saccharomyces_cerevisiae/dna/Saccharomyces_cerevisiae.R64-1-1.30.dna.genome.fa.gz -P ${ANALYSIS_DIR}/genome
wget -nc ftp://ftp.ensemblgenomes.org/pub/fungi/release-30/gff3/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.30.gff3.gz -P ${ANALYSIS_DIR}/genome
wget -nc ftp://ftp.ensemblgenomes.org/pub/fungi/release-30/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.30.gtf.gz -P ${ANALYSIS_DIR}/genome
gunzip ${ANALYSIS_DIR}/genome/*.gz
Your file organization should now look like this:

Run the workflow¶
You can use the option -n
to make a dry run.
cd ${ANALYSIS_DIR}
snakemake -p -s gene-regulation/scripts/snakefiles/workflows/import_from_sra.wf --configfile gene-regulation/examples/ChIP-seq_SE_GSE20870/config.yml -n
If there is no error, you can procede with the analysis:
# This workflow extracts .fastq files from the .sra archives
snakemake -p -s gene-regulation/scripts/snakefiles/workflows/import_from_sra.wf --configfile gene-regulation/examples/ChIP-seq_SE_GSE20870/config.yml
# This workflow performs quality check and trimming on the raw data
snakemake -p -s gene-regulation/scripts/snakefiles/workflows/quality_control.wf --configfile gene-regulation/examples/ChIP-seq_SE_GSE20870/config.yml
# This workflow perform a classic ChIP-seq analysis, including mapping, peak-calling and motif search
snakemake -p -s gene-regulation/scripts/snakefiles/workflows/ChIP-seq.wf --configfile gene-regulation/examples/ChIP-seq_SE_GSE20870/config.yml
Using 4CPU & 8Go of RAM, the workflow should take about 12mn to complete.
Congratulations! You just executed these wonderful workflows:



Visualizing results¶
The Virtual Machine created on the IFB cloud doesn’t have a graphical interface, but it contains the X2GO software. We’re gonna use it to create a distant desktop to visualize the results from the host machine.
- Install the x2go client and launch it from your local computer.
sudo apt-get install x2goclient
x2goclient
- Create a new session using the Mate desktop.

- The session now appears on the right panel. Just click it to lauch it!

- You should be now on the virtual desktop!

Note: you may need to change your keyboard settings
- Go to System > Preferences > Keybords
- Click on tab Layouts
- Add and/or remove desired keyboards
The result files should be organized like this:

FastQC
You can visualize the FastQC results using firefox or any other
navigator. Fetch the html
files located in the sample directories.
Before trimming:
firefox /root/mydisk/ChIP-seq_SE_GSE20870/fastq/GSM521934/GSM521934_fastqc/GSM521934_fastqc.html firefox /root/mydisk/ChIP-seq_SE_GSE20870/fastq/GSM521935/GSM521935_fastqc/GSM521935_fastqc.html
After trimming:
firefox /root/mydisk/ChIP-seq_SE_GSE20870/fastq/GSM521934/GSM521934_sickle-se-q20_fastqc/GSM521934_sickle-se-q20_fastqc.html firefox /root/mydisk/ChIP-seq_SE_GSE20870/fastq/GSM521935/GSM521935_sickle-se-q20_fastqc/GSM521935_sickle-se-q20_fastqc.html

IGV
You can visualize the peaks by running IGV from the terminal.
igv
- Click “File” > “Open session...” and chose the file
/root/mydisk/ChIP-seq_SE_GSE20870/reports/peaks/igv_session.xml
. - You may need to adjust the panel sizes.

Create your own Gene-regulation appliance¶
Creating a new appliance from scratch is very similar to using one. You have to satisfy the requirements described here.
If you want to manipulate data, you should also create a vDisk following these instructions.
Creation of an appliance¶
When creating a new instance, choose a 10Go Ubuntu appliance and check the Create appliance option:
- Click New Instance button.
- Choose appliance Ubuntu 14.04 IFB-X2GO-10GB in the drop-down menu.
- Name your VM.
- Choose the amount of CPU and RAM to grant the VM.
- Check the box Create appliance.
- Attach the vDisk.
- Click Run.

The new instance should appear in orange bold fonts in the dashboard.

You can connect to the instance through ssh
as shown in previous sections.
Get the gene-regulation
repository¶
wget -nc https://github.com/rioualen/gene-regulation/archive/4.0.tar.gz
tar zvxf 4.0.tar.gz
Run makefile to install the dependencies¶
The Gene-regulation library contains a makefile that installs most of the dependencies required to execute the snakemake workflows. You can also install tools manually, following these instructions.
The execution of the makefile may take a while (up to 30mn-1h), mostly because of the python libraries that are necessary to several NGS tools.
Then you should source the .bashrc
in order to update the $PATH
accordingly.
make -f gene-regulation-4.0/scripts/makefiles/install_tools_and_libs.mk all
source ~/.bashrc
If you want to install the x2go server on the VM for visualization purposes, as shown here, you can also execute this rule:
make -f gene-regulation-4.0/scripts/makefiles/install_tools_and_libs.mk add_repos desktop_and_x2go
You should now be able to execute the example workflow by following instructions from here.
In order for your appliance to remain persistant and be available to other users on the IFB cloud, you should contact an admin.
Docker¶
Docker is an open-source project that automates the deployment of applications inside software containers. Source: Wikipedia.
Diagram of Docker containers compared to physical and virtual machines.

Get started with Docker!¶
Install Docker on your local host¶
Instructions for a linux install can be found here, along with mac and windows instructions. A useful script is availalable here for a debian install.
You can also install it on Ubuntu 14.04 (64bits) typing the following:
#sudo apt-get update
sudo apt-get -y install docker.io
sudo usermod -aG docker <username>
You should now log out and in again from your Ubuntu session. This short procedure was tested in a virtual machine under VirtualBox (see corresponding tutorial).
You can test whether docker works properly:
docker run hello-world

NB: it seems qwerty keyboard keeps popping up after docker install. Switch back to azerty:
setxkbmap fr
- <!– Run the following command:
- sudo apt-get –yes install docker
–>
Gene-regulation with Docker¶
Execute the pipeline¶
snakemake -p -s gene-regulation/scripts/snakefiles/workflows/factor_workflow.py --configfile gene-regulation/examples/GSE20870/GSE20870.yml
On Mac OSX¶
- Install docker
https://docs.docker.com/engine/installation/mac/
- Open the application Docker Quickstart Terminal. This open a new terminal window and launches the docker daemon.
- Get the gene-regulation docker
docker pull rioualen/gene-regulation:0.3
- Check the list of docker images available locally
docker images
- Start the gene-regulation image. The option
-it
specifies the interactive mode, which is necessary to be able using this VM
docker run -it rioualen/gene-regulation:0.3 /bin/bash
You are now in a bash session of a gene-regulation docker. In this session, you are “root” user, i;e. you have all the administration rights. You can check this easily:
whoami
- Check the disks available on this docker
df -h
Currently, your docker can only access its local disk, which comes with the VM. Beware: any data stored on this local disk will be lost when you shut down the gene-regulation docker.
- Exit and get back to your gene-regulation container
If you exits your shell session, the docker will still be running.
exit
You are now back to the host terminal.
Check the currently active docker containers (processes).
docker ps -a
Note that you can run several containers of the same image. Each active
container has a unique identifier which appears in the first column when
you run docker ps
(e.g. faff5298ef95
). You can re-open a running
container with the command
docker attach [CONTAINER_ID]
where [CONTAINER_IDR]
must be replaced by the actual ID of the
running docker container (e.g. faff5298ef95
).
- Shutting down the container
We will now shut down this image, and start a new one which will enable you to store persistent data.
docker stop [CONTAINER_ID]
- Starting a docker container with a shared folder.
500 docker pull rioualen/gene-regulation:0.3 501 mkdir -p ~/gene-regulation_data/GSE20870/GSM521934 ~/gene-regulation_data/GSE20870/GSM521935 502 cd ~/gene-regulation_data/GSE20870/GSM521934 503 wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX%2FSRX021%2FSRX021358/SRR051929/SRR051929.sra 504 cd ~/gene-regulation_data/GSE20870/GSM521935 505 wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX%2FSRX021%2FSRX021359/SRR051930/SRR051930.sra 506 mkdir ~/gene-regulation_data/results/GSE20870 507 mkdir -p ~/gene-regulation_data/results/GSE20870 508 docker pull rioualen/gene-regulation:0.3 509 docker run -v ~/gene-regulation_data:/data -it rioualen/gene-regulation:0.3 /bin/bash
- Running the snakemake demo workflow on the docker container
ls /data
ls /data/GSE20870/
ls /data/GSE20870/GSM521934/
exit
ls /data
source ~/bin/ngs_bashrc
snakemake -s scripts/snakefiles/workflows/factor_workflow.py -np
history
snakemake -s scripts/snakefiles/workflows/factor_workflow.py -np
Questions¶
- Quand on fait un login dans la vm gene–regulation, on entre dans un shell basique (pas bash). Est-il possible de configurer docker pour qu’on entre automatiquement en bash ?
Entry point /bin/bash 2. Il faut ajouter le bashrc dans le /etc du docker.
VirtualBox¶
Creating a virtual machine (VM)¶
Creating a VM under VirtualBox software¶
Virtualbox software
We used VirtualBox 5.0.2, downloadable from https://www.virtualbox.org/ or to be installed manually:
sudo apt-get install virtualbox-5.0
VirtualBox extension pack can be requested (eg. for handling USB2.0, see ‘errors’ section).
wget http://download.virtualbox.org/virtualbox/5.0.2/Oracle_VM_VirtualBox_Extension_Pack-5.0.2.vbox-extpack
Ubuntu image
In this tutorial we used Ubuntu 14.04.5, latest long-term supported version.
wget http://releases.ubuntu.com/14.04/ubuntu-14.04.5-desktop-amd64.iso
Before configuring the virtual machine, we need to tell VirtualBox how it will enable your local virtual machines to interact with their host (the operating system of the machine on which the VM is running).
- Open VirtualBox > File > Preferences...
- Open the tab Network > Host-only Networks
- click on the “+” icon
- this creates a network vboxnet0. Select this network, click on the screw driver icon (edit host-only network), and set the following options:
- Adapter tab
- IPv4 Address: 192.168.56.1
- IPv4 Network Mask: 255.255.255.0
- IPv6 Adress: blank
- IPv6 Network Mask Length: 0
- DHCP Server tab
- Check Enable Server
- Server Address: 192.168.56.100
- Server Mask: 255.255.255.0
- Lower Address Bound: 192.168.56.101
- Upper Address Bound: 192.168.56.254



- Open VirtualBox
- Click on the New button.
- Parameters
- Name and operating system
- Name: gene-regulation
- Type: Linux
- Version: Ubuntu (64 bits)
- Memory size: 2048 Mb (this can be modified afterwards).
- Hard drive: Create a virtual hard drive now.
- Hard drive file type: VDI (VirtualBox Disk Image).
- Storage on physical hard drive
- Select Dynamically allocated
- File location and size
- max size of virtual hard drive: 30GB
- click on Create button
Note: you should adapt the virtual hard drive size to your needs. Be aware that it’s difficult to extend later on, so you should aim larger than expected. Since the size is dynamically allocated, it won’t take up too much space until you fill it.
At this stage, the VM has been created and needs to be configured before installing the operating system.
In the VirtualBox main window, select the newly created virtual machine, and click on the Settings button.
General
For the desktop version of Ubuntu, it is convenient to enable copy-paste between the guest and the host.
- Select the tab Advanced
- Set Shared clipboard to Bidirectional
- Set Drag’n Drop to Bidirectional
Storage
Click on the Empty disc icon in the storage tree. Select the disc
icon on the right and fetch the downloaded .iso
image (see
Requirements). Click on OK.
Network
VirtualBox offers many alternative ways to configure network communications between the virtual machine, the host machine, and the external network.
To get more information about network settings:
- VirtualBox manual page
- An excellent tutorial
We present here one possible way to configure your Virtual machine, but this should be adapted to the particular security/flexibility requirements of the network where the maching has to run.
In the VM settings, select tne Network tab. VirtualBox enables you to specify several adapters, each corresponding to one separate network access (e.g. using an ethernet card + wi-fi connection).
- click on the tab Adapter 1,
- check Enable Network Adapter
- Attached to: Host-only Adapter
- Name: vboxnet0 (this network must have been created beforehand, see above)
- click on the tab Adapter 2,
- check Enable Network Adapter
- Attached to : NAT
- click on the tab Adapter 3,
- check Enable Network Adapter
- Attached to : Bridged Adapter
- Name: choose an option corresponding to the actual internet connection of the host machine (e.g. ethernet cable, Wi-Fi, ...).
You can now start the VM.
- Welcome
- check the language settings and click on Install Ubuntu.
- Preparing to install Ubuntu
- leave all default parameters and click Continue.
- Installation type
- (leave the default) Erase disk and install Ubuntu, click Install Now.
- Where are you (automatic)
- Paris
- Keyboard layout
- French - French
- Who are you ?
- Your name: gene-regulation
- Your computer’s name: gene-regulation-virtual
- Pick a username: gr
- Choose a password: genereg
- (Activate the option Log in automatically)
Restart once installation is completed.
..Once on the desktop, go to the VM menu: select Devices then Install Guest Additions CD image. Run it.
..The VirtualBox Guest Additions will provide closer integration between host and guest and improve the interactive performance of guest systems. Reboot again to see the new display.
Installing programs and dependencies¶
Once in the virtual machine, you can install the required programs from a terminal.
Get the gene-regulation
repository¶
cd
wget --no-clobber https://github.com/rioualen/gene-regulation/archive/4.0.tar.gz
tar zvxf 4.0.tar.gz
Run makefile to install all required dependencies¶
This may take a while (30mn to 1h) & source the .bashrc
(it’s been
updated with the $PATH
for newly installed applications).
cd
ln -s gene-regulation-4.0 gene-regulation
make -f gene-regulation/scripts/makefiles/install_tools_and_libs.mk all
source ~/.bashrc
Executing snakemake workflow example¶
## Create a base directory for the analysis
export ANALYSIS_DIR="${HOME}/ChIP-seq_SE_GSM20870"
mkdir ${ANALYSIS_DIR}
## Download source data
mkdir -p ${ANALYSIS_DIR}/data/GSM521934 ${ANALYSIS_DIR}/data/GSM521935
wget --no-clobber ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX%2FSRX021%2FSRX021358/SRR051929/SRR051929.sra -P ${ANALYSIS_DIR}/data/GSM521934
wget --no-clobber ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX%2FSRX021%2FSRX021359/SRR051930/SRR051930.sra -P ${ANALYSIS_DIR}/data/GSM521935
## Download reference genome & annotations
wget -nc ftp://ftp.ensemblgenomes.org/pub/fungi/release-30/fasta/saccharomyces_cerevisiae/dna/Saccharomyces_cerevisiae.R64-1-1.30.dna.genome.fa.gz -P ${ANALYSIS_DIR}/genome
wget -nc ftp://ftp.ensemblgenomes.org/pub/fungi/release-30/gff3/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.30.gff3.gz -P ${ANALYSIS_DIR}/genome
wget -nc ftp://ftp.ensemblgenomes.org/pub/fungi/release-30/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.30.gtf.gz -P ${ANALYSIS_DIR}/genome
gunzip ${ANALYSIS_DIR}/genome/*.gz
## Execute workflow
cd ${ANALYSIS_DIR}
ln -s ${HOME}/gene-regulation
snakemake -p --configfile gene-regulation/examples/ChIP-seq_SE_GSE20870/config.yml -s gene-regulation/scripts/snakefiles/workflows/import_from_sra.wf
snakemake -p --configfile gene-regulation/examples/ChIP-seq_SE_GSE20870/config.yml -s gene-regulation/scripts/snakefiles/workflows/quality_control.wf
snakemake -p --configfile gene-regulation/examples/ChIP-seq_SE_GSE20870/config.yml -s gene-regulation/scripts/snakefiles/workflows/ChIP-seq.wf
Congratulations! You just executed these wonderful workflows:



Visualizing results¶
FastQC¶
You can visualize the FastQC results using firefox or any other
navigator. Fetch the html
files located in the sample directories.
Before trimming:
firefox ~/ChIP-seq_SE_GSE20870/fastq/GSM521934/GSM521934_fastqc/GSM521934_fastqc.html firefox ~/ChIP-seq_SE_GSE20870/fastq/GSM521935/GSM521935_fastqc/GSM521935_fastqc.html
After trimming:
firefox ~/ChIP-seq_SE_GSE20870/fastq/GSM521934/GSM521934_sickle-se-q20_fastqc/GSM521934_sickle-se-q20_fastqc.html firefox ~/ChIP-seq_SE_GSE20870/fastq/GSM521935/GSM521935_sickle-se-q20_fastqc/GSM521935_sickle-se-q20_fastqc.html

IGV¶
You can visualize the peaks by running IGV from the terminal.
igv
- Click “File” > “Open session...” and chose the file
~/ChIP-seq_SE_GSE20870/results/peaks/igv_session.xml
. - You may need to adjust the panel sizes.

Export appliance (todo)¶
The virtual machine created with VirtualBox can be exported and saved as an appliance.
- Shut down the VM.
- In VirtualBox, open File -> Export Appliance ...
- Select the VM
gene-regulation
- Next >
- Save as: gene-regulation-[YYMMDD].ova
- Format: OVF 1.0
- Write Manifest File: check
- Next >
- Appliance Settings
- Name: gene-regulation-[YYMMDD]
- Product: Regulatory Genomics Pipeline
- Product-URL: -
- Vendor: Claire Rioualen, Jacques van Helden
- Version: YYYY-MM-DD
- Description: Regulatory Genomics Pipeline using Snakemake, installed on an Ubuntu 14.04 Virtual Machine.
- License: Free of use for academic users, non-commercial and non-military usage.
- Export
The appliance saved can be re-imported later on, on another computer if needed.
Import appliance (todo)¶
In VirtualBox, click menu File > Import appliance > fetch OVA file.
Note: there is apparently a bug with the export of VMs under VirtualBox 5.0. If you get this error when launching the imported file:
A new node couldn’t be inserted because one with the same name exists. (VERR_CFGM_NODE_EXISTS).
There is a workaround: go to the imported VM settings, to the USB tab, and untick “enable USB Controller”. You should now be able to start the VM.
Conda¶
TODO
Snakemake¶
This tutorial was developed assuming a unix-like architecture (Ubuntu 14.04).
Introduction¶
Snakemake concepts¶
- Inspired by GNU Make: system of rules & targets
- A rule is the recipe for a target
- Rules are combined by matching their inputs and outputs
Installation¶
sudo apt-get -y install python3-pip
sudo pip3 install snakemake
Downloads for practical exercises¶
Ubuntu libraries¶
sudo apt-get -y install zlib1g-dev # samtools (1-6)
sudo apt-get -y install libncurses5-dev libncursesw5-dev # samtools (1-6)
sudo apt-get -y install r-base-core # Rsamtools (4-6)
sudo pip3 install "rpy2<2.5.6" # Rsamtools (4-6)
sudo pip3 install pyyaml # Config management (5-6)
Tuto material¶
wget https://github.com/rioualen/gene-regulation/archive/1.0.tar.gz
tar xvzf 1.0.tar.gz
cd gene-regulation-1.0/doc/snakemake_tutorial
Samtools¶
wget -nc http://sourceforge.net/projects/samtools/files/samtools/1.3/samtools-1.3.tar.bz2
bunzip2 -f samtools-1.3.tar.bz2
tar xvf samtools-1.3.tar
cd samtools-1.3
make
sudo make install
cd gene-regulation-1.0/doc/snakemake_tutorial
Demo workflows¶
Workflow 1: Rules and targets¶
- Only the first rule is executed by default
- Rule
all
defines the target - Rule
sam_to_bam
automatically produces the target
# file: workflow1.py
rule all:
input: "GSM521934.bam"
rule sam_to_bam:
input: "GSM521934.sam"
output: "GSM521934.bam"
shell: "samtools view {input} > {output}"
In the terminal:
snakemake -s workflow1/workflow1.py
Workflow 2: Introducing wildcards¶
- Wildcards can replace variables
- Workflow applies to list of files or samples
- Use of the expand function
# file: workflow2.py
SAMPLES = ["GSM521934", "GSM521935"]
rule all:
input: expand("{sample}.bam", sample = SAMPLES)
rule sam_to_bam:
input: "{file}.sam"
output: "{file}.bam"
shell: "samtools view {input} > {output}"
In the terminal:
snakemake -s workflow2/workflow2.py
Workflow 3: Keywords¶
- Rules can use a variety of keywords
- An exhaustive list can be found here
# file: workflow3.py
SAMPLES = ["GSM521934", "GSM521935"]
rule all:
input: expand("{sample}.bam", sample = SAMPLES)
rule sam_to_bam:
input: "{file}.sam"
output: "{file}.bam"
params:
threads = 2 log: "{file}.log"
benchmark: "{file}.json"
shell: "(samtools view -bS --threads {params.threads} {input} > {output}) > {log}"
In the terminal:
snakemake -s workflow3/workflow3.py
Workflow 4: Combining rules¶
- Dependencies are handled implicitly, by matching filenames
- Commands can be executed by keywords
run
orshell
- Several languages:
R
,bash
,python
# file: workflow4.py
from snakemake.utils
import R
SAMPLES = ["GSM521934", "GSM521935"]
rule all:
input: expand("{sample}_sorted.bam", sample = SAMPLES)
rule sam_to_bam:
input: "{file}.sam"
output: "{file}.bam"
params:
threads = 2
log: "{file}.log"
benchmark: "{file}.json"
shell: "(samtools view -bS --threads {params.threads} {input} > {output}) > {log}"
rule bam_sorted:
input: "{file}.bam"
output: "{file}_sorted.bam"
run:
R("""
library(Rsamtools)
sortBam("{input}", "{output}")
""")
In the terminal:
snakemake -s workflow4/workflow4.py
Workflow 5: Configuration file¶
- Can be in
json
or inyml
format - Acessible through the global variable config
# file: workflow5.py
from snakemake.utils
import R
configfile: "config.yml"
SAMPLES = config["samples"].split()
OUTDIR = config["outdir"]
rule all:
input: expand(OUTDIR + "{sample}_sorted.bam", sample = SAMPLES)
rule sam_to_bam:
input: "{file}.sam"
output: "{file}.bam"
params:
threads = config["samtools"]["threads"]
log: "{file}.log"
benchmark: "{file}.json"
shell: "(samtools view -bS --threads {params.threads} {input} > {output}) > {log}"
rule bam_sorted:
input: "{file}.bam"
output: "{file}_sorted.bam"
run:
R("""
library(Rsamtools)
sortBam("{input}", "{output}")
""")
# file: config.yml
samples: "GSM521934 GSM521935"
outdir: "gene-regulation-1.0/doc/snakemake_tutorial/results/"
samtools:
threads: "2"
In the terminal:
snakemake -s workflow5/workflow5.py
Workflow 6: Separated files¶
- The keyword
include
is used to import rules
# file: workflow6.py
from snakemake.utils
import R
configfile: "config.yml"
SAMPLES = config["samples"].split()
OUTDIR = config["outdir"]
include: "sam_to_bam.rules"
include: "bam_sorted.rules"
rule all:
input: expand(OUTDIR + "{sample}_sorted.bam", sample = SAMPLES)
# file: sam_to_bam.rules
rule sam_to_bam:
input: "{file}.sam"
output: "{file}.bam"
params:
threads = config["samtools"]["threads"]
log: "{file}.log"
benchmark: "{file}.json"
shell: "(samtools view -bS --threads {params.threads} {input} > {output}) > {log}"
# file: bam_sorted.rules
rule bam_sorted:
input: "{file}.bam"
output: "{file}_sorted.bam"
run:
R("""
library(Rsamtools)
sortBam("{input}", "{output}")
""")
In the terminal:
snakemake -s workflow6/workflow6.py
Workflow 7: The keyword Ruleorder todo¶
Workflow 8: Combining wildcards with zip¶
Workflow 9: Combining wildcards selectively¶
Workflow 10: Using regular expression in wildcards¶
Other¶
- temp()
- touch()
- target/all
Bonus: generating flowcharts¶
snakemake -s workflow6/workflow6.py --dag | dot -Tpng -o d.png
snakemake -s workflow6/workflow6.py --rulegraph | dot -Tpng -o r.png
include img
Wiki NGS & Bioinformatics¶
To be completed
Glossary¶
We define hereafter a series of abbreviations, terms and concepts which appear recurrently in the litterature about NGS analysis. This document aims at providing a support for the interpretation of analysis reports.
Other resources:
A¶
B¶
C¶
- ChIP-exo:
- ChIP-seq:
- cigar: alignment (`more <>`__).
- Cloud:
- Copy number variation:
- Core:
D¶
- DEG: Differentially Expressed Gene.
E¶
- e-value (E): indicates the number of false positives expected by chance, for a given threshold of p-value. It is a number that can exceed 1, it is thus not a probability, and thus, not a p-value.
E = <FP> = P . m
Where m is the number of tests (e.g. genes), FP the number of false positives, the notation < > denotes the random expectation, and P is the nominal p-value of the considered gene.
Note that the e-value is a positive number ranging from 0 to m (number of tests). It is thus not a p-value, since probabilities are by definition comprized between 0 and 1.
F¶
- Family-wise error rate (FWER): indicates the probability to observe at least one false positive among the multiple tests.
FWER = P(FP >= 1)
- fastq (file format): raw sequences + quality (more).
- False discovery rate (FDR): indicates the expected proportion of false positives among the cases declared positive. For example, if a differential analysis reports 200 differentially expressed genes with an FDR threshold of 0.05, we should expect to have 0.05 x 200=10 false positive among them.
G¶
- genome (file format):
- genomic input:
- gff (file format): genome feature file - annotations
(more).
See also
gtf
. - gtf (file format): variant of GFF, with two fields for annotation (more).
- gft2 (file format): Gene annotation (more).
- GSM: Gene Expression Omnibus Sample identifier.
- GSE: Gene Expression Omnibus Series identifier (a collection of samples related to the same publication or thematics).
H¶
I¶
- input: Pour le peak-calling, le mot “input” est utilisé dans un sens tout à fait particulier, pour désigner un jeu de séquences servant à estimer les densités de reads attendues au hasard en fonction de la position génomique. Les méthodes typiques sont l’input génomique (actuellement le plus généralement utilisé) et le mock.
J¶
K¶
L¶
- Library: Terme utilisé de façon parfois ambiguë selon le contxte. Les biologistes se réfèrent à une librairie d’ADN pour désigner ... (à définir). Les bioinformaticiens parlent de librairie de séquences pour désigner l’nsemble des fragments de lectures provenant du séquençage d’un même échantillon. Les informaticiens appellent “”library”” (bibliothèque, librairies ?) des modules de code regroupant une série de fonctions et procédures.
M¶
- m: number of tests in a multiple-testing schema (e.g. number of genes in differential expression analysis).
- Mapped read:
- Mapping: Identifying genomic positions for the raw reads of a sequence library.
- mock: type of control for the peak-calling in ChIP-seq. It is an input obtained by using a non-specific antibody (eg. anti-GFP) for the immunoprecipitation. *afin d’estimer le taux de séquençage aspécifique pour chaque région génomique. L’intérêt du mock est qu’il constitue un contrôle réalisé dans les mêmes conditions que le ChIP-seq spécifique. La faiblesse est que les tailles de librairries sont parfois tellement faibles que l’estimation du backgroun est très peu robuste.
- motif:
- Multiple testing: the multiple testing problem arises from the application of a given statistical test to a large number of cases. For example, in differential expression analysis, each gene/transcript is submitted to a test of equality between two conditions. A single analysis thus typically involves several tens of thousands tests. The general problem of multiple testing is that the risk of false positive indicated by the nominal p-value will be challenged for each element. Various types of corrections for multiple testing have been defined (Bonferroni, e-value, FWER, FDR).
N¶
- Negative control:
- NGS: Next Generation Sequencing.
O¶
P¶
- p-value (P): the nominal p-value is the p-value attached to one particular element in a series of multiple tests. For example, in differential analysis, one nominal p-value is computed for each gene. This p-value indicates the risk to obtain an effect at least as important as our observation under the null hypothesis, i.e. in the absence of regulation.
- padj (abbr.): adjusted p-value. Statistics derived from the nominal p-value in order to correct for the effects of multiple testing (see Bonferroni correction, e-value).
The most usual correction is the FDR, which can be estimated in various ways.
- Paired end:
- Peak:
- Peak-calling:
- pileup (file format): base-pair information at each chromosomal position (more).
Q¶
- q-value:
R¶
- RAM:
- Raw read: non-aligned read.
- Read: short sequence (typically 25-75bp) obtained by high-throughput sequencing.
- Region-calling:
- Replicate: ... distinguer réplicat technique et réplicat biologique
- RNA-seq:
S¶
- sam (file format): aligned reads (more).
- Single end:
- Single nucleotide polymorphism:
- SRA: Sequence Read Archive (SRA). Database maintained by the NCBI.
- SRX: Short Read Experiment. See documentation.
- SRR: Short Read Run. See documentation.
T¶
U¶
X¶
Y¶
Z¶
Notes on multiple testing corrections¶
The problem with multiple tests¶
The multiple testing problem arises from the application of a given statistical test to a large number of cases. For example, in differential expression analysis, each gene/transcript is submitted to a test of equality between two conditions. A single analysis thus typically involves several tens of thousands tests.
The general problem of multiple testing is that the risk of false positive indicated by the nominal p-value will be challenged for each element.
P-value and derived multiple testing corrections¶
P-value (nominal p-value)¶
The nominal p-value is the p-value attached to one particular element in a series of multiple tests. For example, in differential analysis, one nominal p-value is computed for each gene. This p-value indicates the risk to obtain an effect at least as important as our observation under the null hypothesis, i.e. in the absence of regulation.
Bonferroni correction¶
E-value¶
The e-value indicates the number of false positives expected by chance, for a given threshold of p-value.
\(E = <FP> = P \cdot m\)
Where \(m\) is the number of tests (e.g. genes), \(FP\) the number of false positives, the notation \(< >\) denotes the random expectation, and \(P\) is the nominal p-value of the considered gene.
Note that the e-value is a positive number ranging from \(0\) to \(m\) (number of tests). It is thus not a p-value, since probabilities are by definition comprized between 0 and 1.
Family-wise error rate (FWER)¶
The Family-Wise Error Rate (FWER) indicates the probability to observe at least one false positive among the multiple tests.
\(FWER = P(FP >= 1)\)
False Discovery Rate (FDR)¶
The False Discovery Rate (FDR) indicates the expected proportion of false positives among the cases declared positive. For example, if a differential analysis reports 200 differentially expressed genes with an FDR threshold of 0.05, we should expect to have \(0.05 \cdot 200=10\) false positive among them.
What is an adjusted p-value?¶
An adjusted p-value is a statistics derived from the nominal p-value in order to correct for the effects of multiple testing.
Various types of corrections for multiple testing have been defined (Bonferoni, e-value, FWER, FDR). Note that some of these corrections are not actual “adjusted p-values”.
- the original Bonferoni correction consists in adapting the \(\alpha\) threshold rather than correcting the p-value.
- the e-value is a number that can exceed 1, it is thus not a probability, and thus, not a p-value.
The most usual correction is the FDR, which can be estimated in various ways.
Useful links¶
Versioning, code sharing¶
Miscellaneous¶
- QC Fail Sequencing
- FastQC results interpretation
- A Wikipedia list of sequence alignment software
- Genome sizes for common organisms
- A list of formats maintained by the UCSC
- The IFB cloud and its documentation
- A catalogue of NGS-related tools: Sequencing (OmicTools)
- Elixir’s Tools and Data Services Registry.
- Wikipedia list of biological databases
Bibliography¶
ChIP-seq guidelines¶
- Bailey et al., 2013. Practical Guidelines for the Comprehensive Analysis of ChIP-seq Data.
- ENCODE & modENCODE consortia, 2012. ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia.
Tutorials¶
French¶
- Thomas-Chollier et al. 2012. A complete workflow for the analysis of full-size ChIP-seq (and similar) data sets using peak-motifs.
- TODO add JvH & MTC tutos
- TODO Roscoff bioinformatics school: link
- RNA-seq tutorial