The Cpipe ChIP-seq Analysis Pipeline¶
Cpipe is a comprehensive and user friendly ChIP-seq Analysis Pipeline written in Python.
See Simpest config for a quick view of how to use this pipeline.
Manual¶
About the project¶
Purpose¶
- ChIP-seq experiment has been a mature and wide-spread technique for detecting the TF and histone modification distribution from the genome scale.
- Along with the popularity of the technique and the increasingly huge number of highthroughput datasets, it may be confusing for biologists to get a quick and easy access to understand their biological meaning, and the same important thing is the unbiased judgement of the data quality.So, it’s necessary for us establishing a universal and easy-to-use ChIP-seq data analysis pipeline for biologists.
- For bioinformaticians, you may find this may be the most extensible and flexible ChIP-seq integrated python packages ever. It provides various genomic data format support and high-rated analysis toolbox. Any bug report is welcome.
Goal¶
There are two layers for pipeline goal.This is part of the Cistrome project
1.command line
The goal is to simply input the sequence data files up to input format
support and fill in the customized table of experiment descriptions,
output the desired format of DC and QC report
2.web server
The ChIP-seq pipeline could be incorporated into the Cistrome
Features¶
This page is intended for:
Developers: in order to be sure they are developing the right project that fulfills requirements provided in this document.
Users: in order to get familiar with the idea of the project and suggest other features that would make it even more functional.
Cpipe should include the following features.
Supported formats¶
Cpipe should support the following format of Raw Data
as input:
Format | Type | Steps |
---|---|---|
FASTQ | Seq | |
BED | Mapped | Skip mapping |
BAM | Mapped | Skip mapping |
Cpipe may not support the following format in current version:
Format | Type | Solution |
---|---|---|
SRA | Seq | Use SRA Toolkit to convert to FASTQ format |
BED | Summit | |
BED | Peak | |
wig | Profile | |
Bigwig | Profile |
Data preprocession¶
Convert the raw sequencing data into intervals and profiles.
- Use Bowtie for tag alignment (mapping)
- Use MACS2 for peak calling
Correlation¶
Focus on the visulization of similarity between replicates.
- Draw the venn diagram for peaks if there’re less than 3 replicates (treatment or control)
Association Study¶
Focus on association between intervals (result of peak calling) and traits like genome annotation.
- CEAS: Annotate the given intervals and scores with genome features
- Conservation Plot: Calculates the PhastCons scores in several intervals sets
Motif¶
analysis the motif of the binding sites.
Quality control¶
Based on Chip-seq pipeline and Cistrome DC database, QC program will generate a comprehensive quality control report about a particular dataset as well as the relative result compared to the whole DC database.
- Basic information: Species, Cell Type, Tissue Origin, Cell line, Factor, Experiment, Platform, Treatment and Control.
- Reads Genomic Mapping QC measurement: QC of raw sequence data with FastQC, FastQC score distribution, Basic mapping QC statistics, Mappable reads ratio, Mappable Redundant rate.
- Peak calling QC measurement: Peak calling summary, High confident Peak, Peaks overlapped with DHS(Dnase Hypersensitivity sites), Velcro ratio(human only), Profile correlation within union peak regions, Peaks overlap between Replicates.
- Functional Genomic QC measurement: Peak Height distribution, Meta Gene distribution, Peak conservation score, Motif QCmeasurement analysis.
This page is intended for:
Developers: in order to make sure they’re using the right format of data and right version of tool to test
Users: in order to know where they should go to download these data and tools
Data¶
Built-in Data¶
The Cpipe package includes all the build-in data for hg19 and mm9. For other species, you may need to download these data from data source or custom it yourself.
Data Name | Used by | Data Source | Format |
---|---|---|---|
Chromesome length | samtools | UCSC table browser | 2-column |
Chromesome length | CEAS | – | – |
Genome backgroud annotation | CEAS | CEAS site | sqlite3 |
DHS region | bedtools | Custom | BED |
Velcro region | bedtools | Custom | BED |
Motif database | MDSeqPos | MDSeqPos site | xml |
FastQC result database | QCreport | Custom | bed |
Data summary database | QCreport | Custom | bed |
External Data¶
Some data are too large to be included by the pipeline package, so you need to download these data from data source.
Data Name | Used by | Data Source | Format |
---|---|---|---|
Bowtie pre-built index | Bowtie | Bowtie site | ebwt |
Conservation profile | Conservation Plot | Cistrome site | Bigwig |
Tools¶
Built-in Tools¶
Built-in tools are the scripts that can be run from command-line independently when you have installed the Cpipe package.
Tool Name | Modified from |
---|---|
Venn Diagram | |
Conservation Plot | |
Correlation plot | bigwig_correlation |
bamtofastq | |
wigTobigwiggle | |
RegPotential | |
sample_contamination |
External Tools¶
External Tools are the tools invoked by Cpipe by their path.
Tool Name | Download source | Version |
---|---|---|
FastQC | ||
R | ||
Cython | ||
MACS2 | MACS site | 2.0.10 20120605 |
CEAS | CEAS site | 0.9.9.7 |
bedtools | bedtools site | v2.16.2 |
pybedtools | ||
samtools | SAMtools site | 0.1.17 |
Bowtie | Bowtie site | 0.12.8 |
bedGraphToBigWig | UCSC utilities | v4 |
FastQC | FastQC site | v0.10.1 |
pdfTeX | pdfTex site | v1.40.10 |
IGV |
Config file¶
Synopsis¶
-
[meta]
¶ Lists all the meta-data of current workflow.
Consist of the following options:
-
dataset.ID
¶ The name for the dataset, which will be the value of
${DatasetID}
Limit: a string (1) consist of
numbers
,alphabets
or'_'
(2) shorter than 20 characters
-
species
¶ The name of species, written to the QCreport and log
Limit: a string (1) consist of
numbers
,alphabets
or'_'
(2) shorter than 20 characters
-
assembly
¶ The assembly version, written to the QCreport and log
Limit: a string (1) consist of
numbers
,alphabets
(2) shorter than 10 characters
-
treatment
¶ The paths of treatment files
Limit: absolute or relative
path
of files in supported formats
-
control
¶ The paths of treatment files
Limit: absolute or relative
path
of files in supported formats
-
-
[ext]
¶ The external data and external tools to use. Read External Data and External Tools for a full explanation.
-
[steps]
¶ meta
Simpest config¶
Here is one of the simpest Cpipe workflow you can make.
1 2 3 4 5 6 7 8 9 | [meta]
dataset.ID = mydata_ctcf
species = human
assembly = hg19
treatment = .../GSM489301.fastq
[ext]
directory = .../Cpipe/data
|
Use your own path of Raw Data to replace the Line 6. And use the path of the directory used to store External Data to replace Line 9.
When saved to hello_cpipe.conf
, this config file can construct a powerful pipeline via:
$ Cpipe hello_cpipe.conf
When it finished about 2 hours later, you will get Processed Data and a Final PDF Report.
Examples about replicates¶
For example, there are five raw files. Three of them are replicates for treatment
and two of two for control
.
The file names may look like:
demo_treat1.fastq
demo_treat2.fastq
demo_treat3.fastq
demo_control1.fastq
demo_control2.fastq
Then you can write the [meta]
section like this:
1 2 3 4 5 6 7 8 9 | [meta]
dataset.ID = demo_replicate
# species = human
# assembly = hg19
treatment.1 = demo_treat1.fastq
treatment.2 = demo_treat2.fastq
treatment.3 = demo_treat3.fastq
control.1 = demo_control1.fastq
control.2 = demo_control2.fastq
|
Replace the commented in Line 2, Line 3 and Line 4 and complete other sections. Then load it with Cpipe.
For the notation of output files, the ${DatasetID}
will be demo_replicate
. The ${treat_rep}
will be 1
, 2
and 3
. The ${control_rep}
will be 1
and 2
.
Output & Temporary files¶
This page is intended for:
Developers: in order to make sure they have a consistent naming convention and can find each file easily
Users: in order to know what each output file represents
Through the pipeline, several temporary files will be generated, some of them are only used for settings and transitions, others for continuing the next step, the rest for publishing and interpreting a biological story.Below is three sections of tables for universal name rules.
Note
For data which has not been published on GEO use factor name plus your favorate number to replace the GSMID below.
Notation¶
Temporary files¶
Name | Content | Tool used |
---|---|---|
${DatasetID}_treat_rep${treat_rep}.sam | mapping result | External Tools |
${DatasetID}_control_rep${control_rep}.sam | mapping result | External Tools |
${DatasetID}.conf | configuration | Main program |
${DatasetID}_bedtools_dhs.txt | DHS peaks intersection | External Tools |
${DatasetID}_cor.R | correlation plot code | Buit-in tools |
${DatasetID}_seqpos.zip | Motif analysis | External Tools |
${DatasetID}_QC.tex | QC report code | pdftex |
${DatasetID}_mappable_ratio.pdf | Mapping QC result | R |
${DatasetID}_fastqc_score_distribution.pdf | Raw data QC | R |
${DatasetID}_peak_distribution.pdf | Peak calling QC | R |
${DatasetID}_velcro_ratio.pdf | Peak calling QC | R |
${DatasetID}_peak_overlap_DHS.pdf | Peak calling QC | R |
Output result¶
Name | Content | Tool used |
---|---|---|
Folder | containing all results | Main Program |
GSMIDlog | log | |
GSMID_control_repnumber.bam | mapping result | External Tools |
GSMID_treat_repnumber.bam | mapping result | |
GSMID_repnumber_peaks.bed | Peak calling | MACS2 |
GSMID_bedtools_dhs.txt | DHS peaks intersection | BEDtools |
GSMID_cor.R | correlation plot code | Built-in tools |
GSMID_seqpos.zip | Motif analysis | MDSeqpos |
GSMID_QC.tex | QC report code | pdftex_ |
GSMID_ceas.xls | CEAS | CEAS_ |
GSMID_conserv.png | Phascon score plot | Built-in tools |
GSMID_conserv.R | Phascon score | Built-in tools |
Final PDF Report¶
Provide the overall report of the whole pipeline for viewing general result.
Name | Content | Tool used |
---|---|---|
GSMID_ceas_combined.pdf | Cistron annotation | CEAS |
GSMID_QC.pdf | All quality control measurements | Main program |
API Documentation¶
Documentation for every chlin API.
API Documentation¶
Controllers of QC¶
-
class
chilin.qc.
QC_Controller
(template='')¶ All the class in the module derives from this class
-
check
()¶ Check whether the quality of the dataset is ok.
-
render
(template=None)¶ Generate the latex code for current section.
-
run
()¶ Run some QC tools or do some time-costing statistics
-
-
class
chilin.qc.
RawQC
¶
-
class
chilin.qc.
MappingQC
¶
-
class
chilin.qc.
PeakcallingQC
¶
-
class
chilin.qc.
AnnotationQC
¶