BIOINFORMATICS AND APPLICATIONS BIO 4033/ BIO 5033¶

Multiple Sequence Alignments¶
Getting started with MEGA¶
Note
There is an excellent tutorial on the MEGA site and this is excerpt of the tutorial for the exercise.
How to make an alignment using MEGA¶
Step 1: | Open MEGA software and you will see a screen like in the following figure: |
---|

Step 2: | Click on the small arrow on the “Align” tab |
---|---|
Step 3: | Click on ‘Edit/Build alignment’ |

Step 4: | Select a new alignment. |
---|

Step 4: | Select protein |
---|

Step 5: | From file open, select “seq_align2.fasta” and open file. |
---|

Step 6: | From Edit, select all sequences. To do an alignment using Muscle, click on Muscle tab. |
---|

Step 7: | Use default options and perform an alignment. To learn more about the options, go to the MEGA manual. |
---|
Edit the alignments¶
Step 1: | You can insert gaps, delete blocks, and delete residues. |
---|
Be very careful when you edit a sequence alignment. It should be biologically meaningful.

Exporting MSA¶
Mega allows to export the MSA in different formats.

Steps of building a tree¶
Make multiple sequence alignment for Globin gene family¶
Step 1: | Download globin.fasta from Blackboard and perform a MSA using MUSCLE (follow the steps we discussed last week). |
---|---|
Step 2: | Examine the alignment to make sure it is correct and no additional editing is needed. |
Step 2: | Export the alignment as a fasta format file on your Desktop. Name it as globin_align |
Find informative sites for Parsimony¶
Step 1: | Open the alignment file you just created by going to using Open file/Session under Data |
---|

Step 2: | Go to Explore Active Data under Data |
---|---|
Step 3: | Click on Pi button and this show site that are informative for Parsimony |

Building Phylogenetic trees¶
Step 1: | Click on Phylogeny |
---|

Step 2: | Make Neighbor-Joining tree with Bootstrap 500 replicates
|
---|

Step 3: | Save the tree as a pdf file by clicking on Image button |
||||||
---|---|---|---|---|---|---|---|
Step 4: | Build a tree using Parsimony method with 50 Bootstrap replicates (500 will be very slow).
|

Steps of building a tree (Part II)¶
Make multiple sequence alignment for Globin gene family¶
Step 1: | Download globin.fasta from Blackboard and perform a MSA using MUSCLE (follow the steps we discussed last week). |
---|---|
Step 2: | Examine the alignment to make sure it is correct and no additional editing is needed. |
Step 3: | Export the alignment as a fasta format file on your Desktop. Name it as globin_align |
Find the best substitution model¶
Step 1: | Calculate the distance using different substitution models :a: Select Distance and then Compute Pairwise distance :b: Calculate distance using the following methods
|
---|

Step 2: | Use the same alignment file and build three NJ trees using different substitution models:
|
---|

Step 3: | Best model based on ProtTest |
---|
Building Phylogenetic trees¶
Step 1: | Click on Phylogeny |
---|

Step 2: | Make Neighbor-Joining tree with Bootstrap 500 replicates
|
---|

Step 3: | Save the tree as a pdf file by clicking on Image button |
---|
Step 4: | Build a tree using Parsimony method with 50 Bootstrap replicates (500 will be very slow).
|
---|
Step 5: | Bayesian inference of phylogeny |
---|
Follow this link to MrBayes online server
A: Use the same alignment file B: In MrBayes select Poisson amino acid model with equal rates of substitution.
C: Select prior parameters (e.g. equal, fixed frequencies for the states; equal probability for all topologies; unconstrained branch lengths). D: Run 1,000,000 trials for Monte Carlo Markov Chain estimation of the posterior distribution. E: Obtain phylogram F: Export tree files G: View in MEGA software



FastQC analysis using Cyverse Discovery Environment (DE)¶
Data we are using for this analysis came from Loraine et al, 2015 study. In the original study, there are 10 samples (Five Controls and heat treated). Here we are using only 3 samples for each group (3 control and 3 heat treated). These files were downloaded from NCBI’s Short Read Archive (SRA) using SRA toolkit.
First step of the data analysis is to check the quality of the sequences. For this purpose, we are using the FastQC tool on Cyverse DE.
Step 1: Login into Cyverse DE¶
First login to your Cyverse account using your name and password.

Then, go to your DE account.

Step 2: Getting data into Cyverse Discovery Environment¶
Click on “Data” button
Click on “File” and then “New Folder”
Create a folder called “Data” and click “OK”. Create another folder called “Analysis”.
Click on the “Data” folder to enter into it. Click on “Upload” and then “Import from URL”
- I have create public links for fastq files. Copy and paste URLs in the box (one for each box). You will need to do this for all 12 URLs. Then click on “Import from URL”
URLs¶
http://de.cyverse.org/dl/d/5B50EFE6-D0BA-4833-980E-E81E5B63C15E/Control1_1.fastq
http://de.cyverse.org/dl/d/BBFB60AC-8855-40AC-9634-7C62F5B9B02D/Control1_2.fastq
http://de.cyverse.org/dl/d/2AB5824F-73BA-4C6B-8530-457609F632BA/Control2_1.fastq
http://de.cyverse.org/dl/d/46E690E4-C4A0-495F-9B11-F12AD9A25EE3/Control2_2.fastq
http://de.cyverse.org/dl/d/7FEE6359-24AE-478D-A0B1-C6D2CA09E45E/Control3_1.fastq
http://de.cyverse.org/dl/d/8FBB264D-F0CA-4F2C-821A-DB1C709315B2/Control3_2.fastq
http://de.cyverse.org/dl/d/E7AD135C-F2BC-445C-BBC2-695B1D76B010/Heat1_1.fastq
http://de.cyverse.org/dl/d/46093383-493A-4D4E-A607-D3E56916DF59/Heat1_2.fastq
http://de.cyverse.org/dl/d/9668B243-7009-4AD3-BBDA-350D6A60119D/Heat2_1.fastq
http://de.cyverse.org/dl/d/FE1C3CC3-9133-4244-BCBB-816B8D2D5F97/Heat2_2.fastq
http://de.cyverse.org/dl/d/D635B6EE-BE26-4BC4-A058-3E51B1AA69C4/Heat3_1.fastq
http://de.cyverse.org/dl/d/F88561AF-CFF2-4FC8-B6B4-D8623779BB24/Heat3_2.fastq
Step 3: Performing FastQC analysis:¶
Click on “Apps” button.
Type “fastqc” in the search window and select the app shown in red arrow.
- Follow the direction as in the figure to select the folder where your results will be saved.
Then, click on the small downward arrow (black circle).
Click on “+” sign to select the fastq files.
- Go to the folder where you have your fastq files and select them as indicated in the
figure below. Then launch the analysis. Once the analysis is complete, you will be notified via email.
Reference:¶
Loraine AE, Blakley IC, Jagadeesan S, Harper J, Miller G, Firon N. Analysis and Visualization of RNA-Seq Expression Data Using RStudio, Bioconductor, and Integrated Genome Browser. Methods Mol Biol. 2015;1284:481-501. doi: 10.1007/978-1-4939-2444-8_24. PubMed PMID: 25757788.

Relaunching a stalled analysis in Cyverse Discovery Environment¶
If your analysis is appeared to be stalled, you could try restarting it.

Adapter and quality trimming using trim-galore¶
We are going to use Trim-galore to trim adapters, and poor quality bases. This tool has several advantages. It allows selection multiple files. You can also select both forward and reverse reads. If you want to read more about Trim-galore, please visit their website. Also, Trim-galore is a wrapper for Cutadapt , which is the actual tool that performs the trimming.
Please follow the tutorial carefully.
Step 1: Launching Trim-galore¶
- Click on App.
- In the finder window type “trim-galore”
Select “trim-galore-0.4.1”.
Step 2: Selecting output folder¶
As indicated in the figure: 1. Name your analysis as you want
- Select the output folder where your analysis is going to be
Click on “Paired end Input fastq files”
Step 3: Selecting input files¶
Click on the Green “+” sign.
2. Navigate to the folder where your samples are located. Select only the first read files. Click “OK”.
You should all your first read files selected like this.
Scroll down and click on the “+” below “Fastq file(s) (Read 2 of paired end reads):”
Select the read two files as above. You will see them in the box as in the figure below.
6. Scroll down and check box beside “Paired (Select this option for paired-end files)” to indicate these are paired end reads.
very important
- Click on “Parameters” as indicated in the above figure.
- Set the parameters as indicated in the figure:
- Use Fred 20 as quality trimming cut off (this is the default).
- Copy and paste the following adapter sequence for in the box below “Adapter sequence to be trimmed:”
AATGATACGGCGA
- Copy and paste the following adapter sequence for in the box below “Adapter2”
CAAGCAGAAGACGG
Set the stringency to 6.
e. Scroll down and set the length as 40. Any sequence become shorter than this length during the trimming will be discarded. | f. Launch the analysis.

Mapping short reads¶
If you are using genome as the reference for RNAseq reads, you will need to use a splice-aware aligner like Tophat2. If you are using cDNA as the reference, you can use a general purpose aligner like Bowtie2.
You need to do only one of the procedures based on what your group have been assigned to.
Step 1: Mapping with Tophat2¶
- Click on App.
- In the finder window type “Tophat”
Select “Tophat2-PE”.
- As indicated in the figure, Name your analysis as you want.
- Select the output folder where your analysis is going to be.
Click on “Input data”
- Click on the Green “+” sign.
8. Navigate to the folder where your samples are located. Select only the first read files. Click “OK”. You can select all three of your first read files.
- Scroll down and click on the “+” below “Fastq file(s) (Read 2 of paired end reads):”
Select “Reference Genome” and select the tomato genome sequence as input.
Make sure quality is Sanger and leave rest of the default values as they are. Launch the analysis.
Step 2: Mapping with Bowtie2¶
- Click on App.
- In the finder window type “Bowtie”.
Select Bowtie app indicated in the figure.
- As indicated in the figure, Name your analysis as you want.
Select the output folder where your analysis is going to be.
- Click on “Input”
8. Navigate to the folder where your samples are located. Select first and second read files. You can only input one sample at a time.
You need to name your output file carefully. For e.g., if it is heat1 sample, name the output as heat1.sam.
Select “Reference Index” and select the tomato cDNA sequence as input.
Select options. Set “Minimum fragment length” as 100 and “Maximum fragment length” as 600. Launch the analysis.

Counting mapped reads¶
To get the number of reads mapped to a reference sequences (in this case, predicted tomato cDNA sequences), we can use Samtools. Bowtie2 output is in sam format and first, we need to convert the output files into sorted bam files.
- Type Samtools in app finding window.

- Select “SAM to sorted BAM”

- Select Bowtie2 output files (SAM format).

4. Above will create sorted bam file. You will need to use this as the input for the Samtools Flagstat, which will count the number of mapped reads.


** You can get the flagstat for all six files from following link.** | https://github.com/ajwije/2017_spring_Bioinfo_class/blob/master/Files/flagstat.txt
I have used the following bash command to count mapped reads in case you are interested in it doing programmatically.
# navigate to where your sam files are and execute the following commands.
for i in *.sam
do
#extract the file name without extension and print it to the screen
echo ${i%.sam}
#covert sam to sorted bam
samtools view -bS $i | samtools sort - -o ${i%.sam}_sorted.bam
#getting flagstat and write it an output file for each bam file.
samtools flagstat ${fname%.sam}_sorted.bam >> flagstat.txt
done

Differential gene expression analysis¶
Link for Bowtie mapped counts http://de.cyverse.org/dl/d/E9B4C299-D6CB-4656-A4F6-FF67240AEA49/170407_bowtie_counts.txt
Target file for bowtie mapped reads: http://de.cyverse.org/dl/d/BECB62C3-A369-4084-9BC9-2BFD9E6E9600/bowtie_target.txt
DESeq tutorial:¶
Steps to perform DEseq analysis¶
From Apps select “DEseq (Multifactorial Comparison)
Name your analysis and select a folder where your results need to be saved.
Select the correct target file and the count file.
4. Give a name to the project. Reference biological condition should be “control” samples. Variable of interest is “group” (Column header of the third column of the target file).
Set the significant threshold to 0.05 and launch the analysis.
DE gene list¶
I have used the following R code to merge the DE genes list and the functions.
library(reshape2)
library(readr)
# Used the terminal command to grep the fasta headers and wrote it to a file called "ITAG3_10_cDN_names.txt"
#imported this file to Rstudio
# Removed the ">" sign
ITAG3_10_formated_names <- as.data.frame(sapply(ITAG3_10_cDN_names, gsub, pattern = ">", replacement = "" ))
#Seperate gene ids and description using space as delimiter
ITAG3_10_formated_names <- data.frame(colsplit(ITAG3_10_formated_names$X1, " ", c("Id", "Description")))
#imported up regulated genes to Rstudio and merge with the above file using gene ids.
heatvscontrol_up_func <- merge(heatvscontrol_up, ITAG3_10_formated_names,
by.x = "Id",
by.y = "Id")
#write output
write.table(x = heatvscontrol_up_func, file = "heatvscontrol_up_func.txt", quote = FALSE, sep = "\t", row.names = FALSE)
#imported down regulated genes to Rstudio and merge with the above file using gene ids
heatvscontrol_down_func <- merge(heatvscontrol_down, ITAG3_10_formated_names,
by.x = "Id",
by.y = "Id")
#write output
write.table(x = heatvscontrol_down_func, file = "heatvscontrol_down_func.txt", quote = FALSE, sep = "\t", row.names = FALSE)
Up-regulated gene list: http://de.cyverse.org/dl/d/E641698E-8688-4C20-B829-0B12BABC8ABB/heatvscontrol_up_func.txt
Down-regulated gene list: http://de.cyverse.org/dl/d/3C45B913-612F-4B97-8F44-8021470AE527/heatvscontrol_down_func.txt

Secondary Structure Prediction¶
- We will use one of the differentially expressed in tomato pollen transcriptome under head stress.
I have retrive the amino acid sequences for Solyc06g050510 from SolGen website.
MKRHIHYNAHPIDPHPFEAFWYGSWQAVERLRINMGTITTHVLVDGEVIEENIPVTNLRMRSRKATLSDCAC FLRPGLEVCVLSIPYQGENSGDEKDVKPVWIDGKIRSIERKPHELTCTCKFHVSVYVTQGPPPILKKTLSKE IKMLPIDQIAVLQKLEPKPCENKRYRWSSSEDCNSLQTFKLFIGKFSSDLTWLMTASVLKEATFDVRSIHNQ IVYEIVDDDLVRKETNSNQHSYSVNFKLEGGVQTTTVIQFNRDIPDINSTSDLSESGPLVLYDLMGPRRSKR RFVQPERYYGCDDDMAEFDVEMTRLVGGRRKVEYEELPLALSIQADHAYRTGEIEEISSSYKRELFGGNIRS HEKRSSESSSGWRNALKSDVNKLADKKSVTADRQHQLAIVPLHPPSGTGLTVHEQVPLDVDVPEHLSAEIGE IVSRYIHFNSSSTSHDRKASKMNFTKPEARRWGQVKISKLKFMGLDRRGGTLGSHKKYKRNTTKKDSIYDIR SFKKGSVAANVYKELIRRCMANIDATLNKEQPPIIDQWKEFQSTKSSQRESGDHLAMNRDEEVSEIDMLWKE MELALASCYLLDDSEDSHAQYASNVRIGAEIRGEVCRHDYRLNEEIGIICRLCGFVSTEIKDVPPPFMPSSN HNSSKEQRTEEATDHKQDDDGLDTLSIPVSSRAPSSSGGGEGNVWALIPDLGNKLRVHQKRAFEFLWKNIAG SIVPAEMQPESKERGGCVISHTPGAGKTLLIISFLVSYLKLFPGSRPLVLAPKTTLYTWYKEVLKWKIPVPV YQIHGGQTFKGEVLREKVKLCPGLPRNQDVMHVLDCLEKMQMWLSQPSVLLMGYTSFLTLTREDSPYAHRKY MAQVLRQCGLLILDEGHNPRSTKSRLRKGLMKVNTRLRILLSGTLFQNNFGEYFNTLTLARPTFVDEVLKEL DPKYKNKNKGASRFSLENRARKMFIDKISTVIDSDIPKKRKEGLNILKKLTGGFIDVHDGGTSDNLPGLQCY TLMMKSTTLQQEILVKLQNQRPIYKGFPLELELLITLGAIHPWLIRTTACSSQYFKEEELEALQKFKFDLKL GSKVKFVMSLIPRCLLRREKVLIFCHNIAPINLFLEIFERFYGWRKGIEVLVLQGDIELFQRGRIMDLFEEP GGPSKVMLASITTCAEGISLTAASRVILLDSEWNPSKSKQAIARAFRPGQDKVVYVYQLLATGTLEEEKYKR TTWKEWVSSMIFSEDLVEDPSHWQAPKIEDELLREIVEEDRATLFHAIMKNEKASNMGSLQE
Copy and paste the amino acid sequence in the box and label the sequence.
- We will use the previous submitted results:
http://bioinf.cs.ucl.ac.uk/psipred/result/e3f48c8e-28ff-11e7-879a-00163e110593
Tertiary Structure Prediction¶
Frist find a structure similar to above sequence in PBD. We will use DELTA BLAST to search PBD.
2. Click on the first significant hit to access the PDB. In case, you don’t have BLAST results, use the following link to access the previous results. Link
- RCSB provides curated content of PDB and use PDB ID: 1Z3I to visualize the protein in RCSB.
4. Perform a multiple sequence alignment to find conserved sequences. |
:a:. Retrive sequence from databank
![]()
https://github.com/ajwije/2017_spring_Bioinfo_class/blob/master/rad54.fasta
:c:. Use Tcoffee server and align the sequences using structural information:
You can download crystal structure information from PDB in Cn3 format.
Download Cn3D software from NCBI and install it on your computer.
Open above Cn3 file using the Cn3D software.
Go to sequence viewer
Under view, select find pattern:
Copy a conserved region from multiple sequence alignment in the search window and click OK:
You will see conserved region displayed on the crystal structure.

The Delta-Delta Ct Method¶
Delta-Delta Ct method or Livak method is the most preferred method for qPCR data analysis. However, it can only be used when certain criteria are met. Please refer the lecture notes to make sure that these criteria are fulfilled. If not, more generalized method is called Pfaffl method. Please read the additional reading material to get more information about this method.
Here are the steps for Livak method:
The Excel file with all the calculation are in the qPCR analysis folder on Blackboard.
You have raw Ct (number of cycles that takes to reach threshold) for normal and tumor cells (3 replicates for each).
Normalization¶
First, you will need calculate relative difference between the gene of interest (p53) and the house keeping gene (GAPDH).
∆Ct = Ct (gene of interest) – Ct (housekeeping gene)
Average of the control samples (normal cells)¶
As we compare our tumor (treatment) to control (normal cells), first we need to average the ∆Ct for the 3 control (normal) samples.
Calculate the ∆∆Ct relative to the average of ∆Ct normal cells¶
∆∆Ct = ∆Ct (Tumor sample) – ∆Ct (normal average)
You can do this normal samples as well. Use $ signs infront of column number and raw letter (arrows) to fix the cell.
Fold gene expression for each sample¶
Make sure you raise the negative ∆∆Ct to power of two.
Fold gene expression = 2^-(∆∆Ct)
Overall fold change¶
You can calculate average fold change for both tumor and normal samples. Ratio between these two the fold change between tumor and normal samples.
Log transformation¶
To perform parametric statistical tests such as T-test, it advised to transform the final gene expression results to log values (any log base). This would make data distribution symmetric.
Here we have change the 2^-(∆∆Ct) to log 10.
T-test¶
Need to be careful when using parametric tests if data is not normally distributed, it would lead to erroneous conclusions.
Select log 10 of 2^-(∆∆Ct) values for Normal and tumor samples as indicated. Use two tail test (number 2) and assuming unequal variance (3).
Resulting P value is less than 0.05 and therefore, we reject the null hypothesis and two sample means are significantly different at 0.05 level.
. module:: Getting Data into Galaxy
synopsis:

Getting data into Galaxy¶
Step 1: Login into Galaxy¶
Click on the following link to go to the ASU Galaxy site:
https://orpheus.cs.astate.edu/
Use your ASU username and password to login to Galaxy.
Step 2: Getting data¶
Data we are using for this analysis came from Loraine et al, 2015 study. In the original study, there are 10 samples (Five Controls and heat treated). Here we are using only 3 samples for each group (3 control and 3 heat treated). These files were downloaded from NCBI’s Short Read Archive (SRA) using SRA toolkit.
Use the following links to get data. Each link is one data file.
https://de.cyverse.org/dl/d/2AB5824F-73BA-4C6B-8530-457609F632BA/Control2_1.fastq
https://de.cyverse.org/dl/d/46E690E4-C4A0-495F-9B11-F12AD9A25EE3/Control2_2.fastq
https://de.cyverse.org/dl/d/7FEE6359-24AE-478D-A0B1-C6D2CA09E45E/Control3_1.fastq
https://de.cyverse.org/dl/d/8FBB264D-F0CA-4F2C-821A-DB1C709315B2/Control3_2.fastq
https://de.cyverse.org/dl/d/9A45E994-2CDC-4643-AC4C-45C9625138F6/Heat1_1.fastq
https://de.cyverse.org/dl/d/46093383-493A-4D4E-A607-D3E56916DF59/Heat1_2.fastq
https://de.cyverse.org/dl/d/9668B243-7009-4AD3-BBDA-350D6A60119D/Heat2_1.fastq
https://de.cyverse.org/dl/d/FE1C3CC3-9133-4244-BCBB-816B8D2D5F97/Heat2_2.fastq
https://de.cyverse.org/dl/d/D635B6EE-BE26-4BC4-A058-3E51B1AA69C4/Heat3_1.fastq
https://de.cyverse.org/dl/d/F88561AF-CFF2-4FC8-B6B4-D8623779BB24/Heat3_2.fastq
Step 2: Copy one of the links above. Click on the Paste/Fetch icon and paste link in the box. Click on start.¶
Step 3: One the data is uploaded, they will appear in the right hand panel. You can use the pencil icon to change the name.¶

FastQC analysis using Galaxy¶
Step 1: Login into Galaxy¶
First login to your Cyverse account using your name and password.

Then, go to your DE account.


Adapter and quality trimming using Cutadapt¶
We are going to use Trim-galore to trim adapters, and poor quality bases. This tool has several advantages. It allows selection multiple files. You can also select both forward and reverse reads. If you want to read more about Trim-galore, please visit their website. Also, Trim-galore is a wrapper for Cutadapt , which is the actual tool that performs the trimming.
Please follow the tutorial carefully.
Step 1: Launching Cutadapt and performing the analysis¶
- Type Trim-galore in the search box on top left hand corner.
Select paired-end and select the two paired-end files are shown below. Use Illumina universal adapter to trim.
Click on the advance settings. Set the parameters as indicated in blue arrows.
- Launch the analysis.

Adapter and quality trimming using trim-galore¶
We are going to use Trim-galore to trim adapters, and poor quality bases. This tool has several advantages. It allows selection multiple files. You can also select both forward and reverse reads. If you want to read more about Trim-galore, please visit their website. Also, Trim-galore is a wrapper for Cutadapt , which is the actual tool that performs the trimming.
Please follow the tutorial carefully.
Step 2: Selecting input files¶
As indicated in the figure: 1. Is this library paired- or single-end? “Paired-end”
Adapter sequence to be trimmed: “Select Illumina universal”
Step 3: Advance settings¶
From “Trim Galore! advanced settings” select “Full parameter list”
Set “Overlap with adapter sequence required to trim a sequence” to 6.
Run the analysis.

Use Splice aware aligner, Tophat2 to align short reads¶
We are using Paired end reads and set the “Is this library mate-paired?” pulldown to “Paired-end”, then a second pulldown will appear to specify the 2nd FASTQ.
“Mean Inner Distance between Mate Pairs” value for this parameter should obtained from the person incharge of the sequencing.
Mean Inner Distance between Mate Pairs = length of the Fragments used for sequencing – (Length of Illumina adapters (often 120bp) + part sequenced (76+76))
Genome should be obtained from the SolGenome.net (ftp://ftp.solgenomics.net/tomato_genome/assembly/build_3.00/) and select it from the history.
This library has been prepared to preserve the strandedness of the RNAs.
Minimum and maximum intron lengths should be changed according to genome used.
Change the intron lengths for split reads as well.
Output files:¶
- accepted_hits (BAM, BAI)
- Two binary files: .BAM (data) and .BAI (index)
3. These are the actual paired reads mapped to their position on the genome, and split across exon junctions. This can be visualized in IGV, IGB or UCSC, but you must download both .BAM and .BAI files to the same directory. splice_junctions (BED)
- BED file (list of genomic locations, no sequence) listing all the places TopHat had to split a read into two pieces to span an exon junction. This can be visualized at UCSC or in IGV, etc.
- deletions (BED) (if indel search is on)
- insertions (BED) (if indel search is on)

Use Htseq to counts reads mapped to features¶
Use Htseq to counts the reads aligned to exons on the genes. Change the parameters as indicated in red arrows.
Change the intron lengths for split reads as well.

Use Kellisto to map reads to cDNA and count¶
Kellisto is an ultrafast alignment-free quantification tool.
- Change parameters as indicated.
2. cDNA file can be obtained from the Solgenome.net(ftp://ftp.solgenomics.net/tomato_genome/annotation/ITAG3.0_release/ITAG3.0_cDNA.fasta) |
- Once you get the results, click on the “eye” icon on the history pane. Then click on the “disk” icon on the left bottom left-hand corner to download the data into your computer.
- Open the download file with Excel.
5. In Excel, click on “Data” and then click on “Text to Column”. Separate column using tab to separate data into columns. Sum the numbers in “est_counts” using Auto Sum function.

Setup instructions (This is from Data Carpentry (http://www.datacarpentry.org/R-genomics/))¶
R and RStudio are separate downloads and installations. R is the underlying statistical computing environment, but using R alone is no fun. RStudio is a graphical integrated development environment (IDE) that makes using R much easier and more interactive. You need to install R before you install RStudio. After installing both programs, you will need to install the tidyverse package from within RStudio. Follow the instructions below for your operating system, and then follow the instructions to install tidyverse and RSQLite.
Windows¶
If you already have R and RStudio installed¶
Open RStudio, and click on “Help” > “Check for updates”. If a new version is available, quit RStudio, and download the latest version for RStudio. To check which version of R you are using, start RStudio and the first thing that appears in the console indicates the version of R you are running. Alternatively, you can type sessionInfo(), which will also display which version of R you are running. Go on the CRAN website and check whether a more recent version is available. If so, please download and install it. You can check here for more information on how to remove old versions from your system if you wish to do so.
If you don’t have R and RStudio installed¶
Download R from the CRAN website. Run the .exe file that was just downloaded Go to the RStudio download page Under Installers select RStudio x.yy.zzz - Windows XP/Vista/7/8 (where x, y, and z represent version numbers) Double click the file to install it Once it’s installed, open RStudio to make sure it works and you don’t get any error messages.
macOS¶
If you already have R and RStudio installed¶
Open RStudio, and click on “Help” > “Check for updates”. If a new version is available, quit RStudio, and download the latest version for RStudio. To check the version of R you are using, start RStudio and the first thing that appears on the terminal indicates the version of R you are running. Alternatively, you can type sessionInfo(), which will also display which version of R you are running. Go on the CRAN website and check whether a more recent version is available. If so, please download and install it.
If you don’t have R and RStudio installed¶
Download R from the CRAN website. Select the .pkg file for the latest R version Double click on the downloaded file to install R It is also a good idea to install XQuartz (needed by some packages) Go to the RStudio download page Under Installers select RStudio x.yy.zzz - Mac OS X 10.6+ (64-bit) (where x, y, and z represent version numbers) Double click the file to install RStudio Once it’s installed, open RStudio to make sure it works and you don’t get any error messages.
Using DEseq and EdgeR to find differentially expressed genes¶
The first step is to merge all count data files we got from the Htseq. Use the Join two data sets side-by-side on Galaxy and select output from Control2 and Control3 samples. Use Column one to join the data sets. After this is complete, take the resulting file, and combine with Temperate1 output. Repeat this for the next two data sets.
The resulting file would like this. Your actual number may be different, but should have 10 columns.
Download this file onto your computer and move it to folder called “Counts”. Rename file “counts.tabular”.
Open Rstudio and go to Session and select “Set Working Directory” and chose the folder that you just created.
In the console, you will see the following message and the part underline in red is the path to your directory.
Replace your path in this portion of the following code “/Users/aselawijeratne/Desktop”. Execute this to read file into R.
d1=read.delim('/Users/aselawijeratne/Desktop/Counts/counts.tabular', header = FALSE)
Format data using R.
#select only column with data
d1 <- d1[-c(3, 5, 7, 9)]
#Name the columns
colnames(d1) <- c("gene_names", "C2", "C3", "T1", "T2", "T3")
#get rid of the mRNA part infront of the gene name
row_names <- gsub("mRNA:", "", d1$gene_names)
#remove the last trailing ".1" from gene names
row_names <- gsub('.{2}$', '', row_names)
#assign row_name vector to the row names of the data.
row.names(d1) <- row_name
#remove unformated gene names.
d1$gene_names <- NULL
Import necessary libraries.
library(edgeR)
library(DESeq2)
Filter data.
#Filter data with rowsum < 10
d1$rowsum <- rowSums(d1)
#Low count filtered
d1_filterd <- d1[d1$rowsum > 10, ]
Create a group file and normalize data using EdgeR.
conds <- c(rep("C", 2), rep("T", 3))
y <- DGEList(counts=d1_filterd, group=conds, remove.zeros=TRUE) # Constructs DGEList object
dge=calcNormFactors(y)
A multi-dimensional scaling (MDS) plot to see the similarity among samples.
# color for controls
cn.color='blue'
# color for treatments
tr.color='brown'
# define a title for the plot
main='MDS Plot for Count Data'
#par(las=1) # makes y axis labels horizontal not vertical
colors=c(rep(cn.color,2),rep(tr.color,3))
plotMDS(dge,main=main,labels=colnames(dge$counts),
col=colors,las=1)
Hierarchical clustering can also be used to check how different samples are. As you can see, sample T2 is very different from the rest.
>normalized.counts=cpm(dge)
>transposed=t(normalized.counts) # transposes the counts matrix
>distance=dist(transposed) # calculates distance
>clusters=hclust(distance) # does hierarchical clustering
>plot(clusters) # plots the clusters as a dendrogram
Differential expression analysis
y <- estimateCommonDisp(y) # Estimates common dispersion
y <- estimateTagwiseDisp(y) # Estimates tagwise dispersion
et <- exactTest(y, pair=c("C", "T")) # Computes exact test for the negative binomial distribution.
topTags(et, n=4)
Convert data into a dataframe and use dfr and fold change to select genes.
edge <- as.data.frame(topTags(et, n=50000))
edge2fold <- edge[edge$logFC >= 1 | edge$logFC <= -1,]
edge2foldpadj <- edge2fold[edge2fold$FDR <= 0.01, ]
DEseq analysis¶
Create matrix for DESeq2 and prepare data for DEseq
d1_matrix <- data.matrix(d1_filterd) #Create matrix for DESeq2
genotype <- data.frame(Genotype=conds) # create a dataframe for groups
dse <- DESeqDataSetFromMatrix(countData = d1_matrix,
colData = genotype,
design = ~ Genotype)
colData(dse)$Genotype <- factor(colData(dse)$Genotype,
levels=c("C","T"))
Differential expression analysis
dse <- DESeq(dse)
dse <- DESeq(dse)
ddsLocal <- estimateDispersions(dse, fitType="local", maxit =500)
ddsLocal <- nbinomWaldTest(ddsLocal)
res <- results(ddsLocal)
Order the data using p-adjusted value.
res <- res[order(res$padj),]
head(res)
res <- na.omit(res)
Use dfr and fold change to select genes.
deseq_2fold <- res[res$log2FoldChange >= 1 | res$log2FoldChange <= -1,]
deseq_2foldpadj <- data.frame(deseq_2fold[deseq_2fold$padj <= 0.01, ])
bothDF <- merge(deseq_2foldpadj, edge2foldpadj, by="row.names", all.x=TRUE)
Writing results files. You can open “edgr_deseq2.txt” file in Excel if you want to look at it.
write.table(as.data.frame(bothDF), file="edgr_deseq2.txt", sep="\t", quote = F, row.names=F, col.names=TRUE)
Combine DESeq and EdgR to make Venn diagram¶
Count how many gene from each analysis and make a Venn diagram. You need to have overLapper.R file (down loaded from Bb) in Counts folder.
source("overLapper.R")
setlist <- list(edgeR=rownames(edge2foldpadj), DESeq=rownames(deseq_2foldpadj))
OLlist <- overLapper(setlist=setlist, sep="_", type="vennsets")
counts <- sapply(OLlist$Venn_List, length)
vennPlot(counts=counts)

GOseq analysis¶
1. Ge the up and down regulated gene list. “bothDF” is the dataframe that contains both up and down-regulated genes from both EdgeR and DEseq2.
bothDF_down <- bothDF[bothDF$log2FoldChange <= -1,]
bothDF_up <- bothDF[bothDF$log2FoldChange >= 1,]
Convert to these dataframes into table with True or False values. Write the table to local directory.
DE_list_boolup <- as.data.frame(row.names(d) %in% bothDF_up$Row.names,
row.names = row.names(d))
DE_list_booldown <- as.data.frame(row.names(d) %in% bothDF_down$Row.names,
row.names = row.names(d))
write.table(DE_list_boolup,file="DE_goseq_up.txt",row.names=T,sep='\t',quote=F,
col.names = F)
write.table(DE_list_booldown,file="DE_goseq_down.txt",row.names=T,sep='\t',quote=F,
col.names = F)
Perform the GOseq analysis in Galaxy. You will need to perform the analysis for up and down regulated genes separately.
- Combine gene descriptions with up and down regulated genes. You can get the S_lycopersicum_Feb_2014.bed file from the Dropbox link on Bb.
annots_file <- 'S_lycopersicum_Feb_2014.bed'
# keep gene id and gene description columns
annots <- read.delim(annots_file,sep='\t',header=F)[,13:14]
# name the columns
names(annots) <- c('gene','description')
# combine gene expression and annotations
bothDF_genedesc <- merge(bothDF,annots, by.x = "Row.names",by.y='gene')
- To order your data using FDR, you use the following command in R.
bothDF_genedesc <- bothDF_genedesc[order(bothDF_genedesc$FDR), ]
