BIOINFORMATICS AND APPLICATIONS BIO 4033/ BIO 5033

_images/UnivLogo_Stack_2C_Dark.png

Multiple Sequence Alignments

Getting started with MEGA

Note

There is an excellent tutorial on the MEGA site and this is excerpt of the tutorial for the exercise.

How to make an alignment using MEGA

Step 1:Open MEGA software and you will see a screen like in the following figure:
_images/mega_1.png
Step 2:Click on the small arrow on the “Align” tab
Step 3:Click on ‘Edit/Build alignment’
_images/mega_2.png
Step 4:Select a new alignment.
_images/mega_3.png
Step 4:Select protein
_images/mega_4.png
Step 5:From file open, select “seq_align2.fasta” and open file.
_images/mega_5.png
Step 6:From Edit, select all sequences. To do an alignment using Muscle, click on Muscle tab.
_images/mega_7b.png
Step 7:Use default options and perform an alignment. To learn more about the options, go to the MEGA manual.

Edit the alignments

Step 1:You can insert gaps, delete blocks, and delete residues.

Be very careful when you edit a sequence alignment. It should be biologically meaningful.

_images/mega_7.png

Exporting MSA

Mega allows to export the MSA in different formats.

_images/UnivLogo_Stack_2C_Dark.png

Steps of building a tree

Make multiple sequence alignment for Globin gene family

Step 1:Download globin.fasta from Blackboard and perform a MSA using MUSCLE (follow the steps we discussed last week).
Step 2:Examine the alignment to make sure it is correct and no additional editing is needed.
Step 2:Export the alignment as a fasta format file on your Desktop. Name it as globin_align

Find informative sites for Parsimony

Step 1:Open the alignment file you just created by going to using Open file/Session under Data
_images/mega_8.png
Step 2:Go to Explore Active Data under Data
Step 3:Click on Pi button and this show site that are informative for Parsimony
_images/mega_9.png

Building Phylogenetic trees

Step 1:Click on Phylogeny
_images/mega_10.png
Step 2:

Make Neighbor-Joining tree with Bootstrap 500 replicates

A:What relationships can you see in the tree?
B:What can you say about the statistical support for each relationship?
C:What should be the out-group?
_images/mega_11.png
Step 3:

Save the tree as a pdf file by clicking on Image button

Step 4:

Build a tree using Parsimony method with 50 Bootstrap replicates (500 will be very slow).

A:What relationships can you see in the tree?
B:What can you say about the statistical support for each relationship?
C:Do you see the same relationships that you saw with NJ tree?
_images/UnivLogo_Stack_2C_Dark.png

Steps of building a tree (Part II)

Make multiple sequence alignment for Globin gene family

Step 1:Download globin.fasta from Blackboard and perform a MSA using MUSCLE (follow the steps we discussed last week).
Step 2:Examine the alignment to make sure it is correct and no additional editing is needed.
Step 3:Export the alignment as a fasta format file on your Desktop. Name it as globin_align



Find the best substitution model

Step 1:

Calculate the distance using different substitution models :a: Select Distance and then Compute Pairwise distance :b: Calculate distance using the following methods

i:No. of Differences
ii:p-distances
iii:Poisson model


_images/mega_13.png


Step 2:

Use the same alignment file and build three NJ trees using different substitution models:

a:No. of Differences
b:p-distances
c:Poisson model


_images/mega_12.png


Step 3:Best model based on ProtTest

Building Phylogenetic trees

Step 1:Click on Phylogeny

_images/mega_10.png

Step 2:

Make Neighbor-Joining tree with Bootstrap 500 replicates

A:What relationships can you see in the tree?
B:What can you say about the statistical support for each relationship?
C:What should be the out-group?


_images/mega_11.png


Step 3:Save the tree as a pdf file by clicking on Image button


Step 4:

Build a tree using Parsimony method with 50 Bootstrap replicates (500 will be very slow).

A:What relationships can you see in the tree?
B:What can you say about the statistical support for each relationship?
C:Do you see the same relationships that you saw with NJ tree?


Step 5:Bayesian inference of phylogeny

Follow this link to MrBayes online server

A:Use the same alignment file
B:In MrBayes select Poisson amino acid model with equal rates of substitution.
C:Select prior parameters (e.g. equal, fixed frequencies for the states; equal probability for all topologies; unconstrained branch lengths).
D:Run 1,000,000 trials for Monte Carlo Markov Chain estimation of the posterior distribution.
E:Obtain phylogram
F:Export tree files
G:View in MEGA software


_images/mega_15.png


_images/mega_14.png
_images/UnivLogo_Stack_2C_Dark.png

FastQC analysis using Cyverse Discovery Environment (DE)

Data we are using for this analysis came from Loraine et al, 2015 study. In the original study, there are 10 samples (Five Controls and heat treated). Here we are using only 3 samples for each group (3 control and 3 heat treated). These files were downloaded from NCBI’s Short Read Archive (SRA) using SRA toolkit.

First step of the data analysis is to check the quality of the sequences. For this purpose, we are using the FastQC tool on Cyverse DE.

Step 1: Login into Cyverse DE

First login to your Cyverse account using your name and password.

_images/iplant_1.png

Then, go to your DE account.

_images/iplant2.png

Step 3: Performing FastQC analysis:


  1. Click on “Apps” button.

    _images/fastqc_9.png
  2. Type “fastqc” in the search window and select the app shown in red arrow.

    _images/fastqc_10.png

  1. Follow the direction as in the figure to select the folder where your results will be saved.

    Then, click on the small downward arrow (black circle).

    _images/fastqc_11.png
  2. Click on “+” sign to select the fastq files.

    _images/fastqc_12.png

  1. Go to the folder where you have your fastq files and select them as indicated in the

    figure below. Then launch the analysis. Once the analysis is complete, you will be notified via email.

    _images/fastqc_13.png


Reference:

Loraine AE, Blakley IC, Jagadeesan S, Harper J, Miller G, Firon N. Analysis and Visualization of RNA-Seq Expression Data Using RStudio, Bioconductor, and Integrated Genome Browser. Methods Mol Biol. 2015;1284:481-501. doi: 10.1007/978-1-4939-2444-8_24. PubMed PMID: 25757788.

_images/UnivLogo_Stack_2C_Dark.png

Relaunching a stalled analysis in Cyverse Discovery Environment

If your analysis is appeared to be stalled, you could try restarting it.

Step 1: Click on the message icon

_images/restart_1.png

Step 2: Click on the analysis that appears to be stalled

_images/restart_3.png

Step 3: Check the small box and click on analysis

_images/restart_4.png

Step 4: Click on the relaunch button

_images/restart_5.png

Once the analysis window appears, launch the analysis.


_images/UnivLogo_Stack_2C_Dark.png

Adapter and quality trimming using trim-galore

We are going to use Trim-galore to trim adapters, and poor quality bases. This tool has several advantages. It allows selection multiple files. You can also select both forward and reverse reads. If you want to read more about Trim-galore, please visit their website. Also, Trim-galore is a wrapper for Cutadapt , which is the actual tool that performs the trimming.

Please follow the tutorial carefully.

Step 1: Launching Trim-galore

  1. Click on App.

  1. In the finder window type “trim-galore”

  1. Select “trim-galore-0.4.1”.

    _images/trim_1.png

Step 2: Selecting output folder

As indicated in the figure: 1. Name your analysis as you want


  1. Select the output folder where your analysis is going to be

  1. Click on “Paired end Input fastq files”

    _images/trim_2.png

Step 3: Selecting input files

  1. Click on the Green “+” sign.

    _images/trim_3.png

2. Navigate to the folder where your samples are located. Select only the first read files. Click “OK”.

_images/trim_4.png

  1. You should all your first read files selected like this.

    _images/trim_5.png

  1. Scroll down and click on the “+” below “Fastq file(s) (Read 2 of paired end reads):”

    _images/trim_6.png

  1. Select the read two files as above. You will see them in the box as in the figure below.

    _images/trim_7.png

6. Scroll down and check box beside “Paired (Select this option for paired-end files)” to indicate these are paired end reads.

very important

_images/trim_8.png

  1. Click on “Parameters” as indicated in the above figure.

  1. Set the parameters as indicated in the figure:

  1. Use Fred 20 as quality trimming cut off (this is the default).

  1. Copy and paste the following adapter sequence for in the box below “Adapter sequence to be trimmed:”

AATGATACGGCGA

  1. Copy and paste the following adapter sequence for in the box below “Adapter2”

CAAGCAGAAGACGG

  1. Set the stringency to 6.

    _images/trim_9.png

e. Scroll down and set the length as 40. Any sequence become shorter than this length during the trimming will be discarded. | f. Launch the analysis.

_images/trim_10.png
_images/UnivLogo_Stack_2C_Dark.png

Mapping short reads

If you are using genome as the reference for RNAseq reads, you will need to use a splice-aware aligner like Tophat2. If you are using cDNA as the reference, you can use a general purpose aligner like Bowtie2.

You need to do only one of the procedures based on what your group have been assigned to.

Step 1: Mapping with Tophat2

  1. Click on App.

  1. In the finder window type “Tophat”

  1. Select “Tophat2-PE”.

    _images/tophat_1.png

  1. As indicated in the figure, Name your analysis as you want.

  1. Select the output folder where your analysis is going to be.

  1. Click on “Input data”

    _images/tophat_2.png

  1. Click on the Green “+” sign.

8. Navigate to the folder where your samples are located. Select only the first read files. Click “OK”. You can select all three of your first read files.

_images/tophat_3.png

  1. Scroll down and click on the “+” below “Fastq file(s) (Read 2 of paired end reads):”

  1. Select “Reference Genome” and select the tomato genome sequence as input.

    _images/tophat_4.png

  1. Make sure quality is Sanger and leave rest of the default values as they are. Launch the analysis.

    _images/tophat_5.png


Step 2: Mapping with Bowtie2

  1. Click on App.

  1. In the finder window type “Bowtie”.

  1. Select Bowtie app indicated in the figure.

    _images/bowtie_1.png

  1. As indicated in the figure, Name your analysis as you want.

  1. Select the output folder where your analysis is going to be.

    _images/bowtie_2.png

  1. Click on “Input”

8. Navigate to the folder where your samples are located. Select first and second read files. You can only input one sample at a time.


  1. You need to name your output file carefully. For e.g., if it is heat1 sample, name the output as heat1.sam.

    _images/bowtie_4.png

  1. Select “Reference Index” and select the tomato cDNA sequence as input.

    img/bowtie_3.png

  1. Select options. Set “Minimum fragment length” as 100 and “Maximum fragment length” as 600. Launch the analysis.

    _images/bowtie_5.png
_images/UnivLogo_Stack_2C_Dark.png

Counting mapped reads

To get the number of reads mapped to a reference sequences (in this case, predicted tomato cDNA sequences), we can use Samtools. Bowtie2 output is in sam format and first, we need to convert the output files into sorted bam files.

  1. Type Samtools in app finding window.
_images/samtools_1.png
  1. Select “SAM to sorted BAM”
_images/samtools_2.png
  1. Select Bowtie2 output files (SAM format).
_images/samtools_3.png

4. Above will create sorted bam file. You will need to use this as the input for the Samtools Flagstat, which will count the number of mapped reads.

_images/samtools_4.png

_images/samtools_5.png

** You can get the flagstat for all six files from following link.** | https://github.com/ajwije/2017_spring_Bioinfo_class/blob/master/Files/flagstat.txt

I have used the following bash command to count mapped reads in case you are interested in it doing programmatically.

# navigate to where your sam files are and execute the following commands.

for i in *.sam

do
#extract the file name without extension and print it to the screen


echo ${i%.sam}

#covert sam to sorted bam
samtools view -bS $i | samtools sort - -o ${i%.sam}_sorted.bam

#getting flagstat and write it an output file for each bam file.
samtools flagstat ${fname%.sam}_sorted.bam >> flagstat.txt

done
_images/UnivLogo_Stack_2C_Dark.png

Differential gene expression analysis

Link for Bowtie mapped counts http://de.cyverse.org/dl/d/E9B4C299-D6CB-4656-A4F6-FF67240AEA49/170407_bowtie_counts.txt

Target file for bowtie mapped reads: http://de.cyverse.org/dl/d/BECB62C3-A369-4084-9BC9-2BFD9E6E9600/bowtie_target.txt

DESeq tutorial:

Tutorial link

Steps to perform DEseq analysis

  1. From Apps select “DEseq (Multifactorial Comparison)

    _images/DEseq_1.png

  1. Name your analysis and select a folder where your results need to be saved.

    _images/DEseq_2.png

  1. Select the correct target file and the count file.

    _images/DEseq_3.png

4. Give a name to the project. Reference biological condition should be “control” samples. Variable of interest is “group” (Column header of the third column of the target file).

_images/DEseq_4.png

  1. Set the significant threshold to 0.05 and launch the analysis.

    _images/DEseq_5.png

DE gene list

I have used the following R code to merge the DE genes list and the functions.

library(reshape2)
library(readr)

# Used the terminal command to grep the fasta headers and wrote it to a file called "ITAG3_10_cDN_names.txt"
#imported this file to Rstudio

# Removed the ">" sign
ITAG3_10_formated_names <- as.data.frame(sapply(ITAG3_10_cDN_names, gsub, pattern = ">", replacement = "" ))

#Seperate gene ids and description using space as delimiter
ITAG3_10_formated_names <- data.frame(colsplit(ITAG3_10_formated_names$X1, " ", c("Id", "Description")))

#imported up regulated genes to Rstudio and merge with the above file using gene ids.
heatvscontrol_up_func <- merge(heatvscontrol_up, ITAG3_10_formated_names,
                       by.x = "Id",
                       by.y = "Id")

#write output
write.table(x = heatvscontrol_up_func, file = "heatvscontrol_up_func.txt", quote = FALSE, sep = "\t", row.names = FALSE)


#imported down regulated genes to Rstudio and merge with the above file using gene ids
heatvscontrol_down_func <- merge(heatvscontrol_down, ITAG3_10_formated_names,
                       by.x = "Id",
                       by.y = "Id")
#write output
write.table(x = heatvscontrol_down_func, file = "heatvscontrol_down_func.txt", quote = FALSE, sep = "\t", row.names = FALSE)

Up-regulated gene list: http://de.cyverse.org/dl/d/E641698E-8688-4C20-B829-0B12BABC8ABB/heatvscontrol_up_func.txt

Down-regulated gene list: http://de.cyverse.org/dl/d/3C45B913-612F-4B97-8F44-8021470AE527/heatvscontrol_down_func.txt

_images/UnivLogo_Stack_2C_Dark.png

Secondary Structure Prediction

  1. We will use one of the differentially expressed in tomato pollen transcriptome under head stress.

I have retrive the amino acid sequences for Solyc06g050510 from SolGen website.


MKRHIHYNAHPIDPHPFEAFWYGSWQAVERLRINMGTITTHVLVDGEVIEENIPVTNLRMRSRKATLSDCAC FLRPGLEVCVLSIPYQGENSGDEKDVKPVWIDGKIRSIERKPHELTCTCKFHVSVYVTQGPPPILKKTLSKE IKMLPIDQIAVLQKLEPKPCENKRYRWSSSEDCNSLQTFKLFIGKFSSDLTWLMTASVLKEATFDVRSIHNQ IVYEIVDDDLVRKETNSNQHSYSVNFKLEGGVQTTTVIQFNRDIPDINSTSDLSESGPLVLYDLMGPRRSKR RFVQPERYYGCDDDMAEFDVEMTRLVGGRRKVEYEELPLALSIQADHAYRTGEIEEISSSYKRELFGGNIRS HEKRSSESSSGWRNALKSDVNKLADKKSVTADRQHQLAIVPLHPPSGTGLTVHEQVPLDVDVPEHLSAEIGE IVSRYIHFNSSSTSHDRKASKMNFTKPEARRWGQVKISKLKFMGLDRRGGTLGSHKKYKRNTTKKDSIYDIR SFKKGSVAANVYKELIRRCMANIDATLNKEQPPIIDQWKEFQSTKSSQRESGDHLAMNRDEEVSEIDMLWKE MELALASCYLLDDSEDSHAQYASNVRIGAEIRGEVCRHDYRLNEEIGIICRLCGFVSTEIKDVPPPFMPSSN HNSSKEQRTEEATDHKQDDDGLDTLSIPVSSRAPSSSGGGEGNVWALIPDLGNKLRVHQKRAFEFLWKNIAG SIVPAEMQPESKERGGCVISHTPGAGKTLLIISFLVSYLKLFPGSRPLVLAPKTTLYTWYKEVLKWKIPVPV YQIHGGQTFKGEVLREKVKLCPGLPRNQDVMHVLDCLEKMQMWLSQPSVLLMGYTSFLTLTREDSPYAHRKY MAQVLRQCGLLILDEGHNPRSTKSRLRKGLMKVNTRLRILLSGTLFQNNFGEYFNTLTLARPTFVDEVLKEL DPKYKNKNKGASRFSLENRARKMFIDKISTVIDSDIPKKRKEGLNILKKLTGGFIDVHDGGTSDNLPGLQCY TLMMKSTTLQQEILVKLQNQRPIYKGFPLELELLITLGAIHPWLIRTTACSSQYFKEEELEALQKFKFDLKL GSKVKFVMSLIPRCLLRREKVLIFCHNIAPINLFLEIFERFYGWRKGIEVLVLQGDIELFQRGRIMDLFEEP GGPSKVMLASITTCAEGISLTAASRVILLDSEWNPSKSKQAIARAFRPGQDKVVYVYQLLATGTLEEEKYKR TTWKEWVSSMIFSEDLVEDPSHWQAPKIEDELLREIVEEDRATLFHAIMKNEKASNMGSLQE


  1. Point your browser to.

  1. Copy and paste the amino acid sequence in the box and label the sequence.

    _images/protein_struct_1.png

  1. We will use the previous submitted results:

http://bioinf.cs.ucl.ac.uk/psipred/result/e3f48c8e-28ff-11e7-879a-00163e110593

Tertiary Structure Prediction

  1. Frist find a structure similar to above sequence in PBD. We will use DELTA BLAST to search PBD.

    _images/protein_struct_2.png

2. Click on the first significant hit to access the PDB. In case, you don’t have BLAST results, use the following link to access the previous results. Link

  1. RCSB provides curated content of PDB and use PDB ID: 1Z3I to visualize the protein in RCSB.

4. Perform a multiple sequence alignment to find conserved sequences. |

:a:. Retrive sequence from databank

_images/protein_struct_3.png
:b:. Selected sequences are in the following fasta file.

https://github.com/ajwije/2017_spring_Bioinfo_class/blob/master/rad54.fasta

:c:. Use Tcoffee server and align the sequences using structural information:

  1. You can download crystal structure information from PDB in Cn3 format.

    _images/protein_struct_4.png
  2. Download Cn3D software from NCBI and install it on your computer.

  3. Open above Cn3 file using the Cn3D software.

  4. Go to sequence viewer

    _images/protein_struct_5.png
  5. Under view, select find pattern:

    _images/protein_struct_6.png
  6. Copy a conserved region from multiple sequence alignment in the search window and click OK:

    _images/protein_struct_7.png
  7. You will see conserved region displayed on the crystal structure.

    _images/protein_struct_8.png
_images/UnivLogo_Stack_2C_Dark.png

The Delta-Delta Ct Method

Delta-Delta Ct method or Livak method is the most preferred method for qPCR data analysis. However, it can only be used when certain criteria are met. Please refer the lecture notes to make sure that these criteria are fulfilled. If not, more generalized method is called Pfaffl method. Please read the additional reading material to get more information about this method.

Here are the steps for Livak method:

The Excel file with all the calculation are in the qPCR analysis folder on Blackboard.

You have raw Ct (number of cycles that takes to reach threshold) for normal and tumor cells (3 replicates for each).

_images/qPCR_1.png

Normalization

First, you will need calculate relative difference between the gene of interest (p53) and the house keeping gene (GAPDH).

∆Ct = Ct (gene of interest) – Ct (housekeeping gene)

_images/qPCR_2.png

Average of the control samples (normal cells)

As we compare our tumor (treatment) to control (normal cells), first we need to average the ∆Ct for the 3 control (normal) samples.

_images/qPCR_3.png

Calculate the ∆∆Ct relative to the average of ∆Ct normal cells

∆∆Ct = ∆Ct (Tumor sample) – ∆Ct (normal average)

You can do this normal samples as well. Use $ signs infront of column number and raw letter (arrows) to fix the cell.

_images/qPCR_4.png

Fold gene expression for each sample

Make sure you raise the negative ∆∆Ct to power of two.

Fold gene expression = 2^-(∆∆Ct)

_images/qPCR_5.png

Overall fold change

You can calculate average fold change for both tumor and normal samples. Ratio between these two the fold change between tumor and normal samples.

_images/qPCR_6.png
_images/qPCR_7.png
_images/qPCR_8.png

Log transformation

To perform parametric statistical tests such as T-test, it advised to transform the final gene expression results to log values (any log base). This would make data distribution symmetric.

Here we have change the 2^-(∆∆Ct) to log 10.

_images/qPCR_9.png

T-test

Need to be careful when using parametric tests if data is not normally distributed, it would lead to erroneous conclusions.

_images/qPCR_9.png

Select log 10 of 2^-(∆∆Ct) values for Normal and tumor samples as indicated. Use two tail test (number 2) and assuming unequal variance (3).

Resulting P value is less than 0.05 and therefore, we reject the null hypothesis and two sample means are significantly different at 0.05 level.

_images/qPCR_11.png

. module:: Getting Data into Galaxy

synopsis:
_images/UnivLogo_Stack_2C_Dark.png

Getting data into Galaxy

Step 1: Login into Galaxy

Click on the following link to go to the ASU Galaxy site:

https://orpheus.cs.astate.edu/

Use your ASU username and password to login to Galaxy.

Step 1: Click on the upload icon on upper left hand corner

_images/getting_data_1.png

Step 3: One the data is uploaded, they will appear in the right hand panel. You can use the pencil icon to change the name.

_images/getting_data_3.png

|Reference:

Loraine AE, Blakley IC, Jagadeesan S, Harper J, Miller G, Firon N. Analysis and Visualization of RNA-Seq Expression Data Using RStudio, Bioconductor, and Integrated Genome Browser. Methods Mol Biol. 2015;1284:481-501. doi: 10.1007/978-1-4939-2444-8_24. PubMed PMID: 25757788.

_images/UnivLogo_Stack_2C_Dark.png

FastQC analysis using Galaxy

Step 1: Login into Galaxy

First login to your Cyverse account using your name and password.

_images/iplant_1.png

Then, go to your DE account.

_images/iplant2.png

Step 3: Performing FastQC analysis:


First step of the data analysis is to check the quality of the sequences. For this purpose, we are using the FastQC tool on Galaxy. a. Type FastQC in the search box on top left hand corner.

_images/galaxy_fastqc_1.png

  1. Select a fastq file and execute the analysis.

    _images/galaxy_fastqc_2.png

_images/UnivLogo_Stack_2C_Dark.png

Adapter and quality trimming using Cutadapt

We are going to use Trim-galore to trim adapters, and poor quality bases. This tool has several advantages. It allows selection multiple files. You can also select both forward and reverse reads. If you want to read more about Trim-galore, please visit their website. Also, Trim-galore is a wrapper for Cutadapt , which is the actual tool that performs the trimming.

Please follow the tutorial carefully.

Step 1: Launching Cutadapt and performing the analysis

  1. Type Trim-galore in the search box on top left hand corner.

  1. Select paired-end and select the two paired-end files are shown below. Use Illumina universal adapter to trim.

    _images/cutadapt_1.png
  2. Click on the advance settings. Set the parameters as indicated in blue arrows.

    _images/cutadapt_2.png

_images/cutadapt_3.png

_images/cutadapt_4.png

_images/cuadapt_5.png
  1. Launch the analysis.
_images/UnivLogo_Stack_2C_Dark.png

Adapter and quality trimming using trim-galore

We are going to use Trim-galore to trim adapters, and poor quality bases. This tool has several advantages. It allows selection multiple files. You can also select both forward and reverse reads. If you want to read more about Trim-galore, please visit their website. Also, Trim-galore is a wrapper for Cutadapt , which is the actual tool that performs the trimming.

Please follow the tutorial carefully.

Step 1: Launching Trim-galore

  1. In the finder window type “trim-galore”

  1. Select “Trim Galore”.

    _images/trim_galore_1.jpg

Step 2: Selecting input files

As indicated in the figure: 1. Is this library paired- or single-end? “Paired-end”


  1. Adapter sequence to be trimmed: “Select Illumina universal”

    _images/trim_galore_2.jpg

Step 3: Advance settings

  1. From “Trim Galore! advanced settings” select “Full parameter list”

    _images/trim_galore_3.jpg
  2. Set “Overlap with adapter sequence required to trim a sequence” to 6.

    _images/trim_galore_4.jpg
  3. Run the analysis.

_images/UnivLogo_Stack_2C_Dark.png

Use Splice aware aligner, Tophat2 to align short reads

  1. We are using Paired end reads and set the “Is this library mate-paired?” pulldown to “Paired-end”, then a second pulldown will appear to specify the 2nd FASTQ.

  2. “Mean Inner Distance between Mate Pairs” value for this parameter should obtained from the person incharge of the sequencing.

  3. Mean Inner Distance between Mate Pairs = length of the Fragments used for sequencing – (Length of Illumina adapters (often 120bp) + part sequenced (76+76))

  4. Genome should be obtained from the SolGenome.net (ftp://ftp.solgenomics.net/tomato_genome/assembly/build_3.00/) and select it from the history.

    _images/galaxy_tophat_1.png
  5. This library has been prepared to preserve the strandedness of the RNAs.

    _images/galaxy_tophat_2.png

  1. Minimum and maximum intron lengths should be changed according to genome used.

    _images/galaxy_tophat_3.png

  1. Change the intron lengths for split reads as well.

    _images/galaxy_tophat_4.png

Output files:

  1. accepted_hits (BAM, BAI)

  1. Two binary files: .BAM (data) and .BAI (index)

3. These are the actual paired reads mapped to their position on the genome, and split across exon junctions. This can be visualized in IGV, IGB or UCSC, but you must download both .BAM and .BAI files to the same directory. splice_junctions (BED)


  1. BED file (list of genomic locations, no sequence) listing all the places TopHat had to split a read into two pieces to span an exon junction. This can be visualized at UCSC or in IGV, etc.

  1. deletions (BED) (if indel search is on)

  1. insertions (BED) (if indel search is on)
_images/UnivLogo_Stack_2C_Dark.png

Use Htseq to counts reads mapped to features

Use Htseq to counts the reads aligned to exons on the genes. Change the parameters as indicated in red arrows.


  1. Change the intron lengths for split reads as well.

    _images/htseq_1.png
_images/UnivLogo_Stack_2C_Dark.png

Use Kellisto to map reads to cDNA and count

Kellisto is an ultrafast alignment-free quantification tool.


  1. Change parameters as indicated.

2. cDNA file can be obtained from the Solgenome.net(ftp://ftp.solgenomics.net/tomato_genome/annotation/ITAG3.0_release/ITAG3.0_cDNA.fasta) |

_images/kellisto_1.png
  1. Once you get the results, click on the “eye” icon on the history pane. Then click on the “disk” icon on the left bottom left-hand corner to download the data into your computer.
  2. Open the download file with Excel.

5. In Excel, click on “Data” and then click on “Text to Column”. Separate column using tab to separate data into columns. Sum the numbers in “est_counts” using Auto Sum function.

_images/UnivLogo_Stack_2C_Dark.png

Setup instructions (This is from Data Carpentry (http://www.datacarpentry.org/R-genomics/))

R and RStudio are separate downloads and installations. R is the underlying statistical computing environment, but using R alone is no fun. RStudio is a graphical integrated development environment (IDE) that makes using R much easier and more interactive. You need to install R before you install RStudio. After installing both programs, you will need to install the tidyverse package from within RStudio. Follow the instructions below for your operating system, and then follow the instructions to install tidyverse and RSQLite.

Windows

If you already have R and RStudio installed

Open RStudio, and click on “Help” > “Check for updates”. If a new version is available, quit RStudio, and download the latest version for RStudio. To check which version of R you are using, start RStudio and the first thing that appears in the console indicates the version of R you are running. Alternatively, you can type sessionInfo(), which will also display which version of R you are running. Go on the CRAN website and check whether a more recent version is available. If so, please download and install it. You can check here for more information on how to remove old versions from your system if you wish to do so.

If you don’t have R and RStudio installed

Download R from the CRAN website. Run the .exe file that was just downloaded Go to the RStudio download page Under Installers select RStudio x.yy.zzz - Windows XP/Vista/7/8 (where x, y, and z represent version numbers) Double click the file to install it Once it’s installed, open RStudio to make sure it works and you don’t get any error messages.

macOS

If you already have R and RStudio installed

Open RStudio, and click on “Help” > “Check for updates”. If a new version is available, quit RStudio, and download the latest version for RStudio. To check the version of R you are using, start RStudio and the first thing that appears on the terminal indicates the version of R you are running. Alternatively, you can type sessionInfo(), which will also display which version of R you are running. Go on the CRAN website and check whether a more recent version is available. If so, please download and install it.

If you don’t have R and RStudio installed

Download R from the CRAN website. Select the .pkg file for the latest R version Double click on the downloaded file to install R It is also a good idea to install XQuartz (needed by some packages) Go to the RStudio download page Under Installers select RStudio x.yy.zzz - Mac OS X 10.6+ (64-bit) (where x, y, and z represent version numbers) Double click the file to install RStudio Once it’s installed, open RStudio to make sure it works and you don’t get any error messages.

Using DEseq and EdgeR to find differentially expressed genes

The first step is to merge all count data files we got from the Htseq. Use the Join two data sets side-by-side on Galaxy and select output from Control2 and Control3 samples. Use Column one to join the data sets. After this is complete, take the resulting file, and combine with Temperate1 output. Repeat this for the next two data sets.

_images/DE_1.png

The resulting file would like this. Your actual number may be different, but should have 10 columns.

_images/DE_2.png

Download this file onto your computer and move it to folder called “Counts”. Rename file “counts.tabular”.

_images/DE_3.png

Open Rstudio and go to Session and select “Set Working Directory” and chose the folder that you just created.

_images/DE_4.png

In the console, you will see the following message and the part underline in red is the path to your directory.

Replace your path in this portion of the following code “/Users/aselawijeratne/Desktop”. Execute this to read file into R.

d1=read.delim('/Users/aselawijeratne/Desktop/Counts/counts.tabular', header = FALSE)

Format data using R.

#select only column with data
d1 <- d1[-c(3, 5, 7, 9)]

#Name the columns
colnames(d1) <- c("gene_names", "C2", "C3", "T1", "T2", "T3")

#get rid of the mRNA part infront of the gene name
row_names <- gsub("mRNA:", "", d1$gene_names)
#remove the last trailing ".1" from gene names
row_names <- gsub('.{2}$', '', row_names)

#assign row_name vector to the row names of the data.
row.names(d1) <- row_name

#remove unformated gene names.
d1$gene_names <- NULL

Import necessary libraries.

library(edgeR)
library(DESeq2)

Filter data.

#Filter data with rowsum < 10
d1$rowsum <- rowSums(d1)
#Low count filtered
d1_filterd <- d1[d1$rowsum > 10, ]

Create a group file and normalize data using EdgeR.

conds <- c(rep("C", 2), rep("T", 3))
y <- DGEList(counts=d1_filterd, group=conds, remove.zeros=TRUE) # Constructs DGEList object

dge=calcNormFactors(y)

A multi-dimensional scaling (MDS) plot to see the similarity among samples.

# color for controls
cn.color='blue'
# color for treatments
tr.color='brown'
# define a title for the plot
main='MDS Plot for Count Data'
#par(las=1) # makes y axis labels horizontal not vertical
colors=c(rep(cn.color,2),rep(tr.color,3))
plotMDS(dge,main=main,labels=colnames(dge$counts),
col=colors,las=1)

Hierarchical clustering can also be used to check how different samples are. As you can see, sample T2 is very different from the rest.

>normalized.counts=cpm(dge)
>transposed=t(normalized.counts) # transposes the counts matrix
>distance=dist(transposed) # calculates distance
>clusters=hclust(distance) # does hierarchical clustering
>plot(clusters) # plots the clusters as a dendrogram

|image1|

Differential expression analysis

y <- estimateCommonDisp(y) # Estimates common dispersion
y <- estimateTagwiseDisp(y) # Estimates tagwise dispersion
et <- exactTest(y, pair=c("C", "T")) # Computes exact test for the negative binomial distribution.
topTags(et, n=4)

Convert data into a dataframe and use dfr and fold change to select genes.

edge <- as.data.frame(topTags(et, n=50000))
edge2fold <- edge[edge$logFC >= 1 | edge$logFC <= -1,]
edge2foldpadj <- edge2fold[edge2fold$FDR <= 0.01, ]

DEseq analysis

Create matrix for DESeq2 and prepare data for DEseq

d1_matrix <- data.matrix(d1_filterd) #Create matrix for DESeq2

genotype <- data.frame(Genotype=conds) # create a dataframe for groups



dse <- DESeqDataSetFromMatrix(countData = d1_matrix,
                      colData = genotype,
                      design = ~ Genotype)

colData(dse)$Genotype <- factor(colData(dse)$Genotype,
                        levels=c("C","T"))

Differential expression analysis

dse <- DESeq(dse)
dse <- DESeq(dse)

ddsLocal <- estimateDispersions(dse, fitType="local", maxit =500)
ddsLocal <- nbinomWaldTest(ddsLocal)
res <- results(ddsLocal)

Order the data using p-adjusted value.

res <- res[order(res$padj),]
head(res)
res <- na.omit(res)

Use dfr and fold change to select genes.

deseq_2fold <- res[res$log2FoldChange >= 1 | res$log2FoldChange <= -1,]
deseq_2foldpadj <- data.frame(deseq_2fold[deseq_2fold$padj <= 0.01, ])
bothDF <- merge(deseq_2foldpadj, edge2foldpadj, by="row.names", all.x=TRUE)

Writing results files. You can open “edgr_deseq2.txt” file in Excel if you want to look at it.

write.table(as.data.frame(bothDF), file="edgr_deseq2.txt", sep="\t", quote = F, row.names=F, col.names=TRUE)

Combine DESeq and EdgR to make Venn diagram

Count how many gene from each analysis and make a Venn diagram. You need to have overLapper.R file (down loaded from Bb) in Counts folder.

source("overLapper.R")

setlist <- list(edgeR=rownames(edge2foldpadj), DESeq=rownames(deseq_2foldpadj))
OLlist <- overLapper(setlist=setlist, sep="_", type="vennsets")
counts <- sapply(OLlist$Venn_List, length)
vennPlot(counts=counts)
_images/UnivLogo_Stack_2C_Dark.png

GOseq analysis

1. Ge the up and down regulated gene list. “bothDF” is the dataframe that contains both up and down-regulated genes from both EdgeR and DEseq2.

bothDF_down <- bothDF[bothDF$log2FoldChange <= -1,]
bothDF_up <- bothDF[bothDF$log2FoldChange >= 1,]

Convert to these dataframes into table with True or False values. Write the table to local directory.

DE_list_boolup <- as.data.frame(row.names(d) %in% bothDF_up$Row.names,
      row.names = row.names(d))
DE_list_booldown <- as.data.frame(row.names(d) %in% bothDF_down$Row.names,
                        row.names = row.names(d))

write.table(DE_list_boolup,file="DE_goseq_up.txt",row.names=T,sep='\t',quote=F,
    col.names = F)

write.table(DE_list_booldown,file="DE_goseq_down.txt",row.names=T,sep='\t',quote=F,
col.names = F)
  1. Perform the GOseq analysis in Galaxy. You will need to perform the analysis for up and down regulated genes separately.

    _images/GOseq.png

  1. Combine gene descriptions with up and down regulated genes. You can get the S_lycopersicum_Feb_2014.bed file from the Dropbox link on Bb.
annots_file <- 'S_lycopersicum_Feb_2014.bed'
# keep gene id and gene description columns
annots <- read.delim(annots_file,sep='\t',header=F)[,13:14]
# name the columns
names(annots) <- c('gene','description')
# combine gene expression and annotations
bothDF_genedesc  <- merge(bothDF,annots, by.x = "Row.names",by.y='gene')
  1. To order your data using FDR, you use the following command in R.
bothDF_genedesc  <-  bothDF_genedesc[order(bothDF_genedesc$FDR), ]
_images/UnivLogo_Stack_2C_Dark.png

Run RNAseq analysis as a workflow

_images/workflow_1.png
_images/workflow_2.png
_images/workflow_3.png

Indices and tables