March 2015 Feng Lab meeting

The Data Scientist’s Toolbox for ChIP-Seq and Beyond

Feng Lab Group meeting by Wayne in March.

Slides link.

A Google Doc for sharing today

http://we.tl/TYWUi1aFRf

3:40am-5:30pm, March 19th, 2015. (2nd session; see first session info here; next session hands-on here)

Getting started

Using this Web site

Reload to get the latest links!

A Google Doc for sharing today

Preparation

Read

Interactive notebooks: Sharing the code by Helen Shen. Nature. 2014 Nov 6;515(7525):151-2. doi: 10.1038/515151a. PMID: 25373681

Programming tools: Adventures with R by Sylvia Tippmann. Nature. 2015 Jan 1;517(7532):109-10. doi: 10.1038/517109a. PMID: 25557714

Tech prep

Be sure you have a modern, updated browser on your system. Preferably Chrome or Firefox.

Register and do the follow-up activation at SourceLair.

Register for Sagemath Cloud.

Be sure to have a good text editor on your computer. Sounds like you may have been using AquaMac in the past and so this shouldn’t be a problem. I highly recommend Sublime Text. However, for what we’ll be doing Thursday, even TextWrangler on a Mac will be sufficient. For those not on a Mac, I’d recommend Sublime Text or Notepad++ or jEdit.

Intro to technology

We’ll use as a group two technologies today.

The idea for using cloud-based tools is to make it easier upfront to get coding and then you can modify what you use as you develop your coding workflow preferences. (Sorry for needing two, but finding a good interface that has all the features desired and works on the Upstate network is not easy.)

For today

ChIP-seq

Background on ChIP-seq in preparation for running through an anlysis workflow next session.

I’ll sprinkle in some real world examples of using some available tooks with at least one possibly being hands-on for those who wish to participate

Slides link.

A Google Doc for sharing today, if needed

Examples from the Wild I: REGULAR EXPRESSIONS

NGS Analysis of ChIP-seq data with NUCwave

ChIP-Seq example at NUCwave site

S. cerevisiae reference genome was downloaded from SGD and FASTA headers for chromosome names were replaced with chrI-chrXVI.

Of course, there are only sixteen chromosomes in yeast, plus the mitochondrial genome, so this is not an overly difficult to do by hand. But it is tedious and offers a good place to utilize regular expressions.

Highly recommend the following combination for learning Regular Expressions, or Regex or Regexp as it is often called:

First I’ll demonstrate doing this with Sublime Text using the process I already worked out.

So what are Regular Expressions? See Exploring with Regular Expressions 101.

I’ll demo wildcards, character sets, qantifiers and capturing.

Finally, we’ll use Regular Expressions 101 to really follow what was going on in this example.

Examples from the Wild 2: IPython Notebooks

NGS Analysis of ChIP-seq data using IPython Notebooks to Explore

Determining Average ChIP-seq signal over promoters with Metaseq > This example demonstrates the use of :mod:metaseq for performing a common task when analyzing ChIP-seq data: what is the average signal over transcription start sites (TSS) throughout the genome?

That looks interesting but what framework is being used to make and host this?

So what are IPython Notebooks?

Allows you to code interactively in your browser and take advantage of all the aspects of HTML and other special web features, including sharing online

These are especially useful for exploring data and developing code or developing approaches to analyzing your data.

Titus Brown’s screencast and associated notebook illustrates much of this.

I’ll show two other notebooks I have made and show them interactively.

Q: You’ve run your notebook and populated the cells, now how can you share it with colleagues? A: If you follow these steps you or anyone else you share it with can see your notebook on the web.

  • Upload your notebook code to somewhere. Github or simply even as a Gist will work fine.
  • Place the URL here and click ‘Go!’

Note that the notebooks shared in this form will not be interactive. You can though download them and run them locally.

The future ...

The IPython Notebooks concept goes beyond Python and now they are developing a language-agnostic version of the Notebook as the Jupyter Project project.

Back to Metaseq

The page here actually has a legend with some of the plots that describes additionally apects of the exploratory analyses done on the page Determing Average ChIP-seq signal over promoters with Metaseq.

Examples from the Wild 3: Git, Github, and Gists

Git, Github, and Gists

Quick tour of Github site and Gists since most useful for getting started in the world of using git for version control software.

See a section under ‘Going forward’ for additional resources.

Examples from the Wild 4: R, the Bioconductor Project for R, RStudio

R, the Bioconductor Project for R, RStudio

Quick tour of RStudio to simply show it is exceptional.

R is very much about the concept of tidy data

RStudio can easily be run on Amazon Web Services if computing power is an issue.

See a section under ‘Going forward’ for additional resources.

Going forward

Look into

Regular Expressions

Learning R

Questions

  • Try Google, probably will lead you to one of my listed resources or...
  • Biostars
  • Stackoverflow for general scripting and computing
  • SEQanswers - a high throughput sequencing community
  • Try Twitter - for example this

Literature Selections for ChIP-seq

ChIP-Seq

Bias issues

Motif identification

ACRONYMS