This tutorial assumes that you are familiar with the concepts related to High-Throughput sequencing (reads, read mapping) and ChIP-seq technology (peaks).
This tutorial does not direclty use RSAT tools, but explains how to obtain datasets that can be used as input for the RSAT program peak-motifs.
The Galaxy server (http://main.g2.bx.psu.edu/) combines a wide variety of programs for accessing and analyzing genomic sequences. Those tools are remarkably powerful and efficient, and they are accompanied with an excellent documentation, including training videos.
The goal of this tutorial is to give a short explanation of the successions of operations that permit retrieving peak sequences from the Galaxy server, starting from a set of reads or peaks annotated in the Gene Expression Omnibus database (GEO http://www.ncbi.nlm.nih.gov/geo/).
Goal: Identify the dataset corresponding to the article by Chen et al., 2008 (Pubmed ID: 18555785) in the GEO datasets and Retrieve the data for the Sox2 experiment.
Open a connection to the Pubmed database.
In the text box, enter the title of the article:
This should give a single result. If this is not the case, you can select the publication on the basis of its PubMed ID 18555785
On the right side of the Pubmed record, under the title All links from this record, click the link GEO DataSets. This opens the record GSE11431 with the title Mapping of transcription factor binding sites in mouse embryonic stem cells.
Note: the title of the record differs from the title of the article, which makes it somewhat difficult to identify a record by browsing the GEO datasets alone. The easiest way to go from an article to the corresponding records is generally to use the direct link from PubMed to GEO DataSets, as we did.
In the GEO database, the identifiers with prefix GSE denote series of experiments. Chen et al. (2008) published ChIP-seq results for various transcription factors, so that the series associated to this article contains 16 samples in total.
The bottom of the GSE record provides the list of samples (identifiers starting with GSM). Click on the link corresponding an experiment of your choice (e.g. ES_Oct4, sample ID GSM288346).
Read the information available about this sample. The bottom of the record provides links to the data sets at various processing stages:
The Galaxy Web server offers a wide variety of tools for analyzing genomic data.
To take benefit of all the advantages of Galaxy, you can open an account on the server, which will allow you to keep a trace of previous analyses and store the data and results on their server.
Open a connection to the Galaxy server (http://main.g2.bx.psu.edu/).
On the top of the window, open the command User > Login and provide your email address and password (at the first connection, you must fill a form to obtain a login).
In the menu at the left of the window, click Get Data > Upload File.
You can either upload a file from your computer (click Browse besides the File text box) or from a Web server (type a link to the file in the URL/Text box).
ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/samples/GSM288nnn/GSM288346/GSM288346%5FOct4%2Ebed%2Egz
In the Genome pop-up menu, select Mouse Feb 2006 (NCBI36/mm8). Tip: he genome is selected if you simply type mm8 when the menu is selected.
Leave the other options to their default value, and click Execute. The upload may take several minutes.
When the file will be uploaded, the yellow box on the right side will turn to green. Click on this box and check that the format is BED and the genome ("database") is mm8.
chr1 100000123 100000148 0 0 + chr1 100000387 100000412 0 0 - chr1 100001969 100001994 0 0 - chr1 100002597 100002622 0 0 + chr1 100002637 100002662 0 0 + chr1 10000261 10000286 0 0 - chr1 100003474 100003499 0 0 - chr1 100004023 100004048 0 0 + chr1 100004191 100004216 0 0 + chr1 100005158 100005183 0 0 - chr1 100005335 100005360 0 0 + ...
We recommend to rename the Galaxy entry for the sake of readability.
~4,700,000 regions format: bed, database: mm8 Info: ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/samples/GSM288nnn/GSM288346/GSM288346%5FOct4%2Ebed%2E
Run the same protocol to download the sample GSM288358 from GEO. This sample was obtained by immunoprecipitating green fluorescent protein (GFP). In principle, it should thus not contain any specific peak. It can optionally be used as control ("mock") for peak-calling programs. After download, rename the dataset "Chen 2008 ES_GFP reads".
In the left frame of the Galaxy window, you can see a set of specialized tools for analyzing data from Next Generation (NGS TOOLBOX BETA).
Click NGS:Peak Calling > MACS Model-based Analysis of ChIP-Seq.
Enter an Experiment Name (e.g. OCT4 Chen-2008 peaks MACS no input).
For the ChIP-seq tag file, select the file you uploaded in the previous step (if you performed all the steps above, it should appear as "Chen 2008 ES_Oct4 reads" in the pop-up menu).
Note: in a first time, we will run the peak-calling without providing any control. In case you would dispose of a control set, you could enter its mapped read BED file as Control tag file This wil be done as an exercise below.
Effective genome size: this is the size of the genome considered "usable" for peak calling. This value is given by the MACS developpers on their website. It is smaller than the complete genome because many regions are excluded (telomeres, highly repeated regions...). The default value is for human (2700000000.0), as we work on mouse, choose 1870000000.0
Set the Tag size to 26bp (the default is 25).
Leave all other options to their default values and click Execute.
While the program is running, two yellow boxes should appear in the "History" frame at the right of the Galaxy Window. After completion of the job, the boxes will be colored in green. The first box contains an HTML page with links to the results in various formats. The second box contain a BED file with the coordinates of the peaks. How many peaks ("regions") were detected by MACS ?
Once the result is available, click on the pencil to change the information. Rename the dataset (e.g. Oct4 peaks from MACS).
Optionally, you can upload the peak coordinates to the UCSC genome browser to visualize them on the mouse chromosomes. For this, you can simply click on the link display at UCSC main in the green box Oct4 peaks from MACS of the History frame.
In the protocol above, we used the simplest approach to detect peaks with MACS: we entered a single file (the "test" reads) and adapted some parameters to the particularity of our experiment (e.g. genome size, tag size).
Some peak-calling programs (including MACS) allow users to submit a second set of reads as control. Typical controls are "mock" datasets, i.e. genomic sequences obtained from a non-immunoprecipitated protein, or genomic DNA. In Chen's experiment, the control consisted in performing the ChIP-seq with the Green Fluorescent Protein (GFP) instead of a transcripiton factor.
The Galaxy Web server allows to quickly retrieve sequences from a coordinate file (eg. BED file). The coordinates can be provided in various forms:
The BED file retrieved in the previous section indicates the chromosomal coordinates of the peaks, but in the next section (motif discovery) we will need to analyze the peak sequences. In the Tools frame at the left iof the Galaxy window, click Fetch Sequences - Extract Genomic DNA. Select the Oct4 peaks from MACS dataset and click Execute.
Once the box become green in the History frame, click on the pencil icon and rename the data set (for example Oct4 peaks from GEO).
Open the green box and click on the disk icon to store the result on your computer (for example in a file Oct4_MAC_peak_sequences.fasta.
You can skip Options B and C.
If you dipose of a bed file (e.. produced by a stand-alone peak calling program running on your comptuer), you can upload this bed file to the Galaxy server and proceed as for the Option A.
In this section, we will directly upload the peak sequences published by Chen in GEO (the above mentioned "txt" file, which is actually in BED format).
In the menu at the left of the Galaxy page, click Get Data - Upload File.
In the URL/Text box, paste the URL of the Oct4 sample:
ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/samples/GSM288nnn/GSM288346/GSM288346%5FES%5FOct4%2Etxt%2Egz
In the Genome pop-up menu, select Mouse Feb 2006 (NCBI36/mm8).
Leave the other options to their default value, and click Execute. The upload may take several minutes. When the file will be uploaded, the yellow box on the right side will turn to green.
Click on this box and make sure that the format is BED and the genome is mm8. How many ChIP-seq "regions" are present in this file ?
Click on the disk icon in this box to download this sequence file. Save it on your computer.
We obtained a sequence file that can now be used as input for motif discovery and TF binding site prediction. For this, we will use the RSAT workflow peak-motifs, whose usage is explained in the protocol peak-motifs: motif detection in full-size datasets of ChIP-seq peak sequences.