This tutorial assumes that you are familiar with the concepts developed in the following parts of the theoretical course.
It is better to follow the corresponding tutorials before this one.
A companion tutorial explains how to retrieve peak sequences from Galaxy.
The program peak-motifs combines various programs of the RSAT suite to discover cis-regulatory motifs and predict putative transcription factor binding sites from a set of peak sequences identified by high-throughput methods such as ChIP-seq, ChIP-on-chip or related methods.
In this tutorial, we expain how to tune the parameters and interpret of results for the different steps of the peak-motifs workflow:
To illustrate the features of peak-motifs, we will analyze a set of peak sequences that were obtained by pulling down genomic regions bound by the transcription factor Oct4 in the mouse. The experiment was performed in the context of a wider study, where X. Chen and colleagues characterized the binding location of 12 transcription factors involved in mouse embryonic stem cell differentiation (Chen et al., 2008).
A set of test sequences are available on the supplementary material
Web site
http://rsat.bigre.ulb.ac.be/rsat/data/published_data/peak-motifs_2011/
The peak sequences from Chen's article are in the
subdirectory
data/sequences/Chen_2008/peaks_from_galaxy/
Note: these peak sequences differ from those available in the GEO database. Indeed, Chen and colleagues filtered their peaks on the basis of discovered motifs in order to submit a "cleaned" collection of peaks to GEO. Since the goal of this tutorial is to show how peak-motifs performs on a raw collection of peak sequences, we have re-generated a complete peak collection from the original reads submitted by Chen in the GEO database. The mapping of the reads was performed with Bowtie against the mm9 assembly, then we used the program MACS to call the peak regions, and PeakSplitter to split the large regions into effective peaks. The peak sequences were then collected from Galaxy.
Open a connection to the RSAT Web server.
In the menu on the left side, expand the title NGS - ChIP-seq and select the tool peak-motifs.
Unless you dispose of a custom set of peak sequences, you can download the test set provided on the supplementary material Web site (file Oct4vsGFP_MACS_fdr0.02_splitted_peaks_sorted.fa).
Note: the sequences should be saved as unformatted text file.
Enter a Title for this analysis (e.g. Oct4 dataset Chen 2008)
Under Peak sequences, click on the Browse button to select your peak sequence file.
This panel can be expanded by double-clicking on the triangle on the right.
It allows you to limit the analysis to a given number of top peaks from the input file, or to clip sequences around the centers in order to restrict them to a maximal size. With the peak sequences used in this tutorial, there is no specific need to apply those restrictions. The two steps hereafter just indicate the reasons why you generally don't need to activate the restrictions on peak number and peak size.
This panel contain the parameters for the motif discovery step. For the case study, we will keep the default settings, using the programs oligo-analysis and position-analysis to discover over-represented motifs and motifs with positional biais.
We explain hereafter the way to tune the parameters for depending on the properties of the peak collection and the expected structure of the trnascription factor binding motif.
Discovered motifs can be compared to databases of known motifs. We directly support various public databases like JASPAR, Uniprobe. Users may also upload here private collection of matrices e.g. TRANSFAC.
Keep JASPAR core Vertebrates checked, and also check JASPAR PBM (UNIPROBE) Mouse, since our dataset was obtained from mouse.
If the sequences are provided in appropriate format, the positions of the predicted sites can automatically be converted from peak-relative to genomic coordinates.
Keep the Search putative binding sites option checked.
Assuming that you followed the steps above, select Sequences were fetched from Galaxy. This will recalculate the genomic coordinates of the predicted binding sites, and generate a custom UCSC track to vizualise the results in this popular genome browser.
Enter your email adress and click GO.
A link to the result should appear on the new page. The results appear progressively, to enable the users to analyse their results more quickly.