RSA-tools - Tutorials - Gibbs sampler
Contents
Introduction
We illustrate here the functioning of the gibbs sampler. Beware that gibbs sampliong is a stochastic process. In consequence, each run of the program can return a different result, which can be disappointing at first sight. The good practice is to run the gibbs sampler repeatedly, and check which motifs are selected frequently.
The results are also affected by two important parameters: matrix weight and expected number of matches. When analyzing a new gene cluster, it is a good idea to try different settings and compare the resulting matrices.
Warnings
- The version of the gibbs sampler installed on this site is that developed by Andrew Neuwald in 1995. This version was primarily developed for the detection of motifs in sets of peptidic sequences, and does not take into account the strand-insensitivity of regulatory motifs. To circumvent this problem, we added an option "add reverse complement", which calculates the reverse complement of each input sequences and adds it to the data set before analysis. This is tricky, but it works in many cases.
- A more recent version of the gibbs sampler has been developed for the analysis of yeast regulatory regions (Roth et al., 1998, Tavazoie et al., 1999, Hughes et al., 1999). This DNA-specific version, called AlignACE, performs an a posteriori filtering of the motifs, to get rid of spurious motifs (e.g. AT-rich sequences in the yeast). This fixes several of the problems encountered with the original gibbs sampler and mentionned in this tutorial. ALIGNace can be downloaded or used online via its web interface (http://atlas.med.harvard.edu/).
- Even more recently, Gert Thijs, from the University of Gent, developed a motif sampler with Markov chain estimates of expected probabilities. This Markov-based gibbs sampler should in principle provide better results than the native one, since it takes into account the dependencies between adjacent positions. The MotifSampler can be installed on your computer or used via its web interface (http://www.esat.kuleuven.ac.be/~thijs/Work/MotifSampler.html).
Example of utilization
- Retrieve upstream sequences from -800 to -1 for PHO genes of Saccharomyces cerevisiae, as we saw during the tutorial on sequence retrieval. Make sure to inactivate the option Prevent overlap with neighbour genes.
PHO5 PHO8 PHO11 PHO81 PHO84- At the top of the result page, click the button labelled gibbs sampler.
- Leave all parameters unchanged and click GO
- Look carefully at the results:
- Which motif is selected ?
- Does it look like the known Pho4p-binding site (CACGTGgg or CACGTTtt) ?
- If yes, what is the value of the information per parameter.
- Click on the back icon of your browser, to come back to the gibbs form. Do not modify any parameters and click GO once again.
- Do you obtain the same result as the previous time ?
- Is the pattern alike ?
- Has it the same information content ?
- Repeat the process a few time, and compare the patterns discovered.
- How frequently do you obtain patterns that look like the known Pho4p binding site (CACGTGggg or CACGTTtt) ?
Interpreting the results
The gibbs sampler returns two matrices, and each matrix is presented in two units (residue frequencies, and entropy, respectively).
Why are there two matrices ?
The first matrix is the result of the expectation-maximisation (EM) cycles. Since the expectation-maximization phase is stochastic, this matrix might contain some suboptimal sites.
The matrix resulting from the EM cycles is used to scan the input sequence set, and the best sites are extracted, to form the second matrix. This second matrix is thus very similr, if not identical, to the first one.
Is the result significant ?
The main parameter for evaluating the result is the information per parameter (IPP). The higher the IPP, the more significant is the matrix.
It is important to realize that the IPP depends on the number of sequences and of their size. One empirical way to evaluate the significance of a matrix is to run the gibbs sampler on random selections of genes, with the same number of genes and sequence sizes as in your data set.
Additional exercises
- Try to extract regulatory motifs from the NIT family.
DAL5 GAP1 MEP1 MEP2 MEP3 PUT4 DAL80The known regulatory element is GATAAg. Can you extract this motif ? Try different matrix sizes and expected number of matches.- Use the gibbs sampler to predict regulatory elements in the GAL regulon.
GAL1 GAL2 GAL3 GAL7 GAL80 GCY1Note that you will need to sample different matrix width for detecting the Gal4p motif (you can for exmple try 10, 20, 30, and 40). For each matrix width, evaluate the significance of the result on the basis of the information per parameter. Which matrix width gives the highest significance ? Which motif is returned ? Does it contain to the Gal4p consensus ?- Select a random set of 10 yeast genes (with the program random-genes, retrieve their upstream sequences (without overlap with upstream ORFs), and test this random selection with consensus. Perform this test a few times, with different numbers of random genes, and different settings. Examine the information per parameter of the resulting matrices.
You can now come back to the tutorial main page and follow the next tutorials.
For suggestions or information request, please contact