RSA-tools - tutorials

RSA-tools - Tutorials - consensus

Contents

Introduction
Example of utilization
Interpreting the result
Additional exercises

Introduction

In this tutorial, we will use consensus to discover motifs in the upstream sequences of the MET family, as we did for the tutorial on oligo-analysis.

Example of utilization
We will retrieve the upstream sequence of a set of genes involved in phosphate metabolism, and try to predict their regulatory motif with consensus.
The two main parameters to estimate are the matrix width and the expected number of sites. For the matrix width, you should sample different possibilities, since some factors (like Pho4p) have a binding core of 5-10 contiguous base pairs, whereas other factors (like Gal4p) recognized a pair of trinucleotide, separated by a spacer of fixed width but variable content (see the tutorial on dyad-analysis).
Retrieve the upstream sequences of the PHO genes. Make sure to inactivate the option Prevent overlap with neighbour genes.
PHO5
PHO8
PHO11
PHO81
PHO84
Send the resulting sequence to consensus by clicking on the button consensus at the bottom of the result page.
In the consensus form, set the expected number of patterns to 10. This relies on the observation that, in yeast, there are often between 1 and 3 binding sites for the same transcription factor per upsteram sequence. Since there are 5 genes in our test, we set the expected number of sites to 10. Two binding sites per gene is a reasonable first guess for yeast regulons, but beware that this might vary from organism to organism.
Leave all other parameters unchanged and click GO.
After a few tens of seconds, you should see the result of this analysis.
Interpreting the results

The consensus outputs starts with a header, providing some description of the parameters and data set. After the header, the programs returns a set of matrices.
Each matrix represents the alignemnt of a set of sequence fragments, whose size was specified in the consensus parameters.
By default, the program returns the 4 most significant matrices. These matrices are often slight variants of the aame motif. With the PHO family, the 4 matrices are centered on CACGTG, which is indeeed the core of the Pho4p binding site.
The most informative statistics for estimating the significant of the discovered matrices is the expected frequency. With the settings used in the above example, we obtained an expected value of 0.07, which indicates the number of matrices with a higher or equal significance which would be obtained by chance alone.

Additional exercises
Use consensus to predict regulatory elements in the GAL regulon.
GAL1
GAL2
GAL3
GAL7
GAL80
GCY1
Note that you will need to sample different matrix width for detecting the Gal4p motif (you can for exmple try 10, 20, 30, and 40). For each matrix width, evaluate the significance of the result on the basis of the expected frequency. Which matrix width gives the highest significance ? Which motif is returned ? Does it contain to the Gal4p consensus ?
Select a random set of 10 yeast genes (with the program random-genes, retrieve their upstream sequences (without overlap with neighbour genes), and test this random selection with consensus. Perform this test a few times, with different numbers of random genes, and different settings. Examine the expected frequency of the resulting matrices.
You can now come back to the tutorial main and follow the next tutorials.

For suggestions or information request, please contact