Knowing that a set of genes are co-expressed (e.g. they show similar expression profiles in an microarray experiment), we can suppose that some elements are shared by their upstream region, and one would like to detect such elements.
We implemented a simple and fast method to extract such elements, based on a detection of over-represented oligonucleotides. This method has been described in detail in J.Mol. Biol. (1998) 281, 827-842.
We will illustrate this approach with a set of genes for which the regulatory sites is well known : the genes responding to a stress in methionine. The motif discovery program will of course not receive any indication about these known regulatory motifs. We will only use them a posteriori to compare the discovered motifs with those documented in the literature.
Retrieve upstream sequences from -800 to -1 for the following yeast genes (as you have seen in the tutorial on sequence retrieval).
Since we are working with an eukaryote, make sure that the option Prevent overlaps with neighbour genes is inactivated before retrieving the sequence.
MET1 MET2 MET3 MET6 MET14 MET19 MET25 MET30 MUP3 SAM1 SAM2
Once you get the result, go to the bottom of the sequences. You should see a list of buttons, which allow you to send these sequences as input for another program. Click the button labelled oligonucleotide analysis.
A new form appears.
Leave all parameters unchanged and click GO.
The results of the analysis are displayed in a table. Each row corresponds to one oligonucleotide, and each column to one statistical criterion. The header indicates the content of each column, which depends on the return parameters selected in the oligo-analysis form.
Note the particularly high degree of over-representation for the hexanucleotide CACGTG. Thirteen occurrences are observed, whereas only 1.22 would be expected by chance. CACGTG is the core of the binding site for Met4p, the main transcription factor for the regulation of methionine biosynthesis.
The P-value (column occ_P) represents the probability to observe at least 13 occurrences when expecting 1.22. For CACGTG, it is of the order of 10-9. However, this P-value might be misleading, because in this analysis we considered 2080 possible patterns. Indeed, there are 4096 possible hexanucleotides, but we regrouped each of them with its reverse complement, resulting in 2080 distinct patterns. Thus, we performed a simultaneous test of 2080 hypotheses.
The E-value (column occ_E) provides a more reliable and intuitive statistics than the P-value. The E-value is simply obtained by multiplying the P-value by the number of distinct patterns. It represents the number of patterns with the same level of over-representation which would be expected by chance alone. For CACGTG, the E-value is of the order of 10-6, indicating that, if we would submit random sequences to the program, such a level of over-representation would be expected every 1,000,000 trials.
An even more intuitive statistics is provided by the significance index (column occ_sig), which is the minus log transform (in base 10) of the E-value. The higher values are associated to the most significant patterns. On the average, a significance higher than 0 would be expected by chance alone once per trial (sequence set). A score higher than 1 every ten trials, a score higher than 2 every 100 trials, and a score higher than 6 every 106 trials.
The default parameters were chosen to return no more than one pattern per sequence set (threshold on significance is 0). With these settings, the analysis of a small regulon typically returns half a dozen of hexanucleotides, among the 2080 possibilities. In addition, these hexanucleotides are related with each other, and can be assembled to provide a more refined description of the predicted binding sites, as discussed below.
Below the oligo-analysis result (the table with one row per oligo and one column per statistics), you can see the result of pattern-assembly. with the MET family, two clusters of patterns can be formed.
The first cluster contains the most significant hexanucleotide, CACGTG, and the strongly overlapping ACGTGa. In addition, since in our analysis we considered both strands as equivalentm, the reverse complement of this hexanucleotide, tCACGT,is also considered. The assembly of these three hexanucleotides results in an octanucleotide, tCACGTGa, which perfectly corresponds to the Met4p consensus.
The second cluster of hexanucleotides forms a completely different motif, whose consensus is aactgtgg. This corresponds to the binding site of the homolog transcription factors Met31p and Met32p, which cooperate with Met4p for the regulation of methionine biosynthesis and sulfur assimilation.
The pattern assembly left one isolated pattern, TAGTCA, which is related to neither Met4p nor to Met31p binding sites. The significance of this isolated pattern is very close to 0, and it is likely to be a false poitive.
One possibility would be to increase the threshold of significance in order to reduce the rate of false positive. however, this would also result in the loss of some biologically relevant patterns, like ACTGTG and CTGTGG, which contributed to the description of the Met31p binding site. Thus, our strategy is to set the significance threshold to 0, and to ignore isolated patterns which have a low significance index.
In the bottom of the result page, click on the button Pattern matching (dna-pattern). The same form is displayed as in the previous tutorial on pattern matching, but this time the Query pattern(s) box has been automatically filled with the result of oligonucleotide analysis. Do not change anything to the form and click GO.
The list of matching positions of the over-represented oligonucleotides is displayed. Notice that each match is associated to a score, which is the significance obtained for that hexanucleotide in oligo-analysis.
On the bottom of the result page, click the Feature map bbutton.
A form appears which will allow yo to generate a graphical representation of the matching positions.
Fill the From box with -800,
Fill the To box with 0
For the Feature handle option, select "none".
Make sure that the option Feature thickness proportional to score is activated.
Click GO.
The feature map appears. The boxes have now different thicknesses, reflecting the respective matching score.
This proportional thickness strongly emphazises the CACGTG pattern, which is the core of the main transcription factor for the selected genes. Notice the overlap between these very high boxes and smaller boxes. These clusters of mutually overlapping boxes indicate that the consensus extends over 6 base pairs of the core (most matches correspond to the octanucleotide tCACGTGa). You can check this by positioning the cursor above a feature box, and reading the matching sequence in the status bar of your web browser.
A problem with the automatic feature thickness, is that some patterns with a lower significance are barely visible, including the Met31p binding sites. we can now refine the representation to increase the visibility of these low-scoring motifs.
Come back to the feature-map form.
Set the min feature thickness to 5. This will forc the boxes to have a thickness of at least 5 pixels.
Leave all other parameters as descriibed above, and click GO.
This modification allows you to better detect the Met31p binding sites. Notice that these also appear as clusters of mutually overlapping boxes, which emphazises the conserved octanucleotide AAACTGTG.
Use oligo-analysis to predict cis-acting elements with the following family of phosphate-responding genes.
PHO5 PHO8 PHO11 PHO81 PHO84
Compare the discovered motifs with the known consensus (CACGTGggg or CACGTTtt) for Pho4p, the transcription factor mediating response to phosphate stress.
Compare the results obtained with the same family when the option Prevent overlap with upstream matches is activated and inactivated respectively, during sequence retrieval. Explain the difference.
Select a family of 10 random genes with the program random gene selection (in the left frame), and analyze their upstream region with oligo-analysis. Try this a few times, with different numbers of genes (e.g. 5, 10, 20, 50). How many patterns are predicted as significant ?
Use oligo-analysis to predict cis-acting elements involved in galactose utilization, on the basis of the following genes.
GAL1 GAL2 GAL3 GAL5 GAL6 GAL7 GAL10 GAL80 GCY1
How do you interpret the result ? o you obtain patterns with a high significance ? Come back to this exercise after the practivcal on dyad-analysis.
Use retrieve-seq and oligo-analysis to count the frequencies of the different stop codons in the complete yeast genomes. Tips
Use the same procedure to check the start codons in all genes from the yeast genome.
You can now come back to the tutorial main page and follow the next tutorials.