RSA-tools - Tutorials - dyad-analysis
Contents
- Introduction
- Motif discovery with a small gene family
- Interpreting the results
- Drawing a feature-map with the discovered patterns
- Additional exercises
Introduction
The analysis of word frequencies gives pretty good results for many families of co-regulated genes, but fails for a specific class of transcription factors: the Zn cluster proteins. The reason is these proteins have two distant points of contact with DNA. Each contact point imposes a specificity over 3 base pairs, but there is an intermediate region of fixed width but variable content. The width of the spacing is transcription factor-specific. This kind of patterns is not only found in yeast, it is also characteristic for the HTH proteins in prokaryotes, which also bind spaced pair of trinucleotides.
We designed a specific algorithm to extract such motifs: dyad-analysis. the statistical treatment has been described in detail in
- van Helden, J., Rios, A. F. & Collado-Vides, J. (2000). Discovering regulatory elements in non-coding sequences by analysis of spaced dyads. Nucleic Acids Res. 28(8):1808-18. Pubmed 10734201
[Free access to the PDF file].Motif discovery with a small gene family
We will illustrate the usage of dyad-analysis with a family of genes which are expressed when galactose is provided in the culture medium. Starting from the set of upstream sequences for these genes, we will try to extract over-represented motifs. We will first try to analyse these sequences with oligo-analysis and see how the program performs with this kind of motifs. Then we will analyse the same sequences with dyad-analysis and compare the results.
- Retrieve upstream sequences from -800 to -1 for the following yeast genes (as you have seen in the tutorial on sequence retrieval). Make sure to inactivate the option prevent overlap with neighbour genes.
GAL1 GAL2 GAL3 GAL7 GAL10 GAL80 GCY1 MTH1 PCL10 FUR4- Once you have the sequences, click the button labelled oligonucleotide analysis. Leave all other parameters unchanged and click GO.
Which result do you obtain ? How many patterns were selected ? What is their significance ?
Remember that patterns with a significance index (occ_sig) lower than 1 should not be considered too seriously as putative regulatory elements. They are likely to appear more or less once per random sequence set. So, when an analysis does not return any pattern with a significance > 1, it can be considered as a negative answer.
With the GAL family, oligo-analysis did not detect any pattern with a significance higher than 1. The program is thus unable to identify any really significant motif in the upstream sequernces of the GAL genes. This comes from the fact that these genes are regulated by a Zn-cluster protein, Gal4p, which binds a spaced dyad. This is precisely the type of patterns for which dyad-analysis has been designed.
- Click on the icon back of your browser until you come back to the page with the upstream sequences you retrieved. This time, click on the button dyad-analysis in the Next step box.
- The default parameters are to scan pairs of trinucleotides spaced by any length between 0 and 20. These parameters are appropriate when you have no a priori idea of the spacing, since they will evaluate a good range of possible spacing values.
Note that the computation time is directly proportional to the spacing range: when 21 possible spacings are tested (from 0 to 20), the processing can take a few tens of seconds to a few minuts, depending on the server load.
- Leave all other parameters unchanged and click GO. You will have to wait for the answer a bit longer than for oligo-analysis (it usually takes 20 seconds for this test case).
Interpreting the results
The results of dyad-analysis are displayed in the same format as those of oligo-analysis. In principle, you should already have performed the tutorial on oligo-analysis, and you should thus be able to interpret the dyad-analysis result page.
- How many distinct dyads were analyzed ? (this information appears in the header at the beginning of the result).
- How many dyads are selected as significant ?
- What is the highest significance index ?
- Look now the result of the pattern-assembly, at the bottom of the result page. How many patterns are assembled into an alignment ?
- How many nucleotides of the consensus are specified (different from N) ?
The pattern-assmebly returned two alignments, but you can easily see that these alignments are closely related: they only differ by one substitution. Actually, the pattern-assembly has an option to allow a given number of substitutions, but with dyads, allowing one substitution tends to assemble too mny patterns, so we inactivated it.
Additional exercises
- Analyze the MET family (see tutorial on sequence retrieval) with the dyad-analysis, and compare the results to those previously obtained with oligo-analysis. Discuss the differences.
- As a negative control, select some families of random genes (tool random gene selection in the menu Genomes and genes of the left frame), and apply dyad-analysis to discover patterns in their upstream regions. Discuss the result.
- Until now, we only analyzed "easy" cases, since we used groups of genes which are all regulated b the same factor. In reality, we will often use these tools to analyze noisy data sets, like those obtained from microarray data. The programs are quite robust to noise, and are still able to detect regulatory patternsm even if the data set contains some genes which are not ergulated by the same transcriptin factor as the other ones.
We tested this with various data sets, but you can experience it yourself, for example with the clusters of genes expressed at different stages of the cell cycle (Spellman et al., 1998). The clusters defined by Spellman and co-workers are available in the data repository. Select some of these clusters, and apply the different motif discovery appraoches described in these tutorials to detect putative regulatory elements. You can then compare these elements with those annotated in SCPD, and evaluate if the predicted motifs correspond to binding sites for factors involved in the cell cycle.
You can now come back to the tutorial main page and follow the next tutorials.
For suggestions please post an issue on GitHub or contact the