RSA-tools - Tutorials - genome-scale dna-pattern
Introduction
We are now considering the case where we know the binding specificity for a given transcription factor, and we are looking for all the genes potentially regulated by this factor. We will use the Nitrogen regulation as an example. Response to nitrogen is mediated by the so-called GATA-box, whose consensus is GATAAG. An interesting property is that a single GATA-box is insufficient to affect the transcription. All genes effectively controlled by these elements possess at least one group of 2 to 4 closely associated GATA-boxes.
Selecting genes on the basis of pattern counts
We will use dna-pattern to select all genes having 3 or more occurrences of the GATA box in their 500 bp upstream region.
- On the left frame, click genome-scale dna-pattern (strings).
- In the Sequence retrieval options, make sure that the selected organism is Saccharomyces cerevisiae, and that Sequence type is set to upstream.
- Specify the sequence limits from -500 to -1.
- In the Query pattern(s) box, type
GATAAG- In the Return option, select match counts.
- Change the Threshold to 3.
- Click GO
After a few seconds you will see a list of all the yeast genes having 3 or more occurrences of the GATAAG pattern in their 500 bp upstream sequence. This list contains some genes of completely unknown function, some other genes with a weak similarity to other protein, and some genes of known function (generally, those having a 3-letter + 1-number name).
Notice that, despite the rudimentary criterion used in this search (3 ocurrences of a single hexanucleotide), most of the genes with known function are associated to nitrogen metabolism (actually, this is one of the very rare cases where a single pattern count returns decent results, see remarks below). The genes with unknown function are good candidates for testing response to Nitrogen, and might be used for further experimental characterization.
Links to external databases
Each ORF identifier appears as a hyperlink, which is connected to the information on this ORF in the yeast genome database.
The right column of the result table links each gene to various databases. By clicking on a link, you will see the information on that particular gene in the selected database.
At the bottom of the page, you should see two butons, which allow to send the gene selection as a whole as query to KEGG, the pathway database, or yMGV, the yeast gene expression database. Beware, the link to yMVG can take a bit of time, because profiles drawing are generated on the fly, and there is one profile per gene and per microarrya publication.
- Send your results to yMGV transcription profiles, and wait for the answer. Try to detect conditions under which most of the selected genes are either activated or repressed. Are these conditions associated to modifications of nitrogen supply in the culture medium ?
- Come bck to the result page. Send now the results to KEGG pathway coloring . Are the selected genes associated to some specific pathway ?
Remarks
Selecting genes on the basis of a single pattern count in their upstream reagions is a very poor way to predict their regulatory properties. For this tutorial, we applied a very stringent criterion (at least 3 GATA boxes in a region restricted to 500bp), which returned a good selection, but selectivity is at the cost of coverage, and it is obvious that we missed many of the genes regulated by GATA-boxes.
When the pattern is relatively large and degenerated (which is not the case for GATA boxes), matrix-based pattern genome-scale matching (with patser) provides better results than string-based pattern counts. However, even in such a case, there is a tradeoff between sensitivity and coverage.
We are currently working on more complex approaches, based on combinations of motifs, which return more accurate results. These appproaches ar time costly, and are currently not supported on the web server, but we will soon dvelop a web interface for them.
You can now come back to the tutorial main page and follow the next tutorials.
For suggestions or information request, please contact