RSA-tools - Tutorials - dna-pattern
Introduction
DNA-pattern is a program that allows to match known patterns against a sequence set. Patterns are described as strings. The program has some features that make it particularly well suited for the recognition of regulatory motifs in DNA sequences, in short :
- Patterns can automatically be searched on both strands of the sequences.
- Patterns can contain spacers of fixed length (e.g. CGGn{11}CCG) or variable length (e.g. GATAAGn{0,60}GATAAG).
- Several patterns can be entered in the query box (the first word of each row is considere as a distinct pattern), all of them will be matched against the sequences.
- Several sequences can be entered in the sequence box, patterns will be matched against all of them.
- Patterns can contain letters from the degenerate nucleotide alphabet (e.g. N for "any nucleotide", W for "A or T", ..., see the manual page for the complete list).
- Regular expressions are supported, allowing to search for complex patterns including spacers (e.g.GATAAn{0,10}GATAA).
Searching pattern positions
Let us suppose that you are searching the positions of putative GATA boxes and Hap sites within the upstream regions of a selection of nitrogen-responding genes. We will first retrieve the sequences upstream from your genes.
- Retrieve upstream sequences from -800 to -1 for the following yeast genes (as you have seen in the tutorial on sequence retrieval)
DAL5 GAP1 MEP1 MEP2 MEP3 PUT4 DAL80- Since you are working with an eukaryote, make sure to inactivate the option Prevent overlap with neigbout genes. Check that all your sequences have 800bp.
Now we have the upstream sequences, we will scan them with the consensi for GATA boxes and HAP sites. At the bottom of the sequences, a series of buttons are presented. These buttons allow you to send your sequence to a selection of sequence analysis programs. Click on dna-pattern (IUPAC). A new form appears.
Note that the search will automatically be performed on the sequences you just retrieved (sequence transferred from your previous query). This differs from the form you would receive by clicking on "dna-pattern" in the left frame, and which would contain an empty box for entering your own sequences.
- In the Query pattern(s) box, we will enter the patterns to be searched for. Each pattern must come on a separate line. The first word of each line is the string description of the pattern, the second word is an identifier for this pattern. Type the following text in the Query pattern(s) box:
GATAAG Gata_box CCAAY Hap_siteNote the use of degenerate IUPAC degenerate code: the Y from CCAAY on the second line means "either C or T".- Leave all other parameters unchanged and click GO.
You see now the positions of all matches with the patterns you entered within the upstream sequences of the selected genes. Each line shows a single match, and the different columns indicate respectively:
- pattern identifier
- strand on which the match was found (D for direct, R for Reverse)
- pattern searched for (i.e. the query strings you provided)
- name of the sequence in which it was found
- starting position of the match
- end position of the match
- match sequence. The matching bases are indicated in UPPERCASES. The 4 flanking bases at left and right are in lowercases.
- matching score. In this case all scores equal 1, but we will see later how to use this column.
Notice that positions are returned in negative coordinates, relative to the end of the sequence (the last nucleotide has position -1). This behaviour was selected with the "Origin" option in the dna-pattern form (Origin=end). This option is particularly useful for analyzing regulatory sequences, but it can be inactivated in other cases.
We will now display the same results graphically.
- Click on the Feature map button on the bottom of the result page.
- In the Title box, type
Gata boxes and Hap sites in the upstream regions of NIT genes- after the title Display limits, fill
- the from box with -800,
- the to box with 0
- In the pop-up menu "feature handle", select symbol
- make sure the Dynamic map option is checked.
- Leave other parameters unchanged and click GO.
After a few seconds, the feature map should appear. A few comments:
- Gata boxes appear in blue, Hap sites in red
- A specific symbol is associated to each pattern, allowing to distinguish them when the feature map is printed in black and white
- Color boxes are displayed either above or below the horizontal black lines, accordingly to the strand of the match.
- Coordinates are provided with reference to the ORF starting position, negative values indicate an upstream position, and positive coordinates are within the coding sequences (0 corresponds to the first nucleotide ot the start codon).
- If your browser is recent, the map is dynamic. With your mouse, position the cursor just above one of the sites in the sequences. Look now at the status bar (at the bottom) of your browser window. The complete information about this site is displayed. Move the cursor to another site and check that the information is well updated. If you are using Internet Explorer, make sure to activate the status bar (in the View menu).
Searching for complex patterns
We will now show an example of search for patterns containing spacings.
Another characteristics of GATA boxes is that they often come clustered in the upstream region: nitrogen-responding genes usually have a pair of GATA boxes, separated by 0 t 60 base pairs. dna-pattern allows to search for spaced motifs by using a notation called regular expressions. For example :
- a repetition is specified by a number within curly brackets (e.g. A{6} is equivalent to AAAAAA)
- this can be combined with the IUPAC notation to specify a fixed spacing (e.g. n{30} means a spacing of exactly 30 nucleotides)
- variable number of repeats can be specified by entering two numbers, separated by a comma, in the curly brackets (e.g. n{0,60} means "between 0 and 60 nucleotides")
Run the tutorial as above, but enter the following patterns.
GATAAGn{0,60}GATAAG Gata_tandem CTTATCn{0,60}GATAAG Gata_inv1 GATAAGn{0,60}CTTATC Gata_inv2 GATAA Gata_box GATAAG Gata_box_strictCounting multiple patterns in multiple sequences
A charcteristics of yeast GATA boxes is that they act in a synergic way, i.e. nitrogen-responsive generally genes contain multiple GATA boxes in their upstream sequences. Thus, for this particular regulation, one might be interested in counting the number of matches, rather than returning their precise positions. This can be done with dna-pattern.
- Come back to the dna-pattern form.
- Enter the same list of patterns as before.
- Deselect the checkbox match positions
- Select the checkbox match count table
- GO
The program returns a table, where each row represents a sequence and each column a pattern. Totals per row and per columns are optionally included.
You can now come back to the tutorial main page and follow the next tutorials.
For suggestions please post an issue on GitHub or contact the