RSAT - pattern search manual
Name
Description
Searches all occurrences of a pattern within DNA sequences. The pattern can be entered as a simple nucleotide sequence, or may include degenerate nucleotide codes and regular expressions.
Options
Query pattern(s)
The pattern(s) you are searching. Each pattern should be the first word
of a new line. An identifier can be provided after each pattern, separated
by a tab character.
Example:
TTGTT TGbox
GATWA GATAbox
CCCCT STRE
Ambiguous
nucleotide codes of the IUPAC-IUB commission are supported.
A (Adenine)
C (Cytosine)
G (Guanine)
T (Thymine)
R = A or G (puRines)
Y = C or T (pYrimidines)
W = A or T (Weak hydrogen bonding)
S = G or C (Strong hydrogen bonding)
M = A or C (aMino group at common position)
K = G or T (Keto group at common position)
H = A, C or T (not G)
B = G, C or T (not A)
V = G, A, C (not T)
D = G, A or T (not C)
N = G, A, C or T (aNy)
Upper and lower case are considered equivalent.
The pattern can also contain regular expression elements:
- GAT[TA]AG means "GATAAG or GATTAG"
(equivalent to GATWAG).
- CGGN{11}CCG means CGG followed by 10 N followed
by CCG.
- GATAAGN{0,30}GATAAG means two GATAAG spaced by 0
to 30 nucleotides.
Input sequence
The sequence in which the pattern has to be searched.
Input sequence format:
The following sequence formats are supported(click on the links for more details).
- raw: the raw sequence without any identifier or comment.
- multi: several raw sequences concatenated.
- IG: IntelliGenetics format.
- FastA: the sequence format used by FastA, BLAST, Gibbs sampler and a lot of other bioinformatic programs.
- Wconsensus: the format defined by Jerry Hertz for his programs (patser, consensus, wconsensus).
Search strands
The pattern can be searched on the direct (W for Watson), the
reverse complement (C for Crick), or both strands.
Substitutions
Imperfect matches can be allowed, with a given number of substitutions.
insertions and deletions are not supported.
Return
The program either returns the matching positions, or the number of
matches within each input sequence.
Flanking
When the option "return matching position" is selected, the
program also return the sequence of the matching word. This sequence can
be extended leftwards and rightwards with the flanking option.
Origin
Matching positions can be calculated either relative to the sequence start, or to the sequence end. The second option is most useful for upstream sequences, since it directly indicates the distance between the site and the ORF start. It is thus selected by default.