RSAT - pattern search manual
Name
Description
Searches all occurrences of a pattern within DNA sequences. The pattern can be entered as a simple nucleotide sequence, or may include degenerate nucleotide codes and regular expressions.
Options
Query pattern(s)
The pattern(s) you are searching. Each pattern should be the first word
of a new line. An identifier can be provided after each pattern, separated
by a tab character.
Example:
TTGTT TGbox
GATWA GATAbox
CCCCT STRE
Ambiguous nucleotide codes of the IUPAC-IUB commission are supported.
Symbol Nucleotide(s) Description
A A Adenosine
C C Cytidine
G G Guanosine
T T Thymidine
R = A or G puRines
Y = C or T pYrimidines
W = A or T Weak hydrogen bonding
S = G or C Strong hydrogen bonding
M = A or C aMino group at common position
K = G or T Keto group at common position
H = A, C or T not G
B = G, C or T not A
V = G, A, C not T
D = G, A or T not C
N = G, A, C or T aNy
Upper and lower case are considered equivalent.
The pattern can also contain regular expression elements:
- GAT[TA]AG means "GATAAG or GATTAG"
(equivalent to GATWAG).
- CGGN{11}CCG means CGG followed by 10 N followed
by CCG.
- GATAAGN{0,30}GATAAG means two GATAAG spaced by 0
to 30 nucleotides.
Input sequence
The sequence in which the pattern has to be searched.
Input sequence format:
The following sequence formats are supported(click on the links for more details).
- raw: the raw sequence without any identifier or comment.
- multi: several raw sequences concatenated.
- IG: IntelliGenetics format.
- FastA: the sequence format used by FastA, BLAST, Gibbs sampler and a lot of other bioinformatic programs.
- Wconsensus: the format defined by Jerry Hertz for his programs (patser, consensus, wconsensus).
Search strands
The pattern can be searched on the direct,
the reverse complement,
or both strands.
Substitutions
Imperfect matches can be allowed, with a given number of substitutions.
insertions and deletions are not supported.
Return
The program either returns the matching positions, or the number of
matches within each input sequence.
Threshold
(only valid with the option return count)
Only return sequences having more matches than the specified threshold
Flanking
When the option "return matching position" is selected, the
program also return the sequence of the matching word. This sequence can
be extended leftwards and rightwards with the flanking option.
Origin
Matching positions can be calculated either relative to the sequence start, or to the sequence end. The second option is most useful for upstream sequences, since it directly indicates the distance between the site and the ORF start. It is thus selected by default.
Prevent overlapping matches
This option allows to prevent successiv matches of a periodic pattern to overlap with each other. When the option is activated, each occurrence of a given pattern will prevent another occurrence of the same pattern before the end of the first matching segment. For example, the sequence TATATATATATA could be considered to contain either 4 (with overlap) or 2 (without overlap) occurrences of the pattern TATATA.
Limits
Return the limits (start and end) of each sequence. This information
is used by feature-map, in order to represent the length of each
sequence.
non ACGT characters
Return the positions corresponding to stretches of non ACGT characters (eg: N, X, degenerate nucleotides from IUPAC code)
Availability
The program can be used through its web interface at:
http://www.rsat.eu/
dna-pattern is a perl script running on unix machines (SUN, SGI have been tested). The web interface is a perl-cgi script.