RSAT - pattern search manual

Name

1997-98 by

Description

Searches all occurrences of a pattern within DNA sequences. The pattern can be entered as a simple nucleotide sequence, or may include degenerate nucleotide codes and regular expressions.

Options

Query pattern(s)

Example:

        TTGTT   TGbox
        GATWA   GATAbox
        CCCCT   STRE

IUPAC-IUB commission

    Symbol  Nucleotide(s)   Description                  
    A       A               Adenosine
    C       C               Cytidine
    G       G               Guanosine
    T       T               Thymidine
    R       = A or G        puRines
    Y       = C or T        pYrimidines
    W       = A or T        Weak hydrogen bonding
    S       = G or C        Strong hydrogen bonding
    M       = A or C        aMino group at common position
    K       = G or T        Keto group at common position
    H       = A, C or T     not G
    B       = G, C or T     not A
    V       = G, A, C       not T
    D       = G, A or T     not C
    N       = G, A, C or T  aNy

GAT[TA]AG means "GATAAG or GATTAG" (equivalent to GATWAG).
CGGN{11}CCG means CGG followed by 10 N followed by CCG.
GATAAGN{0,30}GATAAG means two GATAAG spaced by 0 to 30 nucleotides.

Input sequence

The sequence in which the pattern has to be searched.

Input sequence format:
The following sequence formats are supported(click on the links for more details).

raw: the raw sequence without any identifier or comment.
multi: several raw sequences concatenated.
IG: IntelliGenetics format.
FastA: the sequence format used by FastA, BLAST, Gibbs sampler and a lot of other bioinformatic programs.
Wconsensus: the format defined by Jerry Hertz for his programs (patser, consensus, wconsensus).

Search strands

The pattern can be searched on the direct, the reverse complement, or both strands.

Substitutions

Imperfect matches can be allowed, with a given number of substitutions. insertions and deletions are not supported.

Return

The program either returns the matching positions, or the number of matches within each input sequence.

Threshold

Flanking

When the option "return matching position" is selected, the program also return the sequence of the matching word. This sequence can be extended leftwards and rightwards with the flanking option.

Origin

Matching positions can be calculated either relative to the sequence start, or to the sequence end. The second option is most useful for upstream sequences, since it directly indicates the distance between the site and the ORF start. It is thus selected by default.

Prevent overlapping matches

This option allows to prevent successiv matches of a periodic pattern to overlap with each other. When the option is activated, each occurrence of a given pattern will prevent another occurrence of the same pattern before the end of the first matching segment. For example, the sequence TATATATATATA could be considered to contain either 4 (with overlap) or 2 (without overlap) occurrences of the pattern TATATA.

Limits

Return the limits (start and end) of each sequence. This information is used by feature-map, in order to represent the length of each sequence.

non ACGT characters

Return the positions corresponding to stretches of non ACGT characters (eg: N, X, degenerate nucleotides from IUPAC code)

Availability

http://www.rsat.eu/