RSA-tools - Tutorials - string-based representations

Strings with DNA alphabet
Regular expressions
Combination of IUPAC code and Regular expressions
Case insensiivity

Strings with DNA alphabet

The simplest way to represent a transcription factor binding site is with a string composed with the 4-letter alphabet of DNA sequences: A, C, G and T.

Unfortunately, transcription factor binding motifs (TFBM) are generally not restricted to one perfectly specified 4-letters string. The simple DNA alphabet representation is thus insufficient to represent partly specified or aspecific residues in the DNA/factor interface.

Some more elaborate representations have been developed to represent partially specified motifs (IUPAC, regular expressions, positin-specific scoring matrices). These representations are supported on RSAT pattern matching programs (dna-pattern, matrix-scan).

IUPAC code

IUPAC-IUB

code for ambiguous nucleotides

IUPAC nucleotides Mnemonics

A Adenine

C Cytosine

G Guanine

T Thymine

R A or G puRines

Y C or T pYrimidines

W A or T Weak hydrogen bonding

S G or C Strong hydrogen bonding

M A or C aMino group at common position

K G or T Keto group at common position

H A, C or T not G

B G, C or T not A

V G, A, C not T

D G, A or T not C

N G, A, C or T aNy

IUPAC	nucleotides	Mnemonics
A		Adenine
C		Cytosine
G		Guanine
T		Thymine
R	A or G	puRines
Y	C or T	pYrimidines
W	A or T	Weak hydrogen bonding
S	G or C	Strong hydrogen bonding
M	A or C	aMino group at common position
K	G or T	Keto group at common position
H	A, C or T	not G
B	G, C or T	not A
V	G, A, C	not T
D	G, A or T	not C
N	G, A, C or T	aNy

Regular expressions

Regular expressions are a convenient way to express complex patterns with strings. This formalism supports many syntacic feature, which are out of scope for this tutorial, but a complete description can be found in many source e.g. in Perl textbooks. We will just provide a few examples of useful expressions.

Brackets can be used to specify set of alternative letters.
Example
- GAT[TA]AG means "GATTAG or GATAAG" (this is equivalent to GATWAG in IUPAC code)

Fixed multipliers can be specified by a number between curly brackets.
Examples:
- A{8} means "AAAAAAAA"
- [AG]{8} means "a succession of 8 times A or G"
- CGG[ACGT]{11}CCG means "CGG followed by exactly 11 times A,C,G or T, followed by CCG"

Variable multipliers can be specified by two numbers between curly brackets.
Example
- GATAAG[ACGT]{0,30}GATAAG means "two GATAAG separated by a spacing covering between 0 and 30 letters"

Alternative words can be specified by separating the words by a pipe character ("|").
Example
- (?CACGTTTT|CACGTGGG) means "either CACGTGGG or CACGTTTT"

Combinations of IUPAC and regular expressions

RSAT support patterns described as combinations of IUPAC alphabet and regular expression.

Example

GATAAGN{0,30}GATAAG means "two GATAAG separated by a spacing covering between 0 and 30 letters"

Case (in)sensitivity

Whichever string-based representation is used, upper and lower case are considered equivalent by RSAT pattern matching and motif discovery algorithms.

However, some programs support a filtering option, allowing to mask either lowercases or uppercases before starting the analysis. This option can be used when a specific meaning is attached to lower- or uppercases. For example, the "Get DNA" tool at the UCSC Genome Browser allows to denote specific sequence types with lower- or upper-cases (e.g. repetitive sequences, genes, non-coding, ...)

Next steps

You can now come back to the tutorial main page and follow the next tutorials, or directly switch to the following lessons.

Position-specific scoring matrices (PSSM)
dna-pattern: string-based pattern matching

For suggestions please post an issue on GitHub or contact the