The simplest way to represent a transcription factor binding site is with a string composed with the 4-letter alphabet of DNA sequences: A, C, G and T.
Unfortunately, transcription factor binding motifs (TFBM) are generally not restricted to one perfectly specified 4-letters string. The simple DNA alphabet representation is thus insufficient to represent partly specified or aspecific residues in the DNA/factor interface.
Some more elaborate representations have been developed to represent partially specified motifs (IUPAC, regular expressions, positin-specific scoring matrices). These representations are supported on RSAT pattern matching programs (dna-pattern, matrix-scan).
IUPAC | nucleotides | Mnemonics |
---|---|---|
A | Adenine | |
C | Cytosine | |
G | Guanine | |
T | Thymine | |
R | A or G | puRines |
Y | C or T | pYrimidines |
W | A or T | Weak hydrogen bonding |
S | G or C | Strong hydrogen bonding |
M | A or C | aMino group at common position |
K | G or T | Keto group at common position |
H | A, C or T | not G |
B | G, C or T | not A |
V | G, A, C | not T |
D | G, A or T | not C |
N | G, A, C or T | aNy |
Regular expressions are a convenient way to express complex patterns with strings. This formalism supports many syntacic feature, which are out of scope for this tutorial, but a complete description can be found in many source e.g. in Perl textbooks. We will just provide a few examples of useful expressions.
RSAT support patterns described as combinations of IUPAC alphabet and regular expression.
Example
Whichever string-based representation is used, upper and lower case are considered equivalent by RSAT pattern matching and motif discovery algorithms.
However, some programs support a filtering option, allowing to mask either lowercases or uppercases before starting the analysis. This option can be used when a specific meaning is attached to lower- or uppercases. For example, the "Get DNA" tool at the UCSC Genome Browser allows to denote specific sequence types with lower- or upper-cases (e.g. repetitive sequences, genes, non-coding, ...)
You can now come back to the tutorial main page and follow the next tutorials, or directly switch to the following lessons.