The theoretical background required for this tutorial can be found in the RSAT course.
In particular, we recommend to read the following slides before starting this tutorial.
Position-specific scoring matrices (PSSM) offer a flexible way to represent the specificity of transcription factor/DNA interactions. PSSM can be built on the basis of a set of known binding sites for the factor of interest.
Knowledge-based PSSM can be obtained from various transcription factor databases, for example:
The matrix below was obtained from SCPD, the Saccharomyces cerevisiae promoter database. It has been built from an alignment of 12 binding sites for the yeast transcription factor Pho4p.
A 3 2 0 12 0 0 0 0 1 3 C 5 2 12 0 12 0 1 0 2 1 G 3 7 0 0 0 12 0 7 5 4 T 1 1 0 0 0 0 11 5 4 4
Each row represents one residue (A, C, G or T), and each column a position in a set of aligned binding sites. Some positions are perfectly conserved across all known binding sites (the motif CACGT starting at the 3rd position), whereas some other positions present two choices (e.g. G or T at position 8), and other position can contain any letter, but with different frequencies (e..g first and last position).
When the matrix is used to scan sequences for putative Pho4p binding sites, the more conserved positions impose stronger constraints than those where any nucleotide can be found. Matrix-based motif representations this provide a better support than string-based representations for representing the binding affinity
Actually, the frequencies are not used as such to score putative sites. The score assigned assigned to a piece of sequence S is calculated as the log-ratio of two probabilities:
With the matrix above, we could calculate the probability of a sequence S of 10 nucleotides, as the product of the relative frequencies of these nucleotides in the PSSM.
A critical issue is to chose an appropriate background model. The simplest background models are based on a Bernoulli schema, which means that they rely on an assumption that the nucleotides succeed to each other independently. More elaborate models have been proposed, based on Markov chains (the description of these models can be found in the RSAT course, slides on sequence models).
One weakness of PSSM is that they do not take into account higher order dependencies between residues, i.e. correlations between the residue found at a given position and those found at other positions. Even if, with Markov-chain based background models, such dependencies are taken into account for the background model, correlations between different positions of the binding sites are still not taken into account.
For example, a PSSM does not allow to specify a pattern like "either CACGTGGG or CACGTTTT" : if one builds a matrix where G and T are allowed at the 3 last positions, any combination of them will be allowed (e.g. CACGTGTG, CACGTTGT). Higher order dependencies can be represented with more elaborated methods, such as Hidden Markov Models (HMM), which are ou of scope for this tutorial.
The RSAT include a program called convert-matrix, which allows to extract a PSSM from the output files of different programs (consensus, gibbs, MEME, clustal) or databases (JASPAR, TRANSFAC).
We will import a matrix representing the binding specificity of the yeast transcription factor Abf1p, and analyze the different parameters which characterize this PSSM.
>ABF17121SCPD A 0 0 11 2 3 7 3 5 5 14 0 1 T 14 0 2 1 6 2 7 3 0 0 0 0 G 0 0 0 0 2 3 1 2 3 0 1 13 C 0 14 1 11 3 2 3 4 6 0 13 0.
Some basic questions
We will now analyze the content of this matrix with the program convert-matrix.
The original matrix was converted into different formats. We will briefly comment these formats.
Counts are the primary information obtained from SCPD. They represent the number of occurrences (absolute frequency) of each residue at each position of the alignment of the annotated binding sites for the transcription factor Abf1p.
Relative frequencies are obtained by dividing the counts of each cell of the matrix by the sum of counts in its column.
You will notice that the frequency matrix does not faithfully reflect the relative frequencies calculated from the counts. In particular, the cells of the original matrix with count values of 0 have values larger than 0 in the frequency matrix.
You can check this by coming back to the convert-matrix form (click the Back button of your brower), and redoing the conversion with a value of 0 for the option pseudo-weight (note: this is only for illustrative purposes, it is generally recommended to use a pseudo-weight of at least 1).
The weight is the score described above, i.e. the log-likelihood between
Positive weights indicate that the residue is considered to favour the binding of the transcription factor, negative weights that it is unfavorable.
The information content of each cell of the matrix is calculated by multiplying the weight by the frequency. The information content of a row (column) is the sum of information contents of its cells.
You can now come back to the tutorial main page and follow the next tutorials.