RSA-tools - Tutorials - Position-specific scoring matrices

Prerequisite
Introduction
Converting and analyzing PSSM
Additional exercises
Next steps

Prerequisite

The theoretical background required for this tutorial can be found in the RSAT course.

In particular, we recommend to read the following slides before starting this tutorial.

Introduction

Position-specific scoring matrices (PSSM) offer a flexible way to represent the specificity of transcription factor/DNA interactions. PSSM can be built on the basis of a set of known binding sites for the factor of interest.

Knowledge-based PSSM can be obtained from various transcription factor databases, for example:

JASPAR (Eukaryotes)
RegulonDB (Escherichia coli K12)
SCPD (The Promoter Database of Saccharomcyces cerevisiae)
Yeastract (Yeast Search for Transcriptional Regulators And Consensus Tracking)
TRANSFAC (Note: the full database access requires a commercial license)

Example

The matrix below was obtained from SCPD, the Saccharomyces cerevisiae promoter database. It has been built from an alignment of 12 binding sites for the yeast transcription factor Pho4p.

PHO4 matrix (source: SCPD)

A  3   2   0  12   0   0   0   0   1   3
C  5   2  12   0  12   0   1   0   2   1
G  3   7   0   0   0  12   0   7   5   4
T  1   1   0   0   0   0  11   5   4   4

Each row represents one residue (A, C, G or T), and each column a position in a set of aligned binding sites. Some positions are perfectly conserved across all known binding sites (the motif CACGT starting at the 3rd position), whereas some other positions present two choices (e.g. G or T at position 8), and other position can contain any letter, but with different frequencies (e..g first and last position).

Matrix scores

When the matrix is used to scan sequences for putative Pho4p binding sites, the more conserved positions impose stronger constraints than those where any nucleotide can be found. Matrix-based motif representations this provide a better support than string-based representations for representing the binding affinity

Actually, the frequencies are not used as such to score putative sites. The score assigned assigned to a piece of sequence S is calculated as the log-ratio of two probabilities:

P(S|M), the probability to observe sequence S given the motif model M (the matrix).
P(S|B), the probability to observe sequence S given the background model B (the genomic context).
The score of a sequence segment is W_S=log[P(S|M)/P(S|B)]

With the matrix above, we could calculate the probability of a sequence S of 10 nucleotides, as the product of the relative frequencies of these nucleotides in the PSSM.

A critical issue is to chose an appropriate background model. The simplest background models are based on a Bernoulli schema, which means that they rely on an assumption that the nucleotides succeed to each other independently. More elaborate models have been proposed, based on Markov chains (the description of these models can be found in the RSAT course, slides on sequence models).

Higher order dependencies

One weakness of PSSM is that they do not take into account higher order dependencies between residues, i.e. correlations between the residue found at a given position and those found at other positions. Even if, with Markov-chain based background models, such dependencies are taken into account for the background model, correlations between different positions of the binding sites are still not taken into account.

For example, a PSSM does not allow to specify a pattern like "either CACGTGGG or CACGTTTT" : if one builds a matrix where G and T are allowed at the 3 last positions, any combination of them will be allowed (e.g. CACGTGTG, CACGTTGT). Higher order dependencies can be represented with more elaborated methods, such as Hidden Markov Models (HMM), which are ou of scope for this tutorial.

Converting and analyzing PSSM

The RSAT include a program called convert-matrix, which allows to extract a PSSM from the output files of different programs (consensus, gibbs, MEME, clustal) or databases (JASPAR, TRANSFAC).

Converting the counts into frequencies, weights and information

We will import a matrix representing the binding specificity of the yeast transcription factor Abf1p, and analyze the different parameters which characterize this PSSM.

Connect SCPD, the Saccharomyces cerevisiae promoter database. To access the list of sites by factor from the SCPD home page, click on the Section "Regulatory elements and transcriptional factors".

Click on the link to ABF1, and then on the button Get matrix. The following PSSM is displayed.

>ABF17121SCPD

A   0   0  11   2   3   7   3   5   5  14   0   1
T  14   0   2   1   6   2   7   3   0   0   0   0
G   0   0   0   0   2   3   1   2   3   0   1  13
C   0  14   1  11   3   2   3   4   6   0  13   0.

Some basic questions

What is the alphabet size ?
What is the matrix width ?
How many inding sites were used to build this matrix ?
How many Abf1p binding sites are currently stored in SCPD (click on the button Get sites)?
Open a connexion to the enologos Web site and generate logos representing the occurrences, frequencies, weights and information content of the matrix.

Analyze the impact of the pseudo-weight on the matrix. Try progressively larger values (0,1, 10, 100, 1000, 10000) and analyze the impact on the frequencies, total information content, and shape of the sequence logo.

We will now analyze the content of this matrix with the program convert-matrix.

In the left menu of the RSAT page, select the form convert matrix under the title Matrix tools.

Copy the ABF1 matrix from SCPD to the text area in the form.

Make sure that the selected matrix format is tab.

Select a relevant background model. With the factor Abf1p, check the option Organism-specific, select the organism Saccharomyces cerevisiae, and the sequence type upstream-noorf.

For the Return fields, select the following options:
- counts
- frequencies
- weights
- info
- margins
- parameters
- logo

Click GO.

Matrix conversions

The original matrix was converted into different formats. We will briefly comment these formats.

Counts

Counts are the primary information obtained from SCPD. They represent the number of occurrences (absolute frequency) of each residue at each position of the alignment of the annotated binding sites for the transcription factor Abf1p.

Frequencies

Relative frequencies are obtained by dividing the counts of each cell of the matrix by the sum of counts in its column.

You will notice that the frequency matrix does not faithfully reflect the relative frequencies calculated from the counts. In particular, the cells of the original matrix with count values of 0 have values larger than 0 in the frequency matrix.

You can check this by coming back to the convert-matrix form (click the Back button of your brower), and redoing the conversion with a value of 0 for the option pseudo-weight (note: this is only for illustrative purposes, it is generally recommended to use a pseudo-weight of at least 1).

Weights

The weight is the score described above, i.e. the log-likelihood between

P(S|M), the probability to observe the sequence S given the motif model (matrix) M, and
P(S|B), the probability to observe the sequence S given the background model B.

Positive weights indicate that the residue is considered to favour the binding of the transcription factor, negative weights that it is unfavorable.

Information

The information content of each cell of the matrix is calculated by multiplying the weight by the frequency. The information content of a row (column) is the sum of information contents of its cells.

Additional exercises

In the JASPAR database, retrieve the two alternative matrices for Klf4 (IDs MA0039.1 and MA0039.2, see tip below). These matrices both represent the TFBS of the mouse transcription factor Klf4. However their numbers are quite different. Make a guess about the origin of the data that served to build these two matrices.
Convert each matrix with the tool convert-matrix, to extract the frequencies, weights, information content, parameters, and logo. Compare the results. How do you interpret the differences between these two matrices, supposed to represent the binding specificity of the same transcription factor.

Tips

For some transcription factors, Jaspar contains several databases built from different datasets. There are for example two matrices for the mouse factor Klf4, denoted by the identifiers MA0039.1 and MA0039.2, respectively. By default, the Web itnerface only returns the most recent version of the matrix. In order to access all the versions:
- search by name (with Klf4) will return the matrix MA0039.2;
- click on the sequence logo of the Klf4 matrix;
- in the detailed information window, click Show me all versions
Read carefully the detailed information of the JASPAR records to understand the relationship between the data source and the composition of the two matrices.

Next steps

PSSM can be used to detect occurrences of a motif in sequences. The theory underlying matrix-based pattern matching will be introduced in the slides matrix-scan, which permits to scan sequences with one or more matrices, under different background models, in order to predict binding sites ad cis-regulatory modules (CRM). A detailed protocol has been published to explain the theoretical concepts and practical aspects related to sequence scanning with matrices.
- Turatsinze, J.V., Thomas-Chollier, M., Defrance, M. and van Helden, J. (2008) Using RSAT to scan genome sequences for transcription factor binding sites and cis-regulatory modules. Nat Protoc, 3, 1578-1588. Pubmed 18802439

Several pattern-discovery programs allow to detect significant motifs represented as PSSM from a set of unaligned sequences. These programs have been used to predict regulatory elements in the regulatory regions of sets of co-regulated genes. Matrix-based pattern discovery are discussed in the tutorials on the gibbs sampler and consensus.

You can now come back to the tutorial main page and follow the next tutorials.

For suggestions please post an issue on GitHub or contact the