The theoretical background required for this tutorial can be found in the RSAT course.

In particular, we recommend to read the following slides before starting this tutorial.

Position-specific scoring matrices (PSSM) offer a flexible way to represent the specificity of transcription factor/DNA interactions. PSSM can be built on the basis of a set of known binding sites for the factor of interest.

Knowledge-based PSSM can be obtained from various transcription factor databases, for example:

- JASPAR (Eukaryotes)
- RegulonDB (
*Escherichia coli K12*) - SCPD (The Promoter Database of
*Saccharomcyces cerevisiae*) - Yeastract (Yeast Search for Transcriptional Regulators And Consensus Tracking)
- TRANSFAC (Note: the full database access requires a commercial license)

The matrix below was obtained from SCPD, the
*Saccharomyces cerevisiae* promoter database. It has been built
from an alignment of 12 binding sites for the yeast transcription
factor Pho4p.

A 3 2 0 12 0 0 0 0 1 3 C 5 2 12 0 12 0 1 0 2 1 G 3 7 0 0 0 12 0 7 5 4 T 1 1 0 0 0 0 11 5 4 4

Each row represents one residue (A, C, G or T), and each column a
position in a set of aligned binding sites. Some positions are
perfectly conserved across all known binding sites (the
motif `CACGT` starting at the 3rd position), whereas some other
positions present two choices (e.g. `G` or `T` at
position 8), and other position can contain any letter, but with
different frequencies (e..g first and last position).

When the matrix is used to scan sequences for putative Pho4p binding sites, the more conserved positions impose stronger constraints than those where any nucleotide can be found. Matrix-based motif representations this provide a better support than string-based representations for representing the binding affinity

Actually, the frequencies are not used as such to score putative
sites. The score assigned assigned to a piece of sequence *S* is
calculated as the log-ratio of two probabilities:

*P(S|M)*, the probability to observe sequence*S*given the motif model*M*(the matrix).*P(S|B)*, the probability to observe sequence*S*given the background model*B*(the genomic context).- The score of a sequence segment is
*W*_{S}=log[P(S|M)/P(S|B)]

With the matrix above, we could calculate the probability of a
sequence *S* of 10 nucleotides, as the product of the relative
frequencies of these nucleotides in the PSSM.

A critical issue is to chose an appropriate background model. The simplest background models are based on a Bernoulli schema, which means that they rely on an assumption that the nucleotides succeed to each other independently. More elaborate models have been proposed, based on Markov chains (the description of these models can be found in the RSAT course, slides on sequence models).

One weakness of PSSM is that they do not take into account higher order dependencies between residues, i.e. correlations between the residue found at a given position and those found at other positions. Even if, with Markov-chain based background models, such dependencies are taken into account for the background model, correlations between different positions of the binding sites are still not taken into account.

For example, a PSSM does not allow to specify a pattern
like *"either CACGTGGG or
CACGTTTT"* : if one builds a matrix where

The RSAT include a program called convert-matrix, which allows to extract a PSSM from the output files of different programs (consensus, gibbs, MEME, clustal) or databases (JASPAR, TRANSFAC).

We will import a matrix representing the binding specificity of the yeast transcription factor Abf1p, and analyze the different parameters which characterize this PSSM.

- Connect SCPD, the Saccharomyces cerevisiae promoter database. To access the list of sites by factor from the SCPD home page, click on the Section "Regulatory elements and transcriptional factors".
- Click on the link to ABF1,
and then on the button
**Get matrix**. The following PSSM is displayed.>ABF17121SCPD A 0 0 11 2 3 7 3 5 5 14 0 1 T 14 0 2 1 6 2 7 3 0 0 0 0 G 0 0 0 0 2 3 1 2 3 0 1 13 C 0 14 1 11 3 2 3 4 6 0 13 0.

**Some basic questions**

- What is the alphabet size ?
- What is the matrix width ?
- How many inding sites were used to build this matrix ?
- How many Abf1p binding sites are currently stored in SCPD (click on the button
**Get sites**)? - Open a connexion to the enologos Web site and generate logos representing the occurrences, frequencies, weights and information content of the matrix.
- Analyze the impact of the pseudo-weight on the matrix. Try progressively larger values (0,1, 10, 100, 1000, 10000) and analyze the impact on the frequencies, total information content, and shape of the sequence logo.

We will now analyze the content of this matrix with the
program *convert-matrix*.

- In the left menu of the RSAT page, select the
form
**convert matrix**under the title**Matrix tools**. - Copy the ABF1 matrix from SCPD to the text area in the
form.
**Beware**, you should only copy the 4 rows containing the nucleotide information, and not the matrix header. - Make sure that the selected matrix format is
.*tab* - Select a relevant
**background model**. With the factor Abf1p, check the option*Organism-specific*, select the organism*Saccharomyces cerevisiae*, and the sequence type*upstream-noorf*. - For the
, select the following options:*Return fields*- counts
- frequencies
- weights
- info
- margins
- parameters
- logo

- Click
**GO**.

The original matrix was converted into different formats. We will briefly comment these formats.

** Counts** are the primary information obtained from
SCPD. They represent the number of occurrences (absolute frequency) of
each residue at each position of the alignment of the annotated
binding sites for the transcription factor Abf1p.

** Relative frequencies** are obtained by dividing the counts of each
cell of the matrix by the sum of counts in its column.

You will notice that the frequency matrix does not faithfully reflect the relative frequencies calculated from the counts. In particular, the cells of the original matrix with count values of 0 have values larger than 0 in the frequency matrix.

You can check this by coming back to the convert-matrix form (click
the *Back* button of your brower), and redoing the conversion
with a value of 0 for the option * pseudo-weight* (note:
this is only for illustrative purposes, it is generally recommended to
use a pseudo-weight of at least 1).

The ** weight** is the score described above, i.e. the
log-likelihood between

*P(S|M)*, the probability to observe the sequence*S*given the motif model (matrix)*M*, and*P(S|B)*, the probability to observe the sequence*S*given the background model*B*.

Positive weights indicate that the residue is considered to favour the binding of the transcription factor, negative weights that it is unfavorable.

The ** information content** of each cell of the matrix is calculated by
multiplying the weight by the frequency. The information content of a
row (column) is the sum of information contents of its cells.

- In the JASPAR database, retrieve the two alternative matrices for Klf4 (IDs MA0039.1 and MA0039.2, see tip below). These matrices both represent the TFBS of the mouse transcription factor Klf4. However their numbers are quite different. Make a guess about the origin of the data that served to build these two matrices.
- Convert each matrix with the tool convert-matrix, to extract the frequencies, weights, information content, parameters, and logo. Compare the results. How do you interpret the differences between these two matrices, supposed to represent the binding specificity of the same transcription factor.

- For some transcription factors, Jaspar contains several
databases built from different datasets. There are for example two
matrices for the mouse factor Klf4, denoted by the identifiers
MA0039.1 and MA0039.2, respectively. By default, the Web
itnerface only returns the most recent version of the matrix. In
order to access all the versions:
- search by name (with Klf4) will return the matrix MA0039.2;
- click on the sequence logo of the Klf4 matrix;
- in the detailed information window, click
*Show me all versions*

- Read carefully the detailed information of the JASPAR records to understand the relationship between the data source and the composition of the two matrices.

- PSSM can be used to detect occurrences of a motif in
sequences. The theory underlying matrix-based pattern matching
will be introduced in the slides
matrix-scan, which
permits to scan sequences with one or more matrices, under
different background models, in order to predict binding sites ad
cis-regulatory modules (CRM). A detailed protocol has been
published to explain the theoretical concepts and practical
aspects related to sequence scanning with matrices.
- Turatsinze, J.V., Thomas-Chollier, M., Defrance, M. and van Helden, J. (2008) Using RSAT to scan genome sequences for transcription factor binding sites and cis-regulatory modules. Nat Protoc, 3, 1578-1588. Pubmed 18802439

- Several pattern-discovery programs allow to detect significant motifs represented as PSSM from a set of unaligned sequences. These programs have been used to predict regulatory elements in the regulatory regions of sets of co-regulated genes. Matrix-based pattern discovery are discussed in the tutorials on the gibbs sampler and consensus.

You can now come back to the tutorial main page and follow the next tutorials.

For suggestions please post an issue on GitHub or contact the