2004 by
Performs inter-conversions between various formats of position-specific scoring matrices (PSSM).
The program also performs a statistical analysis of the original matrix to provide different position-specific scores (weight, frequencies, information contents), general statistics (E-value, total information content), and synthetic descriptions (consensus).
PSSM can be used to represent the binding specificity of a transcription factor or the conserved residues of a protein domain.
Each row of the matrix corresponds to one residue (nucleotide or amino-acid depending on the sequence type). Each column corresponds to one position in the alignment. The value within each cell represents the frequency of each residue at each position.
Some formats are supported only for input, others for output. There are more formats accepted for input, because the general use of this program is to convert a PSSM obtained from a database (e.g. TRANSFAC) or a pattern-discovery program (e.g. consensus, gibbs, meme, MotifSampler, ...) and obtain a matrix either for scanning (with matrix-scan) or for computing statistical parameters (see the return fields below). We generally use the TRANSFAC (tf) format, in which we can specify identifiers and names for the matrix.
AC MA0001.1 XX ID AGL3 XX DE MA0001.1 AGL3; from JASPAR PO A C G T 1 0 94 1 2 2 3 75 0 19 3 79 4 3 11 4 40 3 4 50 5 66 1 1 29 6 48 2 0 47 7 65 5 5 22 8 11 2 3 81 9 65 3 28 1 10 0 3 88 6 XX CC program: jaspar XX //
The tab format accepts a user-specific set of return fields (option -return), proviging different statistics on the matrix (counts, frequencies, weights, information, other parameters: see description below).
; MET4 matrix, from Gonze et al. (2005). Bioinformatics 21, 3490-500. A | 7 9 0 0 16 0 1 0 0 11 6 9 6 1 8 C | 5 1 4 16 0 15 0 0 0 3 5 5 0 2 0 G | 4 4 1 0 0 0 15 0 16 0 3 0 0 2 0 T | 0 2 11 0 0 1 0 16 0 2 2 2 10 11 8 //
> Mycn
A [ 0 29 0 2 0 0 ]
C [31 0 30 1 3 0 ]
G [ 0 0 0 28 0 31]
T [ 0 2 1 0 28 0 ]
>mef2
10 0 0 0 22 0 6 2 3 4 22 10
0 2 12 0 0 0 0 0 0 0 0 0
9 20 2 0 0 0 0 0 0 0 0 10
3 0 8 22 0 22 16 20 19 18 0 2
>myf
7 9 4 0 16 7 0 6 0 0 6 0
8 0 2 15 0 0 15 0 0 10 0 0
1 7 10 1 0 9 1 0 16 6 0 16
0 0 0 0 0 0 0 10 0 0 10 0
Background letter frequencies A 0.303 C 0.183 G 0.209 T 0.306 MOTIF crp alternative name letter-probability matrix: alength= 4 w= 19 nsites= 17 E= 4.1e-009 0.000000 0.176471 0.000000 0.823529 0.000000 0.058824 0.647059 0.294118 0.000000 0.058824 0.000000 0.941176 0.176471 0.000000 0.764706 0.058824 0.823529 0.058824 0.000000 0.117647 0.294118 0.176471 0.176471 0.352941 0.294118 0.352941 0.235294 0.117647 0.117647 0.235294 0.352941 0.294118 0.529412 0.000000 0.176471 0.294118 0.058824 0.235294 0.588235 0.117647 0.176471 0.235294 0.294118 0.294118 0.000000 0.058824 0.117647 0.823529 0.058824 0.882353 0.000000 0.058824 0.764706 0.000000 0.176471 0.058824 0.058824 0.882353 0.000000 0.058824 0.823529 0.058824 0.058824 0.058824 0.176471 0.411765 0.058824 0.352941 0.411765 0.000000 0.000000 0.588235 0.352941 0.058824 0.000000 0.588235
>element1 /name=element1 0 4 2 14 12 0 0 8 8 0 1 11 20 0 0 0 ....
NA Mync
XX
DE Mync
XX
P0 A C G T
01 0 31 0 0 C
02 29 0 0 2 A
03 0 30 0 1 C
04 2 1 28 0 G
05 0 3 0 28 T
06 0 0 31 0 G
XX
This is actually the same format as tab (described above), but the only return field is the count matrix.
Protein: Cbf1 Seed k-mer: ATCACGTG Enrichment Score: 0.499010437669239 A: 0.251714422716682 0.231020715440932 0.371175995676819 0.343515826416987 0.189181911178663 0.373249743142318 0.159425685466501 0.387398837326962 0.160370450851774 0.00579566973382471 0.984310428811586 0.000520578518462409 0.0512168242470759 0.00554791387069823 0.00108871328362558 0.436684281349379 0.106429865986653 0.0872652424535894 0.2779359708333 0.222894293683715 0.366796870220836 0.226022414885529 C: 0.10226033847082 0.315992694980937 0.148489261324769 0.182792315701972 0.406736016253256 0.213860951366744 0.324485360588445 0.0418650553618826 0.045745403962552 0.99055073718171 0.0105040001038691 0.975244090256611 0.0064124125195024 0.00433861322483728 0.0016092288172844 0.262472975122313 0.184027817720692 0.549338793818378 0.127202171464537 0.198102864294932 0.306135553163069 0.321957177096839 G: 0.130212757211399 0.266959960914667 0.154092799416608 0.241158374156534 0.15046928890062 0.0930127274890034 0.354968645980097 0.554463521652852 0.0956960762910429 0.00242503126912309 0.0010685902059978 0.00922202192948052 0.94180586404824 0.00940766558113401 0.973242502288466 0.0857562483507934 0.0600857360004131 0.129208087201377 0.253172180381345 0.437904236128241 0.1940215817188 0.116641876697499 T: 0.515812481601099 0.186026628663464 0.326241943581804 0.232533483724507 0.253612783667461 0.319876578001935 0.161120307964957 0.0162725856583034 0.698188068894632 0.00122856181534235 0.00411698087854669 0.0150133092954459 0.000564899185181348 0.98070580732333 0.0240595556106241 0.215086495177514 0.649456580292242 0.234187876526655 0.341689677320817 0.141098605893113 0.133045994897295 0.335378531320133
>SIX5_disc1 SIX5_GM12878_encode-Myers_seq_hsa_r1:MEME#1#Intergenic G 0.008511 0.004255 0.987234 0.000000 A 0.902127 0.012766 0.038298 0.046809 R 0.455319 0.072340 0.344681 0.127660 W 0.251064 0.085106 0.085106 0.578724 T 0.000000 0.046809 0.012766 0.940425 G 0.000000 0.000000 1.000000 0.000000 T 0.038298 0.021277 0.029787 0.910638 A 0.944681 0.004255 0.051064 0.000000 G 0.000000 0.000000 1.000000 0.000000 T 0.000000 0.000000 0.012766 0.987234
Fij=Cij/SUMi(Cij)
where
F''ij=Cij+b*Pi/[SUMi(Cij)+b]
where
Wij=ln(F''ij/Pi)
Iij = Fij*ln(Fij/Pi)
In addition, we calculate a ``corrected'' information content which takes pseudo-weights into account.
I''ij = F''ij*ln(F''ij/Pi)
k=C.j C.j! k Cij-k
Pij= SUM ---------- Pi (1-Pi)
k=Cij k!(C.j-k)!
where
Some additional parameters are optionally calculated
The consensus is exported as regular expression, and with the IUPAC code for ambiguous nucleotides (http://www.chem.qmw.ac.uk/iupac/misc/naseq.html).
A (Adenine)
C (Cytosine)
G (Guanine)
T (Thymine)
R = A or G (puRines)
Y = C or T (pYrimidines)
W = A or T (Weak hydrogen bonding)
S = G or C (Strong hydrogen bonding)
M = A or C (aMino group at common position)
K = G or T (Keto group at common position)
H = A, C or T (not G)
B = G, C or T (not A)
V = G, A, C (not T)
D = G, A or T (not C)
N = G, A, C or T (aNy)
The strict consensus indicates, at each position, the residue with the highest positive weight.
Sequence logo are generated using the freeware program Weblogo.