Next: Node degree statistics Up: Network Analysis Tools (NeAT) Previous: Network visualization and format Contents

Subsections

Comparisons between networks

Introduction

Protein interaction networks have deserved a special attention for molecular biologists, and several high-throughput methods have been developed during the last years, to reveal either pairwise interactions between proteins (two-hybrid technology) or protein complexes (methods relying on mass-spectrometry). The term interactome has been defined to denote the complete set of interactions between proteins of a given organism.

Interactome data is typically represented by an un-directed graph, where each node represents a polypeptide, and each edge an interaction between two polypeptides.

The yeast interactome was characterized by the two-hybrid method by two independent groups, Uetz and co-workers [32], and Ito and co-workers [13], respectively. Surprisingly, the two graphs resulting from these experiments showed a very small intersection.

In this tutorial, we will use the program compare-graphs to analyze the interactome graphs published by from Uetz and Ito, respectively.

We will first perform a detailed comparison, by merging the two graphs, and labelling each node according to the fact that it was found in Ito's network, in Uetz' network, or in both. We will then compute some statistics to estimate the significance of the intersection between the two interactome graphs.

Computing the intersection, union and differences between two graphs

Study case

In this demonstration, we will compare the networks resulting from the two first publications reporting a complete characterization of the yeast interactome, obtained using the two-hybrid method.The first network [32] contains 865 interactions between 926 proteins.The second network [13] contains 786 interactions between 779 proteins. We will merge the two networks (i.e. compute their union), and label each edge according to the fact that it is found in Ito's network, Uetz' network, or both. We will also compute the statistical significance of the intersection between the two networks.

Protocol for the web server

In the NeATmenu, select the command network comparison.
In the right panel, you should now see a form entitled ``compare-graphs''.
Click on the button DEMO.
The form is now filled with two graphs, and the parameters have been set up to their appropriate value for the demonstration. At the top of the form, you can read some information about the goal of the demo, and the source of the data.
Click on the button GO.
The computation should take a few seconds only. The result page shows you some statistics about the comparison (see interpretation below), and a link pointing to the full result file.
Click on the link to see the full result file.

Protocol for the command-line tools

If you have installed a stand-alone version of the NeAT distribution, you can use the program compare-graphs on the command-line. This requires to be familiar with the Unix shell interface. If you don't have the stand-alone tools, you can skip this section and read the next section (Interpretation of the results).

We will now describe the use of compare-graphs as a command line tool. The two two-hybrid datasets described in the previous section may be downloaded at the following address http://rsat.scmbb.ulb.ac.be/rsat/data/neat_tuto_data/. These are the files uetz_2001.tab and ito_2002.tab.

Go in the directory where the files containing the graphs to compare are located.

Type the following command

			compare-graphs -v 1 -Q ito_2002.tab -R uetz_2001.tab -return union \ 
			-o uetz_2001_union_ito_2002.tab

Using these options, some comparaison statistics are displayed and the results are stored in the tab-delimited file uetz_2001_union_ito_2002.tab.

In order to compute the difference or the intersection, you must change the -return option. For example, to compute the intersection, you shoud type.

			compare-graphs -v 1 -Q ito_2002.tab -R uetz_2001.tab -return intersection \ 
			-o uetz_2001_inter_ito_2002.tab

Interpretation of the results

The program compare-graphs uses symbols and respectively, to denote the two graphs to be compared. Usually, stands for reference, and for query.

In our case, indicates Ito's network, whereas indicates Uetz' network. The two input graphs are considered equivalent, there is no reason to consider one of them as reference, but this does not really matter, because the statistics used for the comparison are symmetrical,as we will see below.

Union, intersection and differences

The result file contains the union graph, in tab-delimited format. This format is very convenient for inspecting the result, and for importing it into statistical packages (R, Excel, ...).

The rows starting with a semicolon (;) are comment lines. They provide you with some information (e.g. statistics about the intersection), but they will be ignored by graph-reading programs. The description of the result graph comes immediately after these comment lines.

Each row corresponds to one arc, and each column specifies one attribute of the arc.

source: the ID of the source node
target: the ID of the target node
label: the label of the arc. As labels, we selected the option ``Weights on the query and reference''. Since the input graphs were un-weighted, edge labels will be used instead of weights. The label <NULL> indicates that an edge is absent from one input network.
color and status: the status of the arc indicates whether it is found at the intersection, or in one graph only. A color code reflects this status, as indicated below.
- R.and.Q: arcs found at the intersection between graphs and . Default color: green.
- R.not.Q: arcs found in graph but not in graph . Default color: violet.
- Q.not.R: arcs found in graph but not in graph . Default color: red.
The result file contains several thousands of arcs, and we will of course not inspect them by reading each row of this file. Instead, we can generate a drawing in order to obtain an intuitive perception of the graph.

Sizes of the union, intersection and differences

The beginning of the result file gives us some information about the size of the two input files, their union, intersection, and differences.

; Counts of nodes and arcs
;    Graph    Nodes    Arcs     Description
;    R        779      786      Reference graph
;    Q        926      865      Query graph
;    QvR      1359     1529     Union
;    Q^R      346      122      Intersection
;    Q!R      580      743      Query not reference
;    R!Q      433      664      Reference not query

Statistical significance of the intersection between two graphs

The next lines of the result file give some statistics about the intersection between the two graphs. These statistics are computed in terms of arcs.

; Significance of the number of arcs at the intersection
;  Symbol   Value     Description                        Formula
;  N        1359      Nodes in the union                    
;  M        922761    Max number of arcs in the union    M = N*(N-1)/2
;  E(Q^R)   0.74      Expected arcs in the intersection  E(Q^R) = Q*R/M
;  Q^R      122       Observed arcs in the intersection
;  perc_Q   14.10     Percentage of query arcs           perc_Q = 100*Q^R/Q
;  perc_R   15.52     Percentage of reference arcs       perc_R = 100*Q^R/R
;  Jac_sim  0.0798    Jaccard coefficient of similarity  Jac_sim = Q^R/(QvR)
;  Pval     2.5e-228  P-value of the intersection        Pval=P(X >= Q^R)

A first interesting point is the maximal number of arcs () that can be traced between any two nodes of the union graph. In our study case, the graph obtained by merging Ito's and Uetz' data contains nodes. This graph is un-directed, and there are no self-loops. The maximal number of arcs is thus . This number seems huge, compared to the number of arcs observed in either Uetz' () or Ito's () graphs. This means that these two graphs are sparse: only a very small fraction of the node pairs are linked by an arc.

The next question is to evaluate the statistical significance of the intersection between the two graphs. For this, we can already compute the size that would be expected if we select two random sets of arcs of the same sizes as above (, ).

If the same numbers of arcs were picked up at random in the union graph, we could estimate the probability for an arc to be found in the network as follows: . Similarly, the probability for an arc of the union graph to be part of the network is . The probability for an arc to be found independently in two random networks of the same sizes as and is the product of these probabilities.

$\begin{displaymath}P(QR) = P(Q)*P(R) = A_R/M \cdot A_Q/M = 7.98e-07\end{displaymath}$

The number of arcs expected by chance in the intersection is the probability multiplied by the maximal number of arcs.

$\begin{eqnarray*} E(QR) & = & P(QR) \cdot M \\ & = & (A_Q \cdot A_R)/M \\ & = & 7.98e-07 \cdot 922761 = 0.74 \end{eqnarray*}$

Thus, at the intersection between two random sets of interaction, we would expect on the average a bit less than one interaction. It seems thus clear that the 122 interactions found at the intersection between he two published experiments is much higher than the random expectation.

We can even go one step further, and compute the P-value of this intersection, i.e. the probability to select at least that many interactions by chance.

The probability to observe exactly arcs at the intersection is given by the hypergeometrical distribution.

$\begin{displaymath} P(QR=x)=\frac{C^{x}_{R}C^{Q-x}_{M-R}}{C^Q_{M}} \end{displaymath}$

(3.1)

where

: is the number of arcs in the reference graph;
: i the number of arcs in the query graph;
: is the maximal number of arcs;
: is the number of arcs at the intersection between the two graphs.

By summing this formula, we obtain the P-value of the intersection, i.e. the probability to observe at least arcs at the intersection.

$\begin{eqnarray*} Pval = P(QR>=x)=\sum_{i=x}^{min(Q,R)}P(X=i)=\sum_{i=x}^{min(Q,R)} \frac{C^i_{R}C^{Q-i}_{M-R}}{C^Q_{M}} \end{eqnarray*}$

We can replace the symbols by the numbers of our study case.

$\begin{eqnarray*} Pval & = & P(QR>=122) \\ & = & \sum_{i=x}^{min(865,786)} \fr... ...{786}C^{865-i}_{922761-786}}{C^{865}_{922761}} \\ & = 2.5e-228 \end{eqnarray*}$

This probabilty is so small that it comes close to the limit of precision of our program ( $\approx 10^{-321}$ ).

Summary

In summary, the comparison revealed that the number of arcs found in common between the two datasets (Ito and Uetz) is highly significant, despite the apparently small percentage of the respective graphs it represents (14.10% of Ito, and 15.52% of Uetz).

Strengths and weaknesses of the approach

Exercises

Using the tool the tool network randomization, generate two random graphs of 1000 nodes and 1000 arcs each (you will need to store these random networks on your hard drive). Use the tool network comparison to compare the two random graphs. Discuss the result, including the following questions:
1. What is the size of the intersection ? Does it correspond to the expected value ?
2. Which P-value do you obtain ? How do you interpret this P-value ?
Randomize Ito's network with the tool network randomization, and compare this randomized graph with Uetz' network. Discuss the result in the same way as for the previous exercise.

Troubleshooting

The P-value of the intersection between two graphs is 0. Does it mean that it is impossible to have such an intersection by chance alone ?
No. Any intersection that you observe in practice might occur by chance, but the limit of precision for the hypergeometric P-value is $\approx 10^{-321}$ . Thus, a value of 0 can be interpreted as $Pval < 10^{-321}.$
The web server indicates that the result will appear, and after a few minutes my browser displays a message ``No response the server''.
How big are the two graphs that you are comparing ? In principle, compare-graphs can treat large graphs in a short time, but if your graphs are very large (e.g. several hundreds of thousands of arcs), the processing time may exceed the patience of your browser. In such case, you should consider either to install the stand-alone version of NeATon your computer, or write a script that uses NeATvia their Web services interface.

Next: Node degree statistics Up: Network Analysis Tools (NeAT) Previous: Network visualization and format Contents

RSAT 2009-09-04