Preprocessing the graph - Algorithms and complexity for ﬁnding and enumerating stories

3.4 Algorithms and complexity for ﬁnding and enumerating stories

3.4.1 Preprocessing the graph

First, we show how a graph may be simplified without essentially changing the set of its stories. The simplifications allow, from a theoretical point of view, shorter proofs of our results and, from a biological point of view, the simplified graphs obtained by these preprocessing steps turn out to be interesting since they correspond to a more compact representation of graphs that is equivalent in terms of story sets.

For the applications of the stories enumeration method described in this work, the input graph is a compound graph representation of a metabolic network. In this case, not all paths

3.4 Algorithms and complexity for ﬁnding and enumerating stories 31

between black nodes are biologically meaningful. Metabolic reactions are in many cases of the form m₁+s₁ →m₂+s₂, where m₁ and m₂ are the main compounds and s₁ and s₂ are so- called side compounds. For example, a typical pair of side compounds is ATP and ADP, that appear in many reactions with the role of providing energy for the reaction to take place. The problem, however, is that a path fromm₁ to ATP will be present in the compound graph even if there was no direct exchange of atoms between these two compounds. Consider a second reaction transforming m₃ into m₄ that also uses these side compounds. Figure 3.4 shows a shortcut connecting m₁ to m₄ passing through ATP. Notice that we could also connect m₃ to m₂ simply passing through ADP. Considering that a large portion of the reactions are reversible, it is clear that almost any pair of nodes may be connected with a few steps. One way to avoid these unrealistic paths is to compute a graph corresponding to the union of all lightest paths between all pairs of black nodes and to eliminate the remaining of the network, given some deﬁnition of arc weight.

Figure 3.4: An example of shortcuts in a metabolic network caused by the presence of hubs like ATP and ADP.

Notice how in the compound graph representation there is a path fromm₁ tom₄ passing through ATP that does not exist in the hypergraph representation.

For the computed paths to be more biologically meaningful than simply the shortest ones in terms of reactions, we may adopt as arc weight the out-degree of the target vertex. In this way, the weight of a path is the sum of the weights of the arcs in the path. Naturally, ATP and ADP have high degrees in the compound graph since they are involved in a large number of reactions as side compounds and, therefore, shortcuts passing through them tend to be avoided with the selected weight policy. An even more sophisticated approach to compute meaningful paths is to consider the exchange of carbon atoms between the metabolites and to trace a route between the source and the target of interest in such a way that in every step there is a carbon atom passing from one metabolite to the other (Boyer and Viari (2003);

Arita(2004);Blum and Kohlbacher(2008)). However, this approach needs more detailed data and algorithms, such as a graph representation of the chemicals of the whole network, that were not available at the time our experiments were performed. We intend, though, to use this approach in the near future in order to check how it improves our results, specially for the application on automatically recovering metabolic pathways that will be further described later. In the remaining of this chapter, we consider that the input graph is a collection of lightest paths between all-pairs of black nodes.

We now deﬁne the following four simpliﬁcation operations:

• A white source and target removal consists in removing iteratively a white node from the graph that is either a source or a target. Clearly such nodes cannot appear in any story. Letde(G) be the graph resulting from such removals.

3.4 Algorithms and complexity for ﬁnding and enumerating stories 32

• A self-loop removalconsists in removing all arcs of the form (u, u). Since stories are acyclic, such arcs do not appear in any story. Let sl(G) be the graph resulting from such removals.

• Aforward bottleneck removalconsists in removing a white nodevwhose out-degree is equal to 1, and directly connecting any predecessor of v to the unique successor ofv (without creating multi-arcs). Letfb(G, v)be the resulting graph.

• Abackward bottleneck removal consists in removing a white node whose in-degree is equal to 1, and directly connecting the unique predecessor ofv to the successors ofv (without creating multi-arcs). Letbb(G, v)be the resulting graph.

We prove that the last two operations leave the set of stories essentially unaltered. First an observation:

Observation 1. Let v, p, and s be three nodes such that (p, v),(v, s),(p, s) ∈ E and v is a (white) bottleneck. Then, for any story S,(p, v),(v, s)∈S if and only if (p, s)∈S.

Proof. The lemma follows from two observations. First,(p, v)and (v, s)belong to cycle C of Gif and only if the arc (p, s) belongs the cycleC of G, which contains, next to(p, s), all the arcs ofC except(p, v) and(v, s). Second,(p, s)and the pair(p, v),(v, s)will create the same white sources or targets, if any.

Given three nodesv, p, s∈V with(p, v),(v, s)∈Eand(p, s)∈E, letab(G, v, p, s)denote the graph obtained by adding to Gthe arc(p, s).

Lemma 1. Let v∈W be a forward bottleneck and letp, s∈V be such that (p, v),(v, s)∈E and (p, s)∈E. Then there exists a bijection from Σ(G) toΣ(ab(G, v, p, s)).

Proof. For any storyS∈Σ(G), we deﬁnef(S) =S∪{(p, s)}if(p, v)∈S (and hence,(v, s)∈ Ssincevis a forward bottleneck), otherwisef(S) =S. To prove thatf(S)∈Σ(ab(G, v, p, s)), we use Observation 1 to show that f(S) is acyclic if and only if S is acyclic. We now show that f(S) is maximal. Indeed, if (p, s) ∈ f(S), then no set of arcs could be added to f(S) since otherwise it could also be added toS. Otherwise, if(p, s) could be added tof(S), then, from Observation 1 also (p, v) and (v, s) could be added to f(S) and, hence, these two arcs could be added toS.

Let us now prove that, ifS₁ andS₂are two stories such thatS₁ =S₂, thenf(S₁)=f(S₂).

If (p, v) ∈ S₁ ∪S₂, then f(S₁) = S₁ = S₂ = f(S₂). Otherwise, if (p, v) ∈ S₁∩S₂, then f(S₁) =S₁∪ {(p, s)} =S₂∪ {(p, s)}=f(S₂). Finally, if(p, v)∈S₁\S₂ (the other case can be dealt with similarly), then(p, s)∈f(S₁)while (p, s)∈f(S₂) and, hence, f(S₁)=f(S₂).

It remains to show that, for any S ∈ Σ(ab(G, v, p, s)), there exists a S ∈ Σ(G) such that f(S) = S. Deﬁne S = S \ {(p, s)}. Since S is acyclic, so is S. If (p, s) ∈ S, then S =S andS ∈Σ(G), since the only diﬀerence betweenGand ab(G, v, p, s) is the arc(p, s).

Otherwise, from Observation 1, it follows that(p, v),(v, s)∈S and, hence,(p, v),(v, s)∈S:

the maximality of S then follows from the maximality of S, since any set of arcs that could be added to S could also be added to S.

By this lemma we may assume that, for any forward bottleneck v ∈ W whose unique successor iss, and for any predecessorpofv, the graph contains the arc(p, s). To complete the forward bottleneck removal operation, we then need to delete the vertexv without changing the stories set of the graph. Consider now the following operation: given a graph G with a forward bottleneck v, dp(G, v) denote the graph obtained by deleting from G the vertex v and all incident arcs.

3.4 Algorithms and complexity for ﬁnding and enumerating stories 33

Lemma 2. Let v ∈ W be a forward bottleneck and s its unique successor. Suppose that for any predecessor p of v, the graph contains the arc (p, s). Then there is a bijection from Σ(G) to Σ(dp(G, v)).

Proof. For anyS ∈Σ(G), we deﬁnef(S) =S\{v}, that is the subgraph obtained by removing v and all incident arcs from S if v ∈ S. Since S is acyclic, so is f(S). Moreover, from Observation 1, it follows that if (p, v),(v, s) ∈S, then (p, s) ∈ S and, hence, (p, s) ∈ f(S).

The maximality of f(S) then follows from the maximality of S, since any set of arcs that could be added tof(S)could also be added to S.

Let us now prove that, ifS₁ andS₂are two stories such thatS₁ =S₂, thenf(S₁)=f(S₂).

If (p, s) ∈ S₁ ∪S₂, then (p, v),(v, s) ∈ S₁∪S₂ and f(S₁) = S₁ = S₂ = f(S₂). Otherwise, if (p, s) ∈ S₁ ∩S₂, then (p, v),(v, s) ∈ S₁ ∩S₂ and f(S₁) = S₁ \ {(p, v),(v, s)} = S₂ \ {(p, v),(v, s)}=f(S₂). Finally, if(p, s)∈S₁\S₂ (the other case can be dealt with similarly), then(p, s)∈f(S₁) while (p, s)∈f(S₂)and, hence, f(S₁)=f(S₂).

Finally, letSbe a story ofdp(G, v). ThenSobtained by adding toS the path(p, v),(v, s) for every predecessor p ofv such that (p, s)∈S is clearly a story and f(S) =S.

Using the two previous lemmas, we obtain a justiﬁcation for the third simpliﬁcation operation.

Theorem 1. For any forward bottleneck v∈W, Σ(G) = Σ(fb(G, v)).

Analogously, we can justify the fourth operation.

Theorem 2. For any backward bottleneck v∈W,Σ(G) = Σ(bb(G, v)).

For any graph G, letfb(G) (respectivelybb(G)) denote the graph obtained by applying as many times as possible the forward (respectively backward) bottleneck removal operation.

Notice that, even if G does not contain self-loops, it might happen that fb(G) (respectively bb(G)) contains self-loops created by one bottleneck removal. Remember also that sl(G) denotes the graph obtained by the removal of all self-loops fromGandde(G)denotes the graph obtained by the iterative removal of all white sources and targets fromG. Our simpliﬁcation procedure can now be described as follows.

(1) LetG₀ =sl(de(G))and leti= 0.

(2) LetGi+1 =sl(bb(sl(fb(Gi)))).

(3) IfGi+1 =Gi then returnGi, otherwise let i=i+ 1and go to Step 2.

As a consequence of the previous results, we have that ifH is the graph returned by this procedure, then there is a bijection between Σ(G) and Σ(H), and we may enumerateΣ(H) instead. Hence from now on, we assume that anyv∈Whasd⁺(v)>1andd⁻(v)>1. Notice that this avoids graphs like the one shown in Figure 3.3. Indeed, in this case, the two arcs (c, y) and (y, a) would disappear and the arc (c, a) would be inserted. Furthermore, also x will disappear and we get arcs(b, c)and (a, c). Observe also that this simpliﬁcation procedure does not guarantee that a minimalFASenumerator would produce all possible minimalSAS. We applied the preprocessing steps described in this section on a collection of 107 metabolic networks obtained from MetExplore (Cottret et al.(2010a)). We randomly chose sets of black nodes with sizes varying from 5% to 15% of the total number of nodes of the graph. For each pair of metabolic network and set of randomly chosen black nodes we then computed subgraphs corresponding to the lightest paths between all pairs of black nodes. These subgraphs

3.4 Algorithms and complexity for ﬁnding and enumerating stories 34

vary on number of vertices from 42% to 98% with respect to the number of vertices in the original input graph and from 46% to 68% with respect to the number of arcs. In average, extracting all lightest-paths between the black nodes gives a graph with 68% of the nodes and 69% of the arcs of the original input graph. Over this new collection of graphs we applied then the four simpliﬁcation operations: white source and target removal, self-loops elimina- tion and forward and backward bottlenecks removals. The compression ratio on the number of nodes goes from 65% to 98% with an average reduction of 83%, while the compression ratio on the number of arcs goes from 56% to 99% with an average reduction of 77%, with respect to the original input graph. Overall it is 60% of reduction due to the lightest paths and an additional 20% because of the graph simpliﬁcations. This more compact representation of the interactions between black nodes greatly facilitates the visualisation and analysis of the input data.

No documento Énumération de sous-structures fonctionnelle dans des réseaux métaboliques complets : Histoires métaboliques, (páginas 39-43)