A Parallel Approximation Hitting Set Algorithm for Gene Expression Analysis
D. P. Ruchkys
Universidade de S˜ao Paulo
S. W. Song
Universidade de S˜ao Paulo
• Given an experiment where expression levels of thousands of genes are measures.
• We consider the problem of determining which genes affect the expression level of a given gene.
Our Problem
• Given an experiment with n genes of a set E = {a0, a1, ..., an−1} whose expression levels are measured in a time series of m measures (typically n >> m). We have a total of nm values of 0’s or 1’s.
• Our algorithm (based on Ideker et al.
[ITK00]) receives an m × n matrix of such values and determine, for a given gene an−1, which other genes are responsible for the expression level of an−1.
• Example.
1
1
- 1
1
-
1 0
1 0 0
0
x0 x1 x2 x3
p0
p1
p2
Infer the truth table for a3 of the matrix E shown.
1
1 1 1
- 1
1
1 1 -
1
1
1 -
+ 0 1 0 0
M = 0
x0 x1 x2 x3
p0
p1
p2
p3
p4
(1) In step (1), the expression levels of a3 differ in the row pairs (0,1), (0,3), (1,2) and (2,3).
We find:
• for (0,1), S01 = {a0, a2}, containing all the other genes whose expression levels also differ in the row pairs p0 and p1.
• the same is done for (0,3), S03 = {a2}.
• for (1,2), S = {a , a }.
Result of Step 1
Result of Step 1: S01 = {a0, a2}, S03 = {a2}, S12 = {a0, a1}, S23 = {a1}.
(2) In Step (2), find Smin = {a1, a2}, the smallest set such that each element in Smin is also
present in each one of the sets Sij of the previous step.
• Given a finite set E, a finite collection
S = {S1, ..., Sw} of subsets of E, find a subset A ⊆ E of the smallest size, such that
A ∩ Si 6= ∅ for all i = 1, ..., w.
The Hitting Set Problem
1
2 5
3
8 7 6
4 9
E
S
5 7
6
3 51 9
4 1
1 6
A
Primal-Dual Approximation Algorithm [FMCF01]
• Due to Bar-Yehuda and Even [BYE81] and was originally conceived for the minimum set cover problem.
• It is an α-approximation algorithm, where α = maxwi=1|Si|.
• α = maxwi=1|Si| = O(n).
The Hitting Set Problem
Greedy Approximation Algorithm [J74]
• Strategy of constructing the set A by
choosing the elements that occurs the most times in the subsets of S.
• The approximation ratio is ln|S| + 1.
• ln|S| + 1 = O(logm2)
Greedy Approximation Algorithm
1
2 5
3
8 7 6
4 9
E
S
5 7
6
3 51 9
4 1
1 6
A
The Hitting Set Problem Greedy Approximation Algorithm
1
2 5
3
8 7 6
4 9
E
S
5 7
6
3 51 9
4 1
1 6
Greedy Approximation Algorithm
S
a1
a3
a2
a0
a0
a2
a2
a0
a1
a1
E
A
The Sequential Algorithm
0 2
1 0
3 2
1 0
2 3
1 1
1
0
0
false 0
gene vector
set vector
occurrence list
i1 j1 covered list
1
- 1
1 1 0
1 0
x0 x1 x2 x3
p0
p
3 0
2
1 0 2
1 0
3 2
0
0 1 0
2 3
0
2 1
1 3 2 3 2 2 2
0
0 1
2 2
1
set vector gene vector
false false false false j1
i1 covered list occurrence list
1
1
- 1
1
-
1 0
1 0 0
0
x0 x1 x2 x3
p0
p1
p2
The Sequential Algorithm
0 2
occurrence
1 0
3 2
0
0 1 0
j1 list
2 3
0
2 1
1 3 2 3 2 2 2
0 1
2
1 1
1 0
list
0 2
3 0
2
1 gene vector
set vector
covered i1
HS:{0}
false false
true
falsetrue false
Time and Space Complexities
• To construct the data structures: O(m2n).
• Let k the size of the hitting set. We have to find k times the element with the largest
number of occurrences. Therefore we have the time complexity of O(kn).
• For each such element, we have to update the data structures: O(m2n) time. Since we have k elements, the total time complexity to
update data structures is O(km2n).
The Sequential Algorithm Time and Space Complexities
• The total time complexity is therefore O(m2n)+O(kn)+O(km2n) = O(km2n) .
• The size k of the hitting set is O(m2).
Therefore, the time complexity of the algorithm can be expressed as O(m4n).
• The space complexity is O(m2n).
• The input matrix M is partitioned vertically to be stored in each processor.
• Example of the partitioning:
a0 a1 a2 a3 a4 a5 a6 a7 a8
.. .
.. .
.. .
.. .
.. .
.. .
.. .
.. .
.. . x0,0 x0,1 x0,2 x0,3 x0,4 x0,5 x0,6 x0,7 x0,8
.. . x1,0
x2,0
x1,1
x2,1
x1,2
x2,2
x1,3
x2,3
x3,3
x1,4
x2,4
x3,4
x1,5
x2,5
x3,5
x1,6
x2,6
x3,6
x1,7
x2,7
x3,7
x1,8
x2,8
x3,8 x3,2
x3,0 x3,1 M =
xm
−1,0 xm
−1,1 xm
−1,2 xm
−1,7 xm
−1,8
xm
−1,3 xm
−1,4 xm
−1,5 xm
−1,6
processor 0 processor 1 processor 2
row 3 row 2 row 1
row m −1 row 0
The Parallel Algorithm
• Each processor reads a piece of the input of size m × n−p1.
• All the processors store a vector v,
corresponding to the expression levels of the gene under study an−1.
• Each processor pi also stores a gene vector, with information about genes it is responsible for. The gene vector stores information of the genes for which processor pi is responsible.
The gene vector in each processor has size O(mp2n).
• Each processor also has a set vector, such that only elements of set S of its
• Example:
1
1 - 0 0 x2
1
1 1 1
- 1
1
1 1 E = -
x0 x1
1 + 0 1 0 x4
p0
p2
p3
p4
p1
0
0 0 0 0 x3
covered
i1 j1
gene vector
set vector
0 false
0 0 1
1 2 3
2 2
0 1
3 2 3
false
false false 2
1 0
3 2 0
2
i1 j1
gene vector
set vector
0 0 2 3
1 2 3
0
0 1
1 3 2 3 2
2 2 1 0
2
0
1 1
list
list covered
false false false false
list ocurrence
ocurrence
list
The Parallel Algorithm Time and Space Complexities
• Time complexity: O(mp4n).
• Requires O(k) communication rounds, where k is the size of the hitting set. It can be
expressed in terms of m, O(m2).
• Requires O(mp2n) space.
◦
◦ ◦ ◦
◦ 20x1024
•
•
• •
• 20x2048
20x4096
2 4 6 8
0 0.01 0.02 0.03 0.04
No. Processors
Seconds
Bibliographical References
[BYE81] R. Bar-Yehuda and S. Even. A linear time approximation algorithm for the weighted vertex cover problem. Journal of Algorithms, 2:198-203, 1981.
[FMCF01] C. G. Fernandes, F. K. Miyazawa, M. Cerioli, P. Feofiloff. Uma introdu¸c˜ao sucinta a algoritmos de
aproxima¸c˜ao. 23 Col´oquio Brasileiro de Matem´atica, 2001.
[ITK00] T. E. Ideker, V. Thorsson, R. Karp. Discovery of regulatory interactions through perturbation: inference and experimental design. Pacific Symposium on
Biocomputing, 5:302-313, 2000.
[J74] D. S. Johnson. Approximation algorithms for
combinatorial problems. Journal of Computer and System Sciences, 9:256-278, 1974.