A Parallel Approximation Hitting Set Algorithm for Gene Expression Analysis

(1)

D. P. Ruchkys

Universidade de S˜ao Paulo

S. W. Song

Universidade de S˜ao Paulo

(2)

• Given an experiment where expression levels of thousands of genes are measures.

• We consider the problem of determining which genes affect the expression level of a given gene.

(3)

Our Problem

• Given an experiment with n genes of a set E = {a0, a1, ..., a_n₋1} whose expression levels are measured in a time series of m measures (typically n >> m). We have a total of nm values of 0’s or 1’s.

• Our algorithm (based on Ideker et al.

[ITK00]) receives an m × n matrix of such values and determine, for a given gene a_n₋1, which other genes are responsible for the expression level of a_n₋1.

• Example.

1

- 1

1

-

1 0

1 0 0

0

x0 x1 x2 x3

p0

p1

p2

(4)

Infer the truth table for a3 of the matrix E shown.

1

1 1 1

- 1

1

1 1 -

1

1 -

+ 0 1 0 0

M = 0

x0 x1 x2 x3

p0

p1

p2

p3

p4

(1) In step (1), the expression levels of a3 differ in the row pairs (0,1), (0,3), (1,2) and (2,3).

We find:

• for (0,1), S01 = {a0, a2}, containing all the other genes whose expression levels also differ in the row pairs p0 and p1.

• the same is done for (0,3), S03 = {a2}.

• for (1,2), S = {a , a }.

(5)

Result of Step 1

Result of Step 1: S01 = {a0, a2}, S03 = {a2}, S12 = {a0, a1}, S23 = {a1}.

(2) In Step (2), find S_min = {a1, a2}, the smallest set such that each element in S_min is also

present in each one of the sets S_ij of the previous step.

(6)

• Given a finite set E, a finite collection

S = {S1, ..., S_w} of subsets of E, find a subset A ⊆ E of the smallest size, such that

A ∩ S_i 6= ∅ for all i = 1, ..., w.

(7)

The Hitting Set Problem

1

2 5

3

8 7 6

4 9

E

S

5 7

6

3 51 9

4 1

1 6

A

(8)

Primal-Dual Approximation Algorithm [FMCF01]

• Due to Bar-Yehuda and Even [BYE81] and was originally conceived for the minimum set cover problem.

• It is an α-approximation algorithm, where α = max^w_i₌₁|S_i|.

• α = max^w_i₌₁|S_i| = O(n).

(9)

The Hitting Set Problem

Greedy Approximation Algorithm [J74]

• Strategy of constructing the set A by

choosing the elements that occurs the most times in the subsets of S.

• The approximation ratio is ln|S| + 1.

• ln|S| + 1 = O(logm²)

(10)

Greedy Approximation Algorithm

1

2 5

3

8 7 6

4 9

E

S

5 7

6

3 51 9

4 1

1 6

A

(11)

The Hitting Set Problem Greedy Approximation Algorithm

1

2 5

3

8 7 6

4 9

E

S

5 7

6

3 51 9

4 1

1 6

(12)

Greedy Approximation Algorithm

S

a1

a3

a2

a0

a2

a0

a1

E

A

(13)

The Sequential Algorithm

0 2

1 0

3 2

1 0

2 3

1 1

1

0

false 0

gene vector

set vector

occurrence list

i1 j1 covered list

1

- 1

1 1 0

1 0

x0 x1 x2 x3

p0

p

(14)

3 0

2

1 0 2

1 0

3 2

0

0 1 0

2 3

0

2 1

1 3 2 3 2 2 2

0

0 1

2 2

1

set vector gene vector

false false false false j1

i1 covered list occurrence list

1

- 1

1

-

1 0

1 0 0

0

x0 x1 x2 x3

p0

p1

p2

(15)

The Sequential Algorithm

0 2

occurrence

1 0

3 2

0

0 1 0

j1 list

2 3

0

2 1

1 3 2 3 2 2 2

0 1

2

1 1

1 0

list

0 2

3 0

2

1 gene vector

set vector

covered i1

HS:{0}

false false

true

falsetrue false

(16)

Time and Space Complexities

• To construct the data structures: O(m²n).

• Let k the size of the hitting set. We have to find k times the element with the largest

number of occurrences. Therefore we have the time complexity of O(kn).

• For each such element, we have to update the data structures: O(m²n) time. Since we have k elements, the total time complexity to

update data structures is O(km²n).

(17)

The Sequential Algorithm Time and Space Complexities

• The total time complexity is therefore O(m²n)+O(kn)+O(km²n) = O(km²n) .

• The size k of the hitting set is O(m²).

Therefore, the time complexity of the algorithm can be expressed as O(m⁴n).

• The space complexity is O(m²n).

(18)

• The input matrix M is partitioned vertically to be stored in each processor.

• Example of the partitioning:

a₀ a₁ a₂ a₃ a₄ a₅ a₆ a₇ a₈

.. .

.. . x_0,0 x_0,1 x_0,2 x_0,3 x_0,4 x_0,5 x_0,6 x_0,7 x_0,8

.. . x_1,0

x_2,0

x_1,1

x_2,1

x_1,2

x_2,2

x_1,3

x_2,3

x_3,3

x_1,4

x_2,4

x_3,4

x_1,5

x_2,5

x_3,5

x_1,6

x_2,6

x_3,6

x_1,7

x_2,7

x_3,7

x_1,8

x_2,8

x_3,8 x_3,2

x_3,0 x_3,1 M =

x_m

−1,0 x_m

−1,1 x_m

−1,2 x_m

−1,7 x_m

−1,8

x_m

−1,3 x_m

−1,4 x_m

−1,5 x_m

−1,6

processor 0 processor 1 processor 2

row 3 row 2 row 1

row m −1 row 0

(19)

The Parallel Algorithm

• Each processor reads a piece of the input of size m × ⁿ⁻_p¹.

• All the processors store a vector v,

corresponding to the expression levels of the gene under study a_n₋1.

• Each processor p_i also stores a gene vector, with information about genes it is responsible for. The gene vector stores information of the genes for which processor p_i is responsible.

The gene vector in each processor has size O(^m_p²ⁿ).

• Each processor also has a set vector, such that only elements of set S of its

(20)

• Example:

1

1 - 0 0 x2

1

1 1 1

- 1

1

1 1 E = -

x0 x1

1 + 0 1 0 x4

p0

p2

p3

p4

p1

0

0 0 0 0 x3

covered

i1 j1

gene vector

set vector

0 false

0 0 1

1 2 3

2 2

0 1

3 2 3

false

false false 2

1 0

3 2 0

2

i1 j1

gene vector

set vector

0 0 2 3

1 2 3

0

0 1

1 3 2 3 2

2 2 1 0

2

0

1 1

list

list covered

false false false false

list ocurrence

ocurrence

list

(21)

The Parallel Algorithm Time and Space Complexities

• Time complexity: O(^m_p⁴ⁿ).

• Requires O(k) communication rounds, where k is the size of the hitting set. It can be

expressed in terms of m, O(m²).

• Requires O(^m_p²ⁿ) space.

(22)

◦

◦ ◦ ◦

◦ 20x1024

•

• •

• 20x2048

20x4096

2 4 6 8

0 0.01 0.02 0.03 0.04

No. Processors

Seconds

(23)

Bibliographical References

[BYE81] R. Bar-Yehuda and S. Even. A linear time approximation algorithm for the weighted vertex cover problem. Journal of Algorithms, 2:198-203, 1981.

[FMCF01] C. G. Fernandes, F. K. Miyazawa, M. Cerioli, P. Feofiloff. Uma introdu¸c˜ao sucinta a algoritmos de

aproxima¸cão. 23 Colóquio Brasileiro de Matemática, 2001.

[ITK00] T. E. Ideker, V. Thorsson, R. Karp. Discovery of regulatory interactions through perturbation: inference and experimental design. Pacific Symposium on

Biocomputing, 5:302-313, 2000.

[J74] D. S. Johnson. Approximation algorithms for

combinatorial problems. Journal of Computer and System Sciences, 9:256-278, 1974.