• Nenhum resultado encontrado

A Parallel Approximation Hitting Set Algorithm for Gene Expression Analysis

N/A
N/A
Protected

Academic year: 2022

Share "A Parallel Approximation Hitting Set Algorithm for Gene Expression Analysis"

Copied!
23
0
0

Texto

(1)

A Parallel Approximation Hitting Set Algorithm for Gene Expression Analysis

D. P. Ruchkys

Universidade de S˜ao Paulo

S. W. Song

Universidade de S˜ao Paulo

(2)

• Given an experiment where expression levels of thousands of genes are measures.

• We consider the problem of determining which genes affect the expression level of a given gene.

(3)

Our Problem

• Given an experiment with n genes of a set E = {a0, a1, ..., an1} whose expression levels are measured in a time series of m measures (typically n >> m). We have a total of nm values of 0’s or 1’s.

• Our algorithm (based on Ideker et al.

[ITK00]) receives an m × n matrix of such values and determine, for a given gene an1, which other genes are responsible for the expression level of an1.

• Example.

1

1

- 1

1

-

1 0

1 0 0

0

x0 x1 x2 x3

p0

p1

p2

(4)

Infer the truth table for a3 of the matrix E shown.

1

1 1 1

- 1

1

1 1 -

1

1

1 -

+ 0 1 0 0

M = 0

x0 x1 x2 x3

p0

p1

p2

p3

p4

(1) In step (1), the expression levels of a3 differ in the row pairs (0,1), (0,3), (1,2) and (2,3).

We find:

• for (0,1), S01 = {a0, a2}, containing all the other genes whose expression levels also differ in the row pairs p0 and p1.

• the same is done for (0,3), S03 = {a2}.

• for (1,2), S = {a , a }.

(5)

Result of Step 1

Result of Step 1: S01 = {a0, a2}, S03 = {a2}, S12 = {a0, a1}, S23 = {a1}.

(2) In Step (2), find Smin = {a1, a2}, the smallest set such that each element in Smin is also

present in each one of the sets Sij of the previous step.

(6)

• Given a finite set E, a finite collection

S = {S1, ..., Sw} of subsets of E, find a subset A ⊆ E of the smallest size, such that

A ∩ Si 6= ∅ for all i = 1, ..., w.

(7)

The Hitting Set Problem

1

2 5

3

8 7 6

4 9

E

S

5 7

6

3 51 9

4 1

1 6

A

(8)

Primal-Dual Approximation Algorithm [FMCF01]

• Due to Bar-Yehuda and Even [BYE81] and was originally conceived for the minimum set cover problem.

• It is an α-approximation algorithm, where α = maxwi=1|Si|.

• α = maxwi=1|Si| = O(n).

(9)

The Hitting Set Problem

Greedy Approximation Algorithm [J74]

• Strategy of constructing the set A by

choosing the elements that occurs the most times in the subsets of S.

• The approximation ratio is ln|S| + 1.

• ln|S| + 1 = O(logm2)

(10)

Greedy Approximation Algorithm

1

2 5

3

8 7 6

4 9

E

S

5 7

6

3 51 9

4 1

1 6

A

(11)

The Hitting Set Problem Greedy Approximation Algorithm

1

2 5

3

8 7 6

4 9

E

S

5 7

6

3 51 9

4 1

1 6

(12)

Greedy Approximation Algorithm

S

a1

a3

a2

a0

a0

a2

a2

a0

a1

a1

E

A

(13)

The Sequential Algorithm

0 2

1 0

3 2

1 0

2 3

1 1

1

0

0

false 0

gene vector

set vector

occurrence list

i1 j1 covered list

1

- 1

1 1 0

1 0

x0 x1 x2 x3

p0

p

(14)

3 0

2

1 0 2

1 0

3 2

0

0 1 0

2 3

0

2 1

1 3 2 3 2 2 2

0

0 1

2 2

1

set vector gene vector

false false false false j1

i1 covered list occurrence list

1

1

- 1

1

-

1 0

1 0 0

0

x0 x1 x2 x3

p0

p1

p2

(15)

The Sequential Algorithm

0 2

occurrence

1 0

3 2

0

0 1 0

j1 list

2 3

0

2 1

1 3 2 3 2 2 2

0 1

2

1 1

1 0

list

0 2

3 0

2

1 gene vector

set vector

covered i1

HS:{0}

false false

true

falsetrue false

(16)

Time and Space Complexities

• To construct the data structures: O(m2n).

• Let k the size of the hitting set. We have to find k times the element with the largest

number of occurrences. Therefore we have the time complexity of O(kn).

• For each such element, we have to update the data structures: O(m2n) time. Since we have k elements, the total time complexity to

update data structures is O(km2n).

(17)

The Sequential Algorithm Time and Space Complexities

• The total time complexity is therefore O(m2n)+O(kn)+O(km2n) = O(km2n) .

• The size k of the hitting set is O(m2).

Therefore, the time complexity of the algorithm can be expressed as O(m4n).

• The space complexity is O(m2n).

(18)

• The input matrix M is partitioned vertically to be stored in each processor.

• Example of the partitioning:

a0 a1 a2 a3 a4 a5 a6 a7 a8

.. .

.. .

.. .

.. .

.. .

.. .

.. .

.. .

.. . x0,0 x0,1 x0,2 x0,3 x0,4 x0,5 x0,6 x0,7 x0,8

.. . x1,0

x2,0

x1,1

x2,1

x1,2

x2,2

x1,3

x2,3

x3,3

x1,4

x2,4

x3,4

x1,5

x2,5

x3,5

x1,6

x2,6

x3,6

x1,7

x2,7

x3,7

x1,8

x2,8

x3,8 x3,2

x3,0 x3,1 M =

xm

1,0 xm

1,1 xm

1,2 xm

1,7 xm

1,8

xm

1,3 xm

1,4 xm

1,5 xm

1,6

processor 0 processor 1 processor 2

row 3 row 2 row 1

row m 1 row 0

(19)

The Parallel Algorithm

• Each processor reads a piece of the input of size m × np1.

• All the processors store a vector v,

corresponding to the expression levels of the gene under study an1.

• Each processor pi also stores a gene vector, with information about genes it is responsible for. The gene vector stores information of the genes for which processor pi is responsible.

The gene vector in each processor has size O(mp2n).

• Each processor also has a set vector, such that only elements of set S of its

(20)

• Example:

1

1 - 0 0 x2

1

1 1 1

- 1

1

1 1 E = -

x0 x1

1 + 0 1 0 x4

p0

p2

p3

p4

p1

0

0 0 0 0 x3

covered

i1 j1

gene vector

set vector

0 false

0 0 1

1 2 3

2 2

0 1

3 2 3

false

false false 2

1 0

3 2 0

2

i1 j1

gene vector

set vector

0 0 2 3

1 2 3

0

0 1

1 3 2 3 2

2 2 1 0

2

0

1 1

list

list covered

false false false false

list ocurrence

ocurrence

list

(21)

The Parallel Algorithm Time and Space Complexities

• Time complexity: O(mp4n).

• Requires O(k) communication rounds, where k is the size of the hitting set. It can be

expressed in terms of m, O(m2).

• Requires O(mp2n) space.

(22)

◦ ◦ ◦

◦ 20x1024

• •

• 20x2048

20x4096

2 4 6 8

0 0.01 0.02 0.03 0.04

No. Processors

Seconds

(23)

Bibliographical References

[BYE81] R. Bar-Yehuda and S. Even. A linear time approximation algorithm for the weighted vertex cover problem. Journal of Algorithms, 2:198-203, 1981.

[FMCF01] C. G. Fernandes, F. K. Miyazawa, M. Cerioli, P. Feofiloff. Uma introdu¸ao sucinta a algoritmos de

aproxima¸ao. 23 Col´oquio Brasileiro de Matem´atica, 2001.

[ITK00] T. E. Ideker, V. Thorsson, R. Karp. Discovery of regulatory interactions through perturbation: inference and experimental design. Pacific Symposium on

Biocomputing, 5:302-313, 2000.

[J74] D. S. Johnson. Approximation algorithms for

combinatorial problems. Journal of Computer and System Sciences, 9:256-278, 1974.

Referências

Documentos relacionados

Neste estudo, utiliza-se o termo ‘objeto’ referenciando todo e qualquer artefato material ou virtual, que tenha uma interface com o usuário.. utilizados na configuração de

Beyond genes belonging to detoxification gene families the analysis revealed also the over- expression of eight transcripts encoding putative cuticle proteins, which have

With the elements of K determined for mESC proliferation and transition between patterns of allelic Nanog expression and the single-cell gene expression model in place, we proceeded

The integration of functional gene annotations in the analysis confirms the detected differences in gene expression across tissues and confirms the expression of porcine genes being

Por outro lado, a curva mestre de sinterização calculada para o sistema via método HTMW mostra uma energia aparente global de sinterização menor que a dos óxidos via

This result indicated that, when the EPS yield was high, the expression level of EPS genes was up-regulated, implying that the expression level of genes within the EPS gene cluster

We submitted the full list of candidate genes (n = 43) obtained from the significant plus suggestive DAPPLE analysis and for each gene we obtained a different genetic expression

No congresso penitenciário realizado em Praga (1930), o representante brasileiro (Cândido Mendes de Almeida) expôs que o modelo do encarceramento celular não era viável