poster baco lrec 2006

(1)

Luís Sarmento

Universidade do Porto (NIAD&R) and

Linguateca

[email protected]

Setup

* 2.8 Ghz PIV * 2Gb RAM

* 160 Gb IDE HD * Fedora Core 2 * Perl 5.6

* MySQL 5.0.15

* DBI + DBD-Mysql

Optimize Queries…

* Text at sentence level:

QA, Definition Extraction * 1-4 word window contexts: find MWE, collocations

* word co-occurrence data: WSD, context clustering

Global Motivation

* Obtain fast text query methods for a variety of “data-driven” NLP techniques

* Develop practical methods for querying current gigabyte corpora (web collections…)

* Experiment scalable methods for querying the next generation of terabyte corpora

Table _(millions)# tuples _{size (GB)}Table _{size (GB)}Index

Metadata 1.529 0.2 0.05

Sentences 35.575 6.55 5.90

dictionary 6.834 0.18 0.27

2-grams 54.610 1.50 0.92

3-grams 173.608 5.43 2.97

4-grams 293.130 10.40 6.35

co-occurrence 761.044 20.10 7.56

BACO total - 44.4 ~ 24

Statistics

BACO

A large database of text and co-occurrences

Some Practical Problems

* How to compile lists of n-grams (2,3,4…) in a 1B word collection?

* How to obtain co-occurrence info for all pairs of words in a 1B word collection? * Which data structures are best (and easily available in Perl)

hash tables? Trees? Others (Judy? T-Trees?)…

* How should all this data be stored and indexed in a standard RDBS?

Some conclusions

* RDBS are a good alternative for querying gigabyte text collections for NLP purposes * complex data pre-processing tasks, data

modeling and system tuning may be required

* current implementation deals with raw text but models may be extended for annotated corpora * query speed depends on internal details of

MySQL indexing mechanism

* current performance may be improved by a more efficient database scheme and parallelization

Current Deliverables

* MySQL Encoded database of text, n-grams and information about co-occurrence pairs

* Perl Module to easily query BACO instances

Duplicate

removal

(by Nuno Seco [email protected])

WPT0

3 12 GB

6 GB

1.5M docs

sentence

splitting

document

metadata

tabular format

load

data

index

data

indexed

database

metadata + text sentences

Stage 1: Data preparation and loading

text

sentence s

Stage 2: compiling dictionary +

2,3,4-grams + co-occurrence pairs

single pass

13 iterations

disjoint division based on number of chars

DIC

2

GRAMS

3, 4-grams + co-occurrence pairs

multiple iterations

N documents per iteration temp files are sorted

3

GRAMS

4

GRAMS

CO-OC PAIRS

load data

index data

BACO

Final Tables:

* metadata

* text sentences * Dictionary

* 2,3,4-grams

* co-occurrence pairs

Linguateca

* Improving processing and research on the Portuguese language * Fostering collaboration among researchers

* Providing public and free-of-charge tools and resources to the community

http://www.linguateca.pt

WPT03 - A public resource

* The WPT 03 is a resource built by XLDB Group (xldb.di.fc.ul.pt), and distributed by Linguateca (www.linguateca.pt)

* 12GB, 3.7M web documents and ~1.6B words

* Obtained from the Portuguese web search engine TUMBA!

http://www.tumba.pt

NIAD&R

* Research group started in 1998 as part of the LIACC (AI Lab) @ Universidade do Porto

* Research topics: Multi-Agent Systems, E-business Technology, Machine Learning, Robotics, Ontologies

poster baco lrec 2006

Luís Sarmento

Universidade do Porto (NIAD&R) and

Linguateca

[email protected]

Setup

Optimize Queries…

Global Motivation

Statistics

BACO

A large database of text and co-occurrences

Some Practical Problems

Some conclusions

Current Deliverables

Duplicate

removal

WPT0

3

12 GB

6 GB

1.5M docs

sentence

splitting

document

metadata

tabular format

load

data

index

data

indexed

database

Stage 1: Data preparation and loading

Stage 2: compiling dictionary +

2,3,4-grams + co-occurrence pairs

single pass

13 iterations

DIC

multiple iterations

load data

index data

BACO

Final Tables:

Linguateca

http://www.linguateca.pt

WPT03 - A public resource

NIAD&R

http://www.fe.up.pt/~eol/