Luís Sarmento
Universidade do Porto (NIAD&R) and
Linguateca
[email protected]
Setup
* 2.8 Ghz PIV * 2Gb RAM
* 160 Gb IDE HD * Fedora Core 2 * Perl 5.6
* MySQL 5.0.15
* DBI + DBD-Mysql
Optimize Queries…
* Text at sentence level:
QA, Definition Extraction * 1-4 word window contexts: find MWE, collocations
* word co-occurrence data: WSD, context clustering
Global Motivation
* Obtain fast text query methods for a variety of “data-driven” NLP techniques
* Develop practical methods for querying current gigabyte corpora (web collections…)
* Experiment scalable methods for querying the next generation of terabyte corpora
Table (millions)# tuples size (GB)Table size (GB)Index
Metadata 1.529 0.2 0.05
Sentences 35.575 6.55 5.90
dictionary 6.834 0.18 0.27
2-grams 54.610 1.50 0.92
3-grams 173.608 5.43 2.97
4-grams 293.130 10.40 6.35
co-occurrence 761.044 20.10 7.56
BACO total - 44.4 ~ 24
Statistics
BACO
A large database of text and co-occurrences
Some Practical Problems
* How to compile lists of n-grams (2,3,4…) in a 1B word collection?
* How to obtain co-occurrence info for all pairs of words in a 1B word collection? * Which data structures are best (and easily available in Perl)
hash tables? Trees? Others (Judy? T-Trees?)…
* How should all this data be stored and indexed in a standard RDBS?
Some conclusions
* RDBS are a good alternative for querying gigabyte text collections for NLP purposes * complex data pre-processing tasks, data
modeling and system tuning may be required
* current implementation deals with raw text but models may be extended for annotated corpora * query speed depends on internal details of
MySQL indexing mechanism
* current performance may be improved by a more efficient database scheme and parallelization
Current Deliverables
* MySQL Encoded database of text, n-grams and information about co-occurrence pairs
* Perl Module to easily query BACO instances
Duplicate
removal
(by Nuno Seco [email protected])
WPT0
3
12 GB
6 GB
1.5M docs
sentence
splitting
document
metadata
tabular format
load
data
index
data
indexed
database
metadata + text sentences
Stage 1: Data preparation and loading
text
sentence s
Stage 2: compiling dictionary +
2,3,4-grams + co-occurrence pairs
single pass
13 iterations
disjoint division based on number of chars
DIC
2
GRAMS
3, 4-grams + co-occurrence pairs
multiple iterations
N documents per iteration temp files are sorted
3
GRAMS
4
GRAMS
CO-OC PAIRS
load data
index data
BACO
Final Tables:
* metadata
* text sentences * Dictionary
* 2,3,4-grams
* co-occurrence pairs
Linguateca
* Improving processing and research on the Portuguese language * Fostering collaboration among researchers
* Providing public and free-of-charge tools and resources to the community
http://www.linguateca.pt
WPT03 - A public resource
* The WPT 03 is a resource built by XLDB Group (xldb.di.fc.ul.pt), and distributed by Linguateca (www.linguateca.pt)
* 12GB, 3.7M web documents and ~1.6B words
* Obtained from the Portuguese web search engine TUMBA!
http://www.tumba.pt
NIAD&R
* Research group started in 1998 as part of the LIACC (AI Lab) @ Universidade do Porto
* Research topics: Multi-Agent Systems, E-business Technology, Machine Learning, Robotics, Ontologies