Corpus analysis for indexing:
Corpus analysis for indexing:
when corpus
when corpus
-
-
based
based
terminology makes a
terminology makes a
difference
difference
D
DééboraboraOliveiraOliveira Lu
LuííssSarmentoSarmento Belinda BelindaMaiaMaia Diana DianaSantosSantos
Linguateca Linguateca
Corpus
Corpus
-
-
based indexing of a
based indexing of a
specialized Web portal in PT & EN
specialized Web portal in PT & EN
Interdisciplinary work
Interdisciplinary work
––Information retrieval Information retrieval –
–CorpusCorpus--based terminologybased terminology
Corp
Corpó
ógrafo
grafo
––WebWeb--based environment for terminology work based environment for terminology work
Busca
Busca
––LinguatecaLinguateca’’sssite search enginesite search engine
LINGUATECA
LINGUATECA
Linguateca is a distributed language resource Linguateca is a distributed language resource centre for Portuguese
centre for Portuguese
Aim: contributing to the quality of NLP resources Aim: contributing to the quality of NLP resources for Portuguese
for Portuguese
Increasingly large website at Increasingly large website at http://www.linguateca.pt
http://www.linguateca.ptsince mid 1998since mid 1998
–
–Several onSeveral on--line resources (corpora, tools, line resources (corpora, tools, publications, etc) produced by Linguateca publications, etc) produced by Linguateca –
–Catalogue of resources produced by other Catalogue of resources produced by other researchers
researchers –
–1300 web documents and 2500 external links1300 web documents and 2500 external links
Busca
Busca
: a simple search engine
: a simple search engine
A search
A search-
-engine for our site:
engine for our site:
1.1. Person Search (simple database query)Person Search (simple database query) 2.
2. Publication Search (simple database query)Publication Search (simple database query) 3.
3. Simple keyword search (FreeSimple keyword search (Free--text Search):text Search):
Processing of rtf,
Processing of rtf, pspsand and pdfpdffiles includedfiles included Whole system based on CQP:
Whole system based on CQP: ““Site as a corpusSite as a corpus”” All words are
All words are ““alikealike””: no TF/IDF, no document : no TF/IDF, no document clustering, no terminological knowledge clustering, no terminological knowledge
Search
Search
Systems
Systems
1 and
1
and
2 are OK but
2 are OK
but
not
not
System
System
3
3
(too
(
too
naive!
naive
! too
too
simple...)
simple
...)
How could we improve
How could we improve
Busca
Busca
?
?
Our group has an extensive experience in Our group has an extensive experience in terminology
terminology
Terminology and IR/search
Terminology and IR/search--engines seem a engines seem a “
“perfectperfect--matchmatch””
–
–BUT terminology has not been widely accepted in IRBUT terminology has not been widely accepted in IR
Our question: is the knowledge of Our question: is the knowledge of
terminologically relevant units going to help us terminologically relevant units going to help us improve
improve BuscaBusca??
–
–AtAtindexingindexingstagestage –
–AtAtqueryqueryprocessingprocessingstagestage –
–AtAtresultresultranking ranking stagestage –
–...
Looking at
Looking at
Busca
Busca
logs
logs
January 2003
January 2003 --April 2005April 2005 1527
1527 ““freefree--text searchestext searches””queries:queries:
–
–Excluding own searchesExcluding own searches –
–Very few queries for more than 2 years!!Very few queries for more than 2 years!!
Some statistics: Some statistics:
Repetition of the search strings
Four times; 25; 2% Twice; 170; 15% Three times; 55; 5% Five times Once; 835; 77%
Number of queries vs size of the search string
590 242 126 66 74 0 100 200 300 400 500 600 700
What
What
was
was
being
being
searched
searched
in
in
Busca?
Busca?
4 Consultoria 4 Concordância 4 Autor 4 Árvore 4 Admir 4 Adjetivos 4 About 5 Trail 5 Tradução 5 Tesouro 5 Sexo 5 registros doque é Conjuções coordenadas
5 Peniche
5 linguagem natural
5 corpus da folha de são Paulo
5 Corpus 7 Verbos 8 Cabeça 9 Adjunto 10 Variaçoes # search string 3 floresta sintactica 3 ensino%2C portugues%2C lingua estrangeira
3 emprego do artigo 3 dicionário técnico 3 concordancia verbal 3 comparable corpora 3 cetem publico um milhao de palavras
3 adjunto adniminal 3 verbos irregulares 4 Vanguardaeuropeia 4 singno linguistico 4 redação coerência e coesão
4 o cortiço
4 lingua portuguesa 7%AA série
4 há momentos
4 ele é nada mais nada menos que um idiota
4 creme de legumes
5 Registros doque é Conjuções coordenadas
5 linguagem natural
5 corpus da folha de são paulo
# search string (2 or more tokens)
310 310 Enciclopedias Enciclopedias 315 315 dicionario
dicionarioportuguesportuguesononlineline
334
334
dicionario
dicionarioportuguesportuguesinglesingles
349 349 curriculum vitae curriculum vitae 360 360 dicionario
dicionarioportuguesportugues
381 381 dinalivro dinalivro 384 384
português para estrangeiros
português para estrangeiros
391
391
dicionario
dicionarioportuguesportuguesaurelioaurelio
392
392
dicionario
dicionarioportuguesportuguesinglesinglesononlineline
424
424
livrarias
livrarias portugalportugal
431
431
power
powertranslatortranslator
431 431 editoras editoras 451 451 avalon avalon 457 457 compara compara 463 463 priberam priberam 582 582 portugues
portuguespara estrangeiros para estrangeiros
602 602 livrarias livrarias 625 625 literatura infantil literatura infantil 812 812 dicionario
dicionarioinglesinglesportuguesportuguesononlineline
832 832 linguateca linguateca # # queriesqueries Search
Searchstringstring
What
What
was
was
being
being
searched
searched
in
in
to
to
get
get
to
to
Linguateca
Linguateca
’
’
s site?
s site?
2895 Termos 3034 tradução 3350 lingua 4230 portuguesa 4953 online 5054 e 5063 do 5349 inglês 5612 da 6746 em 7941 para 7966 line 8270 on 8419 português 8757 download 10920 ingles 11725 dicionário 14228 dicionario 18102 portugues 36151 de # ocorrences Word in search string
Overview of queries found in logs
Overview of queries found in logs
Informatics in general
Informatics in general
–
–E.g.: E.g.: ““CADCAD””, , ““PascalPascal””, , ““JavaJava””, , ““AutocadAutocad2000 2000 Topics concerning Portuguese language
Topics concerning Portuguese language
(literature, grammar, use)
(literature, grammar, use)
–
–E.g.E.g.: : ““figuras de estilofiguras de estilo””, , ““verbosverbos””, , ““Tipos de Sujeito Tipos de Sujeito Indeterminado e Ora
Indeterminado e Oraçção sem Sujeitoão sem Sujeito””, , ““verbo verbo inacusativo
inacusativo””, , ““expressões idiomexpressões idiomááticasticas””.. General tools or resources.
General tools or resources.
–
–E.g.E.g.: : ““corporacorpora””, , ““diciondicionááriorio””, , ““conjugador de verbosconjugador de verbos””
Overview of queries found in logs
Overview of queries found in logs
Specific fields or knowledge domains.
Specific fields or knowledge domains.
–
–E.g.E.g.: : ““extracextracçção de informaão de informaççãoão””, , ““terminologiaterminologia””, , “
“semântica lexicalsemântica lexical””, , ““PortuguesePortugueselanguagelanguagehistoryhistory””.. Queries about specific tools or resources.
Queries about specific tools or resources.
–
–E.g.: E.g.: ““CetempCetempúúblicoblico””, , ““CetenfolhaCetenfolha””(two corpora from (two corpora from Linguateca
Linguateca), ), ““COMPARACOMPARA””, , ““CorpCorpóógrafografo””
Queries that seem to be intended for our on
Queries that seem to be intended for our on-
-line concordance tools rather than for the
line concordance tools rather than for the
search engine.
search engine.
–
–E.g.: E.g.: ““semsemnadanada””, ", "abonadabonad.+", ".+", "ansiosoansiosoparapara", ", ““porporéémm (
(ocorrênciasocorrências))””. .
Some conclusions
Some conclusions
All six cases suggest that users have:All six cases suggest that users have:
–
–different goals in minddifferent goals in mind –
–different knowledge about the content of the site different knowledge about the content of the site Users ARE familiar with terminological units:
Users ARE familiar with terminological units:
–
–especially noun phrases especially noun phrases –
–use them in search expressions naturally use them in search expressions naturally
even if the
even if the TUsTUsare inappropriate in respect to the are inappropriate in respect to the
content of our website
content of our website
Sometimes users type incomplete, ill
Sometimes users type incomplete, ill--defined defined
or misspelled terminological units.
or misspelled terminological units.
Initial
Initial
improvements
improvements
for Busca
for Busca
Each document in the site should be
Each document in the site should be
indexed using only the
indexed using only the TUs
TUs
it contains
it contains
Quite easy if complete list of
Quite easy if complete list of TUs
TUs
known:
known:
the
the Corp
Corpó
ógrafo
grafo
may help us in this!
may help us in this!
Knowing all possible variants and
Knowing all possible variants and
synonyms of a given TU
synonyms of a given TU
For more problematic search strings
For more problematic search strings
(ambiguous, incomplete) > set of
(ambiguous, incomplete) > set of TUs
TUs
suggesting re
Empirical work
Empirical work
Subcorpus
Subcorpus
-
-
178 files in Portuguese
178 files in Portuguese
Total number of tokens approximately 1M.
Total number of tokens approximately 1M.
Corp
Corpó
ógrafo
grafo
> extracted and manually
> extracted and manually
validated 1209
validated 1209 TUs
TUs
5+ words 4% 2 words 42% 3 words 18% 4 words 9% 1 word 27%
Frequency and Distribution of the 1209 TUs extracted. The axis are set to logarithmic scale. Region 1 Region 1 Region 3 Region 3 Region 2 Region 2
Explanation of chart
Explanation of chart
Region 1Region 1: frequent but not widely distributed : frequent but not widely distributed TUs
TUs. E.g.: . E.g.: ““modelomodelococlearcoclear””, , ““taxataxade disparosde disparos””
--usually compound words.usually compound words. Region 2
Region 2: frequent and widely distributed : frequent and widely distributed TUsTUs. . E. g.:
E. g.: ““ananááliselise””, , ““corpuscorpus””, , ““modelomodelo””, , “
“lingulinguíísticastica””, etc. , etc. --usually very generic usually very generic TUsTUs, , and /or single words (they nevertheless have and /or single words (they nevertheless have multiple possible modifiers).
multiple possible modifiers). Region 3
Region 3: where less frequent and less : where less frequent and less distributed
distributed TUsTUsmay be found. E.g.: may be found. E.g.: ““verbo verbo intransitivo
intransitivo””, , ““relarelaçção semâticaão semâtica””,,””vibravibraçção ão macromecânica
macromecânica””..
Items to help searches
Items to help searches
Synonyms Portuguese (53 pair) Synonyms Portuguese (53 pair) --E.g.: E.g.: “
“adjectivo: adjetivoadjectivo: adjetivo””, , ““bibliografia: documento: bibliografia: documento: publica
publicaççãoão””;;
Translation equivalents between Portuguese Translation equivalents between Portuguese- -English (107 pairs)
English (107 pairs)--E.g.: E.g.: ““diciondicionááriorio: : dictionary
dictionary””;;
Synonyms English (23 pair)
Synonyms English (23 pair)--E.g.: E.g.: ““parsing parsing system: parser
system: parser””;;
Acronyms in Portuguese and English (81) Acronyms in Portuguese and English (81)- -E.g.:
E.g.: ““RI: RecuperaRI: Recuperaçção de Informaão de Informaççãoão””..
modelo auditivo de Seneff, modelo coclear de Goldstein 0,2
3 CN + ADJ + PRP + PN
processamento automático da linguagem natural, criação semi-automática de recursos lexicais 0,7
9 CN + ADJ + PRP + CN + ADJ
bd, cce, IA, lil 1,2
14 Acronym/abbreviation
modelo de Kanis-Deboer, teorema de Bayes, rede de Elman 1,6
19 CN + PRP + PN
Legendagem automática de notícias, reconhecimento óptico de caracteres 1,7
20 CN + ADJ + PRP + CN
arquitectura do sistema de interrogações, processo de aquisição de vocabulário 2,3
28 CN + PRP + CN + PRP + CN
dicionário Aurélio, sistema Edite 2,9
35 CN + PN
reconhecimento de dígitos isolados, resolução da ambigüidade lexical 3,1 37 CN + PRP + CN + ADJ COMPARA, Corpógrafo 4,3 52 PN
sistema de tradução, sinal de fala 14,7 178 CN + PRP + CN dicionário, gramática 18,7 226 CN
vagueza grammatical, sumarização automática 41,6 504 CN + ADJ Examples % occur. POS
The distribution of existing POS structures (ADJ – adjective; CN – common name; PN – Proper Name; PRP - Preposition)
Semantic Classification 1
Semantic Classification 1
Language resources
Language resources. E.g.:
. E.g.: “
“corpora
corpora”
”,
,
“
“CETEMP
CETEMPú
úblico
blico”
”,
, “
“dicion
dicioná
ário
rio”
”,
, “
“Wordnet
Wordnet”
”,
,
“
“COMPARA
COMPARA”
”
etc.
etc.
Tools and systems
Tools and systems.
.
E.g.: “
E.g.:
“anotador
anotador”
”,
,
“
“analisador morfol
analisador morfoló
ógico
gico”
”,
, “
“Corp
Corpó
ógrafo
grafo”
”,
,
etc.
etc.
Actions and processes
Actions and processes.
.
E.g.:
E.g.:
“
“aquisi
aquisiç
ção de vocabul
ão de vocabulá
ário
rio”
”,
, “
“extrac
extracç
ção de
ão de
terminologia
Semantic Classification 2
Semantic Classification 2
Specific theories and models
Specific theories and models..E.g.: E.g.: ““modelo modelo
auditivo de Seneff
auditivo de Seneff””, , ““algoritmo de Earleyalgoritmo de Earley””, etc. , etc.
Linguistic concepts and phenomena
Linguistic concepts and phenomena..E.g.: E.g.:
“
“polissemiapolissemia””, , ““ambiguidade lexicalambiguidade lexical””, , ““verbo verbo incusativo
incusativo””, , ““advadvéérbio de temporbio de tempo””, , ““adjectivoadjectivo””, , etc.
etc.
Disciplines or knowledge fields
Disciplines or knowledge fields..E.g.: E.g.:
“
“lexicografialexicografia””, , ““engenharia da linguagemengenharia da linguagem””, , “
“inteligência artificialinteligência artificial””, , ““semântica lexicalsemântica lexical””, etc. , etc.
Suggestions
Suggestions
For:
For:
–
–Improvement of Improvement of BuscaBusca’’sssearch capabilities search capabilities –
–User satisfaction.User satisfaction.
Easier searching
Easier searching
Single words
Single words
–
–Suggest possible modifiers of wordSuggest possible modifiers of word –
–With names of resources > to resource –With names of resources > to resource –e.g. e.g. COMPARA
COMPARA
Mechanism to cope with different varieties
Mechanism to cope with different varieties
of spelling in Portuguese
of spelling in Portuguese
Lists of synonym lists, acronym lists and
Lists of synonym lists, acronym lists and
translation equivalents
translation equivalents
Clustering of results
Clustering of results
More suggestions
More suggestions
Semantic classification of keywords + pragmatic rules of Semantic classification of keywords + pragmatic rules of thumb
thumb
If interested in a particular technology/tool/resource, > If interested in a particular technology/tool/resource, > systems that apply or implement such a technology or systems that apply or implement such a technology or function
function E.g.
E.g. --““morphologymorphology”” > choice > choice
–
– ““scientific disciplinescientific discipline””
–
– ““applications that deal with morphologyapplications that deal with morphology””(morphological (morphological analysers, stemmers, morphological generators, POS analysers, stemmers, morphological generators, POS taggerstaggers)) –
– ““specific systems that perform any of these tasksspecific systems that perform any of these tasks””
(
(PalavrosoPalavroso, PALMORF, etc.) , PALMORF, etc.) –
– ““evaluationevaluation””
More suggestions
More suggestions
Manually select correct semantic
Manually select correct semantic
classification of each TU
classification of each TU
(partially done)
(partially done)
Automatic text categorization system
Automatic text categorization system
Corp
Corpó
ógrafo
grafo
tools for finding semantic
tools for finding semantic
relations
relations
and building thesaurus/ontologies
and building thesaurus/
ontologies
for helping navigation
for helping navigation
ETC
ETC
Conclusions on
Conclusions on
Interdisciplinary work
Interdisciplinary work
Requires
Requires
––Mutual understandingMutual understanding –
–Tolerance Tolerance –
–Mental gymnastics Mental gymnastics
Exemplified here with
Exemplified here with
–
–Computer scienceComputer science –
–Computational linguisticsComputational linguistics –