• Nenhum resultado encontrado

LX-Tagger. de maneira a que de/lcj1 maneira/lcj2 a/lcj3 que/lcj4

N/A
N/A
Protected

Academic year: 2021

Share "LX-Tagger. de maneira a que de/lcj1 maneira/lcj2 a/lcj3 que/lcj4"

Copied!
5
0
0

Texto

(1)

LX-Tagger

1. BASIC INFORMATION

1.1 Tool name

LX-Tagger.

1.2 Overview and purpose of the tool

The present tool, that was built to deal with Portuguese-specific issues concerning syntactic categorization, assigns a single morpho-syntactic tag, from the tagset below, to every token. The tag is attached to the token, using a / (slash) symbol as separator:

um exemplo → um/IA exemplo/CN

Each individual token in multi-token expressions gets the tag of that expression prefixed by "L" and followed by the number of its position within the expression:

de maneira a que → de/LCJ1 maneira/LCJ2 a/LCJ3 que/LCJ4

This tagger was developed with TnT software over 90% of a small, 260 Ktoken, accurately hand tagged corpus. Accuracy of 96.87% was obtained with the tagger being trained over 90% of the 260 Ktokens and evaluated over the held out 10%, this being repeated over 10 different test runs and the results averaged.

LX-Tagger was developed and is maintained at University of Lisbon by the NLX-Natural Language and Speech Group of the Department of Informatics.

1.3 A short description of the algorithm

For the algorithm description, please see Brants (2000).

2. TECHNICAL INFORMATION

2.1 Software dependencies and system requirements

Linux.

2.2 Installation

Not applicable.

2.3 Execution instructions

The tagger works as a command line filter tool (it reads input from stdin and writes to stdout). Accordingly, it is meant to be used as part of pipe constructs in the (UNIX/Linux) command line.

(2)

Example:

$ cat input.txt | /path/to/tagger/run-Tagger.sh > output.txt

Note that:

1. The tool needs the following files to be executable in order to run:

./run-Tagger.sh ./Chunker/chunker-one ./Chunker/run-Chunker.sh ./Tokenizer/run-Tokenizer.sh ./Tokenizer/tokenizer ./Tokenizer/no-lo_fix.pl ./Tokenizer/post-tok.pl ./Tagger/run-tagger.sh

2. This tool use pre-compiled C code. Make sure you have the ia32-libs package installed for 32bit support, this happened to us too when we made the switch to 64bit.

The input text must encoded using UTF-8. The output uses the same encoding.

The underlying POS tagger that is MXPOST, by Adwait Ratnaparkhi. A model for the POS tagger is provided, but you must get the POS tagger yourself since we cannot redistribute that software. The MXPOST tagger can be downloaded at the following address: ftp://ftp.cis.upenn.edu/pub/adwait/jmx/jmx.tar.gz.

You must set the path to your local installation of MXPOST. For this, edit the Tagger/run-Tagger.sh script and set the MXPOST_JAR variable.

POS-Tagger tagset:

Tag Category Examples

ADJ Adjectives bom, brilhante, eficaz, …

ADV Adverbs hoje, já, sim, felizmente, …

CARD Cardinals zero, dez, cem, mil, …

CJ Conjunctions e, ou, tal como, …

CL Clitics o, lhe, se, …

CN Common Nouns computador, cidade, ideia, …

(3)

DEM Demonstratives este, esses, aquele, …

DFR Denominators of Fractions meio, terço, décimo, %, …

DGTR Roman Numerals VI, LX, MMIII, MCMXCIX, …

DGT Digits 0, 1, 42, 12345, 67890, …

DM Discourse Marker olá, …

EADR Electronic Addresses http://www.di.fc.ul.pt, …

EOE End of Enumeration etc

EXC Exclamative ah, ei, ...

GER Gerunds sendo, afirmando, vivendo, …

GERAUX Gerund "ter"/"haver" in compound tenses tendo, havendo …

IA Indefinite Articles uns, umas, …

IND Indefinites tudo, alguém, ninguém, …

INF Infinitive ser, afirmar, viver, …

INFAUX Infinitive "ter"/"haver" in compound tenses ter, haver, …

INT Interrogatives quem, como, quando, …

ITJ Interjection bolas, caramba, …

LTR Letters a, b, c, …

MGT Magnitude Classes unidade, dezena, dúzia, resma, …

MTH Months Janeiro, Dezembro, …

NP Noun Phrases idem, …

ORD Ordinals primeiro, centésimo, penúltimo, …

PADR Part of Address Rua, av., rot., …

PNM Part of Name Lisboa, António, João, …

PNT Punctuation Marks ., ?, (, …

POSS Possessives meu, teu, seu, …

PPA Past Participles not in compound tenses afirmados, vivida, …

PP Prepositional Phrases algures, …

PPT Past Participle in compound tenses sido, afirmado, vivido, …

PREP Prepositions de, para, em redor de, …

PRS Personals eu, tu, ele, …

QNT Quantifiers todos, muitos, nenhum, …

REL Relatives que, cujo, …

STT Social Titles Presidente, drª., prof., …

SYB Symbols @, #, &, …

TERMN Optional Terminations (s), (as), …

UM "um" or "uma" um, uma

(4)

VAUX Finite "ter" or "haver" in compound tenses temos, haveriam, … V Verbs (other than PPA, PPT, INF or GER) falou, falaria, …

WD Week Days segunda, terça-feira, sábado, …

Multi-Word Expressions

LADV1…LADVn Multi-Word Adverbs de facto, em suma, um pouco, … LCJ1…LCJn Multi-Word Conjunctions assim como, já que, …

LDEM1…LDEMn Multi-Word Demonstratives o mesmo, … LDFR1…LDFRn Multi-Word Denominators of Fractions por cento

LDM1…LDMn Multi-Word Discourse Markers pois não, até logo, …

LITJ1…LITJn Multi-Word Interjections meu Deus

LPRS1…LPRSn Multi-Word Personals a gente, si mesmo, V. Exa., … LPREP1…LPREPn Multi-Word Prepositions através de, a partir de, …

LQD1…LQDn Multi-Word Quantifiers uns quantos, …

LREL1…LRELN Multi-Word Relatives o qual…

2.4 Input/Output data formats

Input is in raw text (see 3.1).

Output is in raw text with syntactic tags (see 3.2).

2.5 Integration with external tools

Not applicable.

3. CONTENT INFORMATION

3.1 A test input file

Esta frase serve para testar o funcionamento da suite. Esta outra frase faz o mesmo.

3.2 The output file

Esta/DEM frase/CN serve/V para/PREP testar/V o/DA funcionamento/CN de_/PREP a/DA suite/CN .*//PNT

Esta/DEM outra/ADJ frase/CN faz/V o/LDEM1 mesmo/LDEM2 ./PNT

3.3 Approximation of the time necessary to process the test input file.

(5)

4. ADMINISTRATIVE INFORMATION 4.1 Contact person

Name: António Branco

Address: Departamento de Informática NLX - Grupo de Fala e Linguagem Natural, Faculdade de Ciências da Universidade de Lisboa, Edifício C6,

Campo Grande 1749-016 Lisboa Position: Assistant professor

Affiliation: Faculty of Sciences, University of Lisbon Telephone: +351 217 500 087

Fax: +351 217 500 084

E-mail: antonio.branco@di.fc.ul.pt

5. LICENSE

This tool is free for research purposes, with attribution and no redistribution or derivatives allowed. It will be available on the META-SHARE platform.

6. RELEVANT REFERENCES AND OTHER INFORMATION

Branco, António and João Silva (2004). “Evaluating Solutions for the Rapid Development of State-of-the-Art POS Taggers for Portuguese.” In Maria Teresa Lino, Maria Francisca Xavier, Fátima Ferreira, Rute Costa and Raquel Silva (eds.), Proceedings of the 4th International

Conference on Language Resources and Evaluation (LREC2004), Paris, ELRA, ISBN

2-9517408-1-6, pp. 507-510.

Brants, T. (2000). “TnT – A Statistical Part-of-Speech Tagger.” In Proceedings 6th Applied Natural Language Processing Conference, ACL, pp. 224-231.

Silva, João (2007). Shallow Processing of Portuguese: From Sentence Chunking to Nominal

Lemmatization. MSc thesis, University of Lisbon. Published as Technical Report

Referências

Documentos relacionados

Neste caso particular, foi possível detectar: um sinal de instabilidade (evolução crescente de uma componente de deslocamento de um ponto de referência), identificar a causa

Como referido, neste âmbito foram ainda implementadas através do Orçamento de Estado para 2017, que entrou, , em vigor a 1 de Janeiro de 2017, uma cláusula aberta para

Entretanto, mais importante do que a compreensão dos mecanismos que determinam a associação entre os pólipos e a neoplasia de endométrio, é o entendimento de que em

iv. Desenvolvimento de soluções de big data aplicadas à gestão preditiva dos fluxos de movimentação portuária de mercadorias e passageiros. d) Robótica oceânica: criação

NET, S.A., for instance, is the local Business and Innovation Centre (BIC), whose main objectives are to promote the entrepreneurial attitude of the region, support both the

Enquanto professora de História da Educação no curso de Pedagogia da Uniouro, percebi que a produção de uma linha de tempo e a produção de um vídeo sobre esta linha de

Esse resultado justifica o presente trabalho, cujo objetivo é analisar a formação de soluções na fase sólida, para os sistemas binários: ácido láurico + ácido mirístico e

Keywords: diffuse low-grade glioma; extent of resection; functional neurooncology; intraoperative functional mapping; brain plasticity; awake craniotomy; multimodal