DEMONSTRATION SESSION

(1)

DEMONSTRATION SESSION

Following the spirit of the Demos Session of PROPOR’2010, 2012, 2014, and 2016, the PROPOR 2018 demonstration track aims at bringing together Academia and Industry and creating a forum where more than written or spoken descriptions of research are available. Thus, demos allow attendees to try and test them during their presentation in a dedicated session adopting a more informal setting. Products, systems or tools are examples of accepted demos, where both early – research prototypes and mature systems were considered.

July 24-26, 2018, Canela - Brazil

Valeria de Paiva (Nuance, USA)

Rodrigo Wilkens (Université Catholique de Louvain, Belgium) Fernando Batista (INESC-ID & ISCTE-IUL, Portugal)

Demos Chairs

(2)

ACCEPTED DEMOS

• A computational grammar for Portuguese Bruno Cuconato and Alexandre Rademaker

• LX-SemanticSimilarity

João Silva, Marcos Garcia, João Rodrigues, and António Branco

• TEITOK: TEI for corpus linguistics Maarten Janssen

• DeepBonDD: a Deep neural approach to Boundary and Disfluency Detection Marcos Treviso, Anderson Smidarle, Lilian Hubner, and Sandra Aluisio

• CL-CONLLU Universal Dependencies in Common Lisp

Alexandre Rademaker, Fabricio Chalub, Bruno Cuconato, Henrique Muniz, Guilherme Passos

• Automatically Generating Temporal Proto-Narratives From Portuguese Headlines Arian Pasquali, Vítor Mangaravite, Ricardo Campos, Alípio Mário Jorge, and Adam Jatowt

• SMILLE: Supporting Portuguese as Second Language

Leonardo Zilio, Rodrigo Wilkens, Maria José Finatto, and Cédrick Fairon

(3)

A computational grammar for Portuguese

Bruno Cuconato¹and Alexandre Rademaker^1,2

1 FGV/EMAp

2 IBM Research

Abstract. This work presents an ongoing eﬀort towards a Portuguese grammar under the Grammatical Framework (GF) formalism. GF and the new grammar are briefly introduced, and then we employ the grammar to parse HPSG’s Matrix MRS test suite. We will demonstrate the use of the grammar in the parsing of text and in natural language applications.

Keywords: grammatical framework · computational grammar · type theory·functional programming

1 Introduction

Grammatical Framework (GF) is a programming language for grammar writing.

It is a functional programming language, with syntax inspired by the Haskell programming language [2]; it draws from intuitionistic type theory for its type system [3].

GF’s forte lies at multilingual processing. It applies to natural languages the distinction made for programming languages: that of abstract and concrete syntaxes. Separating them allows GF to specify a single abstract grammar for several concrete languages. Translation between two natural languages therefore becomes parsing of concrete syntax to its abstract representation, and then further linearization to the target language.

Writing a grammar for even a fragment of a natural language is a complex task. GF boasts a module system, so GF grammars can import other grammars for code reusing. GF grammars can thus be divided in resource and application grammars: while the former intend to provide useful linguistic constructs for downstream grammars in a suitable and stable application programming interface (API) (like software libraries do to programs [4]), the latter aim to apply these and other definitions to domain-specific applications.

The GF Resource Grammar Library (RGL) declares a common abstract syntax for resource grammars, with a number of grammatical categories, construc- tion functions, and a small test lexicon. Each resource grammar then defines this structure in parallel, and is also free to add language-specific extensions.

Listing 1.1.RGL API, resource grammar, and application grammar output examples

> i m p o r t - r e t a i n p r e s e n t / T r y E n g . gfo

(4)

> cc - one mkS ( m k C l ( m k N P t h i s _ D e t ( mkN " c a n d y ")) ( mkA " g o o d ")) th i s c a n d y is g o o d

> i m p o r t p r e s e n t / L a n g E n g . gfo

> p - l a n g = Eng " t h e s e f i s h are r o t t e n "

P h r U t t N o P C o n j ( U t t S ( U s e C l ( T T A n t T P r e s A S i m u l ) P P o s

( P r e d V P ( D e t C N ( D e t Q u a n t t h i s _ Q u a n t N u m P l ) ( U s e N f i s h _ N )) ( U s e C o m p ( C o m p A P ( P o s i t A r o t t e n _ A ) ) ) ) ) ) N o V o c

> i m p o r t F o o d s E n g . gf F o o d s P o r . gf

> p - l a n g = Eng - tr " t h a t p i z z a is d e l i c i o u s " | l - l a n g = Por Pr e d ( T h a t P i z z a ) D e l i c i o u s

es s a p i z z a ´e d e l i c i o s a

In listing 1.1, we can see (in order) the user importing and using the English resource grammar API to build a simple sentence; the user importing the English resource grammar interface and parsing a sentence with it (notice the detailed output of the syntactic structure); finally, the user imports a domain-specific application grammar, parses a sentence with it, and linearizes the obtained tree in Portuguese. Because the application grammar is specialized to a domain, it can produce smaller and more semantic trees.³

2 The Portuguese resource grammar

The current GF RGL supports more than thirty languages, with varying degrees of completeness. This work presents current work on the addition of a Portuguese resource grammar (henceforth PRG) to the RGL.

As an example of the utility of the PRG, a programmer wanting to create a multilingual application grammar involving a Portuguese lexicon would have to hard code the lexicon’s inflection tables in the application. With the PRG, she can import the resource grammar, which includes a concrete syntax and a complete set of paradigms for building words. She can then use an overloaded constructormkC(for any given classC) which accepts a variable number of argu- ments dependent on the word’s irregularity. For most words, simply providing their uninflected form is suﬃcient to obtain the correct inflection table [1].

3 Experiments and Discussion

In order to test the PRG, we used HPSG’s Matrix MRS test suite of 107 sentences in English. ⁴ Our experiment was as follows: we parsed the English sentences into trees, removing spuriously ambiguous ones, and linearized the resulting trees into Portuguese. The Portuguese linearizations were then compared to their corresponding sentences in the test suite, and analyzed with respect to

3 Generally, application grammars also produce less trees than resource grammars.

4 http://moin.delph-in.net/MatrixMrsTestSuiteEn 2 Cuconato and Rademaker

(5)

grammatical correctness. We do not test the translated sentences for equivalence because translation equivalence is not a goal of the RGL [4].

Even when parsing the simple sentences of the test suite, the issue of ambiguity arises. Consider the sentence [Some bark]. Considering the context of the other sentences it is clear that ‘bark’ here is meant as a verb. But our grammar can not know such a thing, and then outputs three possible trees, one for bark as a noun, and two for bark as a verb.

Another example of ambiguity is in [the dog could bark]. The RGL distin- guishes between ‘can’ in the sense of ‘know’ and in the sense of ‘being capable’.

These have the same linearization in English, but the Portuguese grammar can then oﬀer two possible translations [o cachorro sabia ladrar] and [o cachorro podia ladrar].

The test suite allowed us to find several mistakes in our implementation.

For instance, the handling of compound nouns is wrong, translating [the tobacco garden dog barked] to *[o tabaco o jardim o cachorro ladrava].

Besides correcting the mistakes found in the Portuguese linearizations, there are missing constructors that prevented the linearization of some trees, and a few phenomena that are still to be implemented, such as the contraction in *[havia gatos em o jardim].

4 Conclusion

When complete, the Portuguese resource grammar will be one of few freely- available computational grammars for Portuguese. In addition to also being open source, GF oﬀers a whole ecosystem of tools for the use of GF grammars in NLP applications: compilation of grammars to several formats (such as a portable binary format and formats for speech recognition grammars), the possibility of embedding grammars in Haskell, Java, Python, and C# programs, and of course, the use of the RGL for multilingual applications.

In the demo, we will give examples of morphology paradigms of GF and their use in Portuguese, as well as oﬀer examples of application grammars using the PRG, such as a logic to natural language translator following [5].

References

1. D´etrez, G., Ranta, A.: Smart paradigms and the predictability and complexity of inflectional morphology. In: Proceedings of the 13th Conference of the EACL. pp.

645–653. ACL, Stroudsburg, PA, USA (2012)

2. Marlow, S., et al.: Haskell 2010 language report (2010)

3. Ranta, A.: Grammatical Framework: a type-theoretical grammar formalism. Journal of Functional Programming14(2), 145189 (2004)

4. Ranta, A.: The GF resource grammar library. Linguistic Issues in Language Tech- nology2(2), 1–63 (2009)

5. Ranta, A.: Translating between language and logic: What is easy and what is diﬃ- cult. In: CADE. pp. 5–25. Springer (2011)

A computational grammar for Portuguese 3

(6)

LX-SemanticSimilarity

^?

João Silva¹, Marcos Garcia², João Rodrigues¹, and António Branco¹

1 University of Lisbon

NLX—Natural Language and Speech Group, Department of Informatics Faculdade de Ciˆencias, Campo Grande, 1749-016 Lisboa, Portugal

{jsilva,joao.rodrigues,antonio.branco}@di.fc.ul.pt

2 University of Coru˜na, Faculty of Philology marcos.garcia.gonzalez@udc.gal

Abstract. We present the LX-SemanticSimilarity web service and the respective demo, offered as an online service for human users. The web service provides an API to common operations over the LX-DSemVectors word embeddings for Portuguese without requiring the embeddings to be loaded locally.

Keywords: Web service · Online service · Distributional semantics · Word embeddings·Portuguese.

1 Introduction

Distributional semantic models, also known as word embeddings, represent the meaning of an expression as a high-dimension vector of real numbers. This vecto- rial representation of meaning allows, among other possibilities, to reify semantic similarity in terms of distance in a vector space. Having a way to quantitatively measure semantic similarity has opened up many avenues of research that explore how the integration of distributional features can improve a variety of natural language processing tasks, such as determining similarity between words [4], formal semantics [1], sentiment analysis [2], etc.

High-quality embeddings are hard to obtain due to the amount of data and computational effort required. LX-DSemVectors [7] are publicly available word embeddings for Portuguese and their existence helps in this regard, though they still require a great deal of RAM to operate and some technical skills, which may pose problems for some researchers, including from the Digital Humanities.

In this paper, we present the LX-SemanticSimilarity web service, which provides access to the LX-DSemVectors through an API with several operations commonly used on such semantic representations.

?The research presented here was partly supported by the ANI/3279/2016 grant, by the Infrastructure for the Science and Technology of the Portuguese Language (PORTULAN / CLARIN), and by aJuan de la Cierva grant (IJCI-2016-29598).

(7)

2 LX-DSemVectors embbedings and LX-LR4DistSemEval evaluation datasets

LX-DSemVectors [7] are the first publicly available word embeddings for Por- tuguese. Trained over a corpus of 1.7 billion words, these embeddings were evalu- ated over the LX-4WAnalogies dataset [7], a translation of thede factostandard English dataset for analogies [4], and were found to have a performance at the level of the state-of-the-art.

LX-LR4DistSemEval [5] is a collection of datasets adapted via translation from various English gold standard datasets for different mainstream evaluation tasks for embeddings, namely the analogy task, the conceptual categorization task and the semantic similarity task. These datasets provide a standard way to intrinsically evaluate and compare distributional semantic models for Por- tuguese.

3 LX-SemanticSimilarity

The embeddings in LX-DSemVectors require nearly 6 GB of RAM when loaded, making them unfeasible to use on many desktop computers. We have found that loading them on a server and accessing them through a web service help to neatly circumvent this issue.

3.1 Web service

The LX-SemanticSimilarity web service exposes an API with operations commonly used on word embeddings, namely getting the (cosine) similarity between two words, and also between two sets of words; finding the top-n most similar words, allowing to specify words that contribute positively and words that contribute negatively; and getting the nwords closest to a given word.

The server works as a XML-RPC wrapper around the gensim [6] library.

Having a standard protocol like XML-RPC makes it easy to use any of a variety of programming languages on the client side, as seen in the following example in Python that queries the service to get the similarity between the words “batata”

(potato) and “banana”:

import xmlrpc.client

lxsemsim = xmlrpc.client.ServerProxy(url)

result = lxsemsim.similarity("batata", "banana") print(result)

3.2 Online service and demo

The LX-SemanticSimilarity online service/demo (http://lxsemsimil.di.fc.ul.pt/) is built on top of the web service and showcases some simple examples of possible applications of embeddings. The users are presented with two modes of

5 Silva et al.

(8)

Fig. 1.Examples of output by LX-SemanticSimilarity online service and demo

operation: They can either (i) provide two words to see their distance and an interactive visualization of their surrounding vector-space; or (ii) provide a single word to see a list of the most similar words to it, in a tabular format and as a word cloud. The outputs of these two modes are exemplified in Figure 1.

The first mode is supported by the t-SNEJS JavaScript library,³ an implementation of the t-SNE [3] dimensionality reduction technique; while the wordcloud image is generated by resorting to the wordcloud⁴ Python package.

References

1. Baroni, M., Bernardi, R., Zamparelli, R.: Frege in space: A program for composi- tional distributional semantics. Linguistic Issues in Language Technology9, 241–346 (2014)

2. Li, J., Jurafsky, D.: Do multi-sense embeddings improve natural language understanding? arXiv preprint arXiv:1506.01070 (2015)

3. van der Maaten, L., Hinton, G.: Visualizing high-dimensional data using t-SNE.

Journal of Machine Learning Research9, 2579–2605 (2008)

4. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)

5. Querido, A., de Carvalho, R., Rodrigues, J., Garcia, M., Correia, C., Rendeiro, N., Pereira, R.V., Campos, M., Silva, J., Branco, A.: LX-LR4DistSemEval: A collection of language resources for the evaluation of distributional semantic models of Portuguese. Revista da Associa¸c˜ao Portuguesa de Lingu´ıstica3(2017)

6. ˇReh˚uˇrek, R., Sojka, P.: Software framework for topic modelling with large corpora.

In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frame- works. pp. 45–50 (2010)

7. Rodrigues, J., Branco, A., Neale, S., Silva, J.: LX-DSemVectors: Distributional semantics models for the Portuguese language. In: Proceedings of the 12th Interna- tional Conference on the Computational Processing of Portuguese (PROPOR’16).

pp. 259–270 (2016)

3 https://github.com/karpathy/tsnejs

4 https://github.com/amueller/word cloud

LX-SemanticSimilarity 6

(9)

TEITOK: TEI for corpus linguistics

Maarten Janssen CELGA-ILTEC

Abstract. TEITOK is an online environment for building, maintain- ing and searching TEI-based corpora. It provides a wide range of tools to work with different corpora, for instance for visualizing manuscript transcriptions along their facsimile images, searching the corpus using CWB, visualizing documents on the world map, searching dependency relations, visualizing time-aligned spoken corpora, etc. This demo will display a number of key uses of the TEITOK framework.

Keywords: Dependency grammar, TEI/XML, Corpus annotation

1 Introduction

TEITOK [1] is a online platform for visualizing, searching, and editing corpora in which the corpus texts are kept in the TEI/XML format, a rich and widely used XML standard for digital texts (http://www.tei-c.org). Other than text-based corpora, TEI-based corpora can keep all the typographic information the original contains, with a very rich set of annotations defined to account for the demands of very different types of corpora. This makes TEI the preferred framework in corpora where detailed annotation is crucial, for instance for historical corpora where detailed paleographic annotation is essential, spoken corpora where pauses and truncations are highly relevant, and learner corpora where annotation of corrections and errors by the student are of primary importance.

TEITOK provides an online platform that makes the notoriously difficult task of using TEI for linguistic annotation easier, by providing a visualization for TEI documents, and use generic scripts to interact with a number of computational tools behind the screens. These scripts allow you to tokenize, annotate, and parse the TEI documents in the browser, all with simple buttons. It also can export the collection of TEI files to a CWB corpus [2], making the corpus searchable.

And it provides the option to edit the metadata in the XML file as well as the annotations on a given token using a simple HTML form. This allows researchers from a wide range of linguistic areas to build annotated corpora without the need to have detailed information about the processes behind the screens.

TEITOK in principle assumes you created a TEI document with any of the many TEI writing tools out there, and takes over from there, allowing you to add linguistic annotation into the XML to enrich it and turn it into a searchable corpus. It does however also have the option to create specific types of TEI documents directly from within the interface. And it comes with a set of scripts to convert various formats into a TEITOK corpus.

(10)

TEITOK can be used with virtually any type of TEI document, independently of which purpose the document was created for. It can hence make a wide variety of TEI documents tokenized, editable, and searchable. To make TEITOK even more usable for specific linguistic areas, it has various dedicated visualization modules to turn it into a dedicated tool for historical corpora, learner corpora, spoken corpora, and LRL corpora.

During this demo we will show how useful TEITOK is for each of these areas, showing examples from the growing number of corpora using the system.

TEITOK is freely available software, that can be installed on a local server. More information about how to obtain the tool, as well as links to a growing number of projects using the system, can be found at the website: (http://www.teitok.org).

2 Dedicated modules

TEITOK has a modular design in which the same XML file can be displayed and edited in a number of different ways, depending on its content. Below is a selec- tion of tools for two specific types of documents: transcriptions of manuscripts, and transcriptions of sound. The modules listed are not mutually exclusive:

learner corpora often combine transcriptions of written and oral exams, where TEITOK can combine both types into a single corpus while keeping their very different nature. They could even be used on the very same XML file if we have a manuscript that was read out aloud. And these modules can then be combined with any number of additional modules to add dependency relations, interlinear glossed texts, visualizing documents on the world map, any many more.

2.1 Manuscript transcription

For corpora that are transcriptions of manuscripts, typically either historical corpora or learner corpora, TEITOK provides a number of facsimile-oriented options. Firstly, TEITOK can store multiple orthographic realizations on each token, for instance the original spelling and the modernized orthography. In the text visualization, you can select which of those version to display. This makes it possible to switch between, say, a paleographic, a critical, and a normalized version of the same text with a simple click, all built from a single XML source.

TEITOK can also keep track ofbounding boxesfor each XML node, ie. which part of the facsimile image corresponds to the paragraph, manuscript line, or word. And when lines (line breaks in TEI) contain bounding boxes, it can present the document in a interleaved version, showing the transcription of each line below a cut-out of the manuscript (see http://alfclul.clul.ul.pt/teitok/junius11).

It also provides an option to split a facsimile image into lines, and then tran- scribe that page line-by-line, keeping track of the progress of the transcription.

The fact that in this way, TEITOK displays the manuscript line directly above the transcription makes the transcription process not only quicker but also often more accurate since you get direct visual verification.

8 Janssen

(11)

When each token is provided with bounding boxes, it can also provide a visualization similar to a searchable PDF: an image with a hidden text layer, allowing you to search directly in the image, select text from the image, and get information about each token when moving the mouse over a word.

2.2 Spoken corpora

For spoken corpora, TEITOK provides a number of options to create and vi- sualize time-aligned transcriptions. When for a transcription, the corresponding sound file is provided in the metadata, TEITOK can display the sound file above the text transcription, and if the utterances are time-aligned, it can produce a play button in front of each utterance. To create a time-aligned TEI document, TEITOK provides a script to convert EXMARALDA [3] files into the TEI format, turning segments on the tiers into utterances, ordered in an interview-style manner.

When utterances are time-aligned, searching through the CWB corpus will not only provide a list of resulting utterances, but also allow you to directly listen to the corresponding sound fragment. This makes it easy to find spoken examples in the corpus based on orthographic clues.

To get a more dedicated speech-driven interface, TEITOK also provides an interface similar to that found in speech software like ELAN [4] or Praat [5]:

a waveform image with the transcription below. The utterances are ordered vertically, and will scroll along as the sound is playing. And you can click on any utterance to listen to the corresponding sound.

The system also allows you to create a time-aligned transcription directly in the TEITOK interface: you can select a segment on the timeline, create an utterance for it, and type in the transcription in an HTML form. This way, TEITOK provides a quick interface to create time-aligned spoken corpora in TEI.

References

1. Janssen, M.: TEITOK: Text-faithful annotated corpora. Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016 (2016) 4037–4043

2. Evert, S., Hardie, A.: Twenty-first century corpus workbench: Updating a query architecture for the new millennium. In: Corpus Linguistics 2011. (2011)

3. Schmidt, T.: Exmaralda - ein modellierungs- und visualisierungsverfahren fr die computergesttzte transkription gesprochener sprache. In Buchberger, E., ed.: Pro- ceedings of Konvens 2004. Volume 5. (2004) DE.

4. Wittenburg, P., Brugman, H., Russel, A., Klassmann, A., Sloetjes, H.: ELAN: a professional framework for multimodality research. In: Proceedings of LREC 2006.

(2006)

5. Boersma, P., Weenink, D.: Praat, a system for doing phonetics by computer. Glot International5(9/10) (2001) 341–345

TEITOK: TEI for corpus linguistics 9

(12)

DeepBonDD: a Deep neural approach to Boundary and Disfluency Detection

Marcos Treviso¹, Anderson Smidarle², Lilian Hubner², and Sandra Aluisio¹

1 Institute of Mathematics and Computer Science, University of S˜ao Paulo, Brazil

2 Pontifical Catholic University of Rio Grande do Sul,Brazil

marcostreviso@usp.br, and.dick@gmail.com, lilian.c.hubner@gmail.com, sandra@icmc.usp.br

Abstract. In this paper, we present DeepBonDD, a web application responsible for segmenting transcripts which also detects disfluencies present in the tests. The use of DeepBonDD in transcripts allows further application of natural language processing tools that depend on well-formed texts, such as taggers and parsers.

Keywords: Sentence Segmentation· Disfluency Detection · Impaired Speech

1 Introduction

In recent years, mild cognitive impairment (MCI) has received great attention because it may represent a preclinical stage of Alzheimer’s Disease (AD). Sev- eral studies have shown that speech production is a sensitive task to detect aging effects and to differentiate individuals with MCI from healthy ones. Automatic linguistic analysis tools have been applied to transcripts of narratives in En- glish [2] and also in Brazilian Portuguese [1]. However, the absence of sentence boundary information and the presence of disfluencies in transcripts prevent the direct application of Natural Language Processing (NLP) methods that depend on well-formed texts, such as taggers and parsers. Fig. 1 shows the result of a transcript from a neuropsychological retelling task, that does not include capi- talization nor sentence segmentation, and presents disfluencies.

The Sentence Segmentation or sentence boundary detection task can be seen as a specific case of the punctuation recovery task, which attempts not only to detect sentence boundaries but also the types of punctuation that occur in these places. The Disfluency Detection task is concerned with finding regions of disfluencies and categorizing them into their types, such as: (i) fillers, which are usually used by the interlocutor to indicate hesitation or to keep control of a conversation, e.g. “ah, hm, bom, ent˜ao, digo”; (ii) and edit disfluencies, which occur when the interlocutor makes a statement that is not complete or correct and therefore he himself corrects or changes his statement, e.g. “pro castelo na verdade ela vai trabalhar no castelo n´e” in Fig. 1.

Here, we present DeepBonDD, a web application responsible for segmenting transcripts [3] which also detects disfluencies present in the tests [4].

(13)

cinderela a história da cinderela... ela:: encontra um cavaleiro com com um cavalo dai ela ﬁca amiga desse cavalo tudo isso é próximo de um castelo e ela vai pro castelo pro castelo na verdade ela vai trabalhar no castelo né e ela começa a fazer lá...

Fig. 1: Narrative excerpt transcribed using the NURC annotation manual³

2 DeepBonDD

The full pipeline of DeepBonDD can be seen in Fig. 2. After receiving a clean transcript, i.e. with no punctuation marks and with lower case letters, the processing steps of DeepBonDD are: (1) Call the sentence boundary detector and the filler remover (filled pauses and discursive markers only) in parallel; (2) Com- bine the output of these two processes, that is, the sentence boundary signals are entered in their proper places and the filled pauses that were classified as fillers are removed; (3) Apply the edit disfluencies remover (repetitions and revisions), generating a final segmented and free of disfluencies transcription.

filler remover transcription

sentence boundary detector

segmented transcription

list of filled pauses

segmented and without disﬂuencies

transcription repetitions

and revisions remover transcription without ﬁllers

Fig. 2: Processing steps in DeepBonDD

The DeepBonDD’s web interface for the full pipeline can be seen in Fig. 3a before the processing steps, and in Fig. 3b after applying the pipeline to the transcript. Furthermore, the user can detect sentence boundaries, fillers or edit disfluencies independently. The user can also select configurations for each detection, choosing whether or not to use part-of-speech tags and handcrafted features, and choosing available trained models (such as MLP, RCNN and RCNN with CRF) to deal with lexical and audio information. To detect filled pauses, an optional list of filled pauses to be removed can be uploaded by the user.

3 http://www.letras.ufrj.br/nurc-rj/

11 Treviso et al.

(14)

(a) Before (b) After Fig. 3: DeepBonDD’s web interface

3 Evaluation and Demonstration

In the demonstration session to be held in PROPOR 2018, attendees will be able to use DeepBonDD’s web interface to segment narratives into sentences and to remove disfluencies of them. We will make available 3 datasets with narratives in Brazilian Portuguese: (i) The Cinderella Narrative Dataset - 60 narrative samples (20 subjects with MCI, 20 with DA and 20 normal elderly control subjects (CTL)) from a narrative production test based on 22 pictures;

(ii) The Dog Story Dataset - 10 narratives transcripts (6 CTL and 4 MCI) from a narrative production test based on seven pictures; (iii) Lucia’s Story - 10 narrative transcripts (5 CTL, 2 MCI and 3 DA) from the retelling test of BALE (Bateria de Avalia¸c˜ao da Linguagem no Envelhecimento).

References

1. Alu´ısio, S.M., da Cunha, A.L., Scarton, C.: Evaluating progression of alzheimer’s disease by regression and classification methods in a narrative language test in portuguese. In: Silva, J., Ribeiro, R., Quaresma, P., Adami, A., Branco, A. (eds.) International Conference on Computational Processing of the Portuguese Language.

pp. 109–114. Springer (2016)

2. Lehr, M., Prud’hommeaux, E.T., Shafran, I., Roark, B.: Fully automated neuropsychological assessment for detecting mild cognitive impairment. In: INTERSPEECH.

pp. 1039–1042 (2012)

3. Treviso, M., Shulby, C., Alu´ısio, S.: Evaluating word embeddings for sentence boundary detection in speech transcripts. In: Proceedings of the 11th Brazilian Symposium in Information and Human Language Technology. pp. 151–160 (2017)

4. Treviso, M.V.: Segmenta¸cão de senten¸cas e deteçcão de disfluências em nar- rativas transcritas de testes neuropsicológicos. Master’s thesis, Universidade de São Paulo (2017), http://www.teses.usp.br/teses/disponiveis/55/55134/

tde-05022018-090740/

DeepBonDD: a Deep neural approach to Boundary and Disfluency Detection 12

(15)

CL-CONLLU

Universal Dependencies in Common Lisp

Alexandre Rademaker^1,2, Fabricio Chalub¹,

Bruno Cuconato², Henrique Muniz^1,2, and Guilherme Paulino-Passos^1,3

1 IBM Research

2 FGV/EMAp

3 PESC/COPPE/UFRJ

Abstract. The growing interest in theUniversal Dependencies project for creating corpora in different languages, using a common morphological and syntactic tags, motivate different research groups involved in the creation and maintenance of corpora the demand for tools for editing, correction and display of syntactic trees. Here we presentcl-conllu, a Common Lisp library for manipulating CoNLL-U files, the file format used by the Universal Dependencies project.

1 Introduction

The use of different tags for morphological and syntactic annotations, as well as different annotation conventions, makes it difficult to develop multi-language syntax analysis tools and to study common linguistic phenomena between different languages [3]. To solve this problem, the Universal Dependencies (UD) project aims to create consistent linguistic annotations between different languages. Recently, the UD project launched version 2.0 of its treebanks [4], already used in the shared task of the Conference on Computational Natural Language Learning (CoNLL 2017).

The advancement of the UD project demands tools for helping on corpora maintenance. In particular, as part of the UD-Portuguese-Bosque [6] corpus maintenance effort, we developed a library called cl-conllu for manipulating the CoNLL-U files in the Common Lisp (CL) language. The library provides features such as reading and writing CoNNL-U files, annotation validation, batch transformations, queries, the production of different views of syntax trees, evaluating annotations and comparing different annotations of the same sentence.

2 The CoNLL-U format

Following a syntactic model of dependencies, UD considers that each word is dependent on some other (except for the root of the phrase), called its head, through a specific dependency relation. Besides, by its adoption of lexicalism, in UD the basic units of annotation are syntacticwords(not spelling or phonological

(16)

words) [3]. Hence, contractions and clitics are divided, for example,dois broken into two tokensde+o.

For the representation of annotations following these principles, the CoNLL- U format was developed. Each file can contain multiple sentences, separated by a blank line. Sentences start with metadata which is then followed by line- separated words, each of which comprising 10 tab-separated fields, such as se- quential numbering (ID), original form in the text (FORM), lemma (LEMMA), the UD PoS tag (UPOS), morphological attributes (FEATS), index of its head token (HEAD), and the universal dependency relation (DEPREL). Multi-word tokens (orthographic tokens that have been broken into more than one word) also receive a line of their own.

3 The cl-conllu library

The primary data structures of the library are the sentence, token, and multiword token classes. A sentence has as its attributes (‘slot’ in CL) the main list of its tokens and multiword tokens. We chose to keep the tokens and multi-word tokens in separate lists to facilitate the use of these structures by various other library functions.

The functionsread-conlluandwrite-conlluare the functions for reading and writing CoNNL-U files, respectively. The first one, receives a ‘string’ or an object of classpathname.⁴It returns a linked list of objects of the classsentence.

Thewrite-conllufunction receives a linked list of sentenceobjects and a file name and writes the sentences to the file. Among format conversions, the library currently supports the conversion of CoNLL-U files to Prolog and RDF.

Three significant recent additions to the library are: (1) a rule language to facilitate batch transformations; (2) visualization of the syntactic trees; and (3) a standard query language in syntax trees. These recent additions are the focus of this article.

Starting with the visualization, the functiontree-sentencereceives a sentence and a ‘stream’ and produces a nice vertical tree showing the tokens connec- tions. This function has been inspired by similar function in the UDAPI library [5].

To allow batch transformations, the apply-rules-from-file function has been implemented. This functionality was inspired by the program ‘Corte e Cos- tura’ [2] and it was built for batch correction of annotations (syntactic and morphological). The function receives a list of rules, a CoNLL-U file to be read and a CoNLL-U to be generated. The function also produces a log file of the rule applications. Listing 1.1 presents a rule with more than one pattern, with variables, in the left-hand side followed by a list of conditions. The variables are identifiers CL beginning with the character ”?”. The conditions are formed by an operator, a token field that we are interested in testing, and a string that can be a regular expression.

4 In CL, the ‘pathname’ class represents a path in the operating system’s file system [7].

14 Rademaker et al.

(17)

Listing 1.1.Example of rule (=> ( ( ? a ( match lemma ” [ aA ] t e ” ) )

( ? b (= lemma ” e n t a o ” ) ) ) ( ( ? a (s e t u p o s t a g ”ADV” ) )

( ? b (s e t u p o s t a g ”ADV” ) ) ) )

The query function operates over the trees. It was created to facilitate the localization of sentences given a pattern in the corpus. The query language was inspired by [1].

Listing 1.2.query example CL> ( q u e r y ’ ( n s u b j

( a d v c l

(and ( u p o s t a g ”VERB” ) ( lemma ” c o r r e r ” ) ) ( u p o s t a g ”VERB” ) )

( u p o s t a g ”PROP” ) ) ∗s e n t e n c e s∗)

4 Conclusion

We intend to continue adding features to the library, such as: (1) better support for sentence validation; (2) expansion of the rule language with support for variables over expressions and not just variables for tokens, possibly combining the query language with the rules language; and (3) support for sentence editing interactively and other forms of syntactic tree visualization. Finally, we intend to add even more test cases to increase the robustness of the library. The library and its source code can be downloaded from thehttp://github.com/own-pt/

cl-conllurepository and an initial documentation is in the same repository.

References

1. Luotolahti, J., Kanerva, J., Pyysalo, S., Ginter, F.: Sets: Scalable and efficient tree search in dependency graphs. Proc. of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations. (2015) 2. Mota, C., Santos, D.: Corte e costura no AC/DC: auxiliando a melhoria da anota¸c˜ao nos corpos (Sep 2009),http://www.linguateca.pt/acesso/corte-e-costura.pdf 3. Nivre, J., de Marneffe, M., Ginter, F., Goldberg, Y., Hajic, J., Manning, C.D., Mc- Donald, R.T., Petrov, S., Pyysalo, S., Silveira, N., Tsarfaty, R., Zeman, D.: Universal dependencies v1: A multilingual treebank collection. (2016)

4. Nivre, J.e.a.: Universal dependencies 2.0 (2017), http://hdl.handle.net/11234/

1-1983, LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics, Charles University

5. Popel, M., ˇZabokrtsk`y, Z., Vojtek, M.: Udapi: Universal api for universal dependencies. Proc. of the NoDaLiDa 2017 Workshop on Universal Dependencies. (May 2017)

6. Rademaker, A., Chalub, F., Real, L., Freitas, C., Bick, E., de Paiva Universal De- pendencies for Portuguese, V.: Universal dependencies for portuguese. Proc. of the Fourth International Conference on Dependency Linguistics (Depling). (Sep 2017) 7. Steele Jr, G.L.: Common Lisp: the Language. Digital Press20, 124 (1984)

CL-CONLLU Universal Dependencies in Common Lisp 15

(18)

Automatically Generating Temporal Proto-Narratives From Portuguese Headlines

Arian Pasquali^1,2, Vítor Mangaravite¹, Ricardo Campos^1,3, Alípio Mário Jorge^1,2, and Adam Jatowt⁴

1 LIAAD – INESCTEC vima@inesctec.pt

2 FCUP, University of Porto, Portugal {arian.pasquali,amjorge}@fc.up.pt

3 Polytechnic Institute of Tomar - Smart Cities Research Center, Portugal ricardo.campos@ipt.pt

4 Kyoto University, Japan adam@dl.kuis.kyoto-u.ac.jp

Abstract. Conta-me Histórias (Tell me stories) is an online tool that allows users to automatically generate proto-narrative timelines based on search queries on the Portuguese Web Archive. Our approach is based on a Keyphrase Extraction algorithm and a peak detection method to select relevant stories over time. It offers a friendly user interface that enables users to study and revisit topics in the past thus providing a different perspective on historical narratives.

Keywords: Information Retrieval·Temporal Summarization

1 Introduction

Despite the latest advances in the field of natural language processing, generating consistent narratives is still an open problem which has been attracting increasing attention from the research community [4], Portuguese included, over the last few years. In this work, we aim to give a step ahead in this promising research area by proposing an unsupervised method to automatically generate a proto-narrative timeline based on a (temporal) collection of news headlines. In our work, a proto-narrative is a rudimentary narrative made of weakly connected and temporally ordered sentences.

Conta-me Histórias² is an online application that allows users to explore topics and relevant events throughout time, without having to read an entire collection of news articles. This may be very useful, not only for journalists looking for historical information but also for anyone interested in researching forgotten stories. One such tool may work as an innovative way to explore past events, contributing to a better-informed society.

2 http://bit.ly/ContameHistorias

(19)

2 Proposed Solution

InConta-me Histórias, users interact with the system through a friendly interface where is possible to specify a query (a free text field) and a time interval (last five, ten or twenty years) to build an interactive timeline. In order to guar- antee the plurality and the diversity of the information, we consider news from 24 selected Portuguese news providers.

We make use of the Portuguese Web Archive API to obtain a set of news headlines related to the user’s query. The Portuguese Web Archive is a third- party initiative that aims to preserve web pages collected from the web since 1996. Periodically it collects and stores entire websites, processing the data to make it searchable and finally providing a full-text search service that enables the retrieval of the past versions of the site. In order to build the temporal narratives, we propose a framework to identify relevant headlines and important dates. This process can be described in 3 steps: (1) Time intervals: to deter- mine the time-intervals we begin by dividing the timespan into 60 equi-width intervals (partitions). These partitions are used to find peaks of occurrences.

The interval boundaries are then given by the fewer partition (smallest peak) among each pair of peaks; (2) Keyphrase candidates detection: For every time interval, we then select keyphrase candidates based on the relevance of each term that is part of the headline. To this regard, we use an adapted version of the keyword extractor YAKE! [1, 2] which, unlike the original version, considers multiple (news) documents; (3) Finally, we eliminate similar keyphrases in a deduplication analysis applying sequence-matcher similarity. As an add-on we also provide a sentiment analysis tool to classify each headline as for its sentiment and also a named entity detection method to select related entities in order to present a world cloud with the most relevant names in the interface.

To this regard, we apply a lexicon-based sentiment analysis method that uses SentiLex-PT01 [3], a specialized lexicon for the Portuguese language. Finally, to perform named entity detection we apply PAMPO (PAttern Matching and POs tagging based algorithm for NER)[5], the method relies on flexible pattern matching, part-of-speech tagging and lexical-based rules and it was developed to process texts written in Portuguese.

A snapshot of the result can be seen in Figure 1 for the query "Dilma Rouss- eff". The result here presented shows the most relevant extracted keyphrases selected by YAKE! algorithm [2] for each time interval. Each keyphrase may be constituted by 3-grams up to 100. In addition, by hovering over the titles, we may not only access the publication date of the document, but also the archived web page, thus giving the user the possibility to enhance his/her knowledge about a given topic.

3 Conclusion

In the era of post-truth and fake news, web archive initiatives are important contributions to preserve history. In this context, our demo may be considered

17 Pasquali et al.

(20)

Fig. 1.Example of result narrative timeline for the queryDilma Rousseff

an additional solution that allows users to better explore this kind of repository.

We believe that making this demo publicly available and accessible for everyone is an important contribution to foster not only related research, but also the user’s search experience when looking for past events and summarizing complex information. Despite using a web archive as data source, this tool can be adapted to support different kinds of data sources. Therefore, in the future, we aim to test this framework on top of alternative collections.

4 Acknowledgements

This work is partially funded by the ERDF through the COMPETE 2020 Pro- gramme within project POCI-01-0145-FEDER-006961, and by National Funds through the FCT as part of project UID/EEA/50014/2013.

References

1. Campos, R. and Mangaravite, V. and Pasquali, A. and Jorge, A.M. and Nunes, C.

and Jatowt, A.: YAKE! collection-independent automatic keyword extractor. ECIR, 806–810 (2018)

2. Campos, R. and Mangaravite, V. and Pasquali, A. and Jorge, A.M. and Nunes, C.

and Jatowt, A.: A Text Feature Based Automatic Keyword Extraction Method for Single Documents. ECIR, 684–691 (2018)

3. Silva, M. J. and Carvalho, P. and Costa, C and Sarmento L.: Automatic Expansion of a Social Judgment Lexicon for Sentiment Analysis. Technical Report TR 10-08.

University of Lisbon, Faculty of Sciences, LASIGE, (2010)

4. Jorge, A. and Campos, R. and Jatowt, A. and Nunes, S. First International Work- shop on Narrative Extraction from Text (2016)

5. Rocha, C. and Jorge, A. and Sionara, R. and Brito, P. and Pimenta, C. and Rezende, S. PAMPO: using pattern matching and pos-tagging for effective Named Entities recognition in Portuguese, (2016)

6. Gomes D. and Cruz D. and Miranda J. and Costa M. and Fontes S.: Acquiring and providing access to historical web collections, 10th International Conference on Preservation of Digital Objects, (2013)

Automatically Generating Temporal Proto-Narratives From Portuguese Headlines 18

(21)

SMILLE: Supporting Portuguese as Second Language

^?

Leonardo Zilio¹, Rodrigo Wilkens¹, Maria Jos´e Finatto², and C´edrick Fairon¹

1 Universit´e catholique de Louvain, Belgium,

{leonardo.zilio,rodrigo.wilkens,cedrick.fairon}@uclouvain.be

2 Universidade Federal do Rio Grande do Sul, Brazil,mariafinatto@gmail.com

Abstract. This demo presents SMILLE for Portuguese, a system for enhancing pedagogically relevant grammatical structures that can help language learners to read Web-based texts and, at the same time, focus on grammatical structures that are important for their learning process.

Keywords: Second language acquisition·Text enhancements·Gram- matical structures·Reading Assistant.

1 Introduction

Research on the field of Second Language Acquisition (SLA) has already shown that the mere presentation of raw input to a language learner is not enough for ensuring that learning will take place [3]. One way of solving the lack of salience in raw input, as suggested by Smith and Truscott [6], is the use of “input enhancements”, so that the relevant linguistic information is highlighted. This

“focus-on-form strategy” [1] has provided a new way to assist language learners, and some studies have shown that input enhancements represent a positive step in transforming input into intake (e.g. [4, 5]).

In this demo, we present the Smart and Immersive Language Learning Envi- ronment (SMILLE) in its version for Portuguese. SMILLE is a system that can automatically analyze and enhance written texts by employing Natural Language Processing (NLP) techniques for retrieving pedagogically relevant grammatical structures. These structures are directly linked to the different language levels described by the Common European Framework of Reference for Languages (CEFR) [7]. SMILLE was developed based on a scenario in which the users are already taking second language classes and wish to continue the language learning activity by reading Web-based material that corresponds to their interests.

In this case, SMILLE can help not only with the text-understanding process, for it has built-in access to dictionaries and meaning-related information, but also with improving the users’ awareness of the grammatical structures that corre- spond to their language level. As such, SMILLE can be seen as a complementary application to a language course, where the grammatical structures associated

?Supported by the Walloon Region (Projects BEWARE 1510637 and 1610378) and Altissia International.

(22)

to the user’s level will be on focus (by means of text enhancements), with the bonus of having a plethora of new vocabulary available, since it is designed to process any user-chosen, Web-based text.

2 System Description

SMILLE was originally developed for English [8], and then was further extended to Portuguese, both understood as a foreign language for the learner. Following the idea of WERTi [3], it was designed for the users to have independence for choosing on-line reading material in the foreign language. Using the selected text, SMILLE provides a reading assistant module that helps the users to notice linguistic content of the target language by enhancing (i.e. highlighting) language structures in context, while also offering the possibility of looking up meaning and word class information.

SMILLE links the detected information to the guidelines of the CEFR [7], so that the grammatical enhancements are not limited to isolated linguistic structures, but covers the needs for a given language level. By applying rules on top of the parser annotation, SMILLE also detects grammatical structures that are not directly retrieved from parsing.

For retrieving the relevant content in the chosen Web page, SMILLE crawls over the HTML structure and extracts its text content. This text content is then parsed with PassPort [9], a dependency parsing model for Portuguese trained with the Stanford parser [2]. The parsed text content is then analyzed with rules for creating new tags for each relevant grammatical structure. After this process, a new Web page is constructed, showing the same information extracted from the original one, but with new encoded information that allows for on-the-fly modifications of the text.

Much of the grammatical information that is detected by SMILLE for Por- tuguese requires only that PassPort correctly analyze the word or structure in question. However, some structures are retrieved based on rules specifically written for them. As such, SMILLE combines the analysis done by the parser with hand-written rules to extract text information that would not be easily iden- tified, and would not be salient, in a raw input. The grammatical annotation provided by SMILLE for Portuguese was tested in different genres in Zilio et al.

[10] and achieved an overall mean precision of 84.07%.

SMILLE is divided in two modules: a reading assistant and a teaching assistant. The reading assistant is responsible for enhancing grammatical structures, providing access to language dictionaries and also to grammatical explanations of each of the structures being displayed. The enhancements are done in real-time, so that the user can change the highlighted structures on the fly. They modify the text in terms of color coding (font and background colors) and by changing it to boldface (Fig 1). The option for these three modifications is based on the results of [5], which have shown that the use of three modifications was among the best ways of enhancing a text structure without saturation. The teaching assistant suggests what information should be displayed based on a user profile.

20 Zilio et al.

(23)

Fig. 1.Enhanced sentence highlighting the use of a clitic pronoun

SMILLE for Portuguese recognizes up to 71 pedagogically relevant grammatical structures in written texts. The rules for these structures encompass both the Brazilian and the European variants and are based on the CEFR levels from A1 to B2, so that each rule is linked to a specific level.

In this demonstration, the users will be able to see the system working on the Web pages that they choose. The interface will be presented via SMILLE’s Chrome extension, and users will be able to chose different structures to highlight on the fly by interacting with SMILLE’s menu. The access to online dictionaries will be available, as will some features of automatic exercise generation. SMILLE allows for two types of interfaces: one that preserves as much as possible of the original Website, and another that extracts the raw text and presents it in a separate window; and users will be able to interact with both of them.

References

1. Doughty, C.: Second language instruction does make a difference. Studies in second language acquisition13(04), 431–469 (1991)

2. Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S.J., McClosky, D.: The Stanford CoreNLP natural language processing toolkit. In: Association for Computational Linguistics (ACL) System Demonstrations. pp. 55–60 (2014), http://www.aclweb.org/anthology/P/P14/P14-5010

3. Meurers, D., Ziai, R., Amaral, L., Boyd, A., Dimitrov, A., Metcalf, V., Ott, N.: En- hancing authentic web pages for language learners. In: Proceedings of the NAACL HLT 2010 Fifth Workshop on Innovative Use of NLP for Building Educational Applications. pp. 10–18. Association for Computational Linguistics (2010) 4. Plonsky, L., Ziegler, N.: The call-sla interface: Insights from a second-order syn-

thesis (2016)

5. Simard, D.: Differential effects of textual enhancement formats on intake. System 37(1), 124–135 (2009)

6. Smith, M.S., Truscott, J.: Explaining input enhancement: A mogul perspective.

International Review of Applied Linguistics in Language Teaching52(3), 253–281 (2014)

7. Verhelst, N., Van Avermaet, P., Takala, S., Figueras, N., North, B.: Common Euro- pean Framework of Reference for Languages: learning, teaching, assessment. Cam- bridge University Press (2009)

8. Zilio, L., Wilkens, R., Fairon, C.: Using nlp for enhancing second language acquisition. In: Proceedings of Recent Advances in Natural Language Processing. pp.

839–846 (2017)

9. Zilio, L., Wilkens, R., Fairon, C.: Passport: A dependency parsing model for portuguese. In: Proceedings of the International Conference on the Computational Processing of Portuguese (2018)

10. Zilio, L., Wilkens, R., Fairon, C.: Smille for portuguese: Annotation and analysis of grammatical structures in a pedagogical context. In: Proceedings of the Inter- national Conference on the Computational Processing of Portuguese (2018)

SMILLE: Supporting Portuguese as Second Language 21