Beyond Readability: a corpus-based proposal for text difficulty analysis

(1)

UNIVERSIDADE FEDERAL DE MINAS GERAIS FACULDADE DE LETRAS

PROGRAMA DE PÓS-GRADUAÇÃO EM ESTUDOS LINGUÍSTICOS

FILIPE RUBINI CASTANO

BEYOND READABILITY:

A CORPUS-BASED PROPOSAL FOR TEXT DIFFICULTY ANALYSIS

BELO HORIZONTE 2018

(2)

(3)

FILIPE RUBINI CASTANO

BEYOND READABILITY:

A CORPUS-BASED PROPOSAL FOR TEXT DIFFICULTY ANALYSIS

Dissertação apresentada ao Programa de Pós-Graduação em Estudos Linguísticos da Faculdade de Letras da Universidade Federal de Minas Gerais, como requisito parcial para a obtenção do título de Mestre em Linguística Teórica e Descritiva.

Área de Concentração: Linguística Teórica e Descritiva

Linha de Pesquisa: (1D) Estudos Linguísticos baseados em Corpora

Orientadora: Profa_{. Dr}a_{. Heliana Ribeiro de}

Mello

Belo Horizonte

Faculdade de Letras da UFMG 2018

(4)

Ficha catalográfica elaborada pelos Bibliotecários da Biblioteca FALE/UFMG

Castano, Filipe Rubini.

C346c Beyond readability [manuscrito] : a corpus-based proposal for text diﬃculty analysis / Filipe Rubini Castano. – 2018.

210p., enc. : il., color., tabs.

Orientadora: Heliana Ribeiro de Mello.

Área de concentração: Lingüística Teórica e Descritiva. Linha de Pesquisa: Estudos Lingüísticos Baseados em Corpora.

Dissertação (mestrado) – Universidade Federal de Minas Gerais, Faculdade de Letras.

Bibliografia: p.201-210. Apêndices: p. 123-200.

1. Línguística – Teses. 2. Línguística de corpus – Teses. 3. Vocabulário – Teses. 4. Linguística – Metodologia – Teses. 5. I. Mello, Heliana. II. Universidade Federal de Minas Gerais. Faculdade de Letras. III. Título.

(5)

(6)

(7)

A C K N O W L E D G M E N T S

After 290 exchanged e-mails, 14 research project drafts, 12 thesis drafts, and fewer actual meetings than she probably would have liked, it is an understatement to say how well my adviser, Prof. He-liana Ribeiro de Mello, has guided me in every step of the way, with blazing fast replies that were always filled with great sugges-tions; how she has given me sound advice as well as space to not fol-low it and fail terribly (which corresponds to all the mistakes and errors in this work); and, perhaps most important of all, taught me by example not only to be a better researcher but also a better human.

This project would not have existed without her support, pa-tience, and encouragement, and I offer her—yet again—the 2-gram that is likely to be one of the most frequent and well dis-persed in my interactions with her: thank you!

I would also like to thank:

• The defense committee members, who shared some of their pre-cious time and experience in evaluating this work: Prof. Bárbara Malveira Orfanó, Prof. Ricardo Augusto de Souza, and Prof. Van-der Viana, with special regard to Prof. Ricardo and Prof. VanVan-der, who provided invaluable, helpful criticism;

• The six professors from whom I had the privilege of learning

dur-ing the course: César Nardelli Cambraia, Larissa Santos Ciríaco, See? What I wrote about the 2-gram was true.

Heliana Ribeiro de Mello, Tommaso Raso, Rui Rothe-Neves, and Ricardo Augusto de Souza. Prof. Ricardo analyzed a version of this project that was about four times as long as mandated. De-spite its length, he provided a carefully thought out and insight-ful assessment that contributed a lot to the project, for which I am very grateful.

• My fellow colleagues in the Master’s program: Clarice Fernandes, Gustavo Fonseca, Eduardo Lacerda, Jessica Queiroz, Saulo Santos, Isabelle de Sousa, and João Souza, who were generous and support-ive throughout.

(8)

• The professors who were my early inspirations atUFMG: Lúcia Ful-gêncio, Heliana Mello, Laura Miccoli, and Adriana Tenuta. Again, see?

• Profs. Herzila Bastos, Ana Larissa Adorno Marciotto Oliveira, and Magda Veloso, for my 18-month stint as a trainee teacher of English at cenex – fale, which proved to be very enriching and influenced this work;

• My dear friends, whose enduring friendship, generosity, and kind-ness have been a privilege to experience:

– Mariana Lemos de Aquino; – Thaiane Santos Araújo Braga; – Marcela Diniz Guimarães; – Kent James Polkinghorne;

– Mariana Ferreira Quintela; – Allyson Mendes Rosa;

– Alisson Carvalho dos Santos; – Rosane Santos Vilaça.

• The best family one could hope for: my parents, Kátia and Sér-gio, my brother, Fábio, my grandmother, Sondina, my aunts and uncle Elaine, Márcia and Francklin, all of whom have supported, encouraged, and put up with me all this time.

Lastly, I would like to thank the unsung heroes who have, often for no financial reward, either spent hundreds of hours of their time into making many of the tools that allowed this project to be pursued and this thesis to be written, or helping others to use them: Guido van Rossum for creating the Python program-ming language1; Steven Bird, Edward Loper, and Ewan Klein for

NLTK2; Leslie Lamport for LA_TEX3_{, Christian Schenk for} MiK-TeX4, Jonathan Kew for XeTeX; Hàn Thế Thành for Microtype5; André Miede and Ivo Pletikosić for the ClassicThesis6style; and the Stack Overflow7community, which helped me in most of the binds a programmer finds oneself in.

1 https://www.python.org/ 2 https://www.nltk.org/ 3 https://www.latex-project.org/ 4 https://miktex.org/ 5 https://ctan.org/pkg/microtype?lang=en 6 https://bitbucket.org/amiede/classicthesis/ 7 https://stackoverflow.com/

(9)

A B S T R A C T

Since the first half of the twentieth century (Flesch,1948), the task of assessing text difficulty has been primarily tackled by the design and use of readability formulas in many areas: selecting grade level-appropriate books for schoolchildren (Spache,1953), simplifying dense subjects, such as medical and legal texts (L. M. Baker et al., 1997; Razek et al.,1982), and, in more recent years, assisting writ-ers in making themselves more undwrit-erstandable (Readable.io n.d.). However, there is little empirical demonstration of the valid-ity of readabilvalid-ity formulas, as shown for instance in Begeny and Greene (2014), Leroy and Kauchak (2014), Schriver (2000), and Sydes and Hartley (1997), and many of the tools that are currently available for assessing text difficulty, e.g.ATOS for Text, ATOS for Books(n.d.), Miltsakaki and Troutt (2007), andReadable.io(n.d.), depend on those formulas to function. In addition, these tools are quite limited, meant to be used for a specific language, text type, and intended audience.

In this work, we develop a corpus linguistics-based, lexicon-oriented approach to propose a Text Difficulty Scale (TDS) which, conversely to previous efforts, can be adapted for texts of virtually any language, including those that use non-Latin writing systems. To that end, we have used sounder statistical measurements, such as deviation of proportions (DP) (Gries,2008,2010); included 2-grams and 3-2-grams as sources of numerous yet often disregarded idioms and phrasemes (Bu et al.,2011, p. 3); and built a 60+ mil-lion token collection of Wikipedia articles in English for demon-stration purposes. Furthermore, we have made our work available, free and open-source, asa set of Jupyter Notebooks in the Python programming language.

We argue that our proposal not only offers a much-needed flexible measurement of text difficulty, in particular for teachers and students of foreign languages, but also that it could be useful for researchers in cognitive linguistics and psycholinguistics, edi-tors, writers, and children acquiring their first language.

(10)

R E S U M O

Desde a primeira metade do século 20 (Flesch,1948), a tarefa de avaliar a dificuldade de textos tem sido primariamente enfrentada através do design e uso de fórmulas de legibilidade (readability for-mulas) em diversas áreas: a seleção de livros para crianças em deter-minadas séries escolares (Spache,1953), a simplificação de assun-tos complexos, como texassun-tos de medicina e de direito (L. M. Baker et al.,1997; Razek et al.,1982) e, em anos recentes, auxiliar escri-tores a se tornarem mais inteligíveis (Readable.io s.d.).

Contudo, há pouca demonstração empírica da validade de fórmulas de legibilidade, como evidenciado, por exemplo, em Be-geny e Greene (2014), Leroy e Kauchak (2014), Schriver (2000) e Sydes e Hartley (1997), e muitas das ferramentas que estão dispo-níveis para a avaliação de dificuldade de texto, por exemplo Milt-sakaki e Troutt (2007), dependem dessas fórmulas para funcionar. Além disso, essas ferramentas são bastante limitadas, feitas para serem usadas com uma língua, tipo de texto, e público específicos. Neste trabalho, desenvolvemos uma abordagem baseada em corpus e focada no léxico, para propor uma Escala de Dificuldade de Texto (EDT), a qual, ao contrário de abordagens anteriores, é adaptável a textos em praticamente qualquer língua, incluindo as que utilizam sistemas de escrita não latinos. Para alcançar esse obje-tivo, utilizamos medidas estatísticas mais sólidas, tais como o des-vio de proporções (DP) (Gries,2008, 2010); incluímos 2-grams e 3-grams como fontes de expressões numerosas e frequentemente negligenciadas (Bu et al.,2011, p. 3); e construímos uma coleção de textos de mais de 60 milhões de tokens de artigos da Wikipedia em inglês, para demonstração. Ademais, tornamos nosso trabalho de código livre disponível gratuitamente, como um conjunto de Jupyter Notebooks escritos na língua de programação Python.

Argumentamos que nossa proposta não somente oferece uma medida flexível e muito necessária de dificuldade de textos, especi-almente no que tange a professores e alunos de línguas estrangei-ras, mas que também poderia ser útil para pesquisadores em lin-guística cognitiva e psicolinlin-guística, editores, escritores, e crianças em processo de aquisição de sua primeira língua.

(11)

O N S T Y L E

p o s l i n resolution 03/2013 (UFMG, 2013), later reedited on April 2018, states that “the appropriate style guides” should be fol-lowed when writing a dissertation or thesis in a foreign language.

However, there is at least a dozen different style guides for publications in English-speaking countries, and each university ei-ther has its own specifications for the formatting of theses or rec-ommend their students to choose one and stick to it, as is the case with MIT (Massachusetts Institute of Technology,n.d.).

We have chosen an author-year, Harvard-esque citation style that includes publication dates and can be parenthetical (Meade and Smith,1991) or not, e.g. Meade and Smith (1991).

In the electronic (.pdf) version of this thesis, sections of the

text in blue, green, and red work as hyperlinks, so it is possible to jump easily to the desired section (for instance, to check the bib-liography). The bibliography, in turn, lists the pages of the thesis where the citations appeared. Acronyms work in a similar way – for instance, clicking onCOCAtakes you to the list of acronyms.

As to formatting, margin sizes, and typography, we have fol-lowed Bringhurst (2004)’s seminal recommendations as adapted by André Miede and Ivo Pletikosić’s for the ClassicThesis style.

This thesis will make use of margin notes which expand on or This is an example of a margin note.

(12)

(13)

C O N T E N T S 1 i n t r o d u c t i o n 1 i t h e o r e t i c a l u n d e r p i n n i n g s 2 t h e m a n y fa c e t s o f d i f f i c u lt y 5 2.1 Difficulty as a concept 5 2.2 Desirable difficulty 6

2.3 Actual and perceived difficulty 6

2.4 Relative text difficulty 7

2.5 Non-relative text difficulty 8

2.6 What about grammar? 9

2.7 Vocabulary and proficiency 10

3 r e a d a b i l i t y 11

3.1 Type A: Orthography-based formulas 11

3.2 Type B: Familiarity-based formulas 12

3.3 Effectiveness of readability formulas 12

3.3.1 Correlations to the Oral Reading Flu-ency test 13

3.3.2 Correlations to cloze scores 13

3.4 Summary 15

4 f r e q u e n c y a n d d i s p e r s i o n 17

4.1 Word familiarity and word frequency 18

4.2 Scientific evidence 18

4.2.1 Kandula et al. (2010): lexical

simplifica-tion 18

4.2.2 Leroy, Kauchak, and Mouradi (2013): lexical simplification 19

4.2.3 Leroy and Kauchak (2014): word fre-quency influences difficulty 19

4.2.4 Martinez-Gómez and Aizawa (2013): a Bayesian approach 21

4.2.5 Summary 21

4.3 Word dispersion 23

4.3.1 Word frequency vs. word dispersion 23

4.3.2 Scientific evidence 24

5 m u lt i w o r d e x p r e s s i o n s 27

(14)

xiv c o n t e n t s

5.2 Summary 29

6 p o l y s e m y 31

6.1 Defining polysemy 31

6.2 Polysemy and frequency 32

6.3 Counting polysemy 33

6.4 Accounting for polysemy 33

6.5 Summary 34 ii m e t h o d o l o g y 7 e x i s t i n g a p p r o a c h e s 37 7.1 Read-X 37 7.2 Readable.io 37 7.3 WordandPhrase.info 37 7.4 ATOS 38 7.5 DeLite 38 7.6 Our approach 39 7.6.1 User requirements 39 8 t h e t e x t c o l l e c t i o n 41 8.1 Size 41 8.2 Text source 42

8.3 Processing the dump file 43

8.4 Going big or going deep? 43

8.5 Text file structure 44

9 c h o o s i n g a l e x i c a l u n i t 47

9.1 Word families 48

9.2 Lemmas 50

9.3 Alphabetical types 51

9.4 Types 51

9.5 Choosing a lexical unit 53

9.5.1 What is a word? 53

9.5.2 The learner’s perspective 55

9.5.3 Tokenization and our definition of word 57

9.6 Summary 59

10 f r o m t e x t t o d ata 61

10.1 Experiments on n-gram coverage 61

10.2 Converting to lowercase 63

10.3 From text to tokens 63

10.4 N-gram generation 66

10.5 The database 66

(15)

c o n t e n t s xv

10.7 Summary 68

iii t h e t e x t d i f f i c u lt y s c a l e

11 d i f f i c u lt y at t h e w o r d l e v e l 73

11.1 Frequency, dispersion, and the data 73

11.2 Measuring dispersion 77

11.3 Calculation of DP 80

11.4 The distribution of dispersion in the WP data 81

12 d i f f i c u lt y at t h e t h e t e x t l e v e l 83

12.1 Unique-to-total token ratio 83

12.2 Occurrences and unique tokens 85

12.2.1 2-grams and 3-grams 88

12.3 Our proposal for the TDV 92

12.4 Other factors in text difficulty 92

12.5 Summary 95

13 e x p l o r i n g t e x t d i f f i c u lt y va l u e s 97

13.1 Spoken—academic English 97

13.1.1 Spoken data 98

13.1.2 Academic data 98

13.1.3 Calculating Text Difficulty Values 99

13.1.4 Analysis results 99

13.2 Further demonstrations 100

13.2.1 The 475 random Wikipedia articles 100

13.3 Spoken—academic Portuguese 102

13.3.1 The spoken data 103

13.3.2 The academic data 104

13.4 Analysis results 105

13.5 Summary 107

14 t h e d i f f i c u lt y h i g h l i g h t e r 109

14.1 Design considerations 109

14.2 Johann Sebastian Bach in five languages 110

14.2.1 English 111

14.2.2 Portuguese, Japanese, Hebrew, and

Per-sian 113

15 f i n a l r e m a r k s 119

iv a p p e n d i x

a e x t r a c t i n g w i k i p e d i a d u m p f i l e s 125

(16)

xvi c o n t e n t s

a.1.1 Extracting the Wikipedia dump into sep-arate files 125

a.1.2 Choosing text files randomly 127

a.2 Text data 128

b t h e t e x t d i f f i c u lt y a n a l y z e r c o d e 129

b.1 Automatic Wikipedia extractor 129

b.2 The corpus builder 132

b.3 The Text Difficulty Analyzer + Difficulty High-lighter 147

c a s s e m b l i n g t h e a c a d e m i c a n d s p o k e n c o r p o r a 171

c.1 Obtaining TDVs from the SketchEngine wordlists 171

c.2 English 172

c.2.1 The academic text corpus 172

c.2.2 The spoken text corpus 173

c.3 Example of the cleaning results for Portuguese 175

d d i f f i c u lt y h i g h l i g h t e r t r a n s l at i o n s 177 e t o k e n l i s t s 183 e.1 English 183 e.2 Portuguese 186 e.3 Japanese 190 e.4 Hebrew 193 e.5 Persian 197 b i b l i o g r a p h y 200

(17)

L I S T O F F I G U R E S

Figure 4.1 Average actual difficulty for words grouped by word frequency of occurrence, from Leroy and Kauchak (2014, e171). 20

Figure 6.1 Polysemy and frequency. Reproduced from Zipf (1949). 32

Figure 10.1 Concept map of the process of data ex-traction. 64

Figure 10.2 Two scenarios of n-gram extraction, with (top) and without (bottom) sentence tokenization. Inappropriate n-grams are shown in red, with strings inside single quotation marks. 65

Figure 12.1 Histograms of the number of unique 1-grams and total occurrences of 1-1-grams (y axis) and their ranges of DP values

(x axis) for the Wikipedia (WP) article

Miscegenation. 86

Figure 12.2 Histograms of the number of unique 2-grams and total occurrences of 2-2-grams (y axis) and their ranges of DP values (x axis) for the WParticle

Miscegena-tion. 89

Figure 12.3 Histograms of the number of unique 3-grams and total occurrences of 3-3-grams (y axis) and their ranges of DP values (x axis) for the WParticle

Miscegena-tion. 90

Figure 14.1 Johann Sebastian Bach PortugueseWP

article highlighted for difficulty. 114

Figure 14.2 Johann Sebastian Bach JapaneseWP

ar-ticle highlighted for difficulty. 115

Figure 14.3 Johann Sebastian Bach HebrewWP

ar-ticle highlighted for difficulty. 116

Figure 14.4 Johann Sebastian Bach PersianWP

(18)

Figure D.1 Johann Sebastian Bach PortugueseWP

article translated into English. 178

Figure D.2 Johann Sebastian Bach JapaneseWP ar-ticle translated into English. 179

Figure D.3 Johann Sebastian Bach PersianWP arti-cle translated into English. 180

Figure D.4 Johann Sebastian Bach HebrewWP ar-ticle translated into English. 181

L I S T O F TA B L E S

Table 4.1 Breakdown of lemma composition in the Oxford English Corpus (OEC)

ac-cording to the top 𝑥 most frequent lem-mas. Adapted from https://goo.gl/ CvfPY1. 17

Table 9.1 Occurrences of play word forms in the Corpus of Contemporary American En-glish (COCA). 52

Table 10.1 Overall information on the Wikipedia data in the first run of experiments with 129,000 articles. 61

Table 10.2 Overall statistics for the final Wikipedia 124,000-article text collection. 69

Table 11.1 Example of the calculation of DP in a corpus with unequally-sized parts for a token that occurs equally throughout the corpus. 79

Table 11.2 Example of the calculation of DP in a corpus with unequally-sized parts for a token that occurs in only one part. 80

Table 11.3 Example of the calculation of DP in a

corpus with unequally-sized parts for a token that occurs only in the largest

part. 80

Table 11.4 The number of unique tokens in each range ofDPvalues in theWPdata. 81

(19)

Table 12.1 Analysis of 475 WParticles according to their mean and median article lengths, average unique-to-total ratio, andDP. 91

Table 13.1 Results of the spoken–academic text comparison. 100

Table 13.2 Difference in text difficulty values (TDVs)

after ascribing the value of 1.000 to to-kens not found in the English data. 102

Table 13.3 Academic Portuguese data statistics, with

the number of tokens estimated by SketchEngine. 105

Table 13.4 Results of the spoken–academic text

comparison for English and Portuguese. 106

Table 13.5 Difference in TDVs after ascribing the value of 1.000 to tokens not found in the Portuguese data. 107

Table 14.1 Overall statistics for the Wikipedia text data in English, Portuguese, Japanese, Hebrew, and Persian. 111

Table C.1 Santa Barbara Corpus of Spoken Amer-ican English (SBCSAE) transcription file before and after cleaning. 174

Table C.2 c-oral-brasil (COB) transcription file before and after cleaning. 176

Table E.1 Top 100 English 1-grams 183

Table E.2 Top 100 Portuguese 1-grams 186

Table E.3 Top 100 Japanese 1-grams 190

Table E.4 Top 100 Hebrew 1-grams 193

Table E.5 Top 100 Persian 1-grams 197

L I S T I N G S

Listing 10.1 Example of MongoDB document for the token serendipitous 67

Listing 10.2 Different queries for the token database 67

Listing A.1 Extracting subdirectories ofWPdump into text files 125

Listing A.2 Choosing text files randomly 127

(20)

Listing B.2 The corpus builder 132

Listing B.3 The Text Difficulty Analyzer + Diffi-culty Highlighter 147

Listing C.1 ObtainingDPandTDVsfrom SketchEngine

wordlists 171

Listing C.2 Choosing academic article files randomly 172

Listing C.3 Cleaning the Santa Barbara corpus files 173

A C R O N Y M S

BNC British National Corpus CD contextual diversity

COB c-oral-brasil

COCA Corpus of Contemporary American English

DP deviation of proportions

MWE multiword expression

NLTK Natural Language Toolkit OEC Oxford English Corpus OED Oxford English Dictionary

ORF Oral Reading Fluency

RF relative frequency

SBCSAE Santa Barbara Corpus of Spoken American English

TDS Text Difficulty Scale TDV Text Difficulty Value

UFMG Universidade Federal de Minas Gerais

WD word dispersion

WF word frequency

(21)

1

I N T R O D U C T I O N

Language teachers are often tasked with selecting authentic or appropriate pedagogical materials for their students (Okamoto, 2015, p. 9). They either rely on “readers”, usually simplifications of famous literary works for specific grade levels (Davanzo,2016), or on their own instincts as teachers, facing the risk of under– or overestimating their students’ reading abilities.

This work originated from the intention to aid teachers and learners of languages in selecting more appropriate texts, i.e. texts that are more appropriate difficulty-wise for the proficiency level of the reader.

The concept of assessing text difficulty is not new; it has been around since at least the early twentieth century, one such exam-ple being Flesch (1948). The first attempts to assess difficulty were readability formulas, which are used to this day. However, as we will show in chapter3, despite the great number of readability for-mulas, they present little evidence of utility.

Therefore, we need a tool that is able to assess difficulty in a more reliable manner by taking into consideration the available scientific evidence, in particular the linguistics-related evidence. Most readability formulas are focused on children learning their first language (Benjamin,2012, p. 83); instead, we mean to focus on teachers and learners of second or foreign languages.

We argue that such an endeavor is justifiable by its potential, among other things, to:

1. Restate the importance of vocabulary in language learning and sec-ond language acquisition, areas where grammar is often one of the main targets of interest (Cook,2008, p. 6);

2. Aid learners in choosing reading materials appropriate for their current proficiency level, as well as assisting teachers and language professionals in that regard, by reducing or eliminating much of the work in guessing whether a text is difficult or not;

3. Inform the formulation of language courses, learning materials, and vocabulary books;

(22)

2 i n t r o d u c t i o n

4. Make the comparative analysis of texts easier wherever word fre-quency, dispersion, and other vocabulary-related aspects are con-cerned.

Computer programs that attempt to assess text difficulty do exist; however, they present at best limited applicability. We out-line and compare them to our own approach in chapter 7, p.37. With the ultimate goal of constructing such a program in mind, this work is divided into three parts:

• In Part i, we examine the theoretical and empirical research in terms of the potential connections between corpus measurements – such as word frequency, word dispersion, and word familiarity – and text difficulty;

• In Part ii, we apply what can be learned from such a research framework into the development of a flexible, hopefully improved methodology for text difficulty assessment;

• Finally, in Partiii, we demonstrate our proposal for text difficulty assessment on two different languages, as well as our Difficulty Highlighter application on five different languages.

(23)

Part I

T H E O R E T I C A L U N D E R P I N N I N G S

The theoretical and empirical background to the no-tions of readability, corpus measurements such as word frequency and word dispersion, and their potential in-fluence on text difficulty.

(24)

(25)

2

T H E M A N Y F A C E T S O F D I F F I C U LT Y

Everyone is a genius. But if you judge a fish by its ability to climb a tree, it will live its whole life believing that it is stupid.

— Unknown author; often wrongfully attributed to Albert Einstein

One of the most deceptively simple concepts to define, in lin-guistics or any other field, isdifficulty. It is easy to overlook or take for granted; for a work that relies so much on the concept, we must

spend some time exploring its properties. In this section, we will This chapter is not a thorough visitation of the subject of difficulty. We will focus on the aspects of difficulty that could help in creating a text difficulty analyzer. discuss the concept of difficulty in more general (common sense)

terms; what difficulty means in learning; and which linguistic fac-tors, if any, could influence the difficulty level of a text.

2.1 d i f f i c u lt y a s a c o n c e p t

The dictionary definition of difficulty is straightforward: “the state or condition of being difficult; (...) viz. hard to accomplish, deal with, or understand (Press,n.d.)”.

We must, however, remember that difficulty can be relative. A fish has little difficulty to swim, yet will face insurmountable difficulty when attempting to climb a tree. Thus, each individual has a different skill set or background, which influences how hard something is for that particular individual.

The relativity of difficulty presupposes that certainunderlying factors can influence difficulty. In the tree-climbing fish example, the shape, size, and even instinct of the animal could influence its difficulty for climbing trees.

The dictionary definition, however, conveys a binary qual-ity to difficulty: either something is difficult or not. However, ac-knowledging the existence of factors that influence difficulty (i.e. increase or decrease it) allows for building ascale of difficulty, e.g.

(26)

6 t h e m a n y fa c e t s o f d i f f i c u lt y

ranging from easy to difficult. Thus scales or degrees of difficulty depend on the factors influencing difficulty (the background of the task and of its actors etc.) as well as on how the scale designer represents and assigns a weight to each of those factors.

Nevertheless, difficulty can also be independent of skill and/or background, i.e.non-relative. For two able-bodied identical twins with similar fitness backgrounds, climbing a step in a flight of stairs is much less difficult than climbing Mount Everest.

2.2 d e s i r a b l e d i f f i c u lt y

Desirable difficultyis a concept in cognitive psychology. It basically proposes that, when designing the training for a given task, intro-ducing difficulties for the learner increases retention and success in that task (Bjork, 1994, p. 189). The assumption behind desir-able difficulty is that the more extensive the processing, and con-sequently the deeper the cognitive strategy, the better is the learn-For the effect of

difficulty in performance, see Deslauriers et al. (2012), p. 82, for some examples of weak or insignificant correlations, and studies like Hughes et al. (2013) and Pyc and Rawson (2009) for significant correlations.

ing. Desirable difficulty has not always been found to correlate sig-nificantly with higher performance, but it would be nonintuitive to claim that by providing students solely with easy tasks (in our project’s case, unchallenging texts), their learning will improve.

We are of the opinion that there must be a balance: the de-sirable spot in text difficulty is somewhere between the overbear-ingly difficult and the tediously simple. Especially for students of languages, they must be “injected” periodically with a healthy dose of new words and expressions. In order to do this, teachers and stu-dents need to be able to clearly separate what is too difficult from what is too easy.

2.3 a c t ua l a n d p e r c e i v e d d i f f i c u lt y

In addition to being influenced by relative and non-relative fac-tors, there are other ways we can look at difficulty. We can talk of perceived difficulty, i.e. the impressions of an individual regarding how difficult a given task is.Actual difficulty, on the other hand, measures difficulty through tests to ascertain objectively how well the subject fared in the task, as in Leroy and Kauchak (2014).

(27)

2.4 r e l at i v e t e x t d i f f i c u lt y 7

Testing, we argue, is the gold standard to scientifically gauge actual difficulty, considering that it excludes the possibility that one may over or underestimate their comprehension of the text, as one study (Martinez and V. A. Murphy,2011) we discuss in sec-tion5.1, p.28, shows; yet, the psychological element of difficulty cannot be completely ignored.

Having established the different facets of difficulty (relative, non-relative, perceived, and actual) we must approach the subject of our work,text difficulty, investigating it from these four perspec-tives.

2.4 r e l at i v e t e x t d i f f i c u lt y

Lingo can even be used to increase difficulty, as in the case of Polari, 19th century slang shared by homosexual individuals in Britain as a disguise, as homosexuality was illegal at the time (P. Baker,2003). Relative text difficulty is influenced by the social, economic, and

linguistic background of the reader. For instance: a native speaker of Portuguese learning English as a foreign language will proba-bly not find much trouble when encountering the word gingivitis, as it is similar to a Portuguese word of same meaning, gengivite, thus potentially accelerating the learning of the word. An English native speaker would be more likely to know the expression gum swelling, as observed in Maylath (1997).

Similarly, a white kid growing up in an upper-class family in Manhattan may have to rely on their context-awareness skills in or-der to unor-derstand some of the African-American English spoken in inner city ghettos (Bailin and Grafstein,2001, p. 288).

Career choices also influence difficulty. For instance, a physi-cist will likely understand a paper on quantum tunneling1 with less difficulty than the average person.

We acknowledge the importance of social factors in influenc-ing difficulty. However, this work does not intend to dwell on them, for they are too numerous to control: imagine trying to ad-just for factors like native tongue, gender, age, area of birth, ethnic-ity, income level, education level, number and subjects of books read, career history, travels undertaken, religion, and a myriad of other life experiences, in order to gauge difficulty of a text for a person.

1 Quantum tunneling could create new Big Bangs in the vacuum in the far future, giving rise to new universes (Carroll and Chen,2004).

(28)

8 t h e m a n y fa c e t s o f d i f f i c u lt y

It is simply not feasible to go on this route if you consider the diversity of humankind: people can be many things at the same time, and it may not always be that a person who belongs in a group (say, Jews, African-Americans, atheists etc.) will necessarily know the lingo or jargon of that group.

On the other hand, one crucial facet of relative difficulty, in particular for language learners, is vocabulary size (I. Nation, 2006; P. Nation,1997; P. Nation and Coady,1988). If a student The most conservative

study we could find for English vocabulary size (Goulden et al.,

1990) arrived at a “native-like” vocabulary consisting of 17,000 words, whereas the other side of the spectrum found figures as large as 216,000 words (Diller,

1978), which of course depends on what is defined as a word.

of English as a foreign language knows only a few thousand words in their vocabulary – many of which we can assume are the most frequent words – they will probably struggle with some types of texts (contemporary and 20th century novels, academic and news-paper articles) and have an easier time with others (sitcoms, talk shows, informal conversation, etc.).

2.5 n o n - r e l at i v e t e x t d i f f i c u lt y

It is a foregone conclusion that, independently of language back-ground, socioeconomic status, gender, etc., it is of paramount im-portance to learn the meanings of the building blocks of texts – words and expressions – either directly (through, for instance, word definitions) or indirectly (through context or other means), to understand them. If one’s vocabulary is limited, understanding a text becomes too difficult or outright impossible.

Having a diversevocabulary(in this work used to mean the set of words an individual knows, in opposition tolexicon, which comprises all the words and expressions in a language) is not only important for children learning to speak and read, but also to those studying a second or foreign language (P. Nation and Coady, 1988). For the latter, especially as adults, the effort can be daunt-ing, because the lexicon of any language has dozens of thousands of words and expressions (emphasis added):

Knowing words is the key to understanding and be-ing understood.The bulk of learning a new language consists of learning new words.Grammatical knowl-edge does not make for great proficiency in a lan-guage (Vermeer (1992), p. 147)

(29)

2.6 w h at a b o u t g r a m m a r ? 9

Students of foreign languages face a particular struggle in learning new vocabulary. Although learning materials present abun-dant instruction on pronunciation, conversational strategies, gram-mar rules and numerous grammatical exercises, they do not usu-ally present a sufficiently large or useful set of words, as demon-strated by O’Loughlin (2012). In fact, it is impractical to try to cram a language’s lexicon into didactic textbooks; not only the books would become unwieldy, but there is also a limitation on how much of English’s lexical diversity one can include in a lan-guage course, for reasons of time, pedagogical principles, etc.

2.6 w h at a b o u t g r a m m a r ?

As shown previously, the often understated importance of vocabu-lary, and the amount of effort required to acquire it, is among the reasons why, in this project, we will focus on the lexicon rather than on grammar as a measure of difficulty.

Naturally, vocabulary deficiency is not the only factor in text difficulty. The target language’s grammar, and the errors that are bound to occur when one learns it, do shed an important light on the process of language acquisition; some grammatical construc-tions seem to be more difficult to process and learn than others, and there is a multitude of research on different languages and grammatical domains, such as argument realization and morphol-ogy (Boerma et al.,2017; Souza and Mello,2007).

However, attempting to integrate different linguistic factors (for instance, the complexity of the morphosyntactic system – whether it has case, gender, number, etc., and the properties of these features; number of irregular verb forms; number and spe-cific properties of verbal conjugation; and so on) into a unified “theory of difficulty” that would predict text difficulty for a wide range of languages would likely be unfruitful, as languages can be (and often are) very different from each other. There is an “amaz-ing diversity of l“amaz-inguistic structures”, as persuasively argued by Evans and Levinson (2009, p. 445).

Thus, focusing on the lexicon allows a single framework to be applied on many languages – considering that one would be hard pressed to find a language without some sort of lexicon – making unnecessary the inclusion of specific details of the grammar of

(30)

in-10 t h e m a n y fa c e t s o f d i f f i c u lt y

dividual languages (as well as the assessment of learners’ degree of mastery of said details). The only requirement would be obtaining texts on the target language and finding or training a tokenizer for that language.

2.7 v o c a b u l a r y a n d p r o f i c i e n c y

The difficulty in acquiring new vocabulary can be remedied by reading, in particularextensive reading, as demonstrated in the lit-erature. According to P. Nation (1997), the benefits of expanding For studies on reading

and its effects on language ability, see Saragi et al. (1978), Renandya and Jacobs (2016), Pigada and Schmitt (2006), and P. Nation and Coady (1988).

one’s own vocabulary through extensive reading are not limited to just how much one can understand of the language:

experimental studies have shown that not only is there improvement in reading, but that there are improve-ments in a range of language uses and areas of lan-guage knowledge (p. 16).

By this principle, we could make the case that the more texts read and the more words an individual has learned2, the better for their overall proficiency level.

However, it is in theory possible to learn a language more effi-ciently, since some words are more useful than others. One would probably not say that the noun aardvark is, in general, more use-ful than the verb eat for a language student. The problem becomes: how do we know which words are useful and which are not? And, by extension, which texts contain the most useful words for the language student?

2 In this work, knowing at least one definition of a word will be referred to as “learn a word”, “acquire” or “know a word”, etc. interchangeably. By those terms, we are referring to vocabulary acquisition by both children and adults, as well as to first and second language vocabulary acquisition.

(31)

3

R E A D A B I L I T Y

Some of the earliest efforts in grading text difficulty were readabil-ity formulas.Readabilitycan be defined as the measure of how easy a given text is to understand. Therefore, the more readable a text is, the easier it is. Traditional readability research concerns itself with the design of suitablereadability formulasto ascertain how readable a given text is (Benjamin,2012; Meade and Smith,1991).

Readability formulas have been the de facto standard for gaug-ing relative difficulty of texts. Their number is probably in the hun-dreds (Benjamin,2012, p. 63), and their use is widespread. Some of their applications are laid out by Begeny and Greene (2014, p. 199):

• simplify texts that are perceived as hard to understand due to their subject matter, such as accounting textbooks (Razek et al.,1982), surgical consent forms (Grundner,1980), or medical texts (L. M. Baker et al.,1997);

• select, simplify, and grade texts for young native speakers of En-glish in school years (Spache,1953);

• grade exams and entrance forms for the US Army (Sticht,1973). Formulas come in many shapes, sizes, and intended targets (Begeny and Greene, 2014). We will now discuss the two main types of readability formulas, which we have termed in this work Type AandType B.

3.1 t y p e a : o r t h o g r a p h y- b a s e d f o r m u l a s

This type, which includes the influential Flesch formula (Flesch, 1948), measures the number of syllables in words and how many words a sentence contains. In other words, this type of formula assumes that both long words and word-filled sentences have an impact on reading comprehension.

Using these criteria to establish readability is a questionable decision, argue Bailin and Grafstein (2001), p. 289:

(32)

12 r e a d a b i l i t y

First of all, there appear to be a significant number of instances where mono or bisyllabic words are more esoteric, more unfamiliar, than longer polysyllabic terms. Consider, for example, the short, monosyl-labic curr, and the longer, morphologically complex, reinventing. The number of readers, even children, who know the latter term is quite likely greater than those who know the former. Is the word aardvark more familiar than unemployment? One would hardly think so. But again, the reading formulas in question would treat the latter as contributing more to the complexity of the text than the former.

3.2 t y p e b : fa m i l i a r i t y- b a s e d f o r m u l a s

On the other hand,Type Breadability formulas, as the Dale-Chall (Chall and Dale,1948) and its reincarnation 47 years later (Chall and Dale,1995), take into considerationword familiarity. In the case of Dale-Chall, if there are many words in a text that do not belong to a list of the 3,000 most familiar words, the less readable it is. We will discuss word familiarity and its related concepts, word frequency and word dispersion, in chapter4.

3.3 e f f e c t i v e n e s s o f r e a d a b i l i t y f o r m u l a s

Despite their variety and extensive use, “the validity of readability formulas is inconclusive in the scientific literature” (Begeny and Greene,2014, p. 201); “there is little evidence that readability for-mula outcomes relate to text understanding” (Leroy and Kauchak, 2014, p. 169); using readability formulas to perform revision of text has been shown to be unsuccessful in terms of improvements in comprehension (Duffy and Kabance,1982); their limitations can make them quite misleading (Pichert and Elam, 1985), and a critical review (Schriver,2000, p. 140) states that there are only two upsides to their use:

the formulas serve to remind people who would oth-erwise be unaware of issues of readability to consider them (...) [and] have served the very useful function of igniting debate among (...) researchers.

(33)

3.3 e f f e c t i v e n e s s o f r e a d a b i l i t y f o r m u l a s 13

Before taking these researchers’ statements as fact, let us ex-plore the evidence on readability formula effectiveness in a little more detail.

3.3.1 Correlations to the Oral Reading Fluency test

What is most surprising in regards to the empirical evidence on readability is that the majority of researchers in the field did not seem to correlate formulas to actual difficulty or even perceived difficulty; instead, they often made correlations to the Oral Read-ing Fluency (ORF) test, which simply assesses how well children

were able to read texts aloud1 (Begeny and Greene,2014). Logi- A list of the

ORF-related

readability studies is available in Begeny and Greene (2014), p. 201.

cally, being able to do so does not entail an understanding of the words being spoken, especially if the subjects are children.

Even correlating readability formulas to such an inappropri-ate measurement of difficulty yielded little to no result in several studies, as stated in Begeny and Greene (2014, pp. 201–203), with only the Dale-Chall readability formula (Type B, word frequency-based) ultimately showing a correlation toORFperformance (Begeny and Greene,2014, p. 209).

3.3.2 Correlations to cloze scores

Let us now examine the available evidence – which is in very short supply – when it comes to readability formulas and actual diffi-culty. The most practical way to assess the level of reading com-prehension – and, by consequence, difficulty – seems to becloze tests, which are constructed by a procedure similar to this:

replacing every fifth word in a passage with an un-derlined blank of a standard length. Students who have not read the intact passage are asked to write in each blank the word they think was deleted, and their responses are scored correct only when they ex-actly match the word deleted (Bormuth (1971), p. 13).

(34)

14 r e a d a b i l i t y

The rationale is that cloze tests share the same underlying principles of traditional comprehension tests. However, assign-ing as a correct response only the exact match of the original text seems a bit restrictive. For instance:

“I saw a man lay his jacket on a puddle for a woman crossing the street. I thought that was very ______”2

In this passage, there are several options for the blank: roman-tic, chivalrous, gallant, and depending on the test taker’s opinion, cheesy or even foolish. All of these options seem valid, and they could reveal a good amount of information on the subject’s vocab-ulary. If one limits the correct answer to only one alternative – say, romantic – the results of the test have at best limited value.

A 2009 study on 25 African-American men’s understanding of prostate cancer information found that “correlations between Flesch–Kincaid readability and Cloze comprehension scores were not significant” (Friedman et al.,2009, p. 454), Flesch-Kincaid be-ing one example of Type A formulas.

Some readability formulas, such as Dale-Chall, have been considered “valid” empirically by Benjamin (2012, p. 81). A closer look at this statement reveals that this validity is based on per-formance on these limited-type cloze tests, especially those devel-oped in Bormuth (1971) – a U.S. government Office of Educa-tion report – which included several other factors, for instance students’ opinions of the passages and whether they wanted to continue reading (i.e. an assumption that if an elementary student shows interest on a passage, the passage must have been readable). Not to mention that Bormuth’s “evidence” focuses mainly on the suitability of textbooks and pedagogical materials.

Benjamin (2012, p. 66) further states that the Dale-Chall for-mula, in addition to having been found valid on cloze tests, “was also successfully validated by comparing predicted difficulty levels with various standardized reading tests”. No sources are given for those standardized tests, making us suppose that these may well be theORFtests.

(35)

3.4 s u m m a r y 15

3.4 s u m m a r y

We have thus far outlined the following negative aspects of read-ability research and readread-ability formulas:

• Using criteria such as the length of words or the number of words in a sentence is not the best methodology to assess difficulty, as there is plenty of short words (e.g. gouge, bias, err) that could be mistakenly thought of less difficult than longer words (e.g. unem-ployment, teasing, amazing).

• Most of the research on readability shows their limited validity, with formulas failing to correlate even with theORFtests;

• The results of actual difficulty tests e.g. cloze tests, depending on how they are designed, can be either limiting or misleading.

Only in the mid 2010’s we see studies, such as Leroy and Kauchak (2014), gauging actual difficulty of specific components of readability formulas, such as word frequency. In the next chap-ters, we will look into those components separately.

(36)

(37)

4

F R E Q U E N C Y A N D D I S P E R S I O N

Human language, despite its immense potential for creativity and innovation, is often very predictable and formulaic, especially in terms of the lexicon (Griffiths, 2011). One analysis of theOEC1

shows that a very small amount of lemmas account for a great deal of the content in most text corpora, as shown in table4.1.

Table 4.1: Breakdown of lemma composition in theOECaccording to the top 𝑥 most frequent lemmas. Adapted fromhttps://goo. gl/CvfPY1.

Top 𝑥 most frequent lemmas

% of content

inOEC Example lemmas

10 25% the, of, and, to, that, have

100 50% from, because, go, me, our, well, way

1000 75% girl, win, decide, huge, difficult, series 7000 90% tackle, peak, crude, purely, dude, modest 50,000 95% saboteur, autocracy, calyx, conformist >1,000,000 99% laggardly, endobenthic, pomological

Far from an equal distribution, there is a small set of words that are highly frequent, in opposition to a very large set of words that are comparatively uncommon, something that was noticed in the mid-twentieth century (Zipf,1949).

Word frequency, a statistical measure applied to a corpus that

counts the occurrences of each word either in relation to the entire The definition of word can vary from study to study. They can be, for instance, tokens, types, lemmas, or word families. corpus (absolute frequency, as in 347 occurrences) or in relation to

the other words in the corpus (relative frequency, as in x words per million or x percent) can thus separate the words that are frequent from those that are rare.

(38)

18 f r e q u e n c y a n d d i s p e r s i o n

4.1 w o r d fa m i l i a r i t y a n d w o r d f r e q u e n c y

Before going further, some clarification is needed. The term word familiarity has been used interchangeably, in a somewhat impre-cise manner, with the term word frequency in research – see, for instance, Aziz et al. (2010), Begeny and Greene (2014), and Leroy and Kauchak (2014). We argue that they are not the same.

Word frequency is simply a type of count. The concept of word familiarity, on the other hand, assumes that the most fre-quent words are also the most familiar, which may not always be the case (Bailin and Grafstein,2001, p. 287), given the differences in people’s backgrounds, as discussed in section2.4. For instance, the word aardvark may be familiar to biologists yet completely un-familiar to non-biologists. We will thus dispense with the mud-dled term “familiarity” and use frequency instead, unless we are citing or discussing a particular study that has used the former, for fidelity to the original text.

4.2 s c i e n t i f i c e v i d e n c e

In the following sections, we will outline and discuss a few stud-ies that investigate the influence of word frequency in actual diffi-culty. It will be possible to discuss them individually, for they are the only ones we have found. In fact, the most recent – Leroy and Kauchak (2014) – even claims to be the first of its kind (i.e. to investigate actual difficulty in relation to word frequency).

4.2.1 Kandula et al. (2010): lexical simplification

Focusing on electronic medical records and academic articles, Kan-dula et al. (2010) explored the concept of lexical simplification by replacing difficult terms with easier synonyms (making the as-sumption that more frequent synonyms are easier). Cloze scores showed a statistically significant improvement after simplification.

Despite the improved cloze scores, it is hard to establish a clear link between frequency and difficulty in this particular study as the authors included other types of simplifications to deal with “difficult terms”: explanation generators, syntactic simplifications,

(39)

4.2 s c i e n t i f i c e v i d e n c e 19

and further simplifications to grammar by using part-of-speech taggers and cohesion estimates (p. 368). In addition, only four subjects were tested with the cloze tests (which have their own methodological problems, as we have discussed).

In the words of the authors (p. 369), “although all the im-provements are statistically significant, the magnitude of improve-ments is rather small”.

4.2.2 Leroy, Kauchak, and Mouradi (2013): lexical simplification

The texts chosen were: eight sentences on smoking cessation, each taken from separate texts, and two

abstracts that were not descriptions of experiments. They were found via PubMed search results with the query smoking cessation. This study by Leroy, Kauchak, and Mouradi (2013), conducted

over the Internet with 187 subjects, aimed to test four different types of text for actual and perceived difficulty (tested with Lik-ert scales, multiple choice questions and cloze tests): the original unmodified text; lexically simplified text (changing less frequent words to their more frequent synonyms); coherence-enhanced text; and text that was both coherence-enhanced and lexically sim-plified. According to the authors, coherence is improved by “ensur-ing that no gaps exist in the flow of a document by use of anaphoric referents, connective ties, synonyms, etc” (p. 719).

Lexical simplification was found to reduce the perceived dif-ficulty of texts, whereas coherence enhancement reduced actual difficulty (measured by multiple choice questions). The cloze tests, on the other hand, showed that “lexical simplification can nega-tively impact the flow of the test”, and that “for all types of words, the participants performed better with text that was not lexically simplified” (Leroy, Kauchak, and Mouradi,2013, p. 726).

4.2.3 Leroy and Kauchak (2014): word frequency influences diffi-culty

Possibly the largest study to this day to investigate actual difficulty of words, Leroy and Kauchak (2014) used 239 subjects over the Internet.

As a basis for frequency counts, the authors used a large cor-pus (the Google Web Corcor-pus2, with over 13 million unique 1-grams, and a word list with 64,000 common English dictionary words (reviewed manually to exclude proper names, number-letter

(40)

combinations, internet-specific syntax, and formulas). Then, they selected 25 words from each threshold or percentile of frequency: 25 words taken randomly from the top 1 percent most frequent words, then another 25 words from the 9-10 percent most fre-quent words and so on, until the 99-100 percent most frefre-quent words, for a total of 275. Pairing the correct definition (a rephrased, In addition to actual

difficulty, the authors also tested subjects’ perceived difficulty through a Likert five-point scale ranging from very easy (1) to very difficult (5) and found a correlation between perceived difficulty and word familiarity: the words that were perceived as most difficult were among the least frequent.

simplified definition from WordNet) of each of those 275 words, they found thatactual difficulty was correlated with word familiar-ity(i.e., word frequency):

the results show that words with a lower frequency of occurrence (higher percentile) are more difficult and less often correctly defined by participants (p. e171), as shown in fig.4.1:

Figure 4.1: Average actual difficulty for words grouped by word fre-quency of occurrence, from Leroy and Kauchak (2014, e171).

Reinforcing the criticism of Type A (orthography-based) read-ability formulas mentioned in chapter3, Leroy and Kauchak (2014) state that no relationship was found “between the word length and actual difficulty” (p. e171).

(41)

4.2 s c i e n t i f i c e v i d e n c e 21

4.2.4 Martinez-Gómez and Aizawa (2013): a Bayesian approach

Constructing a Bayesian causal network based on three different corpora – Simple Wikipedia, the English Wikipedia and PubMed3, Martinez-Gómez and Aizawa (2013) investigated the correlation of 22 different linguistic features to fixation times (measured on 40 participants – most of which were non-native English speakers – with an eye-tracker while they read three texts each), the

assump-tion being that the longer the eye fixates on a specific word, the more cognitive effort (and therefore difficulty) is involved. Their

findings are interesting to us: Perplexity and surprise

are measurements in probabilistic language models. In essence, the greater the deviation from the word that is expected

(probabilistically), the greater the surprise and perplexity

(Lopopolo et al.,2017). According to the cognitively-grounded reading

dif-ficulty, lexical perplexity (surprise), the occurrence of named entities, out of vocabulary words, passive clauses, academic words, nouns and abstraction (hy-pernyms) are the linguistic features that required longer fixation times in order to understand those docu-ments (Martinez-Gómez and Aizawa,2013, p. 1389).

Despite not investigating word frequency directly, nearly ev-ery linguistic feature found to correlate with longer fixation times in this study has a connection to frequency: out of vocabulary words (words that were not present in lists of highly frequent words); occurrence of named entities (bound to be less frequent than non-named entities like common nouns); and “academic” words, nouns and abstractions (which are arguably more complex and less likely to be frequent).

4.2.5 Summary

A clear-as-day relationship between word frequency and text dif-ficulty cannot be established. There are not enough studies to say conclusively that higher frequency equates to less difficulty and vice-versa. Two of the studies analyzed – Kandula et al. (2010) and Leroy, Kauchak, and Mouradi (2013) – provided either in-conclusive or neutral results in regards to difficulty. Especially

(42)

when considering the understanding of passages of text, lexical simplification – replacing lower frequency words with higher fre-quency equivalents – does not seem to yield significant results, at least on the health-related texts used in the studies.

On the other hand, the difficulty of individual words does appear to be affected by frequency, according to a large study by Leroy and Kauchak (2014). Still in regards to individual words, the experiment by Martinez-Gómez and Aizawa (2013), with a completely different approach (equating longer fixation times at word level to greater difficulty) seems to reinforce these findings. When writing a text, context can be construed so as to lessen the difficulty of the individual words it contains, by clarifying lower frequency (rare) words through, say, didactic explanations and restatements that are composed of higher frequency words; in other words, one could say that the argument that individual words can be difficult does not allow us to say, in turn, that sen-tences, paragraphs, and entire texts are difficult.

We do agree with this statement, especially considering that there is a persuasive case to be made that the semantic content of words is mainly underspecified, with context and use filling the semantic gap (Jaszczolt,2005, p. 4). We can attempt to address this discrepancy between individual word difficulty and text difficulty in two ways.

First, we can obtain median values of difficulty for all the words in a given text, i.e. make an appraisal of difficulty of all to-kens combined, even if repeated, instead of only looking at the unique (different) words in a given text. We will discuss this ap-proach in more detail in section12.2, p.85.

Secondly, in addition to the individual word level, we could look at sequences of words, capturing, for instance, phrasal verbs and other types of formulaic language which may often have dif-fering frequencies (and, we suppose, difficulties) than its individ-ual parts. This we will discuss in chapter5.

(43)

4.3 w o r d d i s p e r s i o n 23

4.3 w o r d d i s p e r s i o n

The potential correlation between word frequency and word diffi-culty may, in fact, not be due to frequency, but instead to contextual diversity (CD) (also called context variability). Especially in the corpus linguistics literature,CDhas been “neglected or even

com-pletely ignored” (Gries,n.d., p. 10).

Contextual diversity refers to the different instances where a word can be encountered, i.e. “the number of contexts in which words are experienced” (Adelman et al.,2006, p. 814), also defined as “the environment in which the stimulus [e.g., a word] was en-countered” (Shiffrin and Steyvers,1997, p. 760).

Initially, this concept appears to have been used in first lan-guage acquisition research, in terms of the amount of different contexts a child finds themselves in (different conversations at home; conversations at school; at the doctor; while playing with friends, etc.), with the definition of “context” varying in the liter-ature (Shiffrin and Steyvers,1997, p. 760).

Later, the same term,CD, was applied in corpus linguistics

to refer to “the number of passages (documents) in a corpus con-taining the word” (Brysbaert and New, 2009, p. 984); the same concept has been calledword dispersion(as defined in Okamoto, 2015, p. 3: “how evenly a given word is spread across a corpus”), which, for reasons of clarity, we will use. However, we will keep the term contextual diversity orCDwhen discussing the literature

in which that term appeared.

4.3.1 Word frequency vs. word dispersion

A word that is very frequent may not necessarily be useful for a learner. We will illustrate this with an example.

Suppose a word frequency count was performed for a text col-lection comprised of 1,000 different texts in English on various

subjects (art, architecture, biology, etc.). Value and building are placed near rank 500 on a list of the most frequent words in English, according to

COCA. With a frequency word list generated, we see on rank 500

the word manuscript. Anyone who has ever inspected a frequency word list would not expect this word to appear among the 500 most frequent in the English language, whereas value or building probably would.

(44)

However, while investigating the text collection, we see that two of the 1,000 texts – less than half a percent of all texts—are cat-alogs of 15th century illuminated manuscripts. The words on those texts are repeated constantly because of their subject matter—in a similar way as we, in this work, have repeated the words word, difficulty, language, learner, frequency, and so on, thus increasing the frequency of those words to an inordinate degree. Despite the high frequency in that particular collection of texts, the word manuscript is not very useful, given how circumscribed it is to a particular domain.

Gries and Nick C. Ellis (2015, p. 232) describe the difference between frequency and dispersion neatly:

(...) frequency answers the question “how often does x happen?” whereas dispersion asks “in how many contexts will you encounter x at all?”

4.3.2 Scientific evidence

From the early 2000’s, researchers in the field of cognitive linguis-tics and psycholinguislinguis-tics started paying more attention to con-textual diversity in regards to its effect on word recognition and the speed of lexical access.

A careful study conducted in 2003 with over 141 partici-pants, controlling for many variables (the degrees of ambiguity and concreteness of words, the strong correlation between CD

and word frequency (WF), the clustering properties of the corpora

used, etc.), found that

CDpredicts word-processing times independently of

WFand, moreover, that there is no evidence for a

fa-cilitatory effect of WF independent ofCD (...) CD

had a unique effect, with highCDleading to fast

re-sponses, whereasWFhad either no unique effect or a suppressor effect, with highWFleading to slow

re-sponses (Adelman et al.,2006, pp. 815, 821)

These findings were replicated in a lexical decision experi-ment for young readers, again showing an effect of word disper-sion, but not word frequency, in word identification times (Perea et al.,2013).

(45)

4.3 w o r d d i s p e r s i o n 25

Also in first language acquisition,CDseems to be an impor-tant factor, with “the earliest learned words [being] the most con-textually diverse in the learning environment” (Hills et al.,2010, p. 259). In addition, in terms of selecting words to include in ped-agogical materials, a criterion such as word dispersion seems desir-able (Okamoto,2015, p. 7).

An evaluation of word frequency norms by Brysbaert and New (2009) again found word dispersion to be a better measure for psycholinguistic tests (such as prediction of reaction times, lex-ical decision, etc.) compared to word frequency.

None of these studies has touched upon word dispersion in terms of language learning in adults; as stated by Okamoto (2015, p. 7), this is a neglected lexical property in teaching: “there is no research that addresses this issue [word dispersion] from the per-spective of goal setting for vocabulary teaching”.

Our line of reasoning, if we presuppose a link between fre-quency and difficulty, now becomes: words used throughout a col-lection of texts will naturally be less difficult, as speakers are ex-pected to find them more often and therefore have a higher likeli-hood of learning them. If we count these more disperse words, we will find that they are often the most frequent (there is a strong correlation between frequency and dispersion, especially for the top 6,000 most frequent words, as stated for instance in Steyvers and Malmberg (2003, p. 761) and Okamoto (2015, p. 7)).

Due to the lack of empirical investigation on the interplay between dispersion and difficulty, we can only make an indirect correlation between them through the “third wheel” that is fre-quency; but in terms of language students, the target audience of this project, word dispersion is arguably more useful than word frequency. A language student will not be confined to a single book, film, album, or TV show, nor a single subject in the tar-get language; they will (or, at least, they should) attempt to un-derstand texts from many different contexts, in order to increase their proficiency in the language studied.

Thus, we argue that the more likely a word is to be encoun-tered – i.e., the greater its word dispersion or contextual diversity – the more important it becomes for learning. We will discuss this

(46)

(47)

5

M U LT I W O R D E X P R E S S I O N S

One neglected feature in difficulty-related linguistic research is ac-counting for multiword expressions (MWEs) such asbat an eye, to kick the bucket, in the light of, etc. These can also be called phrasemes or chunks, although the definition of a multiword expression is more strict:

any word combination for which the syntactic or se-mantic properties of the whole expression cannot be obtained from its parts (Sag et al., 2002, originally cited in Caseli et al. (2010)).

Studies on the optimal vocabulary size for language profi-ciency do not usually take MWEsinto consideration: “(...) indi-vidual words are all that is mentioned in current research on vo-cabulary thresholds” (Martinez and Schmitt,2012, p. 268).

This neglect of MWEs may be due to the fact that

obtain-ing them from corpora is an especially troublesome task. There is even an article that compares the extraction ofMWEsto a “pain

in the neck” for natural language processing (Sag et al.,2002). A panorama of theMWE-related research is given in Omidian et al.

(2017, p. 490), where the authors stress the difficulty in defining proper criteria for classifying and extractingMWEsfrom texts.

Therefore,MWEsmay well be considered an elephant in the room in vocabulary research. Their large number – and conse-quently their importance – is stated as such:

As Jackendoff (1997) notes, the magnitude ofMWEs

is far greater than what has traditionally been real-ized within linguistics. He estimates that the num-ber of MWEs in a speaker’s lexicon is of the same order of magnitude as the number of single words. In WordNet 1.7 (Miller and Fellbaum, 2008), 41% of the entries are multi-words. Some specialized do-main vocabulary, such as terminology, overwhelm-ingly consists ofMWEs. (Bu et al. (2011), p. 3)

(48)

28 m u lt i w o r d e x p r e s s i o n s

Several attempts to extract MWEs from corpora have been made, some of them being Granger (2014)’s lexical bundles, Simpson-Vlach and Nick C Ellis (2010)’s Academic Formulas List, and Mar-tinez and Schmitt (2012)’s Phrasal Expressions List. No single list, however, will be definitive, and newMWEsare likely to appear as frequently as new individual words.

To our knowledge, no readability formula or text difficulty measurement has takenMWEsinto account. In our opinion, if a

correlation can be established empirically between the difficulty of individual words and frequency, it is only natural to suppose that there is a correlation between frequency (arguably as well as dispersion) and the difficulty of anMWE.

In the next section, we will discuss the only study that, to our knowledge, investigatesMWEsin regards to text difficulty.

5.1 M a r t i n e z a n d V. A . M u r p h y (2 0 1 1)

Martinez and V. A. Murphy (2011) tested the reading comprehen-sion of 101 adult Brazilian learners of English with two texts. The first text contained only the top 2,000 most frequent words in En-glish, whereas the second included some multiword expressions made up of these 2,000 most frequent words.

The two texts had practically the same length (416 vs 412 words) and used the same set of words, the only difference being that the second hadMWEs. Participants were asked to rate their

own comprehension; to mark their times for completion of each tests; and also to answer true or false questions that tested their un-derstanding. Thus we have here a study of the influence ofMWEs

on both actual and perceived difficulty.

In regards to actual difficulty, the authors found that “partici-pants’ scores were significantly lower on Test 2 [withMWEsadded]

relative to Test 1 (...) with a strong effect size” (Martinez and V. A. Murphy (2011), p. 278). Time spent in the text withMWEswas

also greater (p. 280).

As to perceived difficulty, participants rated themselves as un-derstanding more than their results in the true-or-false test showed, leading the authors to conclude that

(49)

5.2 s u m m a r y 29

learners may believe they understand more than they actually do by virtue of their simply understanding the individual words in a text (Martinez and V. A. Murphy (2011), p. 278),

with around half of the participants overestimating their abilities.

5.2 s u m m a r y

From the limited evidence we gathered, it appears that multiword expressions could affect text difficulty. For this reason, and also for being in line with our lexicon-oriented approach, multiword expressions will be included as a factor in our text difficulty estima-tions. However, our way of accounting for multiword expressions will be different from other studies: instead of struggling to con-form to the principles that makeMWEswhat they are, we will

de-velop a more inclusive approach through combinatorial searches – thus, we will be extracting more specificallyn-gramdata, which is simply a sequence of items from a text sample. Section10.4, p.

(50)