Towards a dictionary of polish political discourse 2015-2019

(1)

Universidade do Minho

Instituto de Letras e Ciências Humanas

Sonia Kropiowska

October 2019

Towards a Dictionary of Polish Political

Discourse 2015-2019

Sonia Kr opio w ska Tow ards a Dictionar y of P olish P olitical Discour se 20 1 5-20 1 9 19

(2)

Universidade do Minho

Instituto de Letras e Ciências Humanas

Sonia Kropiowska

Towards a Dictionary of Polish Political

Discourse 2015-2019

Supervised by:

Prof. Dr. Idalete Maria da Silva Dias

Prof. Dr. Stefan Evert

MA thesis

(3)

DIREITOS DE AUTOR E CONDIÇÕES DE UTILIZAÇÃO DO TRABALHO POR TERCEIROS

Este é um trabalho académico que pode ser utilizado por terceiros desde que respeitadas as regras e boas práticas internacionalmente aceites, no que concerne aos direitos de autor e direitos conexos. Assim, o presente trabalho pode ser utilizado nos termos previstos na licença abaixo indicada.

Caso o utilizador necessite de permissão para poder fazer um uso do trabalho em condições não previstas no licenciamento indicado, deverá contactar o autor, através do RepositóriUM da Universidade do Minho.

(4)

STATEMENT OF INTEGRITY

I hereby declare having conducted this academic work with integrity. I confirm that I have not used plagarism or any form of undue use of information or falsification of results along the process leading to its elaboration.

(5)

Abstract

Towards a Dictionary of Polish Political Discourse 2015-2019

The MA thesis aims at providing a theoretical sketch of a discourse dictionary, together with its partial implementation, that specifically tackles a controversial period in modern Polish history, namely the eighth term of the Polish Parliament (Sejm) spanning between October 25th, 2015 and October 12th, 2019. The thesis focusses on: the motivation behind the reference work, the practical considerations of target and reference corpus compilation for a discourse dictionary (also discourse analysis), a mixed-method application of corpus linguistics for discourse lexicography (establishing the semantic network) and a qualitative analysis of keywords, collocates and attitudes for lemma description. The author hopes to introduce new considerations for discourse lexicography, specifically in its corpus linguistics methodological approach, by shedding some light on the socially embedded language of a turbulent period in modern Polish history.

Keywords:

(6)

Abstract

A caminho de um Dicionário do Discurso Político na Polónia 2015-2019

A presente tese de Mestrado tem como objetivo propor um modelo para a elaboração de um dicionário do discurso político referente ao período controverso da história moderna da Polónia, nomeadamente o oitavo termo do Parlamento Polaco (Sejm), abarcando o período entre 25 de outubro de 2015 e 12 de outubro de 2019. A tese foca os seguintes aspetos: a motivação que levou à proposta de criação de um dicionário do discurso político no período acima referido; considerações práticas referentes à compilação dos corpora, incluindo o corpus de referência; a aplicação da abordagem mista da linguística de corpus à lexicografia do discurso (estabelecendo uma rede semântica) e de uma análise qualitativa das palavras-chave, das colocações e atitudes para a descrição do lema. A autora espera poder introduzir novas considerações para a compilação de dicionários de discurso, principalmente no que respeita a abordagem metodológica da linguística de corpus, aplicada ao discurso político de um período turbulento da história moderna da Polónia.

(7)

Table of Contents

Abstract

Chapter 1: Introduction

1.1 Polish Socio-Political Context of 2015-2019 1.2 Polish Socio-Political Context of 2015-2019: Media 1.3 Dictionary of Polish Political Discourse 2015-2019 1.4 The Scope of the Thesis

1.4.1 Corpus Compilation 1.4.2 Dictionary Structure 1.5 Thesis Contribution

Chapter 2: Theoretical Framework

2.1 Discourse Lexicography – Heidrun Kämper 2.1.1 Lexicography – Wiegandian Typology 2.1.2 Discourse – Foucauldian Definition

2.2 Discourse Dictionary – Semantic Reconstruction of Discourse 2.2.1 Macro-, Medio-, and Microstructure

2.2.1.1 Macrostructure 2.2.1.2 Mediostructure 2.2.1.3 Microstructure

2.2.2 Attitude-Based "Meaning Definition" and Collocations Chapter 3: Methodology 3.1 Corpus 3.1.1 Design 3.1.2 Tools 3.1.3 Data Collection 3.2 Methods

3.2.1 Keyword Analysis – "Knots" of the Semantic Network

3.2.2 Keyword Collocation Analysis – Linkages of the Semantic Network 3.2.3 Concordancing – Attitude and Sentiment Attribution

(8)

Chapter 4: Analysis

4.1 Semantic Network – Macro- and Mediostructure 4.1.1 Establishing Keywords

4.1.2 Connecting Keywords 4.2 Defining Keywords – Microstructure

4.2.1 Choice of Association Measure 4.2.2 Discourse and Collocational Analysis 4.2.3 Different Kinds of Collocates

4.2.4 Attitude-Based "Meaning Definition" 4.2.4.1 Standard Definition

4.2.4.2 Definition Provided by the Source 4.2.4.3 No Definition

Chapter 5: Conclusion References

Appendix 1 Keywords

Appendix 2 "Morawiecki" Collocates (separate file) Appendix 3 "LGBT" Collocates

Appendix 4a Final Analysis Results EN Appendix 4b Final Analysis Results PL

(9)

Chapter 1 Introduction

The chapter starts out with a discussion of the 2015-2019 Polish socio-political context of language use that is to be represented via discourse lexicographic means. In subchapter 1.2 the author zooms in on the specific situation of Polish media arguing for the selection of data defining the dictionary content and structure. Subchapter 1.3 identifies a lexicographic need within groups potentially interested in understanding the linguistic aspect of the larger socio-political context in question and offers a discourse dictionary solution to satisfy that need. Subchapter 1.4 discusses the exact extent of the work carried out for this thesis. Subchapter 1.5 names the contributions that the author would like to make.

1.1 Polish Socio-Political Context of 2015-2019

In our country there are people who would like to, using their term, 'modernize' Poland. One could say modernize against her will. Yes, Poland needs to be modernized, in terms of her organization, technology, social services... Yes. We are going to modernize Poland. But we will not modernize the Polish spirit! 1

Jarosław Kaczyński (PiS party leader since 20032₎

Independence Day speech from November 11th, 2015

On 25th October 2015 Poland underwent a radical change of power. Unprecedentedly since the establishment of free elections in 1989, the victorious party – Law and Justice (PiS) – won the majority in the larger, more powerful lower chamber of the Polish Parliament (Sejm). Soon afterwards PiS proceeded to implementing its own vision of Poland employing specific language in its description. One example of that is the names of social support programmes introduced by the 2015-2019 government

1 “Mamy w naszym kraju tych, którzy by chcieli Polskę, używam tuta ich języka, modernizować. Można by powiedzieć modernizować na siłę. Tak, Polska musi byćzmodernizowana, jeśli chodzi o organizację, jeśli chodzi o technikę, jeśli chodzi o usługi społeczne. Tak, my będziemy Polskę modernizować, ale nie będziemy modernizować polskiego ducha!” (author´s translation from Polish):

The video of the speech was made available on YouTube by the user PrawicowyInternet at www.youtube.com/watch? v=PjIURr0VnYc

(10)

that end in "+" meaning "more", such as Rodzina 500+ (eng. Family+) in which families receive 500zł (slightly over 100 euros) every month for each of their children except the first one3_.

PiS´s vision and subsequent actions elicited a violent response from the Opposition, media, artists, and ordinary citizens. Some political transformations took place shortly before the parliamentary elections as new political groups were formed – Nowoczesna4_{(eng. Modern Party),}

Kukiz’15 (after the name of the leader Paweł Kukiz) in 20155_{, and Razem}6_{(eng. Together Party). The}

period has been marked by many protests. One such was Czarny Marsz (eng. Black March) organized in 2016 by a grassroot feminist social movement Strajk Kobiet, in which black-clad women went to the streets of more than 200 cities in Poland and abroad at the dawn of the new abortion laws being introduced by the government7_{; another one has been the teacher’s strike}8_{that began in Spring 2019 in}

reaction to the reformative need for and the government’s changes in the education system. The media landscape was also transformed. An example of that is the bulk replacement of key personalities in TVP Polish public television at the beginning of 2016 (Pallus 2016). Coming from an artistic perspective, Robert Górski, the leader of the Kabaret Moralnego Niepokoju comedy group, created a satirical multi-season series called Ucho Prezesa (eng. The Chairman´s Ear) that alludes to the operation of the current government9_{. Some episodes were viewed on YouTube by millions}10_{and the production was the}

most googled Polish series in 201711_.

Due to Poland’s involvement in global affairs, e.g. its membership in NATO (since 1999) and the EU (since 2004), those dramatic changes did not go unnoticed outside of the country causing concern and subsequent intervention on the part of the EU, such as the launching of infringement procedure to protect judges in Poland from political control in April 201912_.

3 PiS (2014). Program PiS 2014. Nowy, powszechny dodatek rodzinny. http://pis.org.pl/dokumenty, accessed 15.07.2019.

Marszałek Sejmu Rzeczpospolitej Polskiej. Obwieszczenie Marszałka Sejmu Rzeczypospolitej Polskiej z dnia 25 października 2018 r. w sprawie ogłoszenia jednolitego tekstu ustawy o pomocy państwa w wychowywaniu dzieci (Dz.U. 2018 poz. 2134): http://prawo.sejm.gov.pl/isap.nsf/DocDetails.xsp?id=WDU20180002134, accessed 15.07.2019.

4 Nowoczesna. Kim jesteśmy?: https://nowoczesna.org/kim-jestesmy/, accessed 15.07.2019. 5 Kim jesteśmy: http://ruchkukiz15.pl/, accessed 15.07.2019.

6 Razem. O nas: http://partiarazem.pl/o-nas/, accessed 15.07.2019. 7 Strajk Kobiet. O nas: http://strajkkobiet.eu/co-robimy/, accessed 15.07.2019.

8 KJ (2019). Strajk nauczycieli 2019 zawieszony! Do kiedy? Czy 30.04 są lekcje w szkołach? [MAPA] [POSTULATY] Czy matury się odbędą?: https://polskatimes.pl/strajknauczycieli-2019-zawieszony-do-kiedy-czy-3004-sa-lekcje-w-szkolach-mapa-postulaty-czymatury-sie-odbeda/ar/c1-13749626, accessed 15.07.2019.

tvn24 (2019). Coraz bliżej strajku nauczycieli. Wstępne wyniki referendum w szkołach.

9 Ucho Prezesa. About: https://www.youtube.com/channel/UCggq3qvOrrM2u9gRL7oMOfA/about, accessed 15.07.2019.

10 tw (2019). Wraca „Ucho Prezesa”, nowe odcinki w internecie: https://www.wirtualnemedia.pl/artykul/ucho-prezesa-nowe-odcinki-2019-na-youtube-w-internecie, accessed 1.09.2019.

11 InteriaTECH (2017). Czego Polacy szukali w Google w 2017 r.: https://nt.interia.pl/internet/news-czego-polacy-szukali-w-google-w-2017-r,nId,2477400, accessed 1.09.2019.

12 European Commission (2019). Rule of Law: European Commission launches infringement procedure to protect judges in Poland from political control. Press Release Database: http://europa.eu/rapid/press-release_IP-19-1957_en.htm, accessed 16.06.2019.

(11)

1.2 Polish Socio-Political Context of 2015-2019: Media

The now endangered state of the media has been signalled in the most recent report by Freedom House, “an independent watchdog organization dedicated to the expansion of freedom and democracy around the world”13_{. The main reason indicated by the experts is politicians who “shaped news}

coverage by undermining traditional media outlets, exerting their influence over public broadcasters, and raising the profile of friendly private outlets”14_{. Reporters Without Borders, “an independent NGO}

with consultative status” with the United Nations and the Council of Europe, has also been recording Poland´s ever-falling position among all world countries in terms of exercising freedom of press15_{. The}

status of media in Poland was changed from “free” to “partially free” in 2016 and in 2019 the country was placed in the 59th ranking position, far below the European average16_.

1.3 Dictionary of Polish Political Discourse 2015-2019

The dictionary idea is intimately connected with the genuine purpose of the dictionary. The term genuine prupose (de. genuiener Zweck) refers to the satisfaction of lexicogrpahic needs of a potential dictionary user group:

Ein genuiner sprachlexikographischer Zweck liegt vor genau darin, wenn es die Intention des Lexikographen war, dass der potentiale Benutzer aus den lexikographischen Textdaten Informationen über einen sprachlichen Gegenstand aus K1 [=der Klasse der genuinen sprachlexikographischen

Zwecke] gewinnen kann und wenn dies tatsächlich möglich ist. (Wiegand 1988: 745-746)

The genuine purpose of the dictionary of Polish Political Discourse 2015-2019, is to allow both Polish-speaking and English-Polish-speaking users to become better informed of the language-mediated political discourse in Poland from the period of the eighth term of the Polish Parliament (Sejm).

The dictionary can be used by Poles who are interested, either privately or professionally, in politics. They would be able to orientate themselves in how different groups understand the key concepts in modern Polish politics, for example “freedom“, “justice”, “equality”, etc.17

13 Freedom House. About Us: https://freedomhouse.org/about-us, accessed 15.07.2019. 14 Freedom of The Press 2017. Press Freedom´s Dark Horizon.

15 Reporters Without Borders (2016). 2016 World Press Freedom Index – leaders paranoid about

journalists: https://rsf.org/en/2016-world-press-freedom-index-leaders-paranoidaboutjournalists, accessed 15.07.2019. Reporters Without Borders (2019). 2019 World Press Freedom Index – A cycle of fear:

https://rsf.org/en/2019-world-press-freedom-index-cycle-fear, accessed 15.07.2019. 16 World Press Freedom Index 2016-2019: https://rsf.org/en/ranking, accessed 16.06.2019.

(12)

For foreigners the dictionary, with its English translations and presentation of equivalent concepts in other languages wherever necessary, would provide an insight into the diversity of views on internationally relevant concepts and enable better international cooperation, or at least comprehension.

1.4 The Scope of the Thesis

The thesis´ main aim is to test its theoretical assumptions and exemplify the use of its methodology and not to compile an actual dictionary. Due to the limitations of time, data and tool availability only texts between July 2018 and June 2019 from four media sources are examined. As a cosequence, it is not possible to make too many generalizations concerning the eighth term of Polish parliament or the period between 2015-2019 on the basis of the empirircal analysis results in this thesis.

1.4.1 Corpus Compilation

The corpus contains data from a representative sample of media outlets spanning between July 2018-June 2019 and July 2013-2018-June 2015, which renders the discourse analysis not a political but a media one because the only origin of participants and texts is media. Only media outlets are used because of the relative ease of data collection while other sources, such as parliamentary session transcripts, are not easily obtainable until after the end of the parliamentary term, in this case until after October 2019.

The researched period (July 2018-June 2019) has been shortened due to the restrictions of one of the corpus methods used – the reference corpus for keyword analysis should be bigger than the observed corpus and the still developing nature of Polish online journalism does not allow for collection of a much larger amount of data from before October 2015. An analysis of the whole period will be, however, possible in the future because it is possible for the reference corpus to contain data from before 25th October 2015 and after 13th October 2019 without compromising the theoretical assumptions of this thesis; the reason is that a reference corpus “can approximate a model reader´s exposure to language patterns, which in turn reflect the typical reader´s point of view“ (Fidler and Cvrček 2015: 204), which means that using the data available today we look at the target corpus from a perspective of a hypothetical online news reader from 2015 and the inclusion of later materials would change the perspective to one of the hypothetical reader. Of course, there are further restrictions of this thesis´ news reader´s perspective; it is assumed that their knowledge exposure is limited to four specific news outlets from July 2013 to June 2015 (see 3.1.1 Design). In general, there are no

restrictions on the perspective taken on the observed corpus; in fact, a multiplicity of such perspectives would be welcome to add to the discursive aspects of the dictionary.

(13)

The initial choice of corpus analysis tool was made for CQPWeb and therefore some methodological considerations were tailored to accommodate that choice. Potentially useful for similar research, these were left and additional considerations for the actual tool used – AntConc – were added.

1.4.2 Dictionary Structure

The practical application of the methodology is limited to the first 100 keywords for the semantic network (de. semantisches Netz) nodes, 30 first keywords for the connections between those nodes and 1 detailed node analysis. However, the theoretical and methodological parts do go a few steps further discussing, e.g. wide- and narrow-range collocational analysis of keywords for the distinction of different lemma types.

Unlike functional and hypo-/hyperonymous relations, synonymy and antonymy were neglected but would otherwise be important in enhancing the interconnectedness of the dictionary.

No exact structuring of the entry is given. Only elements that should belong to the entry are named. The thesis does not include all microstructural items that can be found in existing discourse dictionaries, such as etymology or morphologically related lexical items in Heidrun Kämper´s

Demokratiediskurs 1967/68 (Kämper 2013), because of time limitations. The focus is on arriving at the

lemma, participant-specific definitions, corpus-derived examples of use, attitude labels accompanying each example, information on the example chronology, lemma collocates, and guiding comments.

1.5 Thesis Contribution

The contribution of this thesis is threefold.

Firstly, it analyses the media language of a volatile period in modern Polish history, defined by many goverment-initiated changes, political reorganization, and citizen initiatives. Secondly, it attempts to apply corpus linguistics methods, namely keyword analysis, collocational analysis between keywords and between tokens as well as concordancing to discourse lexicography, which as of now seems to use no specific corpus linguistics analysis methods, apart from concordancing, despite the employment of annotated-by-topic (Kämper and Rothenhöfer 2008: 88), searchable (Kämper 2006: 342, Kämper 2013: 10) specialized (Kämper 2006: 340) corpora. Also, the compilation process differs from Kämper´s as she mentions digitalization of an analogue corpus into a “databank“18_{while for this thesis a digital}

corpus is built from texts found on the Internet. The collocational analysis of keywords gives rise to a

18 “Das hier zugrunde gelegte Korpus wurde digitalisiert, nach einem differenzierten topikalischen Indizierungssystem ausgezeichnet und in eine Datenbank überführt. Diese Datenbank ermoglichte die Abfrage und Sortierung der Korpustexte nach den, durch das Indizierungssystem

(14)

newly defined division of lemmas within a semantic network: central lemmas, interdependent lemmas, dependent lemmas and theoretically possible free lemmas.

Thirdly, the work experiments with the choice of elements needed for the description of discourse in a dictionary entry based on the results of the collocatational analysis results and viewed data.

(15)

Chapter 2

Theoretical Framework

The chapter opens with an overview of the task of discourse lexicography and the place of a discourse dictionary in Herbert Ernst Wiegand´s dictionary typology. It goes on to discuss the discursive character potential of the main building blocks of any dictionary: micro-, medio- and macrostructure. Finally, it proposes a qualitative analysis grounded in attitudes and collocations as an alternative to the approach Heidrun Kämper used for her Schulddiskurs 1945/55 (2002).

2.1 Discourse Lexicography – Heidrun Kämper

The dictionary idea proposed in this thesis follows Heidrun Kämper´s understanding of discourse lexicography, which sees its task in representing “the relevant vocabulary, that constitutes a discourse” and describing it following lexicographic principles, both semasiologically and onomasiologically (Kämper and Klosa 2010: 1). Three of Kämper´s discourse dictionaries have been published on the OWID portal, one of them is Schulddiskurs 1945-5519_{that testifies to the language-mediated reality of}

early post-war Germany. The reference work does not represent a new vocabulary that only emerged in German after 1945, but rather the period´s discourse on guilt that manifests itself in the frequency and function of its vocabulary. It documents the violence to which the victims bear witness, the strategies of defense against guilt and justification used by the perpetrators to relieve their personal burden, and the construction and dismantling of the identity with which the non-perpetrators analyze the phenomenon of National Socialism and attempt to rehabilitate the Germans. The dictionary´s target user groups are linguists, especially lexicographers and language historians, contemporary historians, teachers, pupils, students, and, last but not least, nonlinguists who seek information on how contemporary history is reflected linguistically.

2.1.1 Lexicography – Wiegandian Typology

Typologically, a discourse dictionary should be distinguished from (a) a general dictionary in that the former is more precise, more contextually specific, and less diverse in the grammar information it provides; and (b) a specialized dictionary, such as an author dictionary, in that it describes the perspective of a collective (Kämper and Klosa 2010: 2). A discourse dictionary focusses on language description that often necessitates contextual information and in that sense is akin to Wiegand´s Sprachwörterbuch:

(16)

Er [Wiegand] teilt Nachschlagewerke in drei Klassen auf: Sprachwörterbücher, Sachwörterbücher und Allbücher und gibt dazu in etwa folgende Erklärung: Ein Sprachwörterbuch hat als genuinen Zweck über die Sprache zu informieren, ein Sachwörterbuch über die Sache und ein Allbuch sowohl über die Sprache als auch über die Sache. In einem Sprachwörterbuch kann man, wenn es sinnvoll erscheint, auch über die Sache informieren, so wie man in einem Sachwörterbuch auch über die Sprache informieren kann. (Weber 1996: 177)

2.1.2 Discourse – Foucauldian Definition

Discourse here should be understood in the Foucauldian sense as: “the total of collective communicative acts, with one identical topic, which is realized by one or several associations of participants, and which is represented by a typical and relevant vocabulary“ (Kämper and Klosa 2010: 2). As a logical consequence of that assumption, Kämper names five characteristics of discourse – topic, time period, participant(s), relevant texts, and function(s) (Kämper 2006) – that should be considered in devising a discourse dictionary, and specifically its basis (corpus data), structure, and content. In Kämper´s reference work for Schulddiskurs 1945-55 (de. guilt discourse):

 the topic is guilt (de. Schuld)  the time period is 1945-55

 the participants constitute three groups (victims, offenders and non-offenders)  the primary sources´ authors represent discourse participants20

 the functions include that of witnesses (victims), the accused (offenders), and a mix of the accused, lawyers and judges (non-offenders)

2.2 Discourse Dictionary – Semantic Reconstruction of Discourse

A discourse dictionary contains linguistic information, primarily semantic, that is contextually restricted and might not overlap with what can be found in other dictionaries. The semantic information of a discourse dictionary has to do with (a) the interconnectedness of the discourse and (b) expression of attitude. Discourse dictionaries may express the interconnectedness and attitude within discourse and through lexicographic means differently from each other. For instance, in the reference work for Prostestdiskurs 1967/68 the vocabulary that corresponds to the high scientific and theoretical demands of those involved in discourse constitutes its own lexical network within the dictionary structure21_.

20 Schulddiskurs 1945-1955. Quellenverzeichnis: https://www.owid.de/wb/disk45/projekt/quellenverz.html, accessed 29.10.2019.

21 To the elements of thise lexical network belong: "Bedürfnis, Begriff, Bewußtsein, exemplarisch, falsch, formal, Funktion, immanent, Kategorie, objektiv, subjektiv, Personalisierung, Prinzip, richtig, Struktur, Transformation, Verdinglichung" (Kämper 2013: 13)

(17)

Discourse participants provide the analytical and argumentatively justified discourse sequences (de. Diskurssequenzen), in particular with description categories of the Critical Theory of Marxist Teaching and Psychoanalysis, in order to give a scientific foundation to their concept of an fascist and anti-authoritarian, participatory and enlightened democracy (Kämper 2013: 13).

2.2.1 Macro-, Medio-, and Microstructure

There are three main building blocks of a dictionary: the macro-, the medio-, and the microstructure. According to Mautner and Rainer (2017: 567), the macrostructure systematizes “the order of the main articles across a dictionary“, the mediostructure describes “the overall organization of the resource, that is, the system of cross-referencing different components of a dictionary“ and the microstructure deals with "the data organization within a specific article".

2.2.1.1 Macrostructure

In a discourse dictionary the macrostructure is composed of a semantic network of lexical items representing the discourse, the "keywords" (Leitwörter in Kämper 2013: 10) in which the meaning of the discourse is lexically condensed, and which also function as the lemmas of the dictionary, and which should not be confused with the corpus linguistics keywords that are derived from keyword analysis. On the basis of lemmatic connections that are part of the semantic network, Kämper distinguishes between the main lemmas and sublemmas; the former are characterized by numerous connections within the network and head their own articles:

Liste der Hauptlemmata [...] der abgeleiteten Sublemmata, sowie, vereinzelt, der in einer Bedeutungsbeziehung zu den Hauptlemmata stehenden, aber nicht zur Wortfamilie gehörenden Sublemmata. (Kämper 2006: 344)

Sublemmas, on the other hand, are only related to the main lemmas and do not head their own articles:

[Sublemmata sind] bedeutungsverwandte, in einer begrifflichen Beziehung zum Hauptlemma stehende Lemmata … , die keinen eigenen Artikel haben, sondern in dem Artikel zu dem Hauptlemma mit dargestellt sind (Kämper 2006: 346)

(18)

A topic, one of the indispensible discourse dictionary characteristics, is ideally but not necessarily the lemma that enjoys the most central position in the network:

Dieser Name kann im Idealfall aus dem Textkorpus, welches dem zu beschreibenden Diskurs zugrunde liegt, abgeleitet werden, und zwar dann,wenn ein Diskursthema in Texten des Korpus sich in einem häufig belegten zentralen 'Diskurswort' zu einem begrifflichen Zentrum verdichtet. (Kämper 2006: 339)

For example, in Schulddiskurs (de. guilt discourse) Schuld (de. guilt) is the discourse topic, the name of the discourse, and the central lemma enjoying most connections with other keywords and used by all three discourse participants (Kämper 2002: XXII). This aspect is further pursued in 3.2.2 where a

more systematic definition of lemmatic division is attempted.

2.2.1.2 Mediostructure

Apart from the overarching semantic network, other interlemmatic connections can be created, e.g. in relation to discourse- and text-specific participants, such as all three WWII victims, offenders, and non-offenders have to do with "guilt" (de. Schuld) and "duty" (de. Pflicht) (Kämper 2002, Kämper and Klosa 2010: 2); or function(s), for instance actions can include "protest", "discussion", "revolution" (Handlungskonzeption, Kämper 2013: 13); links based on other criteria can also be set up to further the interconnectedness of discourse representation. For the same reason synonyms, antonyms, hyperonyms, and hyponyms are desirable too. A more discourse-specific form of connection is functional relation that depends on a lexical item´s function in discourse, e.g. the lexical item Czarny Marsz (eng. Black March) mentioned in 1.1 can be defined in conservative or Christian media as “a

feminist protest against the rights of unborn children” expressing an anti-abortion, anti-feminist, pro-government attitude and has the function of supporting the actions of the government (if

“government“ is the topic of the discourse).

2.2.1.3 Microstructure

It is a characteristic of a dicourse lexicographic resource to rely heavily on not only referencing between articles but also addressing relations within them, “a lexicographic substitute for the missing natural language syntactic relations“ (Wiegand and Gouws 2013: 276). A discourse-specific solution here would

(19)

be to precede the many senses that relate to different attitudes expressed within discourse with an introductory part giving organizational and contextual guidance to the lexical item in question:

Wenn ein Lemma von Angehörigen verschiedener Sprecherperspektiven bzw. politischen Systemen gebraucht wird, besteht der Artikel aus einem allgemeinen, einführenden Teil, dem sich dann der perspektiven und systemgebundenen Darstellungsteil anschließt. (Kämper 2006: 347)

Polysemy is expected as the notion of discourse entails a nuanced multiplicity of perspectives, e.g. the Opposition might mean something different when they talk about themselves being the Opposition (perspective on the self – Autoperspektive, Kämper 2013: 11) and when the ruling party/coalition says about them (perspective on the other – Heteroperspektive, Kämper 2013: 11).

Morphological and etymological information may also be present to the extent that it reveals something about the interconnectedness of relevant discourse vocabulary or grants access to greater contextual depth. As lemma companions lexical collocations, derivatives, compounds, etc. are desirous as long as they shed more light on how language is used in discourse, e.g. due to the different preferences in linguistic productivity it makes more sense to include compounds in German rather than in Polish, where they arise only in restricted contexts such as dialects and terminologies (Grochola-Szczepanek 2008).

In Kämper´s dictionaries there are also other categories that contribute to the lemmatic semantic description: discourse participant role, e.g. in her Schulddiskurs 1945/55 the victims primarily played the role of witnesses (Beteiligungsrolle, Kämper 2006: 340-341); and temporal orientation, e.g. democracy in the Demokratiediskurs 1967/68 is generally future-oriented and there are lemmas, such as "emancipation" or "autonomy" that are characterized by the same temporal orientation (Kämper 2013: 12-13).

Corpus evidence is of essence because discourse should be regarded as a language use phenomenon and therefore be subject to empirical reconstruction with the use of a dedicated corpus (Kämper and Rothenhöfer 2008: 88). Examples of use should be given with as much contextual detail as possible even at the cost of conciseness. Preferably, all corpus examples relating to a given item should be provided as they constitute individual cases.

The rigidity of article organization has to be balanced out, however, by an individualized approach to its structural delineation – each discourse dictionary represents a reconstructed discourse

(20)

semantics in relation to a specific corpus; a consequence of this is that the data needs to necessarily shape the dictionary content and its organization.

2.2.2 Attitude-Based "Meaning Definition" and Collocations

Kämper does not seem to explain the discourse analytical procedure she follows when arriving at a given lemma as the main lemma or its description.. Rather, she provides lexicogrpahic principles, such as that "meaning definitions" (de. Bedeutungserläuterungen) should be written in "full sentences, following specific patterns for different semantic and syntactic classes of words". This thesis attempts to define a lemma through collocates of the lemma, including the most frequent word forms, and attitudes expressed within a self-contained textual segment, e.g. a paragraph, in which lexical item being defined is located in the corpus.

Although not employed in Kämper´s original discourse dictionaries, such as Schulddiskurs 1945/55, the approach agrees with the general premise of conceptual and semantic description of lexical items in their communicative context in relation to the discourse functions and intentions of groups that use them22_{and therefore can be considered an alternative that is slightly more quantitative}

than Kämper´s purely qualitative approach.

This chapter presents the concept of a discourse dictionary in relation to both Foucauldian discourse and Wiegandian lexicography and indicates points of contact between the two disciplines. Chapter 3 explains the methodology that can be used to arrive at so defined a product.

22 (“begriffliche bzw. semantische Einheiten im kommunikativen Kontext, d.h. in Bezug auf Funktionen und auf Absichten von Sprechergruppen“, Kämper 2006: 338).

(21)

Chapter 3 Methodology

The methodological approach towards obtaining data and transforming it into dictionary content is a qualitative-quantitative approach, increasingly so in discourse analysis (Baker 2014, Griebel 2017, Wang 2013). This thesis argues that a discourse dictionary can benefit from a mixed-method approach as it involves both lexical and discourse analysis.

There are seven corpus building and analysis aspects that should be considered here: 1. Designing a specialized corpus for a dictionary of Polish media discourse July 2018 - June 2019 2. Designing a reference corpus to carry out a keyword analysis involving the target corpus in (1) 3. Collecting data for the target and reference corpora in (1) and (2)

4. Establishing the final lemma list with the use of keyword analysis

5. Establishing connections between lemmas in (4) by identifying keyword collocations 6. Categorizing lemmas from (4) based on the number of connections they make 7. Analyzing examples of lemma use with regards to the attitude that is expressed

8. Establishing the defintion of a lemma on the basis of attributed attitudes and lemma collocates

3.1 Corpus

A discourse dictionary is descriptive in nature and requires a specialized corpus, one that follows the same Foucault-derived Kämper´s characteristics as the dictionary itself (Kämper 2006: 338). Existing discourse dictionaries such as Protestdiskurs 1967/1968, Schulddiskurs 1945-1955 and Schlüsseldiskurs 1989/90 employ digitalized texts and corpus tools for query-running and concordancing. In this thesis the texts are collected directly from Internet sources and, apart from concordancing, keyword analysis and collocational analysis between keywords as well as a regular collocational analysis between tokens are used as corpus methods.

3.1.1 Design

Ideally, a political discourse dictionary would draw on data coming from different discourse participants, such as politicians, citizens, media, etc. Necessarily, the thesis will focus only on media coverage, technically making it a dictionary of media discourse of Polish domestic affairs, a step towards the dictionary idea described in 1.3. In Table 1 below media outlets, or discourse participants, have been

categorized according to their political bias (Matuszewski 2019: 195-196) and form of organization (Geudes Bailey et al. 2007: 31). The former reflects the activity of Polish Facebook users affiliated with a

(22)

given political group on different media pages in 2017. From a discursive point of view it is worth mentioning that only some outlets officially endorse a political character; to those belong the alternative niezależna.pl23_{and krytykapolityczna.pl}24_{but not the commercial tygodnikpowszechny.pl and the public}

tvp.info, which has a legal obligation to remain impartial25_{. A connection can also be made between the}

discourse participants and political groups in Poland. The readership of tygodnikpowszechny.pl expresses an affinity to the centrist Civic Platform (pl. Platforma Obywatelska) and Modern Party (pl. Nowoczesna), as well as the left-wing Together Party (pl. Razem): krytykapolityczna.pl only to the Together Party; tvp.info and niezależna.pl to the right-wing Law and Justice Party (pl. Prawo i Sprawiedliwość). The views of the remaining major political groups of 2017, i.e. centrist Polish People's Party (pl. Sojusz Lewicy Demokratycznej), and the two anti-establishment political organizations and groups, i.e. Kukiz'15 and Freedom Party (pl. Wolność), seem not to be reflected in the Facebook users activity of 2017 in relation to particular media outlets.

left-leaning centrist right-leaning

public tvp.info

commercial tygodnikpowszechny.pl

alternative krytykapolityczna.pl niezalezna.pl

Table 1 The division of corpus media sources according to their ideological bias (right-leaning/centrist/

left-leaning) and form of organization (public/commercial/alternative)

In this thesis a view is taken that a discourse dictionary should be shaped by important lexical items and linkages between them that can be generated via quantitative analysis accompanied by a subsequent qualitative analysis. These lexical items, found in the target corpus, are only important in relation to a given reference corpus.

The question of what constitutes a good reference corpus has been a topic of a number of articles (Baker 2004, Berber-Sardinha 2000, Fidler and Cvrček 2015, Goh 2011, Scott 2009, Tribble 1999, McEnery 2006). It has been evidenced that differences between reference corpora cause variations in keyword results (Berber-Sardinha 2000, Scott 2009, Goh 2011). Although large,

23 plk (2015). Niezalezna.pl nadal liderem wśród prawicowych portali.

https://niezalezna.pl/63258-niezaleznapl-nadal-liderem-wsrod-prawicowych-portali, accessed 16.06.2019.

24 Polityka Krytyczna. About Us. https://krytykapolityczna.pl/o-nas/eng/, accessed 16.06.2019.

25 KRRiT. Status, zadania i finansowanie nadawców publicznych.

(23)

linguistically representative reference corpora such, as BNC or Brown, are used by default, it has been shown that differences in number and relative quality (measured quantitatively)26_{of obtained keywords}

are insignificant or debatable unless specific variables are introduced (Baker 2004, Scott 2009, Goh 2011, Tribble 1999, McEnery 2006, Tribble and Scott 2006). Goh (2011) and Scott (2009) mention genre, mode, time period, linguistic variation as possible variables; out of which genre and time period were ruled to be significant.

This thesis follows the standard practice of employing a bigger reference corpus and approaches the variables that define a corpus from a strictly Foucauldian perspective, congruent with Kämper´s take on discourse lexicography and corpus linguists´ conclusions on the factors influencing keyword results. It needs to be said that it would be preferable to follow Berber-Sardinha’s 1:5 corpus size ratio (2000:12) for good measure but since the MA project is severely limited in time, space for the discussion of a larger number of examples, and available data (some outlets do not go back to years preceding 2012), a different proportion should suffice.

Although Kämper does not mention keyword analysis as a way of obtaining lemmas for a discourse dictionary27_{, she mentions five characteristics of discourse that in this thesis serve as}

structure points for both the target and reference corpus. Those characteristics, now also “variables”28_,

include: discourse topic (defined by keywords and their structure), time period (here: July 2018-June 2019), participants (different outlets representing the media landscape), texts (various online news articles classified as relating to domestic issues) and function (expressive of the participants´ attitude towards the topic, e.g. defining Czarny Marsz as “a feminist protest against the rights of unborn children” would suggest a strong anti-Czarny Marsz, anti-abortion and possibly anti-feminist attitude, but the function being that of government supporters if the topic is government and anti-Oppositionists if the topic is the Opposition).

In the case of the dictionary that is the subject of this thesis, it would be of interest to know the topic of all relevant online news articles from a given period with the exclusion of vocabulary typical for online news articles (genre), and more specifically the ones belonging to the same media outlets but from other periods. For this reason the reference corpus should be defined through the same

26 I.e. as a function of popularity and precision. In Scott (2009: 85) popularity is defined as presence of each keyword in at least 20 of the 22 genre-specific reference corpus sets, while precision is computed following Oakes (1998: 176) and indicates “the proportion of retrieved items that are in fact relevant (the number of relevant items obtained divided by the total number of retrieved items).”

27 In Diskurs – semiotisch: Aspekte multiformaler Diskurskodierung, Maria Mast and Verena Weiland (2017: 231) define keywords as “resulting from a comparison in terms of lemma frequency between a given corpus and a thematically varied reference one in the same language“ (in the original: "Keywords resultieren aus einem Abgleich der nach der Häufigkeit ihrer Okkurrenzen vorkommenden Lemmata im jeweiligen Korpus mit den jenigen eines thematisch verschiedenen Referenzkorpus in der jeweiligen Sprache."). Such a definition does not recognize the full potential of keyword analysis for discourse as it excludes the use of specialized reference corpora, i.e. ones that would result in discursively, not linguistically, relevant keywords.

(24)

participants (=the same media outlets) from a period preceding the period of interest (July 2013-June 2015).29_{The keyword and collocational analyses are carried out in search of the discourse topic. The}

function is relative to the discourse participants and the topic. So defined, a reference corpus would reflect a perspective representative of the media readership in Poland from before the 2015 election on the July 2018-June 2019 period (Fidler and Cvrček 2015). Prospectively, it would be interesting to use other reference corpora with the only difference lying in the time period, e.g. 2011-2015, 2019-2022 or 2011-2015 and 2019-2023, to see how the perception of the 2015-2019 period changes over time.

If the thesis put the dictionary idea from subchapter 1.3 into action, the target corpus would

cover online news articles between October 25th 2015, the announcement date of parliamentary election results, and October 12th 2019, one day before the announcement of the next parliamentary election results. Considering the approach described above, covering a four-year period would demand a large collection of very specific data that is not yet available. Therefore, a smaller-scale research of the same kind is carried out, i.e. one that involves comparing a target corpus with data from July 1st, 2018 – June 30th, 2019 with a reference one covering the period between July 1st, 2013 – June 30th, 2015. The month overlap between the target and the reference corpus helps to avoid any not fully predictable seasonal biases, e.g. the newspapers´ discussion of the end-of-school exams in April and May. The resulting keywords should be typical of the second half of 2018 and the first half of 2019, this means only of part of the eighth term of the Polish Parliament, for a given set of media sources (same participants, same text types, possibly differing topic and discourse functions relative to discourse topic and participants) in comparison with a two times larger corpus from an earlier (seventh) term of the Polish Parliament.

3.1.2 Tools

New corpora necessitate the use of tools for building, processing, and analysing corpora.

In this thesis Web Scraper is used for data collection. This freely available browser plug-in enables the download of selected website elements, such as article titles, images, image captions, body text, etc., as well as a step-by-step definition of its workflow. Web Scraper affords great control over and precision of obtained data. A popular crawling programme, such as BootCaT or TextSTAT, “reads web pages, indexes their text, and systematically follows their links“ (Thissen-Roe 2016: 3), which in practice means reliance on the linking of the website and control over the desired number of accessed

29 The 2015-2019 period is still ongoing and therefore no data from the succeeding period is available. It would be possible to compare the information from 2015-2019 with data from a preceding and a succeeding period.

(25)

pages30_{or pre-defined lexical items}31_{. Greater control reduces the corpus cleaning time, i.e. removal of}

unwanted information from the collected data.

The data for this thesis is not lemmatized or tagged due to time restrictions but if it was to be, PANTERA would be the preferred software choice. This tool allows for disambiguation of the morphosyntactic description of corpus texts. PANTERA is an open-source tool, originally created for the Polish National Corpus (pl. Narodowy Korpus Języka Polskiego), was employed. It enables carrying out sentence segmentation, morphosyntactic analysis, segment disambiguation (flexemes included) and part-of-speech tagging (Acedański 2012: 197).

There are different corpus analysis toolkits available. The initial choice for this thesis was CQPWeb but a more easily available programme – AntConc – was used in the end. The reasoning behind both options is due. CQPWeb is "a web-based corpus analysis system" (Hardie 2012: 380) allowing concordancing, keyword analysis, and collocational analysis, also one calculated on annotation (lemmas, part-of-speech tags or word forms) (Hardie 2012: 396-8), therefore methods relevant for the analysis proposed in 3.2. AntConc is "a freeware corpus analysis toolkit for concordancing and text

analysis"32_{that also enables the said concordancing, keyword analysis, and collocational analysis}

together with a wide choice of keyness values and collocate measures. It does not, however, offer collocational analyses with both adjustable and wide spans, i.e. ones from 11 to 25 tokens apart, while CQPWeb provides the possibility of adjusting the spans but not up to 10 tokens in the left or right direction. Other practical limitations of using AntConc in comparison with CQPWeb are mentioned in

4.1.2.

3.1.3 Data Collection

Four portals – tvp.info, niezależna.pl, krytykapolityczna.pl, and tygodnikpowszechny.pl – were selected for data collection, a compromise between what is theoretically needed and practically possible. Those portals exibit different levels of scraping difficulty. For example, although it is possible to scrape the page content from tvpinfo.pl and krytykapolityczna.pl, the two do not make it possible to define how the programme should go from a page with one set of articles to another. Although the content on both sites is open to the public, the data needs to be downloaded in increments by changing the start URL in Webscraper metadata, which means each time a new batch of articles is scraped. Niezalezna.pl and tygodnikpowszechny.pl do not pose any such restrictions.

30 TextSTAT (2015). TextSTAT - Simple Text Analysis Tool: http://neon.niederlandistik.fu- berlin.de/en/textstat/, accessed 28.08.2019. 31 BootCaT (2018). Main Page. What BootCaT does: https://bootcat.dipintra.it/, accessed 28.08.2019.

(26)

As there is no need for semiotic analysis involving images or videos and as AntConc does not accommodate storing of metadata without incorporating it in the analysed data, the collected data types are narrowed down to article title and article body. The articles from tvp.info have two original sources – tvpparlament.pl and tvpinfo.pl – but are displayed the same way on the latter site, and are therefore included under the label of tvp.info. Unlike the other portals, tygodnikpowszechny.pl does not have a separate section for domestic affairs and all articles are included in the corpus.

After a clean-up, the total target corpus size is 5,100,00 tokens and the total reference corpus size 6,200,000.

3.2 Methods

To render the conceptual complexity of a discourse dictionary, a few corpus methods are used – keyword analysis (3.2.1), keyword collocational analysis, (3.2.2), concordancing (3.2.3), and collocational analysis (3.2.4) – contributing to the dictionary building blocks: macro-, medio-, and microstructre. All quantitative results benefit from a subsequent qualitative evalution.

3.2.1 Keyword Analysis – "Knots" of the Semantic Network

A discourse dictionary is to represent the reconstructed semantics of a given discourse (Kämper 2006: 336).The first step towards its compilation is to establish lexical items (de. Leitwörter, Kämper 2013:10) in which the discourse is condensed (Kämper 2006: 339-340) and that constitute the "knots" of an interconnected network. These lexical items are also dictionary lemmas and in this thesis emerge out of a keyword analysis whose results are manually curated because keywords do not automatically become lemmas and because multi-unit lemmas are permitted.

Generally, the dictionary puts no restrictions on the number of included lemmas – anything that passes the qualitative and quantitative tests is welcome. For practical reasons, only around 100 top keywords are examined with regards to their suitability for discourse analysis. It is only possible to identify lemma candidates at this stage.

3.2.2 Keyword Collocation Analysis – Linkages of the Semantic Network

Once "knots" are identified, it is crucial to find linkages between them to form a semantic network. This thesis follows Scott and Tribble (2006: 65-70) in their approach towards finding meaningful connections between keywords.

(27)

The meaningful connections between keywords in this thesis are collocational in character, therefore a collocational analysis between keywords is carried out (Scott and Tribble 2006: 66-7). A collocational analysis of a keyword is carried out and the presence of other keywords on the resulting collocate list is checked. In this thesis only the relations between the main keyword and other 23 top keywords are tested with regards to their existence and not their strength. The organization of the elements of the semantic network is not attempted but the following theoretical suggestions can be used in similar projects.

In their analysis of Romeo and Juliet Scott and Tribble ignore all text boundaries, in the thesis only portal boundaries are maintained while articles need to be ordered by date, i.e. the node-collocate pairs can appear in different articles but not on different portals, with the exception of tvp.info and tvpparlament.pl whose articles appear in the same chronological sequence on tvp.info. Collocate lists for all keywords would be filtered in search for potential keyword collocates in order to identify the position of each keyword in the semantic network. To identify central lemmas, narrow-span (5 to the left and to the right) and a broad-span (between 11 to 25 to left and to the right) collocational analyses would need to be carried out.

As a consequence of the methods used, the thesis also offers a more quantifiable classification of discourse dictionary lemmas than Kämper’s division into main lemmas that are interconnected and therefore deserve to head their own articles and sublemmas that are fully dependent on the former. For Kämper the discourse topic is preferably defined through a main lemma that connects all the other lemmas, that is also productive in terms of derivatives and compounds, that represents all discourse participant perspectives in its description, and that acts as the centre of the semantic network (Kämper 2006: 339-340, 347-348).

To translate a classification like this into Scott and Tribble´s terms, it is necessary to distinguish between two types of keywords: global, "dispersed more or less evenly throughout the text" and local, concentrated only in some parts of the text (Scott and Tribble 2006: 66), as well as two types of linkages between keywords: narrow-span (5 tokens to the left and 5 to the right) and wide-span (from 11 to 25 tokens apart) linkages. The four notions relate to each other as the most common linkages "are omnipresent and would be picked up using any system of linkage whatever the span" (Scott and Tribble 2006: 69), meaning that, as part of the narrow-/wide-span overlap, these lexical items are not only popular and well-distributed but also well-connected. Low popularity on narrow-span collocational lists and absence on wide-span collocation lists might indicate localized lemmas with single connections. Absence of keywords on any collocational list and lack of any keywords among one´s own collocates

(28)

indicate "free lemmas" (see below). These and other theoretised correspondences need to be tested in an actual data analysis.

The four lemma types proposed, whose names have been changed to avoid confusion with Kämper´s terminology, can be characterized the following way:

 central lemma – a lemma or a set of lemmas, also in form of word combinations, that include the keyword appearing in the highest number of narrow-/wide-span overlaps resulting from a keyword collocational analysis; in case of a “draw“ all qualified lemmas or lemma sets create their own centres;

 interdependent lemma – a lemma or a set of lemmas, also in form of word combinations, that include a keyword that collocates with more than one keyword different than itself;

 dependent lemma – a lemma or a set of lemmas, also in form of word combinations, that include a keyword that collocates with only one keyword different than itself;

 free lemma – a lemma or a set of lemmas, also in form of word combinations, that include a keyword that has no keyword collocates other than itself.

In this scheme all central lemmas are interdependent lemmas and only some interdependent lemmas are central lemmas. The collocational analysis results might be used for more detailed descriptions, for example ones concerning the relative strength of lemma connections within the semantic network.

Finally, it is important to point out that the thesis does not impose rules about whether the semantic network has to have a single or multiple centres or whether all keywords need to be connected with each other. The thesis proposes one way of looking at how keywords are linked with each other in discursively meaningful ways, an approach that does not compromise the basic assumptions of discourse lexicography:

Das Wesen eines Diskurswortschatzes besteht darin, dass er einen durch unterschiedliche Bedeutungsrelationen aufeinander bezogenen begrifflichen bzw. semantischen kohärenten Komplex bildet. (Kämper 2006: 348)

3.2.3 Concordancing – Attitude Attribution

The thesis also benefits from concordances, lists "of all of the occurrences of a particular search term in a corpus, presented within the context in which they occur" (Baker, Hardie, and McEnery 2006: 42-43), which are used to carry out language-mediated analysis of attitude33_{. Since the corpus size is}

(29)

relatively small (5,200,000 tokens), all instances of a term´s occurrence are expected to be analyzed. All self-contained segments with a given lemma become part of an entry, each such segment is analyzed closely to determine the attitude it expresses. It is possible to mark each segment-specific attitude as positive, negative, or neutral to calculate the overall attitude of the media outlet towards a given lemma. For example, "infringes the law" would be negative, "was announced" would be neutral and "should be supported more" positive.

3.2.4 Collocational Analysis – Discourse Inner Structure

Dictionary mediostructure facilitates representing discourse complexity in terms of how its elements are connected with one another. One example of that is linking entries via their most common attributes whether shared attitudes ("infringes the law"), sentiments (e.g. positive), or accompanying lexical items ("children", "Spokesperson for Children´s Rights", "depravation"). The last case may be expressed via collocations and therefore a collocational analysis is carried out for each keyword and its results are distributed among relevant lemmas. Different ways of viewing the dictionary content give its users more opportunities of approaching discourse.

Chapter 3 discusses the details of the mixed-method approach. It specifies the target and reference corpus design, makes suggestion for corpus analysis tools, and argues in favour of the four corpus linguistics methods – namely keyword analysis, keyword collocational analysis, concordancing, and collocational analysis – that help to build a discourse dictionary project put forward in this thesis. Chapter 4 reports on the practical application of tools and methods described in Chapter 3.

(30)

Chapter 4 Analysis

4.1 Semantic Network – Macro- and Mediostructure

The semantic network is the overarching connecting structure of the main elements of the dictionary, i.e. lemmas. In this thesis the network is defined through keywords (=lemma candidates) and collocational relations between those keywords (=cross-references), which follows Scott and Tribble’s quantitative method of establishing meaningful connections between keywords.

4.1.1 Establishing Keywords – Lemma Candidates

For keyword identification corpus analysis toolkit AntConc is employed.

The programme offers a wide range of keyword statistics and effect size measures. The default options in AntConc are used:

 Log-Likelihood with a 4-term co-occurence, keyword statistic threshold of p<0.05 and Bonferroni correction to counteract the problem of making multiple statistical inferences

 Dice coefficient with the keyword effect size threshold accepting all values.

The first 100 keywords are examined and out of 100 suggested keywords 96 are considered suitable for discourse analysis, i.e. seem relevant to the events taking place in 2018-2019 (see Appendix 1), 5 are

part of unremoved boilerplate ("like us", "see more", etc.), one is an ambiguous abbreviation, unfortunately popular in various contexts during 2018-2019 ("j"). The preliminary conclusion is that using a specialized discourse corpus (5,100,000 tokens) against a slightly bigger (6,200,000 tokens) specialized reference corpus yields very satisfying results.

4.1.2 Connecting Keywords – Cross-References

Out of 30 top keywords, 27 are suitable for discourse analysis. Because the data is not lemmatized, as was the initial plan that also involved integrating the annotated data into the web-based corpus analysis system CQPWeb34_{, some of the keywords in that 27-item group constitute different forms of the same}

word (two proper names: nom. sing. "Morawiecki" and gen./acc. sing. "Morawieckiego", nom. sing.

34 CQPWeb would be more adequate for this kind of analysis because, unlike AntConc, it accommodates viewing metadata, such as the publication date and publication ID, is largely independent of the equipment used by the analyst, and acknowledges characters other than letters or numbers in analysis, such as "+" (as in LGBT+). In contrast, throughout the writing of this thesis AntConc refused to work with certain data sets or while performing certain actions, usually involving large files, and viewing metadata, which is treated only as part of the analyzed content, can be done via ´File view´ or through storing each article into separate files with the date and source in the title. Both AntConc and CQPWeb do allow for collocational search spans bigger than 5 tokens to the left and 5 tokens to the right, which would benefit establishing wide-span linkages between keywords as devised by Scott and Tribble (see 3.2.2).

(31)

"Trzaskowski" and gen./acc. sing. "Trzaskowskiego"; and one common noun meaning "judge": gen. pl. "sędziów" and gen./acc. sing. "sędziego"). In those cases it is enough for either word form to frequently co-occur with "Morawiecki" because, if the data had been lemmatized prior to statistical analysis, different word forms would appear as one item. The goal is to establish if there is any connection between keywords and not how strong this connection is. Both word forms of "Morawiecki" are omitted in the procedure because, as the top keyword, "Morawiecki" is treated as the node in keyword node-collocate pairs. It has to be mentioned that one of the keywords in the top 30 is "Mateusz" appearing mainly as the first name of Prime Minister Morawiecki and would be treated as one entity constituting the lemma "Mateusz Morawiecki" in the dictionary. This example shows that lemma candidates are not always identical with keywords. The categorization of the top 30 keywords can be viewed in Appendix 1. It is worth noting that having "Morawiecki" as the top keyword and potential main lemma would

agree with the premise of the target corpus design that is to represent the media discourse during the eighth term of Polish Parliament, that is the office of Prime Minister Mateusz Morawiecki as the leader of the government. In that sense, "Morawiecki" condenses the discourse through its definition and the connections it makes.

In order to check whether the other keywords are collocates of "Morawiecki“ Log-Likelihood is used to render a collocate list. The minimum frequency is 1 as, again, the goal is only to establish whether there is a connection between the keywords or not. After that, the list (see Appendix 2) is

cloned in AntConc and pasted into Libre Office Calc, where it is filtered in search for each of the 23 top items from the keyword list.

It might be possible to establish how important a collocate, or how strong the connection, is in relation to "Morawiecki" and shape the semantic network accordingly. This is, however, not attempted in this thesis.

Predictably, the top keyword ("Morawiecki", the surname of the Prime Minster of Poland Mr. Mateusz Morawiecki) connects with many other keywords, namely 21 out of 23, only "r" (24th on the list), a keyword of questionable relevance standing for an abbreviation of "year", and "LGBT" (15th on the list) do not collocate with "Morawiecki". The latter case is most probably due to the fact that "LGBT" is mainly framed as a local goverment issue, and not a national issue, in 2018-2019 and, instead, the name of the president of Warsaw (Rafał Trzaskowski) comes up as the main actor. A more detailed analysis of LGBT is carried out in 4.2.

(32)

Further analyses are not carried out due to time restrictions and the limited functionality of AntConc as it does not allow for collocational analyses with adjustable and wide spans, i.e. ones from 11 to 25 tokens apart (see 3.2.2).

On the basis of the scarce evidence provided by the comparison of the collocates of the top keyword "Morawiecki" with 23 other top keywords as its potential collocates, it can be tentatively said that keyword collocational analysis does produce discursively meaningful connections between potential lemmas but its actual utility would have to be tested with more examples, especially lower-ranking keywords, and checked closer in a concordance line examination. As mentioned in case of "Mateusz Morawiecki", keywords do not always equate lemmas. A further example of this is presented in 4.2 where "LGBT" rarely appears alone but rather in word combinations, such as "LGBT people", "LGBT commmunity", "LGBT+ Declaration", "LGBT+ ideology", and "LGBT rainbow".

4.2 Keywords – Microstructure

One top-ranking (15th) keyword, i.e. "LGBT", is chosen for detailed analysis.

4.2.1 Choice of Association Measure

In order to check whether LGBT occurs in word combinations as the modifier, collocational analyses need to be carried out. AntConc offers four collocate measures – Mututal Information (MI), Mutual Information + Log-Likelihood (p < 0.05), Log-Likelihood (LL), T-score – among which Mutual Information is the default one, a choice less preferable than Log-Likelihood for discourse analysts (McEnery, Xiao, and Tono 2016: 217) who follow Dunning (1993). As a side experiment, both Log-Likelihood and the default Mutual Information were used with the frequency threshold of 5. This was not done at the step of building the semantic network because in this thesis only now the strength of the bond between the node and the collocate come into play. The threshold is raised from 1 to 5 to capture all node-collocate pairs whose relationship can be generalized, i.e. ones that co-occur enough times to establish globally relevant context of their use. The two collocate measures render two drastically different lists (see

Appendix 3) as elaborated on in the section.

The list that served in further analysis is the one generated with Log-Likelihood for two reasons.

Firstly, the Log-Likelihood-generated list provides a better lexical overview of discourse, i.e. elements that combine with LGBT as the modifier appear at the top ("parade"), their collocates come next ("mass") and collocates primarily relating to collocates of the node ("celebrate") are still lower on

(33)

the list. With the Mutual Information-generated list it is more difficult to see how elements link in discourse. Also, the two top collocates generated with MI, e.g. "świdnicki" (eng. relating to Świdnica, a town in the Lower Silesia region, m. sing.) and "malopolscy" (eng. relating to the Lesser Poland region, masculine personal) referred to a single peripherial event and therefore are not representative of the period.

Secondly, the problems with using MI are amplified because of the data set. Even though it is thoroughly cleaned of noise, it still contains data that seemed useful during collection but turned out to be questionable during analysis. This data includes titles that are recommended mid-text for further reading in some articles. On the one hand, the readers who access just a few articles from the page are exposed to the recommended title even if they do not read the article which it heads; on the other hand, these titles do not introduce new contexts for the same lemma and give one specific occurrence a lot of weight in defining an entity, especially in cases where a keyword appears solely in one much recommended title. In the future such recommendations can be treated as metadata giving additional information on connectivity within the discourse. Hashtags and media sources are also examples of this. Due to time restrictions, in-article boilerplate and metadata were not removed but the analyst had to be mindful of repetitions of this kind.

4.2.2 Discourse and Collocational Analysis

To establish each portal’s unique relationship with the keyword "LGBT", four collocational analyses are carried out, one for tvp.info, one for tygodnikpowszechny.pl, one for niezależna.pl, and one for krytykapolityczna.pl. Each portal has its own preferred top collocates, which on closer examination, turn out to combine with LGBT as the modifier. There is an overlap among them but also a clear preference for a given combination or more. For example, for tvp.info these are LGBT+ Declaration, LGBT communities ("environments"), LGBT ideology, LGBT rainbow, and Love Parade ("Equality March“); for tygodnikpowszechny.pl LGBT people; for niezależna.pl LGBT+ Declaration, and LGBT communities ("environments"); for krytykapolityczna.pl LGBT people, LGBT community, and LGBT+ Declaration in this order of frequency. The focus on specific issues might characterize a media outlet, e.g. tvp.info and niezależna.pl focus less on LGBT in the context of individuals (people) and more in the context of a special group (community) introducing certain changes (a charter). Additionally, after sorting all the word combination with LGBT as the modifier, it turns out that the keyword LGBT appears very rarely in isolation.