Computational text analysis: Thoughts on the contingencies of an evolving method

(1)

Computational text analysis: Thoughts on

the contingencies of an evolving method

Daniel Marciniak

Abstract

Mapping a public discourse with the tools of computational text analysis comes with many contingencies in the areas of

corpus curation, data processing and analysis, and visualisation. However, the complexity of algorithmic assemblies and

the beauty of resulting images give the impression of ‘objectivity’. Instead of concealing uncertainties and artefacts in

order to tell a coherent and all-encompassing story, retaining the variety of alternative assemblies may actually strengthen

the method. By utilising the mobility of digital devices, we could create mutable mobiles that allow access to our

laboratories and enable challenging rearrangements and interpretations.

Keywords

Topic modelling, science and technology studies, text analysis, Big Data, network analysis, visualisation

Inspired by recent texts on quali-quantitative methods

(Latour et al., 2012; Venturini and Latour, 2010), the

research project recounted in this essay focussed on

mapping public discourse around Big Data in science

and politics. The aim was to test methods of

computa-tional text analysis on a comparatively large corpus of

documents collected from US, UK and EU

governmen-tal websites as well as articles from Web of Science. The

more I delved into the work, the more it became clear

that it would not only be about describing the topics

involved in the discourse, but also about the process

itself, that is, the assemblage of algorithms. In what

follows, I describe some of the contingencies involved

in the process and argue that retaining them may

actu-ally strengthen the method.

In its core, the procedure of computational text

ana-lysis is very similar to the puriﬁcation process described

by Latour and Woolgar (1986) in

Laboratory Life

.

Texts are transformed into bags-of-words (organic

tissue is pure´ed), important words are ﬁltered (the

material passes selective sifts), and over the course of

many iterations topics are inferred (spikes are

diﬀeren-tiated from noise). Just as the construction of facts in

science hides the circumstances of their production, it

would be possible to hide the plethora of necessary

decisions and present the resulting topics as facts. But

maybe because there is not yet an established routine to

computational text analysis, it is still possible to see all

the possible paths one could take in arranging

inscrip-tion devices. They are not yet blocked by what one

‘ought’ to do or by hard coded presets within software.

The task of mapping a discourse with computational

text analysis involves three interrelated areas of

con-cern: (1) the curation of a corpus of documents, (2)

actual data processing and analysis and (3)

visualisa-tion (see Figure 1). The diﬃculties associated with the

ﬁrst task are not really new, as they come with every

kind of content analysis. Where are the boundaries of a

discourse? In addition, it is unclear what requirements a

corpus must meet so that the methods work properly

(DiMaggio, 2015: 3). For example, the disproportion

between the number of documents I had collected for

science and for politics seems to have resulted in a

model that is biased towards topics within science.

Department of Sociology, Goethe University Frankfurt am Main,

Frankfurt am Main, Germany

Corresponding author:

Daniel Marciniak, Goethe-Universita¨t Frankfurt am Main,

Theodor-W.-Adorno-Platz 6, 60323 Frankfurt am Main, Germany.

Email: [email protected]

Big Data & Society

July–December 2016: 1–5

!

The Author(s) 2016

Reprints and permissions:

sagepub.com/journalsPermissions.nav

DOI: 10.1177/2053951716670190

bds.sagepub.com

(2)

The inﬂuences of a corpus’ compositionality are easily

overlooked when the amount of collected documents

suggests that you have ‘seen it all’ – a mistake quickly

made with Big Data (boyd and Crawford, 2012;

Harford, 2014). However, because a computer is, in

contrast to humans, able to ‘unread’ a text by deleting

it from memory, computational text analysis may

pro-vide a platform on which the inﬂuence of a corpus’

compositionality on the outcome can be tested (e.g.

by drawing random samples).

The ability of testing multiple pathways also applies

to the second area of data processing and analysis.

There are numerous algorithms involved in processing

the text that can be combined in many diﬀerent ways

(e.g. transforming a PDF ﬁle into text, applying

part-of-speech tags, detecting bigrams, named entity

recog-nition, topic modelling). And each of them provides a

source of uncertainty regarding how reliable they are

and how diﬀerent orders of assembly may inﬂuence the

outcome. Take the question of employing bigram

detec-tion: Bigrams are pairs of words that co-occur

fre-quently, such as ‘Big Data’. Treating bigrams as

singular tokens (if the second word was a noun)

increased the number of words in my dictionary by

about 30 percent. This, in turn, changed word

frequen-cies tremendously and thereby changed the outcomes of

topic modelling algorithms. Furthermore, when we use

external material, such as stop word lists or sentiment

dictionaries, it is unclear whether they are universally

applicable (Diesner, 2015). However, as the task of

part-of-speech tagging illustrates, computational

meth-ods can be successively improved upon until they

per-form comparable to humans or even better (Manning,

2011). So, by seemingly reducing human interference,

the promise of computational text analysis is an

increase in ‘objectivity’ compared to classical methods

of content analysis. It utilises only the information

con-tained in the corpus, providing a result that is

essen-tially free from interpreter’s biases, for example

preconceptions about the documents (Buurma, 2015).

The topics stem from a process of reading between texts

that is impossible for humans to accomplish otherwise.

The result is a multidimensional space of meanings, a

space that somehow has to be reduced to a

two-dimen-sional

visualisation

to

make

it

accessible

to

interpretation.

Finding an appropriate visualisation to the various

computed results again came with a plethora of

deci-sions on layout algorithms, cut-oﬀ points, and

param-eters. The number of possible graphs is virtually endless

and none of them are necessarily ‘wrong’. Some yield

similar results albeit on diﬀerent levels of abstraction

(e.g. healthcare vs. cancer treatment), some are

dis-torted, and most are plainly not interpretable albeit

not contradictory. Figures 2 and 3 show two diﬀerent

read PDF

PDF Miner

read file

clean te

xt

remo

ve ur

ls

limit to Latin char

acter set

remo

ve e

xcess whitespace

natur

al language processing

split sentences

par

t-of-speech tagging

natur

al entity recognition (NER)

chunking

SENNA

Stanf

o

rd NLP

sentiment dictionar

y

compute sentiment scores

lemmatise

W

o

rdNet

w

ord_tag

big

rams

tr

ig

rams

N-g

rams

limit to letters

map str

ings to n

umbers

dictionar

y

bag-of-w

ords

cor

pus matr

ix

gensim

Latent Dir

ichlet Allocation

Latent Semantic Inde

xing

cluster

ing

topic-w

ord-probability matr

ix

w

ord-dimension matr

ix

tf*idf

Wikipedia (mean w

ord frequency and SD)

W

elch's t-test

tok

en frequency and sd

sentence co-occurrence

inclusion inde

x

NER statistics

stop w

ords

sentiment score dictionar

y

search b

y hand

par

liament.uk

go

v.uk

europa.eu

whitehouse

.go

v

senate

.go

v

house

.go

v

computational cor

pus cur

a

tion

146 of 2970 political documents

W

e

b of Science

title includes'Big Data'

705 of 2291 science ar

ticles

remo

ve duplicates

full te

xt accessibility

Big Data as main topic

NL

TK

scip

y

matplotlib

g

raph f

o

rm

ats

*.net

*.ge

xf

manipulate v

ector g

raphic

pie char

ts

bar char

ts

scatter plots

netw

or

k

s

siz

e

colour

shape

pajek

gephi

cut-off point

distance measures

w

ord-w

ord-matr

ix

n

umber of topics

la

y

out algor

ithms

Kamada-Ka

w

ai

F

o

rce Atlas

F

ruchter

man Reingold

Openord

Y

ifan Hu

Blondel

k-Means

Multidimensional Scaling

inter

pretation

collection of documents

nodes

edges

line char

ts

n

umber of dimensions

cor

p

us cur

ation

data processing and analysis

visualisation

inter

p

retation

used in final result

not used in final result

e

x

ter

nal resources (tools and datasets)

Figur

e

1. Assemblage

of

tasks,

algorithms,

and

resour

ces

emplo

yed

in

the

pr

(3)

maps of my corpus based on the same set of words.

They are the result of two diﬀerent approaches to

making an underlying topical structure visible: as is

apparent from the isolated nodes in Figure 2, the ﬁrst

strategy was to set a cut-oﬀ point to edge weights in

order to reduce the number of edges shown. While in

Figure 2 the nodes’ colours inform how the map is read,

Figure 3 adds readability by including the topics

them-selves as nodes. In this case only maximum weights or

those above a certain threshold are shown as edges. But

is the map that is more easily read also more ‘true’?

Even after being computationally ﬁltered for words

that are important to the context (by frequency or in

comparison with other text corpora), topics remain

groupings of words that have to be made sense of.

After all, topic modelling only shifts the task of

inter-pretation

to

the

very

end

(Mu¨tzel,

2015:

2).

Paradoxically, the visualisations seem to lend

objectiv-ity to the subjective ﬁctions that topics necessarily are:

While I was aware of the troubles I had to go through

a a a about access accessed address marciamcnutt advantage age aim al all althou am american among an analysis and application approach appropriate april are area article association at author available basis be because been benet best between big big administrative broad build but by can canada chairman cardiovascular care case center challenge biometrics clinical clinician cohort held collaboration colleague collect collected college come comparison comprehensive condition consider core cost could criterio current database decision delivery developed developing development diagnosis diagnostic did disease dj doi downloaded edge mic e ed editorial trade electronic element ensure era et evidence example proposal f failure foundation from full further g government great guideline ha have health heart help high history hospital howB however identify if imaging important improve improvement entrepreneur increase increasing individual rmation initiative insight interest intervention into is issue itP k key large vol learning like likely made main making many march may md postgraduate measu med medical medical medicine medium method ginni more must national necessary need need needed new new

next no

no not now encourage offer on one online online opinion house or order organization other others ourP$ outcome p pain part particular patient patient performance functional physician pmid policy commitment population post potential potential plant practice probability procedure process professor proposed provide provides providing public published quality rather record reduce reduction registry regulation report reporting request required requirement result review role rst scenario selection set should show simil so social society some stakeholder state step study such support support surgery system take task technology test testing that that thatT the theirP$

there these

this tool trial understand university use use used using variable variation view vol wa washington way weP were what whenB whereB whether whichT wide will with without work would york a kin ability able accessible according accuracy achieve acm across podesta addition frame advance advancing alan algorithm victoria allows already also amount fun any apache architecture around array aspect kumar associated australia b based become benefit better better block book both building c cache call nsf capability capacity category centre change chapter characteristic federal chen china chinese choice class classification client climate cloud cluster clustering code collaborative collection coming committee common communication company complex complexit compression comput computation computational computational computer computing computing concept concern n conference congress connection consideration considered consortium content context neural control council cpu creating critical currently cycle d datasets day itP$ defense definition degree demand department design detail award develop device different digital buyer charging director discipline discus discussion disk distance distributed diverse do doing domain due during dynamic each earth ecology seizing education effect effective efficiency efficient efficiently effort electronic enable engine engineering invest enterprise environment environmental especially estimation europe european evaluation even every exascale exascale execution existing experience expertise exploit pm fact factor family far feature few field fig file finally find first sustainable focus transportation form format vehicle framework freeman frequency fully function fusion future gap generating generation give global global go good google gpu grant graph greater grid growing growth guest hadoop hand hardware heP commission higher hisP$ hold hpc fema huge human hundred iP ibm ict

idea ieee

image implementation importance improving include included including industry engage inform informatics infrastructure innovation institute int integration interface international international internet introduced involved breach j tcm job john crime july june just identity key kind know knowledge enforcement laboratory language layer lead eading led length level le library informed liu load local located location london lot m machine make manage management map multimedia mapreduce massive producer may mean measurement mechanism member memory message metadata mike million million partnership mit model modeling monitor logistics more polygon teaching most most movement much multiple network noaa node novel number object comment october often erratum only open open operation optimization specially out specie over own

paper paradigm parallel parallel anonymization pattern per period personal perspective phase physical platform point releasing possible power pp press primary priority problem proc proceeding processing hiv processor product productivity program programme programming project property provided publication ecosystem purpose put query protecting r range rate re real recent pbrn region related relationship observation remote unlock require requires research researcher resource respectively live retrieval cube right sS same opportunity scalability scalable scale scheduling scheme scholar school sci science scientific scientist search second section security sense sensing sensor freely september

funded funding

server service gao several sharing shown side significant nature simulation since single size small software solution source space spatial special special specific stage standard statistic statistical still storage store stored monitoring strategy defence structure structured student successful supercomputer supercomputing supporting surface survey symposium syst table team technique technological temperature temporal preserve protected than the themP therefore theyP thing those three through throughput thus time today together too top topic total total towards traditional transaction transfer transmission transparency transport tree trend two type uk under unit united unstructured up up us usa user variety various vector velocity version via virtual volume wang weather web well while rdf within work workshop world xu year zhang additional after allT laughter alzheimer analytics analyze assessment average base before being encourages benefit blood brain breast business cancer chain child claim clinical combination combined compared complication conducted wikipedia contribution create created creation dataset deliver dementia minister themed description diabetes difference differential bd direct disorder distribution doe drive drug early economic bulk end bipartisan estimate event found four fundamental generated given group had identified rm impact increasingly index individual input institution interaction postnotes bise defra life link literature low major making metric voter mortality national ranking objective obtained oncology out participant pathway person court prediction predictive present presentation pricing privacy private protection output question random recommendation regression relevant risk s sample screening selected sequence set sharing site six stroke subject summary surgical survival syndrome systemati target technical then therapy training treatment trust nadler uncertainty understanding coordination useful utilization parameter value visual weight word neuroimaging setting action advanced against agency agenda coalition copyright criminal analytical eq attention august smes expand bioinformatics biol biomedical fund bmc body box citizen neuron clear commercial community component consumer earthserver country course culture biobanks up document dynamic easier applicant economy cde emerging emr error neuroscience et explore extract federal figure finding funding future operator general get getting pcast ozone grow h healthcare here bmt increase increased innovative insurance intelligence datasim accountability last learned learning ppp line market smart societal written strengthen mining module month workforce name nation natural nt ontology particularly people place plan presented prototype quantitative quantity read reliability repository representation response right rule processing score semantic so congressman stated strong t term text urge usage very visualization want who working edp berkeley billion budget continue financial indiana cognitive looking mobile ohio optical predict coherentpaas protein provider signal longo south white y activity allow authority chinese correlation dimension ethic european facility focused ecpa illness mccourt investment item ssg pharmaceutical positive price horizon provision public recognition reform sampling infectious sector ceeds title u investigator administration cfpb asset justice meP judiciary candidate markey discovery driven emergency epidemiology equipment goal hypothesis inform informed lab battelle portman manuscript material matrix much daines office partner pharma nih profile protocol staff submission subset servicemembers odar sdrs tac postnote ci city energy implication integrating li bct prevention subgroup stone december density eu expert expression governance indicator metering uP v woman academic discrimination behavioral bureau california detection educational raise reaction safety series session skill supply week asked bias directive chief competition eld election genome instance jmir manager mckinsey fuzzy operational pathology political re regulatory announced revolution specic statement surveillance vast youP agriculture attribute believe called collecting commerce ecg form genetic genomics i inference intelligent maker news physic progress relation secure affair theory video academia accelerate columbus liberty biological biologist biology bring campaign parliament causal cell chemistry classication compound consent conversation obama decade dna personalised ehrs enhance ethical experiment flu food gene genetics genomic human keep launched law legal look matter might molecular mutation myP$ nat open personalized player plo portal production promise quickly official scientic see sequencing signature signicant speed transform tumor twitter vertex vision website whyB journal analyse animal annotation thursday apps assay audience suicide gft georgetown tweet board building x ceo chem chemical advisor feasibility credit crop david deployment driver abuse engagement ever executive expanding expect experimental gaining harnessing digital leader molecule organisation ostp pace park patent excellence teen president protect reader nursing similarity cornerstone editor stream subject update visualisation museum mirna hearing immune bioassay korea interoperability january industrial pair think act adoption advance american behavior bill biometric card consumption cultural customer debate donor mod drawing electric constitutional kid competitiveness firm forward hope yourP$ inquiry subcommittee l immunology join marketing marketplace procedia related universitaetsbibliothek forest mobility travel senator secretary manufacturing speech crapo original page phone prevent glennbegley erp reasoning turing released leadership revenue run s exploding said sam say sentiment sheP social anomaly civil nsa trajectory tv traffic us ve volunteer wiley donaldberry ericlander legislation adapted boost johnholdren fraud creative

(4)

in drawing the maps, others found them compelling

because of their complexity and beauty. Not to mention

the visualisations of aggregated topic relations that

mask all the uncertainties and artefacts within the

results. Because visualisations also form a crucial

inter-face with the data, mistranslations of numbers into

col-ours, sizes and spatial distances may also mislead the

researcher in his or her interpretations.

Different readers read texts differently. Different

algorithms do so, too. Computational text analysis is

able to reduce human interference to the task of

assem-bling algorithms which are much easier to check then

(5)

use of the mobility of digital devices (Ruppert et al.,

2013) and create mutable mobiles.

Im

mutable mobiles,

such as books and articles, allow for the unchanged

dissemination of ﬁndings, which then can be compared

against each other in order to produce new knowledge.

But the contestation of facts ‘created’ by others used to

necessitate the construction of another laboratory,

causing an ‘arms race’ within science (Latour, 1986).

Now, with the laboratory and its object being digital,

it becomes possible to transport both, the corpus and

all the used and unused algorithms, together with our

stories in order to allow for reconﬁgurations. Being

able to change parameters enables the viewer to get

an idea of how the researcher’s interpretation holds

up across diﬀerent variations of analytic assemblies.

The researcher on the other hand has to argue why

he or she had preferred a certain representation and

interpretation over others. As much as Big Data

changes the nature of the data used in social sciences

from data being constructed by researchers to found

data (Ruppert et al., 2013), it also provides an

oppor-tunity in reducing the role of a researcher’s authority in

vouching for his or her results. By producing

amend-able mutamend-able mobiles we can produce ‘objective’ results

thanks to the multiplicity of our assemblages.

Acknowledgement

I thank Endre Da´nyi and Susanne Bauer for their valuable

comments and support during the project.

Declaration of conflicting interest

The author(s) declared no potential conﬂicts of interest with

respect to the research, authorship, and/or publication of this

article.

Funding

The author(s) received no ﬁnancial support for the research,

authorship, and/or publication of this article.

References

boyd d and Crawford K (2012) Critical questions for Big

Data. Provocations for a cultural, technological, and

scholarly phenomenon.

Information, Communication &

Society

15(5): 662–679.

Buurma RS (2015) The fictionality of topic modelling:

Machine reading Anthony Trollope’s Barsetshire series.

Big Data & Society

2(2): 1–6.

Diesner J (2015) Small decisions with big impact on data

analytics.

Big Data & Society

2(2): 1–6.

DiMaggio P (2015) Adapting computational text analysis to

social science (and vice versa).

Big Data & Society

2(2):

1–5.

Harford T (2014) Big data: Are we making a big mistake?

Financial Times

28 March, 14.

Latour B (1986) Visualization and cognition: Thinking with

eyes and hands.

Knowledge and Society

6: 1–40.

Latour B and Woolgar S (1986)

Laboratory Life: The

Construction of Scientific Facts

Princeton University Press.

Latour B, Jensen P, Venturini T, et al. (2012) The whole is

always smaller than its parts – A digital test of Gabriel

Tarde’s Monads.

The British Journal of Sociology

63(4):

590–615.

Manning C (2011) Part-of-Speech tagging from 97% to

100%: Is it time for some linguistics? In:

Computational

Linguistics

and

Intelligent

Text

Processing:

12th

International Conference

(ed AF Gelbukh) Tokyo, Japan,

20–26 February 2011, pp. 171–189. Berlin: Springer.

Mu¨tzel S (2015) Facing Big Data: Making sociology relevant.

Big Data & Society

2(2): 1–4.

Ruppert E, Law J and Savage M (2013) Reassembling social

science methods: the Challenge of digital devices.

Theory,

Culture & Society

30(4): 22–46.

Venturini T and Latour B (2010) The social fabric: Digital

traces and quali-quantitative methods. In:

Proceedings of

Futur en Seine 2009

87–101. Paris: Editions Futur en Seine.

Available at:

www.tommasoventurini.it/web/uploads/tom-maso_venturini/TheSocialFabric.pdf

(accessed

7 May