Computational text analysis: Thoughts on
the contingencies of an evolving method
Daniel Marciniak
Abstract
Mapping a public discourse with the tools of computational text analysis comes with many contingencies in the areas of
corpus curation, data processing and analysis, and visualisation. However, the complexity of algorithmic assemblies and
the beauty of resulting images give the impression of ‘objectivity’. Instead of concealing uncertainties and artefacts in
order to tell a coherent and all-encompassing story, retaining the variety of alternative assemblies may actually strengthen
the method. By utilising the mobility of digital devices, we could create mutable mobiles that allow access to our
laboratories and enable challenging rearrangements and interpretations.
Keywords
Topic modelling, science and technology studies, text analysis, Big Data, network analysis, visualisation
Inspired by recent texts on quali-quantitative methods
(Latour et al., 2012; Venturini and Latour, 2010), the
research project recounted in this essay focussed on
mapping public discourse around Big Data in science
and politics. The aim was to test methods of
computa-tional text analysis on a comparatively large corpus of
documents collected from US, UK and EU
governmen-tal websites as well as articles from Web of Science. The
more I delved into the work, the more it became clear
that it would not only be about describing the topics
involved in the discourse, but also about the process
itself, that is, the assemblage of algorithms. In what
follows, I describe some of the contingencies involved
in the process and argue that retaining them may
actu-ally strengthen the method.
In its core, the procedure of computational text
ana-lysis is very similar to the purification process described
by Latour and Woolgar (1986) in
Laboratory Life
.
Texts are transformed into bags-of-words (organic
tissue is pure´ed), important words are filtered (the
material passes selective sifts), and over the course of
many iterations topics are inferred (spikes are
differen-tiated from noise). Just as the construction of facts in
science hides the circumstances of their production, it
would be possible to hide the plethora of necessary
decisions and present the resulting topics as facts. But
maybe because there is not yet an established routine to
computational text analysis, it is still possible to see all
the possible paths one could take in arranging
inscrip-tion devices. They are not yet blocked by what one
‘ought’ to do or by hard coded presets within software.
The task of mapping a discourse with computational
text analysis involves three interrelated areas of
con-cern: (1) the curation of a corpus of documents, (2)
actual data processing and analysis and (3)
visualisa-tion (see Figure 1). The difficulties associated with the
first task are not really new, as they come with every
kind of content analysis. Where are the boundaries of a
discourse? In addition, it is unclear what requirements a
corpus must meet so that the methods work properly
(DiMaggio, 2015: 3). For example, the disproportion
between the number of documents I had collected for
science and for politics seems to have resulted in a
model that is biased towards topics within science.
Department of Sociology, Goethe University Frankfurt am Main,
Frankfurt am Main, Germany
Corresponding author:
Daniel Marciniak, Goethe-Universita¨t Frankfurt am Main,
Theodor-W.-Adorno-Platz 6, 60323 Frankfurt am Main, Germany.
Email: [email protected]
Big Data & Society
July–December 2016: 1–5
!
The Author(s) 2016
Reprints and permissions:
sagepub.com/journalsPermissions.nav
DOI: 10.1177/2053951716670190
bds.sagepub.com
The influences of a corpus’ compositionality are easily
overlooked when the amount of collected documents
suggests that you have ‘seen it all’ – a mistake quickly
made with Big Data (boyd and Crawford, 2012;
Harford, 2014). However, because a computer is, in
contrast to humans, able to ‘unread’ a text by deleting
it from memory, computational text analysis may
pro-vide a platform on which the influence of a corpus’
compositionality on the outcome can be tested (e.g.
by drawing random samples).
The ability of testing multiple pathways also applies
to the second area of data processing and analysis.
There are numerous algorithms involved in processing
the text that can be combined in many different ways
(e.g. transforming a PDF file into text, applying
part-of-speech tags, detecting bigrams, named entity
recog-nition, topic modelling). And each of them provides a
source of uncertainty regarding how reliable they are
and how different orders of assembly may influence the
outcome. Take the question of employing bigram
detec-tion: Bigrams are pairs of words that co-occur
fre-quently, such as ‘Big Data’. Treating bigrams as
singular tokens (if the second word was a noun)
increased the number of words in my dictionary by
about 30 percent. This, in turn, changed word
frequen-cies tremendously and thereby changed the outcomes of
topic modelling algorithms. Furthermore, when we use
external material, such as stop word lists or sentiment
dictionaries, it is unclear whether they are universally
applicable (Diesner, 2015). However, as the task of
part-of-speech tagging illustrates, computational
meth-ods can be successively improved upon until they
per-form comparable to humans or even better (Manning,
2011). So, by seemingly reducing human interference,
the promise of computational text analysis is an
increase in ‘objectivity’ compared to classical methods
of content analysis. It utilises only the information
con-tained in the corpus, providing a result that is
essen-tially free from interpreter’s biases, for example
preconceptions about the documents (Buurma, 2015).
The topics stem from a process of reading between texts
that is impossible for humans to accomplish otherwise.
The result is a multidimensional space of meanings, a
space that somehow has to be reduced to a
two-dimen-sional
visualisation
to
make
it
accessible
to
interpretation.
Finding an appropriate visualisation to the various
computed results again came with a plethora of
deci-sions on layout algorithms, cut-off points, and
param-eters. The number of possible graphs is virtually endless
and none of them are necessarily ‘wrong’. Some yield
similar results albeit on different levels of abstraction
(e.g. healthcare vs. cancer treatment), some are
dis-torted, and most are plainly not interpretable albeit
not contradictory. Figures 2 and 3 show two different
read PDF
PDF Miner
read file
clean te
xt
remo
ve ur
ls
limit to Latin char
acter set
remo
ve e
xcess whitespace
natur
al language processing
split sentences
par
t-of-speech tagging
natur
al entity recognition (NER)
chunking
SENNA
Stanf
o
rd NLP
sentiment dictionar
y
compute sentiment scores
lemmatise
W
o
rdNet
w
ord_tag
big
rams
tr
ig
rams
N-g
rams
limit to letters
map str
ings to n
umbers
dictionar
y
bag-of-w
ords
cor
pus matr
ix
gensim
Latent Dir
ichlet Allocation
Latent Semantic Inde
cluster
ing
topic-w
ord-probability matr
ix
w
ord-dimension matr
ix
tf*idf
Wikipedia (mean w
ord frequency and SD)
W
elch's t-test
tok
en frequency and sd
sentence co-occurrence
inclusion inde
x
NER statistics
stop w
ords
sentiment score dictionar
y
search b
y hand
par
liament.uk
go
v.uk
europa.eu
whitehouse
.go
v
senate
.go
v
house
.go
v
computational cor
pus cur
a
tion
146 of 2970 political documents
W
e
b of Science
title includes'Big Data'
705 of 2291 science ar
ticles
remo
ve duplicates
full te
xt accessibility
Big Data as main topic
NL
TK
scip
y
matplotlib
g
raph f
o
rm
ats
*.net
*.ge
xf
manipulate v
ector g
raphic
pie char
ts
bar char
ts
scatter plots
netw
or
k
s
siz
e
colour
shape
pajek
gephi
cut-off point
distance measures
w
ord-w
ord-matr
ix
n
umber of topics
la
y
out algor
ithms
Kamada-Ka
w
ai
F
o
rce Atlas
F
ruchter
man Reingold
Openord
Y
ifan Hu
Blondel
k-Means
Multidimensional Scaling
inter
pretation
collection of documents
nodes
edges
line char
ts
n
umber of dimensions
cor
p
us cur
ation
data processing and analysis
visualisation
inter
p
retation
used in final result
not used in final result
e
x
ter
nal resources (tools and datasets)
Figur
e
1.
Assemblage
of
tasks,
algorithms,
and
resour
ces
emplo
yed
in
the
pr
maps of my corpus based on the same set of words.
They are the result of two different approaches to
making an underlying topical structure visible: as is
apparent from the isolated nodes in Figure 2, the first
strategy was to set a cut-off point to edge weights in
order to reduce the number of edges shown. While in
Figure 2 the nodes’ colours inform how the map is read,
Figure 3 adds readability by including the topics
them-selves as nodes. In this case only maximum weights or
those above a certain threshold are shown as edges. But
is the map that is more easily read also more ‘true’?
Even after being computationally filtered for words
that are important to the context (by frequency or in
comparison with other text corpora), topics remain
groupings of words that have to be made sense of.
After all, topic modelling only shifts the task of
inter-pretation
to
the
very
end
(Mu¨tzel,
2015:
2).
Paradoxically, the visualisations seem to lend
objectiv-ity to the subjective fictions that topics necessarily are:
While I was aware of the troubles I had to go through
a a a about access accessed address marciamcnutt advantage age aim al all althou am american among an analysis and application approach appropriate april are area article association at author available basis be because been benet best between big big administrative broad build but by can canada chairman cardiovascular care case center challenge biometrics clinical clinician cohort held collaboration colleague collect collected college come comparison comprehensive condition consider core cost could criterio current database decision delivery developed developing development diagnosis diagnostic did disease dj doi downloaded edge mic e ed editorial trade electronic element ensure era et evidence example proposal f failure foundation from full further g government great guideline ha have health heart help high history hospital howB however identify if imaging important improve improvement entrepreneur increase increasing individual rmation initiative insight interest intervention into is issue itP k key large vol learning like likely made main making many march may md postgraduate measu med medical medical medicine medium method ginni more must national necessary need need needed new new
next no
no not now encourage offer on one online online opinion house or order organization other others ourP$ outcome p pain part particular patient patient performance functional physician pmid policy commitment population post potential potential plant practice probability procedure process professor proposed provide provides providing public published quality rather record reduce reduction registry regulation report reporting request required requirement result review role rst scenario selection set should show simil so social society some stakeholder state step study such support support surgery system take task technology test testing that that thatT the theirP$
there these
this tool trial understand university use use used using variable variation view vol wa washington way weP were what whenB whereB whether whichT wide will with without work would york a kin ability able accessible according accuracy achieve acm across podesta addition frame advance advancing alan algorithm victoria allows already also amount fun any apache architecture around array aspect kumar associated australia b based become benefit better better block book both building c cache call nsf capability capacity category centre change chapter characteristic federal chen china chinese choice class classification client climate cloud cluster clustering code collaborative collection coming committee common communication company complex complexit compression comput computation computational computational computer computing computing concept concern n conference congress connection consideration considered consortium content context neural control council cpu creating critical currently cycle d datasets day itP$ defense definition degree demand department design detail award develop device different digital buyer charging director discipline discus discussion disk distance distributed diverse do doing domain due during dynamic each earth ecology seizing education effect effective efficiency efficient efficiently effort electronic enable engine engineering invest enterprise environment environmental especially estimation europe european evaluation even every exascale exascale execution existing experience expertise exploit pm fact factor family far feature few field fig file finally find first sustainable focus transportation form format vehicle framework freeman frequency fully function fusion future gap generating generation give global global go good google gpu grant graph greater grid growing growth guest hadoop hand hardware heP commission higher hisP$ hold hpc fema huge human hundred iP ibm ict
idea ieee
image implementation importance improving include included including industry engage inform informatics infrastructure innovation institute int integration interface international international internet introduced involved breach j tcm job john crime july june just identity key kind know knowledge enforcement laboratory language layer lead eading led length level le library informed liu load local located location london lot m machine make manage management map multimedia mapreduce massive producer may mean measurement mechanism member memory message metadata mike million million partnership mit model modeling monitor logistics more polygon teaching most most movement much multiple network noaa node novel number object comment october often erratum only open open operation optimization specially out specie over own
paper paradigm parallel parallel anonymization pattern per period personal perspective phase physical platform point releasing possible power pp press primary priority problem proc proceeding processing hiv processor product productivity program programme programming project property provided publication ecosystem purpose put query protecting r range rate re real recent pbrn region related relationship observation remote unlock require requires research researcher resource respectively live retrieval cube right sS same opportunity scalability scalable scale scheduling scheme scholar school sci science scientific scientist search second section security sense sensing sensor freely september
funded funding
server service gao several sharing shown side significant nature simulation since single size small software solution source space spatial special special specific stage standard statistic statistical still storage store stored monitoring strategy defence structure structured student successful supercomputer supercomputing supporting surface survey symposium syst table team technique technological temperature temporal preserve protected than the themP therefore theyP thing those three through throughput thus time today together too top topic total total towards traditional transaction transfer transmission transparency transport tree trend two type uk under unit united unstructured up up us usa user variety various vector velocity version via virtual volume wang weather web well while rdf within work workshop world xu year zhang additional after allT laughter alzheimer analytics analyze assessment average base before being encourages benefit blood brain breast business cancer chain child claim clinical combination combined compared complication conducted wikipedia contribution create created creation dataset deliver dementia minister themed description diabetes difference differential bd direct disorder distribution doe drive drug early economic bulk end bipartisan estimate event found four fundamental generated given group had identified rm impact increasingly index individual input institution interaction postnotes bise defra life link literature low major making metric voter mortality national ranking objective obtained oncology out participant pathway person court prediction predictive present presentation pricing privacy private protection output question random recommendation regression relevant risk s sample screening selected sequence set sharing site six stroke subject summary surgical survival syndrome systemati target technical then therapy training treatment trust nadler uncertainty understanding coordination useful utilization parameter value visual weight word neuroimaging setting action advanced against agency agenda coalition copyright criminal analytical eq attention august smes expand bioinformatics biol biomedical fund bmc body box citizen neuron clear commercial community component consumer earthserver country course culture biobanks up document dynamic easier applicant economy cde emerging emr error neuroscience et explore extract federal figure finding funding future operator general get getting pcast ozone grow h healthcare here bmt increase increased innovative insurance intelligence datasim accountability last learned learning ppp line market smart societal written strengthen mining module month workforce name nation natural nt ontology particularly people place plan presented prototype quantitative quantity read reliability repository representation response right rule processing score semantic so congressman stated strong t term text urge usage very visualization want who working edp berkeley billion budget continue financial indiana cognitive looking mobile ohio optical predict coherentpaas protein provider signal longo south white y activity allow authority chinese correlation dimension ethic european facility focused ecpa illness mccourt investment item ssg pharmaceutical positive price horizon provision public recognition reform sampling infectious sector ceeds title u investigator administration cfpb asset justice meP judiciary candidate markey discovery driven emergency epidemiology equipment goal hypothesis inform informed lab battelle portman manuscript material matrix much daines office partner pharma nih profile protocol staff submission subset servicemembers odar sdrs tac postnote ci city energy implication integrating li bct prevention subgroup stone december density eu expert expression governance indicator metering uP v woman academic discrimination behavioral bureau california detection educational raise reaction safety series session skill supply week asked bias directive chief competition eld election genome instance jmir manager mckinsey fuzzy operational pathology political re regulatory announced revolution specic statement surveillance vast youP agriculture attribute believe called collecting commerce ecg form genetic genomics i inference intelligent maker news physic progress relation secure affair theory video academia accelerate columbus liberty biological biologist biology bring campaign parliament causal cell chemistry classication compound consent conversation obama decade dna personalised ehrs enhance ethical experiment flu food gene genetics genomic human keep launched law legal look matter might molecular mutation myP$ nat open personalized player plo portal production promise quickly official scientic see sequencing signature signicant speed transform tumor twitter vertex vision website whyB journal analyse animal annotation thursday apps assay audience suicide gft georgetown tweet board building x ceo chem chemical advisor feasibility credit crop david deployment driver abuse engagement ever executive expanding expect experimental gaining harnessing digital leader molecule organisation ostp pace park patent excellence teen president protect reader nursing similarity cornerstone editor stream subject update visualisation museum mirna hearing immune bioassay korea interoperability january industrial pair think act adoption advance american behavior bill biometric card consumption cultural customer debate donor mod drawing electric constitutional kid competitiveness firm forward hope yourP$ inquiry subcommittee l immunology join marketing marketplace procedia related universitaetsbibliothek forest mobility travel senator secretary manufacturing speech crapo original page phone prevent glennbegley erp reasoning turing released leadership revenue run s exploding said sam say sentiment sheP social anomaly civil nsa trajectory tv traffic us ve volunteer wiley donaldberry ericlander legislation adapted boost johnholdren fraud creative