The curious anomaly of skewed judgment distributions and systematic error in the wisdom of crowds.

(1)

and Systematic Error in the Wisdom of Crowds

Ulrik W. Nash*

Strategic Organization Design, University of Southern Denmark, Odense, Denmark

Abstract

Judgment distributions are often skewed and we know little about why. This paper explains the phenomenon of skewed judgment distributions by introducing the augmented quincunx (AQ) model of sequential and probabilistic cue categorization by neurons of judges. In the process of developing inferences about true values, when neurons categorize cues better than chance, and when the particular true value is extreme compared to what is typical and anchored upon, then populations of judges form skewed judgment distributions with high probability. Moreover, the collective error made by these people can be inferred from how skewed their judgment distributions are, and in what direction they tilt. This implies not just that judgment distributions are shaped by cues, but that judgment distributions are cues themselves for the wisdom of crowds. The AQ model also predicts that judgment variance correlates positively with collective error, thereby challenging what is commonly believed about how diversity and collective intelligence relate. Data from 3053 judgment surveys about US macroeconomic variables obtained from the Federal Reserve Bank of Philadelphia and the Wall Street Journal provide strong support, and implications are discussed with reference to three central ideas on collective intelligence, these being Galton’s conjecture on the distribution of judgments, Muth’s rational expectations hypothesis, and Page’s diversity prediction theorem.

Citation:Nash UW (2014) The Curious Anomaly of Skewed Judgment Distributions and Systematic Error in the Wisdom of Crowds. PLoS ONE 9(11): e112386. doi:10.1371/journal.pone.0112386

Editor:Gonzalo G. de Polavieja, Cajal Institute, Consejo Superior de Investigaciones Cientı´ficas, Spain

ReceivedFebruary 7, 2014;AcceptedOctober 15, 2014;PublishedNovember 18, 2014

Copyright:ß2014 Ulrik W. Nash. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding:This work was funded through normal wages paid by the University of Southern Denmark, and thereby taxpayers of Denmark, to the author. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing Interests:The author has declared that no competing interests exist.

* Email: [email protected]

Introduction

We measure and navigate our environment by making intuitive judgments, but these are fallible. By also giving weight to judgments made by others we can diminish our mistakes. The mean of many intuitive judgments, made by numerous different people, is accurate when judgments scatter around the truth. In fact, the mean is perfect when judgments scatter in symmetry around the truth, because then all mistakes of underestimation are matched by counterpart errors of overestimation. However, when the weight of judgments distribute in greater proportion on either side of the truth, the mean has error. Indeed, systematic error in the mean of judgments exists to the extent these distributions can be predicted. Given the trust bestowed upon popular judgment in democratic societies, it would have considerable implication for outcomes of decision making if such a phenomenon of predict-ability existed, because it would imply an avoidable type of mistake is currently being made in many domains, from the misdiagnosis of patients by consensus seeking doctors, to the misallocation of resources by consensus seeking managers, investors, and politi-cians. From that perspective this paper brings bad news, because it contains evidence of systematic error in the wisdom of crowds. However, there is also potentially good news, because collective error appears predictable by the way judgments observably scatter. Judgment distributions are often curiously skewed, something long known [1][2][3], but something we have deferred efforts to understand. Here I argue that when people use cues to make inferences about their environment, when people use these cues

with adeptness, and when the environment is extreme compared to the central tendency of peoples’ prior experience, then skewed judgment distributions occur with high probability. Moreover, collective error can be inferred from the degree and direction of judgment distribution skew, and from the degree of judgment distribution variance, implying that judgment distributions are shaped by cues, and are cues themselves for the wisdom of crowds; decision makers can infer collective intelligence by the shape of judgment distributions, and can moderate their confidence in popular judgment accordingly. We can even hope to repair the systematic error of our collective intelligence, and gain greater knowledge about our world. One candidate procedure for doing that is outlined in the discussion.

(2)

The AQ Model of Judgment

Neurons that categorize and accumulate information about the environment have been modeled with success by researchers in order to understand the neural basis of choice [7][8]. The cognitive mechanism studied involves competing neurons tuned to opposing hypotheses, whose firing rates indicate their level of confidence, and where these firing rates or levels of confidence are accumulated by other neurons positioned further downstream to generate an overall inference. When discrete packets of informa-tion consistent or inconsistent with competing hypotheses arrive at the brain, an accumulating variable is thereby created, which breaches one of two decision thresholds after a while. These models give researchers good reason to believe information clarity determines the speed at which people choose A over B (or B over A) and the amount of evidence needed to decide. But while current models thereby advance our understanding of binary choice, they cannot explain situations where people are required to form refined judgments.

So let us consider an alternative model of intuitive judgment, and let us, in the spirit of Galton’s seminal study, assume judges are competitors trying to guess the weight of an ox at the West of England Fat Stock and Poultry Exhibition.

The Problem of Discriminating

The problem faced by our judges involves discriminating between the weight of the particular ox presented to them,T, and the typical weight of oxen,m_T, the latter being common knowledge gained through experience.

Judges cannot measure T directly, but oxen have numerous perceptible regions,C, correlating withT. Judges use these regions to make inferences about how much the ox weighs. I follow standard practice and call these regionscues, while assuming they are stochastically independent. Regions might include, for instance, the height of the ox, or the degree to which its ribs are showing.

Across the population of all oxen, particular cues share the same correlation with weight, but they have various magnitudes, with some shoulders, say, being larger than others. The information value of the particular cue,ic, derives from both magnitude and correlation.

Typical cues have zero information value, while information values in general are distributed in symmetry around this level; particular cues contain more or less information by which to discriminate T positively or negatively from m_T. For example, while there is a typical degree of rib visibility, the visibility for any particular ox will be higher or lower, with more visible ribs informing judges they face an abnormally light ox(Tvm_T).

In their attempt to discriminateTfrommT, judges discriminate cues from what is typical, gathering evidence across all cues before guessing. I call this processcategorization, because cues are either greater or smaller than their central tendency. Cues at the central tendency are referred to astypicalcues.

Categorizing Cues by Voting Neurons

The process of categorizing leads to refined judgment (Figure 1). Consistent with models of binary choice, refined judgment involves two neuron classes. The first class is represented by two populations,LV andSV, which contain neurons tuned to different regions of cues, and which respond with mean spike rates proportional to the information presented in their preferred region; these are voting neurons, as indicated by the chosen subscript, whileLandSindicate their preference for competing hypotheses ‘‘larger than typical’’ and ‘‘smaller than typical’’. The

second class is represented by two counterpart neurons,LP and

SP, which receive and accumulate evidence from LV and SV respectively; these are polling neurons. Once again the chosen subscript indicates neuron class, while L and S again indicate preferences for the hypotheses ‘‘larger than typical’’ and ‘‘smaller than typical’’.

I assume that voting neurons remain unresponsive to cues outside their preferred region, implying that for every cue there are two active voting neurons, one inLV and one inSV. Voting neurons inLV andSV represent the competing hypothesesicw0 and icv0 respectively, and each population reveals its level of confidence through the firing rates of active members. For higher values ofic, the mean spike rate acrossLV, denotedmL_Vic, rises linearly, while the mean spike rate acrossSV, denotedmS_Vic, falls by the same absolute magnitude:

m_LV ic

~NVzic,m_SV ic

~NV{ic, ð1Þ whereNVis the mean firing rate of the neuron activated by the typical cue.

Opposing neurons inLVandSV compete using their response toic, with the neuron demonstrating greatest activity winning the right to encode ic as being compatible with the hypothesis it supports. This evidence is subsequently passed to the counterpart polling neurons inLPandSP using the following instructions:

L_Vic :

zDNV{m_LV

icj for win {DNV{m_LV

icj for loss 8

<

:

S_Vic :

zDNV{m_SV

icj for win {DNV{m_SV

icj for loss 8

<

:

ð2Þ

In words, when voting neurons win their competition, they instruct their counterpart polling neuron to increase its firing rate by an amount equal to the difference between the voting neuron’s usual response, and the firing rate occurring under typical circumstances. Losing has the effect of reducing firing rates by the same magnitude.

Noise and Categorization Error

If voting neurons could respond to ic using only their mean firing rate, then battles between voting neurons would lead to perfect categorization because the spike rate within LV would always surpass the spike rate withinSVwhenicw0, while the spike rate within SV would always surpass the spike rate within LV whenicv0. But firing rate variance introduces the possibility of categorization error.

Variance introduces the possibility that LV displays greater confidence even though icv0, or that SV displays greater confidence even thoughicw0. In each caseLP andSPare given false instructions, implying that an error of categorization has occurred.

(3)

rates for the active neuron in LV, from the distribution of firing rates for the competing neuron inSV, and second, we integrate the resultant Gaussian distribution from{?to0. This yields:

1{p~1 2

ffiffiffiffiffi 1

s2 r

sErfc m

ffiffiffiffiffi 1

s2 r

ffiffiffi 2

p

2

6 6 4

3

7 7 5

, ð3Þ

wherepdenotes the probability of correct categorization,mis the difference between the mean firing rates of active voting neurons, ands2is the sum of their firing rate variance.

Differentiating error equation (3) with respect tomandsreveals that frequencies of categorization error increase with s, and decrease withm; when voting neurons vary their firing rates more, or when the mean firing rates are less separated, which they are when the particular cue is typical, then the probability thatLVic

Figure 1. Information categorization by neurons leading to intuitive judgment. A. The weight of an ox must be judged. This true value

cannot be observed directly, but numerous perceptible parts of the ox correlate with how much it weighs. These parts, here indicated as shoulder and rump, are cues (C1andC2), and their validity as indicators of weight are either larger or smaller than what is typical for such cues, with validity relative this reference point determining their information content.B. Information categorization byvotingneurons. Two populations of voting neurons (SV andLV) support opposing hypotheses about the true value, in the present case, that weight is larger or smaller than typical. Each population contains neurons tuned to particular cues, so that for each cue there are two competing neurons. As visualized here by the size of dendrites, these neurons moderate their firing rates according to how consistent information is with their supported hypothesis, with the neuron displaying greatest activity wining the right to encode information as consistent with its preference. This information categorization process is fallible because firing rates of voting neurons vary around their appropriate levels, as visualized here using probability distributions. C. Evidence accumulation bypollingneurons. Polling neurons (SPandLP) receive instructions from counterpart voting neurons to increase or decrease their firing rates above normal levels (NP). More specifically, defeated voting neurons send inhibitory instructions, while winning neurons send instructions encouraging greater activity. As streams of instructions are received, two accumulating variables are thereby establishedASPandALP.D. Intuitive judgment. The firing rate of the most active polling neuron, in this caseLP, is ultimately converted to judgment by scaling this rate using an appropriate factor (f). This provides an inference about the true value’s degree of extremeness (et), while subsequent addition of the typically encountered true value,mT, provides an inference about the true value in absolute terms (eT).

(4)

and LVic will communicate wrong instructions to LP and SP increases. This probability drives most predictions, as we shall see.

Accumulating Evidence by Polling Neurons

As voting neurons categorize cues, polling neurons create evolving inferences aboutT. More specifically, neuronsLPandSP change their firing rates by increments equal to instructions received from neurons in LV and SV, as summarized by instruction equation (2). This creates two accumulating spike rates,ALP and ASP, anchored at the level expected forT~m_T, denotedNP, and moving up or down as instructions are received. After allC cues have been categorized, the deviation between max (ALP,ASP) and NP is proportional to how much the judge thinks T deviates from mT. When max (ALP,ASP) = ALP the deviation is proportional to how much the judge thinks T

surpassesmT, while the deviation is proportional to how much the judge thinksT is surpassed bym_T whenmax (ALP,ASP) = ASP.

Converting Spikes to Relevant Scale

Since ic is linearly related to mL_Vic and mS_Vic, the deviation between T and m_T can be measured on the relevant scale by multiplying max (ALP,ASP) by the appropriate constant f. Subsequent addition ofm_T gives the judgment aboutT:

eT~

z½max (A_LP,A_SP){NPfzm_T for max (A_LP,A_SP)~A_LP

{_½max (A_LP,A_SP){NPfzmT for max (ALP,ASP)~ASP (

ð4Þ

The Truth

While judgments are probabilistic according to the model, the true value is deterministic, and so is the true deviation betweenm_T and T. This value, henceforth referred to simply asextremeness, equals the sum of information values across all cues:

t~T{mT~ XC

c~1

ic: ð5Þ

Simplifying

When many competitions at the West of England Fat Stock and Poultry Exhibition are simulated (Code S1), skewed judgment distributions are often seen, and we notice negative correlation between collective error and judgment distribution skew, irrespec-tive of us adopting the mean or the median as vox populi

(Figure 2). Categorization error plays an important role in generating these patterns, but simplification is needed to understand how.

Things are made easier if we introduce the following assumptions: first, we restrict ic to the binary{1or1; now all cues have identical absolute information values. Second, we set DNV{mL_VicD~1so information, and the response to information by voting neurons, equate in absolute terms. As corollary, f~₁ because spike rates and the true value are measured on the same scale. Third, we make p an independent and homogeneous variable across judges to understand the effect of categorization error in general. Finally, we focus on t rather than T, so that examined judgments concern extremeness.

Sir Francis Galton’s [1] was the first scholar to publish ideas about how judgment distributions obtain their particular shape, but he admitted feeling uncertain about the answer due to his limited knowledge of psychology. From that perspective it is quite peculiar that distillation of the detailed model creates aquincunx, Sir Francis Galton’s eminent probability device [11], which he invented in 1873 to demonstrate the central limit theorem. However, unlike Galton’s version, which shows the dispersion of falling balls in general as they deflect past multiple rows of pins, this particular quincunx captures the probabilistic relation between the motion of balls in general, and the motion of an ‘‘attractor ball’’, given probabilities of the former balls following the direction of the latter ball at every row. In this augmented quincunx model, the path of the attractor ball captures extreme-ness of a true value, as indicated by numerous cues (pin rows), while the other balls together reveal the probabilistic process of categorizing those cues to make an inference (the particular compartment each ball settles in) (Figure 3). I have included an application (Application S1), which the reader can use to visualize the AQ Model and its main predictions.

Deriving the Distribution of Judgments

The distribution of judgments predicted by the AQ Model can be derived by conducting nz independent Bernoulli trials over cues positively associated witht, and n{ independent Bernoulli trials over cues negatively associated witht. Lety(q,n)denote the number of cues out ofn cues that are thought to be positively associated withtwhen the probability for such an opinion isq.

Every judgment equals the number of cues perceived to be positively associated witht, minus the number of cues perceived to be negatively associated witht, and each of these two categories can be further divided into cues perceived correctly and cues perceived incorrectly. Since these four numbers are constrained by

n+ the judgment is reduced to

et~y(p,nz){(nz{y(p,nz))

zy(1{p,n{){(n{{y(1{p,n{))

ð6Þ

~2y(p,nz)z2y(1{p,n{){C ð7Þ

Further constraining the two independent stochastic variables

y(p,nz) and y(1{p,n{), so they sum to a particular judgment, yields the distribution of judgments:

Prob (etDt,C,p)~

X

min (Cz₂et,Czt

2 )

k~max (0,etz₂t)

Czt

2

k

! C{t

2

Cz_et 2 {k

!

p2k{etzt

2 (1{p)C{(2k{etzt 2 )ð

8Þ

Deriving Moments of the Judgment Distribution

While the moments of judgments can be calculated from the distribution equation (8), it is more instructive to take an alternative view of the process by which the Bernoulli trials unfold. Let the judge’s categorization of one particular sequence of cues~cc be described by~uu~_fuigCi~1, where each categorization

(5)

the sequence of categorizations represents C independent Bernoulli trials with the following outcomes:

u~

z1 p

with probability

{1 1{p

8 > <

> :

ð9Þ

The judgment is simply the inner product of the cue and the categorization vectors:

et~ XC

i~1

ciui~~cc:~uu ð10Þ

From here the moments and cross-moments of the atomic variables of equation (9) are

E (u)~2p{1:U ð11Þ

E (u2)~1 ð12Þ

E (u2kz1₎_~_E₍_u₎_~₂_p_{₁

ð13Þ

E (u2k)~E(u2)~1 ð14Þ

Var (u)~1{(2p{1)2~1{_U2:V ð15Þ

Figure 2. Simulation of judgment distributions using the detailed neuronal model.Two separate batches of10000judgment distributions,

(6)

V_I(_f1,2,. . .,Cg : EðPi[IuiÞ~Pi[IE (ui) ð16Þ

V_i=j : CoVar (u ui j)~0 ð17Þ and moments of the judgment distribution are calculated using these. For example, the mean judgment is derived as

E (et)~E XC

i~1 ciui

!

~X

C

i~1

ciE (ui)~U XC

i~1

ci~Ut ð18Þ

In turn, higher order raw moments can be calculated by carefully categorizing terms according to how many indices collide. For example, the second raw moment is derived as

E (e2

t)~E XC

i~1 ciui

! XC

j~1 cjuj

!!

ð19Þ

~E X

C

i~1

c2_iu2_izX C

i~1

X

j=i

cicjuiuj !

ð20Þ

~X

C

i~1

c2_iE (u2_i)zX C

i~1

X

j=i

cicjE (uiuj) ð21Þ

~CzU2X

C

i~1

X

j=i

cicjzU2 XC

i~1

X

j~i

cicj{U2 XC

i~1

X

j~i

cicj ð22Þ

~CzU2XC i~1

XC

j~1

cicj{U2C ð23Þ

~VCzU2t2 ð24Þ

noting thatc2

i~1. Subtracting equation (18) from equation (24), the variance s2~VC is obtained. Following this scheme, the mean, variance, and skew of the judgment distribution are found:

mean(et)~Ut~(2p{1)t ð25Þ

var(et)~VC~4C(1{p)p ð26Þ

skew(et)~{ 2Ut C ffiffiffiffiffiffiffiffi

VC

p ~ { (2p{1)t

C ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

C(1{p)p

p ð27Þ

Predictions

As described under the subheadings below, the AQ model makes four central predictions. The first prediction relates to what Galton described as the ‘‘curious anomaly’’ of judgment distribu-tion skew [1], with reference to his observadistribu-tion in Plymouth.

Prediction 1: Judgment Distributions are often Skewed

Setting skew(et)~0and solving for t, p, and C provides the conditions for judgment distribution symmetry. These conditions

are t~0 or p~1

2, while there is no solution for C. If the ox presented at Plymouth had weighed what oxen typically did in 1906, or if judges had categorized cues arbitrarily, then according to the AQ model, Galton would probably have observed a symmetric judgment distribution. However, Galton reported skew [1].

Skewed judgment distributions are predicted to occur when

p=1

2andt=0, and we can hypothesize that judges at Plymouth

Figure 3. Deriving the AQ Model.The AQ model can be derived

from the detailed neuronal model by assuming unit information content of cues ({1or1), unit firing rates among voting neurons ({1 or 1), and by capturing categorization error directly (1{p). This distillation transforms the detailed model into what can be character-ized as an augmented quincunx, that is to say, an augmented version of Sir Francis Galton’s original probability device. Although simple, the AQ model captures the probabilistic relation between our inferences about how unusual situations are (et) and what actually is (t), which is argued to originate from an uncertain cognitive process of categorizing information contained in cues (C). In the AQ model, rows of pegs represent cues, while balls falling through the system into one of various compartments represent the probabilistic categorization of these cues. The true value is computed by the distinct path taken by an attractor ball around pegs in the correct way.

(7)

not only confronted an ox with exceptional weight, but possessed neurons capable of categorizing the associated cues adeptly. Indeed, since the denominator of skew equation (27) is positive for all permitted values of C, and since 2p{1 is positive for all relevant values of p, we can hypothesize judges were not just confronted with an exceptional ox broadly speaking, but were confronted with an exceptionally heavy ox. This hypothesis is made considering the negative skew reported by Galton, and notingskew(et)v0only whentw0. According to the AQ model, had the ox been exceptionally light, then Galton would probably have oberved positive skew.

Prediction 2: There is Systematic Error in the Wisdom of Crowds

The AQ model predicts systematic error in the wisdom of crowds. To see this, first subtract the mean judgment equation (8) fromtto get the expression for collective error:

CE~t{(2p{1)t, ð28Þ

Now set CE~0 and solve for t and p. This provides the conditions where the wisdom of crowds is perfect, which it is when

p~1ort~0. In words, the wisdom of crowds is infallible when neurons of judges make no categorization errors, or when the true value is typical, or both. Conversely, collective error is predicted when the true value is extreme and neurons categorize cues imperfectly under that condition.

The condition p~1 is simple to understand, because when every person makes the judgmentet~t, then the mean judgment equation (25) equals t, and equation (28) becomes 0. In comparison, the condition t~0 is more involved, because understanding why the wisdom of crowds can sometimes be perfect, even though neurons categorize cues arbitrarily, and why sometimes arbitrary categorization is the direct cause of collective error, can seem strange.

But consider the level of the individual and the probability of making various judgments, as captured by distribution equation (8). More specifically, consider in turn the two possible casest=0 andt~0. Whent=0the stream of perceived information broadly agrees in the sense total evidence for the hypothesistw0(ortv0) surpasses the alternative. The implication is that miscategorization of any cue supporting the correct hypothesis is more costly, because to compensate other neurons would need to miscategor-ize, in equal proportion, cues supporting the alternative. But these cues are rarer whent=0. Indeed, in the most extreme case there is no tolerance for miscategorization, because there are no conflict-ing cues.

In contrast, whent~0the neurons can, in principle, be exactly wrong about every cue and still cause a perfect judgment aboutt, because the weight of evidence for tw0 equals the weight of evidence fortv0. More generally, the cost of miscategorization is smallest when t~0 because here the probability of cancelling mistakes by chance is greatest. Furthermore, when the true value is typical, the probability is identical irrespective of the direction of miscategorization, which is untrue when the true value is extreme. Moving to the collective level again, these observations are important because they affect the probability that judgments made by many people will scatter in symmetry around the truth, and consequently the likelihood of an errorless crowd. Whent~0the judgment distribution tends to be symmetric for all crowds of fallible people (p=1), but when t=0 more than half the judgments tend to be smaller than t (Figure 4). Moreover, the

degree to which smaller judgments outweigh larger judgments increases ast grows, and as papproaches 0:5. As corollary, to counter the rising effect on collective error of greater extremeness,

pmust rise, and for this reason individual adeptness is predicted to be important for collective intelligence in extreme situations, but not under typical circumstances.

Prediction 3: The Power of Diversity is Absent

One of the most common perceptions about the wisdom of crowds is that more predictive diversity leads to greater collective intelligence. We can use the appealingdiversity prediction theorem

introduced by Page [6] to examine if the AQ model agrees. The theorem is as follows:

CE2_~_AIE_{_var(_e

t), ð29Þ

where AIE denotes average individual error. The message provided by diversity prediction theorm (8) is simple: holding average individual error constant, collective error will decrease if judgment variance is raised; apparently there are benefits to forming collectives whose members view the world as differently as possible.

But the problem with the diversity prediction theorem is equally plain. While (8) is an identity, and therefore always holds mathematically, it places no restriction on the actual relationship betweenAIE andvar(et), except thatAIEw~var(et). In other words, diversity might increase to compensate, or even overcom-pensate for greater average individual error in reality, or it might not. The diversity prediction theorem may hold either way, and cannot be used in isolation to predict if diversity actually has the effect so often bestowed upon it [6][12]. The AQ model, however, is clear on this matter: the power of diversity is absent.

To see why, start by isolating AIE in (13) using (9) and (11) to get:

AIE~4(1{p)t2(1{p)zCp

: ð30Þ

Now differentiate equation (30) with respect top,Candtto see howAIEchanges with these variables, and differentiate variance equation (26) with respect topandCto see howvar(et)is affected; the relative movement of var(et) and AIE is what we must understand:

dAIE

dp ~C(4{8p){8t

2₍₁_{_p_), _ð₃₁_Þ

dAIE

dC ~4(1{p)p, ð32Þ

dAIE

dt ~8t(1{p)

2_, _ð₃₃_Þ

dvar(e)

dp ~C(4{8p), ð34Þ

(8)

dvar(e)

dC ~4(1{p)p: ð35Þ

Numerous observations can be made from equation (31) to equation (35). First, notice that when 0:5ƒpƒ1, then 8t2₍₁_{_p₎_§₀_{in equation (31) and}_C₍₄_{₈_p₎_ƒ₀_{in equation (31)}

and equation (34). Increasing p therefore never raises average individual error, nor does it ever increase the variance of judgments. Indeed, for all permitted values of p=0, average individual error and variance both increase when p decreases, implying strictly positive association.

Second, from equation (31) and equation (34) we notice that any rise in judgment variance, produced by decreasing p, is never

larger than the simultaneous rise in average individual error, implying collective error can never decrease from this effect. In other words, diversity never has positive consequence for collective error overall.

Third, the increase in diversity arising from greater numbers of cues, as indicated by equation (35), equals the simultaneous increase in average individual error indicated by equation (32), thereby cancelling the effect of raising diversity again. Finally, the rise in average individual error occurring whenDtDrises above0in equation (33) is not accompanied by any alleviating effect of greater diversity, because judgment variance is not predicted to change with extremeness. In short, according to the AQ model, diversity has no power to increase collective intelligence overall. On the contrary, crowds producing less predictive diversity are predicted to be wiser.

Figure 4. Judgment distributions formed by different crowds in typical and extreme situations.According to the AQ model, true value

(9)

Prediction 4: Judgment Distributions are Cues for Collective Intelligence

The AQ model predicts that when neurons categorize cues imperfectly, but not arbitrarily (0:5vpv1), then collective error is negatively correlated with judgment distribution skew. Under this condition, and when the true value is atypically small(tv0), then judgment distribution skew is positive(skew(et)w0)and collective error is negative(CEv0), while the opposite occurs when the true value is atypically large(tw0).

Collective error is not predicted to cause judgment distribution skew, nor is skew predicted to cause collective error. Rather, both are caused by the presence of an extreme true value, combined with better than chance categorization of cues by neurons of judges. To see this, consider the situation where every cue points towards the same conclusion. Here perfect judgment by the crowd demands perfect categorization by every neuron of every judge, because otherwise the mean judgment will overestimate the truth whentv0, and underestimate the truth whentw0.

However, while some neurons categorize accurately, others will, by chance and fallibility, categorize imperfectly, creating collective error and judgment distribution asymmetry simultaneously. As corollary, because skew is observable, it can be used to foretell collective error.

Materials and Methods

With the aim of testing predictions of the AQ model, I gathered 3053 publicly available judgment distributions from The Federal Reserve Bank of Philadelphia’s Survey of Professional Forecasters (FRBP), and The Wall Street Journal’s Economic Forecasting Survey (WSJ). Distributions came from six datasets (Table 1), three from each source, and concern official measures of the US economy. From WSJ these measures were consumer price inflation (Data S1), GDP (Data S2), and unemployment (Data S3), while measures from FRBP were housing starts (Data S4), nominal GDP (NGDP) (Data S5), and unemployment (Data S6). Participants in each survey were economists. Surveys by FRBP are conducted quarterly, with participants required to respond midway through each focus quarter; the surveys I used were those available since initiation in 1969 until 2011. In comparison, the surveys by WSJ are conducted monthly, with participants required to respond during the first week of the focus month. Here I used available surveys since initiation in 2004 until 2011.

For the purpose of testing the AQ model, judgment data has to meet three important criteria. First, sufficient numbers of individuals must participate in each survey to provide reliable

estimates of judgment distribution moments. Since 36 to 37 economists participated in each FRBP survey on average, while 54 to 55 economists participated on average in surveys from WSJ, the chosen data met this yardstick satisfactorily. Second, the forecast-ing methods applied by participants must involve intuition, since the AQ model is about that process. Based on information provided by Stark [13], an estimated 96 percent of FRBP economists apply intuition when forecasting, and I had no reason to suspect that participants surveyed by WSJ were different. Finally, participants must be adept judges, since the AQ model predicts that judgment distribution skew only occurs when neurons categorize cues better than chance. Since participants in all surveys are select economists, this criterion was also satisfied.

Each hypothesis was tested by calculating the Pearson correlation coefficient between variables in question, and observ-ing levels of significance. One-sided tests were applied throughout, except for H7, where no correlation is predicted by the AQ model. Effect sizes based on Cohen’s [14] classification were also reported.

Variables

All variables exceptpandCwere measurable from the gathered judgment distributions, and from publicly available data on realized economic figures. The expert status of surveyed econo-mists, however, naturally separated the range of p from those equal to 0:5, the latter being associated with judges who use information arbitrarily. In other words, I could be quite certain my analysis would say little about judgment distributions formed by novices. Moreover, the inability to measureCprevented me from testing the predicted effects of cue numerosity. The remaining variables, however, were operationalized as follows:

Mean of judgments: mean(eT)i~ 1

Ni XNi

ji~1eTji, where i denotes the particular survey,jidenotes the individual participant in surveyi,Nidenotes the number of participants in surveyi, and

eTji denotes the judgment submitted by participantji.

Variance of judgments: var(eT)i~ 1

Ni XNi

ji~1(eji{

mean(eT)i)2.

Skew of judgments: skew(eT)i~ 1

Ni XNi

j~1(eTji{ mean(eT)i)

3₌_(var( eT)i)

3=2

.

Table 1.Surveys of Expectations.

Source Survey Period Frequency

The Federal Reserve Bank of Philadelphia (FRBP) Nominal GDP 1969 to 2010 Quarterly

Unemployment 1969 to 2010 Quarterly

Housing Starts 1969 to 2011 Quarterly

The Wall Street Journal (WSJ) GDP 2004 to 2011 Monthly

Unemployment 2004 to 2011 Monthly

Inflation 2004 to 2011 Monthly

Data was obtained from the Federal Reserve Bank of Philadelphia’s Surveys of Professional Forecasters, and the Wall Street Journal’s Economic Forecasting Survey. Judgments from the Federal Reserve Bank of Philadelphia about nominal GDP and housing starts concern annual percentage growth in seasonally adjusted values, while judgments about unemployment concern seasonally adjusted workforce percentages. Judgments from the Wall Street Journal also concern annual percentage growth, except for unemployment, which concerns percentages of the workforce. All judgments concern the US macroeconomic system.

(10)

Average individual error: AIEi~ 1

Ni XNi

ji~1(Ti{eTji)

2

,

where Ti denotes the realized economic value in question for the particular survey.

Collective error:CEi = mean(eT)i{hi. Note that CE2i is applied in H6 below.

Extremeness: ti~(Ti{m)=s, where m is the mean of true

values across all surveys,1

n

Xn

i~1Ti, and wheresis the standard

deviation of true values across these surveys,

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1

n

Xn

i~1(Ti{m) 2

r

.

Hypotheses

The status of economists as adept creates the lower bound

pw0:5, while the upper boundpv1is created by the complexity of macroeconomic systems, combined with the fallibility of human judgment generally. Based on these constraints, the tested predictions of the AQ model were those listed below, with values in brackets indicating the corresponding mathematical expressions presented earlier:

H1: Judgment distributions have greater negative (positive) skew when prevailing true values are progressively large (small) compared to average (26).

H2:The mean of judgment distributions increasingly underes-timates (overesunderes-timates) prevailing true values that are progressively large (small) compared to average (25)(28).

H3: Average individual error is greater when prevailing true values are progressively more extreme compared to average (30).

H4: Judgment distributions have greater negative (positive) skew when the mean of judgments underestimates (overestimates) the prevailing true value by greater margin (25)(27)(28).

H5: Greater judgment variance is associated with greater average individual error (31)–(34).

H6: Greater judgment variance is associated with greater collective error squared (31)–(34).

H7: There is no association between judgment distribution variance and how extreme the prevailing true value is compared to average (26).

Results

Evidence supports the AQ model well. For visual evidence, please refer to Figure 5 for the case of judgments about US GDP

Figure 5. Empirical patterns of judgments about US GDP and unemployment. This figure shows associations between judgment

(11)

and unemployment from FRBP, Figure 6 for the case of housing starts and inflation collected from FRBP and WSJ respectively, and Figure 7 for the case of judgments about US GDP and unemployment from WSJ. In 5 of 6 data sets the skew of judgments correlated negatively with extremeness (H1, Table 2). Moreover, collective error correlated positively with extremeness in all data sets (H2, Table 3), as did average individual errors in 5 of 6 cases (H3, Table 4). Hence individual economists and groups of economists generally overestimated situations where measures of the US economy were smaller than their historic levels, while they underestimated when measures were larger. Moreover, in all cases the skew of judgments correlated negatively with collective error, although signs were significant in only 3 cases (H4, Table 5). Results were also clear on the issue of diversity. In all data sets, greater variance of judgments was positively correlated with average individual error (H5, Table 6). Moreover, when variance was greater, the associated average individual error rose in greater proportion, leading to negative correlation between collective error and judgment variance (H6, Table 7). In other words, observations of greater diversity were generally accompanied by observations of smaller collective intelligence. These findings are all consistent with the AQ model, but the final result is not: in 5 of

6 data sets the variance of judgments made by economists was found to be greater when economic measures were more extreme (H7, Table 8). In comparison, the AQ model predicts that judgment variance and extremeness are independent. For visual evidence, please refer to Figure 5, 6 and 7. Figure 5 concerns judgments about US GDP and unemployment obtained from FRBP, Figure 6 concerns judgments about US housing and inflation obtained from FRBP and WSJ respectively, while Figure 7 concerns judgments about US GDP and unemployment obtained by WSJ.

Discussion

I introduced and tested the augmented quincunx (AQ) model of probabilistic cue categorization by neurons of judges. My purpose was to discover if the scatter of intuitive judgments made by many different people can be predicted from the way neurons generate inferences about the environment, and to discover if the mean of judgments systematically deviates from the truth.

In the process of developing inferences about true values, when neurons categorize cues better than chance, and when the particular true value is extreme compared to what is typical and

Figure 6. Empirical patterns of judgments about US housing starts and inflation.This figure shows associations between judgment

distribution skew, collective error, and extremeness of US housing starts and consumer inflation, compared to what is typical for these variables (solid green line). Judgments in the left column were made by economists about the annual seasonally adjusted growth of US housing starts, as obtained from the Survey of Professional Forecasters conducted by the Federal Reserve Bank of Philadelphia. Judgments in the right column were made by economists about the annual rate of US consumer price inflation, as obtained from the Economic Forecasting Survey conducted monthly by the Wall Street Journal.

(12)

anchored upon by individual judges, then skewed judgment distributions will emerge with high probability according to the AQ model. Moreover, according to the AQ model, collective error can be inferred from the degree and direction of judgment distribution skew, and from judgment distribution variance, implying not just that judgment distributions are shaped by cues, but that judgment distributions are cues themselves for collective intelligence.

Using 3053 distributions of judgments about the US economy formed by leading economists, I found evidence supporting the AQ model. These findings suggest that trust in the wisdom of crowds should be moderated by considering the adeptness of people in the crowd, and by the particular way their judgments are observed to scatter.

Figure 7. Empirical patterns of judgments about US GDP and unemployment.Shown are associations between judgment distribution

skew, collective error, and extremeness of US GDP and unemployment compared to what is typical for these variables (solid green line). Judgments were made by economists about the annual growth of US GDP, and the rate of US unemployment, as obtained from the Economic Forecasting Survey conducted monthly by the Wall Street Journal.

doi:10.1371/journal.pone.0112386.g007

Table 2.Testing H1: Negative association betweentandskew(eT).

Source Survey Correlation Significance (1-tail) Effect Size Sign N

FRBP Nominal GDP 20.180*** 0.000 Small Correct 809

Housing Starts 20.078* 0.013 Small Correct 818

Unemployment 0.004 0.453 Incorrect 819

WSJ GDP 20.281*** 0.000 Small Correct 287

Inflation 20.261*** 0.000 Small Correct 161

Unemployment 20.285*** 0.000 Small Correct 159

Evidence suggests that atypically large true values are associated with judgment distributions that have greater negative skew. The reverse holds for atypically small true values.

(13)

Galton’s Conjecture on the Distribution of Judgments

Three important ideas are affected by this paper. The first is Galton’s conjecture on the distribution of judgments, which Galton expressed in his seminal paper on the wisdom of crowds from 1907 [1]. The ‘‘curious anomaly’’ of the skewed judgment distributions, which Galton observed at the West of England Fat Stock and Poultry Exhibition, was thought to be caused by small varieties of different formulae among those who competed to guess the weight of an exhibited ox. The AQ model, and the presented evidence, agrees to some extent with Galton, because smaller judgment variance and greater judgment skew are predicted to occur often and together when people in the crowd are experts.

Now, we know Galton believed judges were ordinary people, but the opposing view of English botanist Perry-Coste must be considered [15]. After Galton’s seminal article appeared in Nature, Perry-Coste argued with conviction to Norman Lockyer, the founding editor of Nature, that Galton had not been exposed to Vox Populi, as the title of his article indicates he believed, but Vox Expertorum. Judges were, so Perry-Coste argued, butchers and farmers whose livelihood depended on their ability to appraise the weight of farm animals before trading, and it appears to be an excellent point.

But Galton also missed another effect, namely interaction between judge adeptness and environmental extremeness; skewed judgment distributions are rare even when judges are expert, unless the subject of judgment deviates from the central tendency of prior experience.

Of course, we can only speculate about the famous ox; we know its weight, but no information was provided by Galton about its breed. All we know is that Galton described it as being ‘‘fat’’. Nevertheless, my recent correspondence with Professor Van Vleck, an esteemed cattle geneticist, combined with evidence

provided by McMurry [16] of Cargill Animal Nutrition, suggest judges in Plymouth were indeed presented with an exceptionally heavy specimen. According to Van Vleck, a contemporary male ox kept until maturity can reasonably weigh 2000 lb, while its dressed weight often lies in the vicinity reported by Galton. However, the situation was different back then. Due to crossbreeding, improvements in health programs, and improve-ments in nutrition programs, mature sizes have grown. Indeed, according to McMurry, the average bull carcass has become 30 percent heavier in the last 30 years alone. Therefore, while we cannot be sure about the particular breed, there is good reason to believe judges were presented with an exceptionally heavy ox. And this is important, because that direction of extremeness, combined with Galton’s observation of negative skew and underestimation by the mean, creates circumstances exactly consistent with those the AQ model predicts.

Muth’s Rational Expectations Hypothesis

The second important idea affected by the present paper is Muth’s rational expectations hypothesis from 1961 [4][5]. Muth was right, the mean of judgments often does perform well. Nevertheless, the content of the present paper suggests the mean judgment is rational on average only, because in the particular it systematically errors depending on individual expertise and the extremeness of what is being judged.

Muth supported his assumption of rationality by noting empirical observations made by Heady and Kaldor about farmer judgments [2]. These researchers had investigated judgments about agricultural prices during 1948 and 1949, and had discovered that mean judgments across the 168 to 176 surveyed farmers corresponded to eventual prices well. What Muth did not disclose, however, but what Heady and Kaldor had made clear,

Table 3.Testing H2: Positive association betweentandCE.

FRBP Nominal GDP 0.785*** 0.000 Large Correct 809

Housing Starts 0.742*** 0.000 Large Correct 818

Unemployment 0.426*** 0.000 Medium Correct 819

WSJ GDP 0.900*** 0.000 Large Correct 287

Inflation 0.173* 0.015 Small Correct 161

Unemployment 0.840*** 0.000 Large Correct 159

Evidence strongly suggests that atypically large true values are associated with positive collective error, or alternatively, that collectives underestimate atypically large true values. The reverse occurs for atypically small true values.

doi:10.1371/journal.pone.0112386.t003

Table 4.Testing H4: Positive association betweenAIEandDtD.

FRBP Nominal GDP 0.312*** 0.000 Medium Correct 809

Housing Starts 0.620*** 0.000 Large Correct 818

WSJ GDP 0.766*** 0.000 Large Correct 287

Inflation 0.617** 0.001 Large Correct 161

Unemployment 20.035 0.331 Incorrect 159

(14)

was that 3 of 4 examined judgment distributions were noticeably skewed, while the last distribution was ‘‘nearly normal’’. More precisely, all distributions were more or less positively skewed, and their mean overestimated actual prices in 3 of 4 cases by 8 to 27 percent, while the actual price was underestimated by 1 percent in the final case. Given the present paper we recognize these observations as consistent with predictions of systematic error in the wisdom of crowds, and thereby systematic violation of Muth’s influential assumption about collective intelligence.

Page’s Diversity Prediction Theorem

The final idea affected by the content of the present paper is the power of diversity to reduce collective error, as argued by Page using his diversity prediction theorem [6]. While transition from individual, to group, and finally to crowd, will involve the introduction of predictive diversity whenever people are fallible, and while such diversity will introduce collective intelligence, it should not be concluded in haste that introducing greater diversity to already established crowds is beneficial too. Indeed, from the practical perspective of assembling collectives, the designer must be careful to separate the beneficial effects of increasing the number of collective members, from the costly effects of introducing individuals to crowds who are less adept than average.

Weaknesses and Strengths

At this point an apparent weakness must be stated, which limits the scope of supported conclusions about diversity. I wrote the above to coincide with the predictions and findings presented, but the logic of the AQ model leads to the idea that diversity can, contrary to predictions under the chosen assumptions, be positively associated withp, and that increasing the diversity of crowds can be beneficial sometimes after all. I have throughout

assumed homogeneity in the probability of categorization error, but let us consider a situation where homogeneous novices are joined by experts, creating an assortment ofpin the collective. Furthermore, let us assume this event happens under extreme circumstances.

Before the experts arrived, collective error would be quite substantial (revisit Figure 3), but two things now happen. First, the mean judgment moves farther away from the typical value and closer towards the truth, driven by smaller individual errors among the newcomers. Second, the variance of judgments increases. In other words, contrary to predictions under the assumption of homogeneity, there is reason to suspect introducing more diversity can be positive, if the increase in diversity is caused by adept newcomers, and if the situation is extreme.

Yet the assumption of homogeneity appears to generate another weakness too, namely the discovered inability to explain why predictive diversity correlates positively with extremeness (revisit Table 8). To see this, let us continue our story and let us assume the new collection of heterogeneous judges must now evaluate numerous true values in succession. Under typical circumstances the distribution of judgments will centre on the truth, with variance determined mainly by errors among the incumbents. But as the true value becomes more extreme, the combined behaviour of experts and novices becomes pivotal. Judges whose neurons categorize cues arbitrarily will be unresponsive to how extreme the true value is, while adept judges will form judgments moving with the truth. In other words, for the entire collective of heterogeneous judges the spread of judgments increases with extremeness, and the inconsistencies shown in Table 8 are thereby explained. In passing, note that skew is amplified during this process by the combination of unresponsive novices, and experts tuned to developments. Indeed, using the application supplied in Applica-tion S1, the reader can verify these claims.

Table 5.Testing H3: Negative association betweenskew(eT)andCE.

FRBP Nominal GDP 20.111** 0.001 Small Correct 809

Housing Starts 20.011 0.376 Correct 818

Unemployment 20.027 0.217 Correct 819

WSJ GDP 20.163** 0.003 Small Correct 287

Inflation 20.011 0.445 Correct 161

Evidence mildly suggests that negatively skewed judgment distributions are associated with positive collective error, or alternatively, that negative skew of the judgment distribution signals underestimation by the collective. The reverse occurs for positively skewed distributions.

doi:10.1371/journal.pone.0112386.t005

Table 6.Testing H5: Positive association betweenAIEand var(eT).

Housing Starts 0.482*** 0.000 Medium Correct 818

WSJ GDP 0.269*** 0.000 Small Correct 287

Ination 0.249** 0.001 Small Correct 161

(15)

Meanwhile, other assumptions can also be debated. First, there is the assumption individuals form judgments by using the central tendency of prior experience as their initial thought, and adjust away from this anchor as unusual information is received. An alternative approach would be to assume individuals anchor at zero, and adjust their inference up or down by the full information value of categorized cues, as is common when judgments are modeled using regression [17][18]. Either way, reference point logic is applied in ways consistent with neurons accumulating discrete evidence sequentially and probabilistically, but only the chosen approach is consistent with established ideas on self-generated anchoring [19]. Indeed, the chosen assumption leads to conclusions relevant for research on the anchoring heuristic, by providing an explanation for the phenomenon of incomplete adjustment [20].

Second, the assumption individuals have generated identical anchors appears unrealistic for novices with little experience, but for individuals with substantial experience working on the same problem, the assumption appears unproblematic. Indeed, sub-stantial reference point diversity among experts would need explaining.

Third, in reality many aspects of the environment contain redundant information, because they are not independent of other aspects. I have assumed stochastically independent cues, which may, from that perspective, be considered wrong. However, while many aspects of the environment correlate, most correlate imperfectly, which implies that despite carrying some redundant information, not all information carried by dependent aspects is superfluous. Indeed, the assumption of stochastically independent cues can be viewed as an assumption of modularity in the environment, with cues being defined as these modules. Within

modules there is correlation between different elements, while between modules little correlation exists.

Fourth, the discrete nature of the modeled probabilistic process leads to discrete judgment distributions unless neurons categorize many cues, while more continuous distributions usually arise in reality, even when judges process little information. This, however, is an inconsistency diminished simply by introducing exogenous noise, as is common when judgments are modeled using regression. Alternatively, the assumption that only two competing voting neurons are activated by the particular cue can be eased to createtuning curves[21]–[25]. Now polling neurons adjacent to the one with greatest mean response might by chance demonstrate greatest activity, thereby decoupling information value from encoded evidence in random fashion. Whatever approach is taken, however, predictions will be unaffected unless noise removes all neural correlates of information.

Finally, the introduced model appears confined to non-social processes, since the judgments made by different individuals are assumed to be independent draws from the same probabilistic process. Nevertheless, this observation is only true in part, because no assumptions are made about cues being communicated socially or not. What is not captured, however, is the display of misleading cues, or the use of social influence to affect how others categorize cues. Moreover, any effect that limits access to cues, such as social networks, is not captured either.

Nonetheless, despite forces undoubtedly being unrepresented in my argument, the AQ model does an excellent job predicting and explaining the phenomenon it was constructed to help us understand. Moreover, it does so, after all, with higher levels of realism than common least square models of intuitive judgment [29]. Indeed, from the perspective of cue learning psychology, the introduced model is consistent with Brunswik’s [26][27][28] ideas

Table 7.Testing H6: Positive association between CE2and var(eT).

Housing Starts 0.424*** 0.000 Medium Correct 818

Unemployment 0.190*** 0.000 Small Correct 819

WSJ GDP 0.193** 0.001 Small Correct 287

Ination 0.174* 0.014 Small Correct 161

Evidence strongly suggests that greater judgment variance is associated with greater collective error squared. doi:10.1371/journal.pone.0112386.t007

Table 8.Testing H7: No association between var(eT) andDtD.

Source Survey Correlation Significance (2-tail) Effect Sign N

FRBP Nominal GDP 0.201*** 0.000 Small Inconsistent 809

Housing Starts 0.279*** 0.000 Small Inconsistent 818

Unemployment 0.264*** 0.000 Small Inconsistent 819

WSJ GDP 0.392*** 0.000 Medium Inconsistent 287

Ination 0.157* 0.046 Small Inconsistent 161

Unemployment 0.117 0.142 Consistent 159

Evidence suggests that judgment variance is greater when the true value is more extreme. These patterns are, unlike the patterns presented in Table 2 - Table 7, inconsistent with predictions of the AQ model under the applied assumptions. The discussion section includes an explanation of why these patterns may occur using AQ model logic.

(16)

on probabilistic functionalism, and predicts that Brunswik’s most basic subject of study, namely the adjustment of the organism to the environment, is observable from the shape of judgment distributions formed by many facing this problem of cognitive adaptation.

In short, the results I have presented link neuroscience to the social realm in ways consistent with cue learning psychology. As people struggle to adjust to their environment, neurons in the brain encode information mediated by the environment to generate phenomena observable not just from the individual, that is to say, from the intuitive judgments people make, but from the crowd, that is to say, from the distribution of judgments made by many different people. That discovery is significant because it moves scientists across these fields closer to understanding the success and failure of information processing entities at multiple levels of aggregation, from individual neurons to entire human organizations.

Future Research

Future research on systematic error in the wisdom of crowds will likely be done using massive data sets of judgments, using controlled laboratory settings, or ideally, using controlled settings generating massive data. In the immediate future it remains to be examined if cue numerosity really does increase both diversity and average individual error, and if skewed judgment distributions are formed less frequently by novices as predicted. These examina-tions hold the promise of improving our confidence in the specific probabilistic representation depicted by the AQ model. Such confidence would be particularly boosted, however, if predictions about skew, collective error, and extremeness, were also demon-strated to hold forthe crowd within [30][31][32]. At that point what promises to be the most rewarding continuation could then be initiated, namely cue learning experiments among crowds of participants, where algorithms extract relationships between collective error and judgment distribution skew in real-time, for the purpose of correcting mistakes before they occur in said laboratories, and beyond.

Meanwhile, however, the robustness of candidate procedures can already be tested through bootstrapping on existing data sets of substantial size. Indeed, the most obvious candidate process involves merely three parameters: 1) the number of judgment distributions used to linearly estimate the association between collective error and skew, 2) the number of judges used to form these distributions, and 3) the number of judges used to generate the cue for collective error on the next task. Within the constraint of each particular setting, random batches of judges and tasks would be sampled to discover the average error of the skew-adjusted mean, and the variance of this error, so comparisons can be made with errors generated by the wisdom of the crowds, as measured by the arithmetic mean, or the median of judgments.

Conclusion

Every day is filled with decisions based on intuitive judgments. When doctors diagnose patients, their intuitive judgments affect the choice of treatment and the patient’s subsequent chance of recovery. When managers assess company performance, their intuitive judgments affect jobs and the economic livelihood of workers. And when individuals vote for politicians, their decision is based on intuitive judgments about how well the politician will perform in office, with the prosperity of entire nations sometimes at stake. We have long known that harnessing the wisdom of crowds can help us reduce the extent of our individual errors, yet what the present paper indicates is that our trust in popular

judgment should have limits, because there is systematic error in our collective intelligence, emerging from the way our neurons categorize information contained in the cues we use. But as with all previous discoveries of judgment biases, now we know, we are positioned to do something about it.

Supporting Information

Application S1 Zipped Java application that readers can use to become familiar with the AQ Model, and discover what predictions it makes about judgment distributions and collective error. The application is accompanied by further instructions.

(ZIP)

Code S1 Zipped MATLAB code for simulating many judgment competitions of the kind observed by Sir Francis Galton in 1906.Judgments made by each competitor derive from the modeled behaviour of their neurons. Readers may toggle between Gaussian or Poisson distributed neuronal firing, and may also examine the effect of firing rate variance on the character of the judgment distribution that judges collectively form.

(ZIP)

Data S1 Judgment data in Excel about US consumer price inflation, gathered from The Wall Street Journal’s Economic Forecasting Survey. Judgments concern annual percentage growth. Surveys are conducted monthly by the journal. (XLS)

Data S2 Judgment data in Excel about US GDP, gathered from The Wall Street Journal’s Economic Forecasting Survey. Judgments concern annual percentage growth. Surveys are conducted monthly by the journal.

(XLS)

Data S3 Judgment data in Excel about US Unemploy-ment, gathered from The Wall Street Journal’s Econom-ic Forecasting Survey.Judgments concern percentage of the workforce. Surveys are conducted monthly by the journal. (XLS)

Data S4 Judgment data in Excel about US housing starts, gathered from The Federal Reserve Bank of Philadelphia’s Survey of Professional Forecasters. Judg-ments concern annual percentage growth in seasonally adjusted values. Surveys are conducted quarterly by the bank.

(XLS)

Data S5 Judgment data in Excel about US nominal GDP, gathered from The Federal Reserve Bank of Philadelphia’s Survey of Professional Forecasters. Judg-ments concern annual percentage growth in seasonally adjusted values. Surveys are conducted quarterly by the bank.

(XLS)

Data S6 Judgment data in Excel about US unemploy-ment, gathered from The Federal Reserve Bank of Philadelphia’s Survey of Professional Forecasters. Judg-ments concern seasonally adjusted percentages of the workforce. Surveys are conducted quarterly by the bank.

(XLS)

Acknowledgments

(17)

their valuable comments and suggestions. Finally, I must thank Dale Van Vleck for taking time to correspond with me about the probable weight of English oxen at the start of the 20th century.

Author Contributions

Conceived and designed the experiments: UWN. Performed the experiments: UWN. Analyzed the data: UWN. Contributed reagents/ materials/analysis tools: UWN. Wrote the paper: UWN.

References

1. Galton F (1907) Vox Populi. Nature, 75 450–451 (March 7, 1907). 2. Heady EO, Kaldor DR (1959) Expectations and errors in forecasting

agricultural prices. Journal of Political Economy, 62(1):34–47.

3. Carlson JA (1975) Are price expectations normally distributed? Journal of the American Statistical Association, 352(70):749–754.

4. Muth JF (1961) Rational expectations and the theory of price movements. Econometrica, 29(3):315–335.

5. Lucas RE (1978) Asset prices in an exchange economy. Econometrica, 46(6):1429–1445.

6. Page SE (2008) The difference: How the power of diversity creates better groups, firms, schools, and societies (New Edition). Princeton University Press. 7. Gold JI, Shadlen MN (2001) Neural computations that underlie decisions about

sensory stimuli. Trends in Cognitive Sciences, 5(1):10–16.

8. Gold JI, Shadlen MN (2007) The neural basis of decision making. Annual Review of Neuroscience, 30(1):535–574.

9. Shadlen MN, Newsome WT (1998) The variable discharge of cortical neurons: Implications for connectivity, computation, and information coding. Journal of Neuroscience, 18(10):3870–3896.

10. Thurstone LL (1927) A Law of comparative judgment. Psychological Review, 34:273–286.

11. Stigler SM (1989) Francis Galton’s account of the invention of correlation. Statistical Science, 4(2):73–79.

12. Surowiecki J (2004) The wisdom of crowds: Why the many are smarter than the few and how collective wisdom shapes business, economies, societies and nations, Doubleday.

13. Stark T (2010) SPF panelists’ forecasting methods: A note on the aggregate results of a November 2009 special survey. Federal Reserve Bank of Philadelphia.

14. Cohen J (1992) Quantitative methods in psychology: A power primer. Psychological Bulletin, 112(1):155–159.

15. Perry-Coste FH (1907) The ballot-box. Nature Letters to Editor, 509 (March 28, 1907).

16. McMurry B (2009) Cow size is growing. Beef Magazine, (February 1).

17. Wallace HA (1923) What is on the corn judge’s mind. Journal of the American Society of Agronomy, pages 300–304.

18. Hammond KR (1955) Probabilistic functioning and the clinical method. Psychological Review, 62:255–262.

19. Epley N, Gilovich T (2001) Putting adjustment back in the anchoring and adjustment heuristic: Differential processing of self-generated and experimenter-provided anchors. Psychological Science, 12:391–396.

20. Epley N, Gilovich T (2006) The anchoring-and-adjustment heuristic: Why the adjustments are insufficient. Psychological Science, 17:311–318.

21. Pouget A, Deneve A, Ducom JC, Latham PE (1999) Narrow versus wide tuning curves: What’s best for a population code? Neural Computation, 11:85–90. 22. Nieder A, Freedman DJ, Miller EK (2002) Representation of the quantity of

visual items in the primate prefrontal cortex. Science, 297(5587):1708–1711. 23. Butts DA, Goldman MS (2006) Tuning curves, neuronal variability, and sensory

coding. PLoS Biology, 4:639–646.

24. Roitman JD, Brannon EM, Platt ML (2007) Monotonic coding of numerosity in Macaque lateral intraparietal area. PLoS Biology, 5(8):11.

25. Nikitin AP, Stocks NG, Morse RP, McDonnell MD (2009) Neural population coding is optimized by discrete tuning curves. Physical Review Letters, 103. 26. Brunswik E (1943) Organismic achievement and environmental probability.

Psychological Review, 50(3):255–272.

27. Brunswik E (1952) The conceptual framework of psychology. University of Chicago Press.

28. Brunswik E (1957) Scope and aspects of the cognitive problem. Contemporary Approaches to Cognition, pages 5–31.

29. Yang T, Shadlen MN (2007) Probabilistic reasoning by neurons. Nature, 447(7148):1075–1080.

30. Vul E, Pashler H (2008) Measuring the crowd within: Probabilistic represen-tations within individuals: Short report. Psychological Science, 19:645–647. 31. Herzog SM, Hertwig R (2009) The wisdom of many in one mind. Psychological

Science, 20:231–237.