4.4 Point process models and geostatistics
4.4.2 Preferential sampling
A typical geostatistical data-set consists of a finite number of locationsxi and associated measurementsYi. If, in this setting, we acknowledge that both the measurements and the locations are stochastic in nature, then a model for the data is a joint distribution for measurements and locations, which we represent formally as [X, Y].
As discussed briefly in Section 1.2.3, we usually assume that sampling is non-preferential i.e., sampling and measurement processes are independent and the joint distribution of X and Y factorises as [X, Y] = [X][Y]. It follows that a conventional geostatistical analysis, by which we mean an analysis which conditions on X, is correctly targeted at the unconditional distribution of Y, and hence at the unconditional distribution of the underlying signal.
If, in contrast, sampling is preferential, then one of two possible factorisations of the joint distribution ofX andY is as [X, Y] = [X][Y|X]. Hence, the implicit inferential target of a conventional geostatistical analysis, which analyses only the dataY, is the conditional distribution [Y|X], whereas the intended target is usually the unconditional distribution [Y], and there is no reason in general to suppose that the two are equal.
It does not follow from the above argument that inferences which ignore preferential sampling will necessarily be badly misleading, but it does follow that we should be wary of accepting them uncritically. Provided that the model for S(x) is known, standard kriging may still give reasonable results. Suppose, for example, that the stationary Gaussian model holds and that sampling favours locations x for which S(x), and hence Yi = S(xi) +Zi, is atypically large.
The kriging predictor will then down-weight the individual influence of the large values ofYi which would tend to occur in spatial concentrations within the over-sampled regions, and up-weight the influence of small, but spatially isolated, values ofYi.
When, as is invariably the case in practice, model parameters are unknown, the consequences of ignoring preferential sampling are potentially more serious because standard methods of estimation will tend to produce biased estimates, which in turn will adversely affect the accuracy of predictive inferences concern-ing the signal. Again assumconcern-ing that relatively large values are over-sampled, this would result in a positively biased estimate of the mean, and hence a tendency for predictions to be too large on average.
A model-based response to the preferential sampling problem is to formulate a suitable joint model for the response dataY and the locationsX. The most natural way to do this is through their mutual dependence on the underlying signal process, S = {S(x) : x ∈ IR2}. For example, we might first assume that, conditional on S, the measured values Yi at locations xi are mutually independent,Yi ∼N(S(xi), τ2), as in the standard Gaussian linear model. A
4.4. Point process models and geostatistics 89 simple, if somewhat idealised, model for the preferential sampling mechanism might then be that, conditional onS, the sampled locationsX = (x1, . . . , xn) are generated by a Poisson process with intensity λ(x) = exp{α+βS(x)}.
Positive or negative β would correspond to over-sampling of large or small values ofS(x), respectively. To complete the model specification, the simplest assumption would be that S is a stationary Gaussian process. To emphasise that the locations at which we observeY are determined by the point process X, we partitionS as S ={S(X), S( ¯X)} where ¯X denotes all locations which are not points ofX. Then, the joint distribution ofS,XandY can be factorised as
[S, Y, X] = [S][X|S][Y|S(X)]. (4.9) In most geostatistical problems, the target for inference is [S]. The predictive distribution ofSis [S|Y, X] = [S, Y, X]/[Y, X], where [Y, X] follows in principle from (4.9) by integration,
[Y, X] = Z
[S, Y, X]dS,
although the integral may be difficult to evaluate in practice. Note that the con-ditional distribution [Y|S(X)] in (4.9) is not of the standard form whereby the Yiare mutually independent,Yi∼N(S(xi), τ2), because of the inter-dependence betweenS andX. We contrast (4.9) with the superficially similar model
[S, Y, X] = [S][X|S][Y|S] (4.10) where now [Y|S] is a set of independent univariate Gaussian distributions. The model (4.10) would be appropriate if we observed a point processX and a set of measured valuesYi at pre-specified locationsxi, rather than at the points of X. This second situation is not without interest in its own right. It would arise, for example, if X represented a set of events whose spatial distribution is of scientific interest and were thought to depend on a spatially varying covariate S(x) which is not directly observable everywhere but can be measured, possibly with error, at a set of pre-specified sample locationsxi:i= 1, . . . , n. A specific example is considered by Rathbun (1996) in a study of the association between a point process of tree locations in a forest and an incomplete set of measured elevations.
Simulation results in Menezes (2005) confirm that when geostatistical data are generated from the model (4.9), standard geostatistical inferences which ignore the preferential sampling mechanism can be very misleading. Here, we give a single example to illustrate.
We simulated the signal process on a discrete grid of 100 by 100 points in a unit square, using a stationary Gaussian process with zero mean, unit variance, and Mat´ern correlation function with parametersκ= 1.5 andφ= 0.2. Holding the signal process fixed, we then took three samples of values, denoted byY1, Y2andY3, which we refer to asrandom,preferential andclustered, respectively.
Each sample consists of the values of the signal at a set of 100 sampling loca-tions from the 100 by 100 grid, as follows. For Y1, the sampling locations are an independent random sample of size 100 i.e., each of the 10,000 points in the
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0Y Coord
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0Y Coord
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0Y Coord
Figure 4.3. Sample locations and underlying realisations of the signal process for the example to illustrate the effects of preferential sampling. The left-hand panel shows the random sample, the centre panel the preferential sample and the right-hand panel the clustered sample. In each case, the grey-scale image represents the realisation of the signal process,S(x), which was used to generate the associated measurement data.
Table 4.1. Sample statistics and parameter estimates for the three samples in the example to illustrate the effects of preferential sampling.
Sampling statistics Model parameter estimates
Sample Mean Variance µˆ σˆ2 φˆ
Random −0.13 0.42 0.2 0.86 0.21
Preferential 0.38 0.35 0.28 0.97 0.23
Clustered −0.13 0.51 0.17 0.98 0.22
grid is equally likely to be selected. For Y2, each grid-pointxi has probability of selection proportional to exp{S(xi)} where S(xi) is the value of the signal atxi. Finally, forY3 each pointxi has probability of selection proportional to exp{S∗(xi)}whereS∗(xi) is the simulated value atxiof a second, independent realisation of the signal process. The samples Y2 and Y3 are spatially clus-tered to the same extent butY3, unlikeY2, satisfies the standard geostatistical assumption thatX andY are independent.
Figure 4.3 shows the three samples of locationsxi together with the under-lying realisation of the signal process. Note in particular that in the left-hand and right-hand panels, the pattern of the sample locations is unrelated to the spatial variation of the signal process.
For each of the three samples we obtained maximum likelihood estimates of the model parameters µ, σ2 and φ, treating κ as known and, in the case of Y2, ignoring the preferential nature of the sampling. Table 4.1 shows the maxi-mum likelihood estimates together with the sample means and variances. The preferential sampling has a pronounced effect on the sample mean, as would be expected. In all three cases, the sample variance grossly under-estimates the variance of the signal process. The maximum likelihood estimates give reason-able results for all three model parameters except that, in the case of preferential sampling, there is still some indication of the biasing effect on the estimation of the mean.
4.4. Point process models and geostatistics 91
Table 4.2. Mean square prediction errors for the three samples in the example to illustrate the effects of preferential sampling, using true and estimated parameter values.
Random Preferential Clustered
True 0.0138 0.0325 0.0192
Estimated 0.0138 0.0326 0.0191
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0
−1.5−1.0−0.50.00.51.01.5
true
predicted
Figure 4.4. Predicted versus true values of the signal at 10,000 grid locations, using preferentially sampled data in conjunction with true values for all model parameters.
We then used each of the three samples to predict the signal at the original 10,000 grid locations, using both true and estimated parameter values. Table 4.2 gives the resulting average squared prediction errors. The larger values for the clustered than for the random sample illustrates that the former is a less ef-ficient design for spatial prediction, whilst the preferential sample gives larger values still. Note also that using true parameters does not necessarily give a smaller averaged squared prediction error than using estimated values, because the estimated values reflect the characteristics of the particular realisation of the signal process. Finally, Figure 4.4 shows, for the preferential sample, a scat-terplot of the 10,000 individual predictions against the true values of the signal, using true values of the model parameters for the predicted values. The prefer-ential sample does a very good job of predicting the larger values of the signal, but is less reliable for smaller values, as a consequence of the under-sampling of sub-regions where the signal takes relatively small values.
Models for preferential sampling can also be considered as models for marked point processes. A marked point process is a point process, each of whose points has an associated random variable called the mark of the point in question.
Marks may be qualitative or quantitative. In this context, it is not necessary for the mark to exist at every point in space, only at each point of the process,
for example the points could be the locations of individual trees in a forest and the marks might denote the species (qualitative) or height (quantitative) of each tree. However, the marks could also be the values, at each point, of an underlying spatially continuous random field. In this case, the model in which the mark process is independent of the point process is called the random field model.
The random field model for a marked point process is therefore the counterpart of non-preferential sampling for a geostatistical model. Schlather, Ribeiro Jr and Diggle (2004) consider methods for investigating the goodness-of-fit of the random field model to marked point process data.