Model-based Inference for Rare and Clustered Population from Adaptative Cluster Sampling using Auxiliary Variables

(1)

Model-based Inference for Rare and

Clustered Populations from Adaptive

Cluster Sampling using Auxiliary Variables

Izabel Nolau de Souza

Orientadores: Kelly Cristina Mota Gon¸

calves

e

Jo˜

ao Batista de Morais Pereira

Universidade Federal do Rio de Janeiro

Instituto de Matem´

atica

Departamento de M´etodos Estat´ısticos

2020

(2)

Model-based Inference for Rare and

Clustered Populations from Adaptive

Cluster Sampling using Auxiliary Variables

Izabel Nolau de Souza

Disserta¸cão de Mestrado apresentada ao Programa de Pós-gradua¸cão em Estat´ıstica do Instituto de Matemáica da Universidade Federal do Rio de Janeiro - UFRJ, como parte dos requisitos necessários à obten¸cão do t´ıtulo de Mestre em Estat´ıstica. Aprovado por:

Kelly Cristina Mota Gon¸calves Dr.Sc. - IM/UFRJ - Orientadora.

Jo˜ao Batista de Morais Pereira Dr.Sc. - IM/UFRJ - Coorientador.

Carlos Antonio Abanto-Valle Dr.Sc. - IM/UFRJ.

Fernando Antonio da Silva Moura Dr.Sc. - IM/UFRJ.

Pedro Luis do Nascimento Silva Dr.Sc. - ENCE.

Rio de Janeiro, RJ - Brasil 30 de abril de 2020

(3)

Agradecimentos

A concretiza¸cão deste projeto não se deve apenas a mim, mas também a todos aqueles que de forma direta ou indireta se envolveram. Foi enorme e constante a partilha das inúmeras dúvidas, incertezas, conquistas e muitas aprendizagens.

Agrade¸co primeiramente a Deus, por ter me dado sa´ude e for¸ca, n˜ao somente durante o mestrado, mas em todos os momentos, permitindo que tudo isso acontecesse.

Agrade¸co aos meus pais, Izilda e Fernando José, por todo amor, apoio e incentivo que sempre me deram, celebrando minhas conquistas como se fossem deles próprios. Obrigada pela educa¸cão que me proporcionaram e por acreditarem tanto em mim! A todos os familiares e amigos que me apoiaram e acompanharam toda minha trajetória, muito obrigada!

Agrade¸co aos meus orientadores Kelly e João, pelo incentivo, aten¸cão e suporte que me deram e pelas corre¸cões que fizeram neste trabalho visando a sua melhoria. Obrigada não apenas pela amizade, mas também por toda preocupa¸cão que sempre demonstraram com o meu futuro e por sempre acreditarem no meu potencial.

Agrade¸co a todos os professores que tive ao longo da minha vida, em especial aos professores que me acompanharam durante a gradua¸cão e o mestrado, que tiveram tanta paciência comigo e que tanto contribu´ıram para a minha forma¸cão.

Agrade¸co aos professores Carlos Antonio Abanto-Valle (UFRJ), Fernando Antonio Moura (UFRJ) e Pedro Luis do Nascimento (ENCE) por aceitarem fazer parte da minha banca.

Por fim, agrade¸co ao Conselho Nacional de Desenvolvimento Cient´ıfico e Tecnol´ogico (CNPq), pelo apoio financeiro dos meus estudos.

(4)

Abstract

Rare populations, such as endangered animals and plants, drug users and individuals with rare diseases, tend to cluster in regions. Adaptive cluster sampling is generally applied to obtain information from clustered and sparse populations since it increases survey effort in areas where the individuals of interest are observed. This work aims to propose a unit-level model which assumes that counts are related to auxiliary variables, improving the sampling process, assigning different weights to the cells, besides referring them spatially. The proposed model fits rare and grouped populations, disposed over a regular grid, in a Bayesian framework. The approach is compared to alternative methods using simulated data and a real experiment in which adaptive samples were drawn from an African Buffaloes population in a 24,108km2_{area of East Africa. Simulation studies show}

that the model is efficient under several settings, validating the methodology proposed in this dissertation for practical situations.

(5)

List of Figures

3.1 Illustration of adaptive cluster sampling procedure for a rare clustered

popula-tion distributed in a region with M = 400 grid cells. Figure (a) presents the initial sample of m = 10 cells in dark grey. From this sample, in Figure (b), neighbors are added to the sample whenever there is at least one observation

(black dot) in the selected cell, finally setting the sample presented in Figure (c). 10

3.2 Illustration of important concepts in adaptive cluster sampling: bold bordered

squares correspond to the observed cluster, gray squares are the network units, and the hatched part are the border units. The unit initially selected is in

darker gray.. . . 11

4.1 Allocation method illustration of two out-of-sample networks of sizes 3 and 2,

based on weights λ (gray background). The lighter the cells’ color, the higher the λ value (intensity) of that cell. The white cells with bold borders, whose weights are equal to zero, are the sampled networks’ cells and the hatched ones correspond to the nonempty networks’ border cells. The red borders surround cells that can be drawn in each stage of the procedure. The cells that compose the first and second allocated networks are blue-painted and green-painted, respectively. The example proceeds as follows: draw one of the red-surrounded cells of Panel (a). The sampled cell is blue-indicated in Panel (b) and only its neighbors can be sorted to keep building this network. In Panel (c) and (d) the allocation of the 3-sized network is finished, and we can draw any cell with the red border to start the allocation of the 2-sized network. Panels (e) and (f)

(8)

4.2 Proposed sampling procedure illustration of a population (points) distributed in a region with M = 400 cells and weights (gray background) used in this scheme. The lighter the cells’ color, the higher the weight of that cell. The white cells, whose weights are equal to zero, with bold borders are the sampled networks’ cells and hatched cells correspond to the nonempty networks’ border cells. Panels (a) and (b) present the same grayscale, since all cells have constant weight in the first stage; and Panels (c) and (d) show different shades of gray

due to the second stage’s different weights.. . . 21

4.3 Values of the generated covariate (gray background) and counts for each nonempty

cell in a grid with M = 400 cells. . . 25

4.4 Plot of the generated covariate versus the population counts and covariate’s

histogram, where the black points in the histogram represent the nonempty cells. 25

4.5 Boxplots with measurements of the point and 95% credibility interval estimates

for T over 100 simulations obtained for the fits of MAN, MDN and MDC models. 26

4.6 Relative frequency of the number of networks sampled over 100 simulations

obtained for the fits of MAN, MDN and MDC models. . . 27

for T over 500 simulations obtained for the fits of the disaggregated and

aggre-gated models, considering different sample sizes and proportions. . . 29

4.8 Relative frequency of the number of networks sampled over 500 simulations

obtained for the fits of the disaggregated and aggregated models, considering

different sample sizes and proportions. . . 30

4.9 Mean (white line) and 2,5% and 97,5% quantiles (gray bars) of the relative bias

estimates of T over 500 simulations obtained for the fits of the disaggregated and aggregated models, considering different sample sizes, proportions, numbers of networks sampled (number of simulations given below the gray bars) and

coverage (above the gray bars). . . 31

4.10 Altitude in a logarithm scale (gray background) and counts of African Buffaloes

over parts of Kenya and Tanzania in 2010 in a grid with M = 391 cells. . . 32

4.11 Plot of altitude in a logarithm scale versus counts of African Buffaloes, and

covariate’s histogram, where the black points in the histogram represent the

(9)

4.12 Boxplots with measurements of the point and 95% credibility interval estimates for T over 500 simulations obtained for the fits of the disaggregated and

aggre-gated models to real data. . . 34

for T over 100 simulations for each number of networks sampled, obtained for

the fits of the disaggregated and aggregated models to real data. . . 36

for T over 100 simulations for each number of networks sampled, obtained for

the fits of the disaggregated and aggregated models. . . 37

4.15 Maps of the posterior mean of African Buffalo counts η(c) (gray background)

for all out-of-sample cell c of R, for each number of sampled networks, and its population (points) distributed in a region with 391 cells. The lighter the cells’ color, the higher the posterior mean of that cell. The blue cells are the sampled networks’ cells and the hatched cells correspond to the nonempty networks’

border cells. . . 38

of the proposed model and population parameters over 500 simulations for

dif-ferent values of α, β and T . . . 40

B.1 Trace plot with the posterior densities of α, β and T obtained from the fits of

the disaggregated and the aggregated models to real data, for each number of

(10)

List of Tables

4.1 Summary measurements of the point and 95% credibility interval estimates of

the population total T , obtained by fitting MAN, MDN and MDC models under

100 samples according to each model. . . 26

4.2 Values of the sample sizes m, m1 and m2, according to each fixed percentage. 28

4.3 Summary measurements of the point and 95% credibility interval estimates for

T over 500 simulations obtained for the fits of the disaggregated and aggregated

models, considering different sample sizes and proportions. . . 29

4.4 Percentage of networks sampled over 500 simulations under each model. . . . 35

4.5 Summary measurements of the point and 95% credibility interval estimates for

T over 100 simulations for different numbers of networks sampled, obtained for

the fits of the disaggregated and aggregated models. . . 35

4.6 Summary measurements of the point and interval estimates of the population

total, obtained by fitting the disaggregated and aggregated models and Raj’s

estimator. . . 37

4.7 Summary measurements of the point and 95% credibility interval estimates of

the proposed model and population parameters over 500 simulations for different

values of α, β and T . . . 39

B.1 Geweke convergence diagnostic for some of the parameters estimated for the

(11)

Chapter 1 Introduction

In several statistical surveys, there are obstacles in data collection since the study object is hard to observe either because it is a rare population, exhibits a pattern of sparsely distributed groups in a region, or is mobile over time. Examples of populations with these characteristics include endangered animals and plants, ethnic minorities, drug users, individuals with rare diseases, and recent immigrants. Assume that the population of interest is spatially distributed in a region of interest, where a regular grid with M equal-sized cells is superimposed. Denote the partitioned region by R = {c1, . . . , cM}.

Let η(c) denote the number of individuals of the population within the grid cell c, for all c ∈ R, that is, this cell’s count. The objective is to estimate a rare and clustered population total T =P

c∈Rη(c).

Under traditional sampling methods of grid cells, a subset of m < M cells is drawn and their respective counts η(c) are observed. Due to the population characteristics, small sample sizes result in large numbers of empty grid cells, for which η(c) = 0, leading us to inaccurate estimates of the population quantity of interest. In this context, adaptive cluster sampling, introduced by Thompson (1990), is a way to surmount this difficulty by increasing survey effort around non-empty grid cells of the sample. From an initial sample of m grid cells, when we find a non-empty grid cell, for which η(c) 6= 0, we also sample its neighbors (cells sharing a common edge with the current one) and continue surveying until we obtain a set of contiguous non-empty grid cells surrounded by empty grid cells. By this way, empty grid cells bring no further survey effort. Therefore, adaptive cluster sampling requires some prior knowledge about the structure of the subjacent population, which may be obtained from a preliminary survey, to be effective.

According to Thompson (1990), the set of contiguous non-empty grid cells is called a network; this set plus its neighboring empty grid cells are together named a cluster; and

(12)

empty cells are defined as one-sized networks. Therefore, R is exhaustively partitioned into disjoint networks, and the final sample contains empty and non-empty networks. Thompson (1990) treated empty edge cells as unobserved and, from an initial random sample without replacement of grid cells, inclusion probabilities are assigned to the sam-pled networks, used to construct design-unbiased estimators of T and their variances. Note that the networks are the basis of the analysis and, although the initial selection of cells is without replacement, the same network can be selected more than once, a problem that Thompson (1990) solved by allowing multiple inclusions of networks. Edge cells can be incorporated into the estimator by taking the conditional expectation of their counts given the minimal sufficient statistic and setting the Rao-Blackwell improved version of that. These estimators were described and computed for small sample sizes in Thompson (1990). Further, Salehi M. & Seber (1997) proposed a scheme whereby the networks are selected one by one without replacement, avoiding select the same network more than once.

Several studies have been conducted using adaptive sampling designs on real popu-lations. For example, Smith et al. (1995) studied the methodology for rare species of waterfowl, Su & Quinn (2003) discussed adaptive cluster sampling with order statistics and a stopping rule for a fish population, Philippi (2005) showed that it is a viable al-ternative for the estimation of occurrences in local populations of low-abundance plants and Gattone et al. (2016) applied it to negatively correlated data.

Thompson & Seber (1996) examined some general ideas about model-based inference approaches for adaptive sampling. Bayesian methods showed promising results among model-based approaches. Beside them, Bayesian inference methods for adaptive cluster sampling designs have been developed in Rapley & Welsh (2008) and Gon¸calves & Moura (2016), which incorporate prior knowledge that the population is rare and grouped for both inference and sample design. Rapley & Welsh (2008) provided a model at the network level, while Gon¸calves & Moura (2016) modeled at the cell level, considering heterogeneity among units belonging to different clusters. Both works did not take into account the spatial locations of the networks, a fact that does not cause any loss of information about the population total since it does not depend on where the networks are located, under the model.

A possible approach to spatially model clustered data is by using point processes (Diggle, 1975; Baddeley & Turner, 2000; Brix & Diggle, 2001), where the clusters are considered as points and have no internal spatial structure, although there is a spatial relationship between them. Rapley & Welsh (2008) place the clusters and give them a spatial size by superimposing a grid on a region containing a clustered population and

(13)

modeling it within this grid structure. In this case, it is assumed that the intensity of the counts in each cluster is proportional to its size. However, this assumption is not always valid. In some situations, cells that belong to the same cluster can have different intensities, e.g. the border cells can present a smaller incidence rate than the central ones. Moreover, a cluster can have a higher incidence of the phenomenon, not because of its size, but due to other factors that influence its disposition, as a spatially referenced covariate.

This work aims to present a disaggregated model, at cell level, which assumes that the intensity in each cell of a cluster is related to an available covariate value. The proposed model fits rare and grouped populations, disposed over a regular grid, in a Bayesian framework. The key idea of this dissertation is the improvement of the population esti-mates through the use of grid cells as analysis units and the incorporation of additional information into the model. Based on this extra information, we also raise an improved sampling process, where different probabilities are assigned to draw the cells, and we can spatially reference the estimates of the cell counts. Introducing additional information seems to be an intuitive idea, provided that the prior knowledge indicates that there is a relationship between the phenomenon occurrence and some covariate.

This dissertation is organized as follows. In Chapter 2, the notation of finite popula-tion sampling is introduced, which will be used throughout the text, as well as design-based and superpopulation approaches. Chapter 3 presents the adaptive sampling plans’ methodology, some extensions, and a model-based approach proposed by Rapley & Welsh (2008), which motivated the ideas of this work. The proposed model is introduced in Chapter 4, a new sampling procedure is proposed and aspects of inference are discussed. Moreover, simulation studies are presented for assessing the effectiveness of the proposed model and the one proposed by Rapley & Welsh (2008) and considering the estimation of model parameters under different degrees of rare and clustered populations. Also, a real population is presented and used in an evaluation of the performance of the pro-posed model. Finally, we conclude with a brief discussion about the advantages of our methodology and suggestions for further research in Chapter 5.

(14)

Chapter 2 Inference for finite populations

In this chapter, important notation and definitions in the theory of finite population sampling, that will be used throughout this work, are presented. In this context, there are two possible approaches: (i) fixed population approach, where each population unit is associated with a fixed but unknown real number, that is the value of the variable under study; and (ii) superpopulation approach, where each population unit is associated with a random variable for which a stochastic structure is specified, and the actual value associated with the population unit is treated as the outcome of this random variable. In Section 2.1, the first approach is presented, and in Section 2.2, the second one, for which the sample design could be considered relevant to perform Bayesian inference about the model parameters.

2.1 Fixed population approach

According to Cassel et al. (1977), a finite population, for which we are interested in a characteristic n, is a collection of M units, denoted by the index set P = {1, . . . , M }, for M < ∞ supposedly known. It is important to remark that, in association with each unit i, i = 1, . . . , M , we also have the value ni. Thus, when data is observed, we should record

not only these values, but also the respective unit that produced each measurement. The complete observation is denoted by the pair (i, ni) and, therefore, there are M pairs for

the whole population.

In statistical inference, a parameter is frequently treated as an unknown quantity indexing a probability distribution. Define n = (n1, . . . , nM)0 as the parameter of the

finite population, belonging to a parameter space defined in RM_{. Any real function of}

(15)

some disease in M neighborhoods, or the number of animals of a particular species in M locations. Inference in finite populations is usually made about a specific parametric function, such as the population total T = PM

i=1ni, the population mean µ = T /M or

the population variance σ2 = PM

i=1(ni − µ)2/M . In particular, in this work, we aim to

estimate the population total.

We make statistical inferences about these parametric functions based on information obtained from a sample of the population P (more details can be seen in Cassel et al. (1977)). A sequence s = {i1, . . . , im} such that ij ∈ P, for j = 1, . . . , m, is called an

ordered sample, or simply a sample, of size m. The label ij is called the j-th component

of s. The finite population sampling based on randomization of the sample differs from other parts of statistics since it treats the population as fixed. In this approach, the probabilistic mechanism of sample selection is a predetermined randomization procedure called sample design. It is represented by a probability function, known as sampling plan, of the set S of all possible samples s, where [s] denotes the probability to select the sample s. A sample design [·] is called non-informative if and only if [·] is a function that does not depend on the values of n associated with s. Otherwise, it is called an informative sampling plan and is denoted by [s | n].

Once s is selected, the observed result can be specified as the set of pairs d = {(i, ni) :

i ∈ s}. In some cases, the interest is only in the values of n and not in the full pair, so define ns = {ni : i ∈ s}. Let ¯s = P − s and so ns¯ = {ni : i ∈ ¯s} be the values of n

associated to the units that do not belong to the sample.

In the following section, we present the approach based on superpopulation models, where the sample remains fixed, and the value related with the population unit is treated as the outcome of its associated random variable, and the inferences refer to a hypothetical superpopulation, in which a probability law governs the variables of interest.

2.2 Superpopulation models

Another inferential approach for finite populations is based on superpopulation mod-els. The process of statistical inferences from a sample comprises a set of principles and procedures that may include, for example, knowledge of some random process that generated the true unknown value of the characteristic of interest for each unit of the population. This process is represented by a model that is used as a basis for making inferences.

(16)

to the population units are treated as fixed constants, under superpopulation models approach, the value of the population vector n = (n1, . . . , nM)0 is considered a realization

of the random vector N = (N1, . . . , NM)0, for which there is a joint distribution of all

values of the population.

According to the model, suppose that, given a parametric vector θ ⊂ Θ, N follows a probability distribution denoted by [N | θ]. Let N = (N1, . . . , NM)0 be the population

vector generated according to the distribution [N | θ]. Let H be the vector containing additional variables associated with the structure of the population and suppose that the joint distribution of H, which depends on a vector parameter ψ, is given by [H | ψ].

2.2.1 Informative sampling design

In a wide range of sampling designs, the sample selection mechanism may depend on the values of the variables of interest in the population. This situation characterizes an informative sampling plan. A typical example is a case-control study, where the sample is selected such that there are cases (units with a given condition of interest) and controls (units without this condition), and one is interested in modelling the indicator of presence or absence of the condition as a function of predictor variables. This indicator is one of the research variables and is considered in the sample selection mechanism.

Under the approach of superpopulation models, it is important to analyze whether the selection probabilities of the population elements are related to the response variables. In this case, it is relevant for inference to take into consideration the sampling plan, either in the model design or in the construction of the likelihood function.

Let v be the set of variables that are fully observed for all units of the population P. Even when we are primarily interested in some aspect of the distribution of N , v can provide information through a regression setting. In other words, if v is fully observed, then a model for N | v can lead to more precise inference about new values of N than would be obtained by modeling N alone.

According to Gelman et al. (1995), when considering data collection, it is useful to split the joint probability model into two parts: (i) the model for the underlying complete data, N — including observed and unobserved components; and (ii) the model for the sample s. The complete-data likelihood of sample s, vector N, and variables H, given the parameters in the model and covariates v, is given by:

[s, N, H | v, θ, ψ] = [s | N, H][N | v, H, θ][H | ψ], (2.1) which depends on the complete data N. In fact, the information obtained from a sample

(17)

is (s, Ns, Hs). Therefore, the likelihood of the observed data, assuming continuity of the

Under the Bayesian approach, we are interested in obtaining the posterior distribution of the parametric vector. In this case, the joint posterior distribution of the model parameters θ and ψ, given the observed information, (s, Ns, Hs, v), is:

If we decide to ignore the sampling design, we can compute the joint posterior distri-bution of the model parameters θ and ψ by conditioning only on Ns, Hs and v but not

s, as follows:

[θ, ψ | Ns, Hs, v] ∝ [θ, ψ][Ns, Hs | v, θ, ψ]

= [θ, ψ] Z Z

[N | v, H, θ][H | ψ]dNs¯dH¯s. (2.4)

The posterior distribution of θ, ignoring the sampling design, is obtained from the expression (2.4) and is given by:

[θ | Ns, Hs, v] ∝ [θ]

Z Z Z

[ψ | θ][N | v, H, θ][H | ψ]dN¯sdHs¯dψ. (2.5)

When unobserved data supplies no information, i.e. when [θ | Ns, Hs, v] given in

(18)

respect to the proposed model). In general, sampling plans involve some knowledge of the structure of the population, such as stratification, conglomeration, and unequal selection probabilities (complex sampling).

In this case, the sufficient condition to ensure design ignorability is [s | N, H] = [s | Ns, Hs]. The important consequence of this condition is that, from (2.3), it follows

that if the sampling plan is ignored with respect to the parameter of interest θ, then [θ | s, Ns, Hs, v] = [θ | Ns, Hs, v]. Thus, the additional information brought trough s

can be discarded when one wishes to make inference about θ, otherwise, it cannot be ignored. Mistakenly ignoring the informative sampling plan in inference may negatively affect the parameters’ estimates.

In this work, the approach based on the superpopulation model will be used, focusing on inference about the model parameters and the prediction of T from data obtained by adaptive cluster sampling, which is an informative sampling plan. As usual, we will avoid evaluating the integrals presented in this section by simply drawing posterior simulations of the joint vector of unknowns, (Ns¯, H¯s, θ, ψ), and then focusing on estimates.

(19)

Chapter 3 Adaptive cluster sampling

In the context of rare and grouped populations, it is usual to divide the region of interest into cells. When we use traditional sampling methods to sample a population with these characteristics, small samples of cells result in a large number of empty cells, leading us to inaccurate estimates of the population quantity of interest. Adaptive cluster sampling is an alternative to deal with this difficulty since it allows us to increase survey effort in the vicinity of regions where individuals of interest are found by using information from the observed values to be more successful in collecting additional cells.

In Section 3.1, we present the adaptive sampling plan proposed by Thompson (1990), since it is a suitable sampling plan for the type of population we aim to study in this work. In Section 3.2 some extensions of this sampling plan are briefly presented. Finally, in Section 3.3 the model-based approach proposed by Rapley & Welsh (2008) is presented, for which the sampling plan is relevant to perform Bayesian inference about the model parameters.

3.1 Thompson (1990)’ approach

Adaptive cluster sampling is generally used when we are dealing with sparse and clustered populations, since this method help to more accurately estimate the population totals. Consider a population spread non-homogeneously over a region, for instance, one in which there are clusters, with a grid of size M superimposed upon it. In this case, the population total may be estimated inefficiently: if the sample includes several clusters the population total will be overestimated, and if the sample includes very few clusters, it will be underestimated. In this situation, using an adaptive sampling strategy provides more efficient estimators and, therefore, should be preferred in most cases.

(20)

Initially proposed by Thompson (1990), the method has shown to be effective in epidemiological research and studies on rare diseases, animals and plants. From an initial sample of m grid cells, when a selected cell contains a member of the population of interest, the cells sharing a common edge with the current cell are also sampled and we continue surveying until we obtain a set of contiguous non-empty grid cells surrounded by empty grid cells. This procedure has shown to be intuitive since it is expected to find an element with similar characteristics to another in its vicinity when the population is grouped. In this way, empty grid cells bring no further survey effort. Hence, adaptive cluster sampling requires some prior knowledge about the structure of the underlying population, which may be obtained from a preceding survey, to be effective.

In Figure 3.1, the method is illustrated for a population distributed over a region partitioned into M = 400 grid cells. The sampling procedure starts with a simple random sample without replacement of m = 10 units, which are displayed in gray in the grid (Figure 3.1a). Note that, from the 10 cells selected, only 2 of them contain a member of the population of interest. Next, the units neighboring these 2 units are also included in the sample (Figure 3.1b). We continue surveying neighbors until the process is finalized, Figure 3.1c, with 45 sampled cells, represented by the highlighted ones.

● ● ●● ● ●● ●● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ●●●●●●● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ●● ●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●

(a) Initial sample

● ● ●● ● ●● ●● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ●●●●●●● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● (b) Sampling process ● ● ●● ● ●● ●● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ●●●●●●● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● (c) Final sample

Figure 3.1: Illustration of adaptive cluster sampling procedure for a rare clustered population distributed in a region with M = 400 grid cells. Figure (a) presents the initial sample of m = 10 cells in dark grey. From this sample, in Figure (b), neighbors are added to the sample whenever there is at least one observation (black dot) in the selected cell, finally setting the sample presented in Figure (c).

The set of contiguous cells containing members of the population make up a network, while the set of contiguous units sampled, both the network and the empty boundary cells are together termed a cluster. These boundary cells are called edge cells. It is also convenient to define all singular empty cells as networks, so an edge cell is, in fact, a network of size one. These definitions are illustrated in Figure 3.2, which is part of the sample seen in Figure 3.1. The squares with bold border correspond to the observed

(21)

cluster, the squares in gray make up the nonempty network and the hatched cells represent the border units. The unit initially selected is in darker gray.

● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

Figure 3.2: Illustration of important concepts in adaptive cluster sampling: bold bordered squares correspond to the observed cluster, gray squares are the network units, and the hatched part are the border units. The unit initially selected is in darker gray.

Although the initial sampled cells are distinct, a cluster may include more than one unit of the initial sample, i.e., if two non-border units in the same cluster are initially selected, then this cluster can occur twice in the final sample. Therefore, an adaptive cluster sample, which begins with a selection without replacement of m initial units, has the number of distinct nonempty networks less or equal to m. Thus, the final number of sampled cells is a random variable and cannot be set.

When the data are analyzed, the networks become the unit of analysis and the respec-tive boundaries are ignored if they have not appeared in the original sample (Thompson & Seber, 1996). Networks are used as unit of analysis because it is possible to calculate inclusion probabilities, based on their size, for each of them in the model. In addition, since the networks are disjoint, they form a partition over the region of interest. Besides that, grid cells within networks usually have a structure of dependency and considering the network as a unit of analysis avoids the need to explicitly define this structure in the model.

Conventional estimators under adaptive cluster sampling design tend to be biased since nonempty cells are sampled disproportionately. Based on this idea, Thompson (1990) obtained an unbiased estimator under this sample design for the population mean and allowed multiple inclusions of networks. Moreover, the Thompson (1990)’s approach lets edge cells to be incorporated into the estimator by taking its conditional expectation given the minimal sufficient statistic and setting the Rao-Blackwell improved version of that. These estimators were described and computed for small sample sizes in Thompson (1990). From this work, some extensions of this sampling design, in addition to the initial selection based on simple random sampling, appeared in the literature and will be presented below.

(22)

3.2 Extensions of adaptive cluster sampling

Several extensions to simple random sampling — e.g., stratified sampling — can also be applied to adaptive sampling. Methods for stratified adaptive cluster sampling were first proposed in Thompson (1991b) and another extension was proposed in Thompson (1991a). In these approaches, primary sampling units, for example, groups of units arranged in strips or rectangles, are defined and then randomly sampled. If a member of the population is found within a primary unit, secondary units outside of the primary unit are added to the sample in the same way as in normal adaptive cluster sampling (Thompson, 1991a). Further, Salehi M. & Seber (1997) proposed a scheme whereby the networks are selected one by one without replacement, avoiding the selection of the same network more than once.

Thompson & Seber (1996) examined some general ideas about model-based inference approaches for adaptive sampling. The likelihood-based methods, such as Bayesian es-timation, showed promising results among model-based approaches, some of which are detailed in the next section.

3.3 Model-based inference

Bayesian inference methods for adaptive cluster sampling designs have been devel-oped in Rapley & Welsh (2008) and Gon¸calves & Moura (2016), which incorporate prior knowledge that the population is rare and grouped for both inference and sample de-sign. Rapley & Welsh (2008) provided a model at the network level, while Gon¸calves & Moura (2016) modeled at the cell level, considering heterogeneity among units belonging to different clusters. Both works did not take into account the spatial locations of the networks, a fact that does not cause any loss of information about the population total since, under the model, it does not depend on where the networks are located. In this chapter, we will focus on Rapley & Welsh (2008)’s model.

Rapley & Welsh (2008) proposed a complex model, which uses networks as units of analysis. Therefore, we refer to this model as the aggregate model. The use of the Bayesian approach is a natural extension of the idea of adaptive cluster sampling because it incorporates the prior knowledge that the rare population is grouped for both inference and sample design. To illustrate their proposal, Rapley & Welsh (2008) compare their estimators with the estimators developed in Thompson (1990) through a simulation study, showing to be efficient, especially in a context of prior knowledge.

(23)

Let R be a region containing a rare, clustered population, over which a regular grid partitioned into M cells overlaps. A cell is considered nonempty if it contains at least one member of the population and empty otherwise. Let X ≤ M be the number of nonempty cells in R. Let P ≤ X be the number of nonempty networks in R, where a network is defined as in Thompson (1990). Let Yibe the number of nonempty cells within

the nonempty network i, for i = 1, . . . , P , and therefore Y = (Y1, . . . , YP)0 is the vector

with the number of nonempty cells within each nonempty network, so that X =PP

i=1Yi.

Note that there are M − X empty cells, which are defined as one-sized empty networks, so there are M − X + P networks in R. We can extend the P -dimensional Y vector to a (M − X + P )-dimensional vector given by Z = (Y0, 10_{M −X})0 where 1M −X is a vector

of dimension M − X. Thus, it follows that Zi = Yi, if the i-th network is a nonempty

one and Zi = 1, otherwise, for i = 1, . . . , M − X + P . Let Ni be the count of a given

phenomenon of interest in the nonempty network i and, therefore, N = (N1, . . . , NP)0

denotes the vector with the population total in each of the nonempty networks. In order to perform inference about the population total T =

P

X

i=1

Ni, one must specify

the joint distribution of {X, P, Y, N} for the entire population and the sampling mecha-nism that provides a particular sample of m networks from the M − X + P networks in the population. First, we model the nonempty network structure and then, conditional on it, model the count on the nonempty networks. Since the model applies to nonempty cells, to avoid degeneration problems it is assumed that there is at least one nonempty cell in R, so distributions are left truncated at zero. The proposed model can be written as follows:

Ni | P, Yi, γ ∼ independent truncated Poisson(γYi), i = 1, . . . , P,

Y | X, P ∼ 1P + Multinomial X − P, 1 P1P , Yi = 1, . . . , X − P, P | X, β ∼ truncated Binomial(X, β), P = 1, . . . , X, X | α ∼ truncated Binomial(M, α), X = 1, . . . , M. (3.1)

The model in (3.1) is applied to samples collected according to the adaptive method proposed by Salehi M. & Seber (1997), which consists of observing Yi for i ∈ s

sequen-tially. Since the sample design depends on the structure of the population, which is unknown, being characterized as an informative sampling, it should be incorporated into the likelihood function of the model to perform inference. Therefore, the next step is to define the probability of selecting a sample s = {i1, . . . , im}, i.e., [s]. It is known that

(24)

be observed and thus the probability of selecting a network is weighted according to its size. To illustrate the construction of the probability of selection of a sample, consider a population consisting of eight networks of sizes {1, 1, 1, 1, 3, 3, 5, 5}, from which the sample {5, 1, 5, 3} is taken. The probability of selecting the first network is equal to the probability of selecting a network of size 5, which is equal to 5×2₂₀ , the probability of selecting a network of size 1 in the second step, given the previous one, is 1×4₁₅ and thus the probability of selection of this particular sample is equal to

5 × 2 20 × 1 × 4 20 − 5 × 5 × 1 20 − 5 − 1 × 3 × 2 20 − 5 − 1 − 5.

Therefore, the probability of selection of a particular sample can be generalized as follows: [s | X, P, Y] = m Y j=1 Zij × gij,j M −X+P X i=1 Zi− j−1 X k=0 Zik , (3.2)

where gij,j is the number of networks of size Zij unselected after j − 1 networks have

been selected and Zi0 = 0.

Making an equivalence with the notation presented in Section 2.2, the following cor-respondence can be obtained: M = M − X + P , H = (X, P, Y0)0, θ = γ and ψ = (α, β)0. Note that the probability of selecting the sample s does not depend directly on N but depends on the variables associated with the population structure, so it is said that the sampling plan is informative with respect to H.

(25)

Chapter 4 Model for rare and clustered

populations under adaptive sampling

using covariates

Rapley & Welsh (2008) proposed a model that uses networks as units of analysis, avoiding introducing spatial components in the model, which may facilitate the inference. In this case, it is assumed that the intensity of the counts in each cluster is proportional to its size. However, this assumption is not always valid, e.g. a cluster can have a higher incidence of the phenomenon due to external factors that influence its disposition, such as a spatially referenced covariate. Thus, proposing a disaggregated model can be interesting in many contexts with rare and clustered populations. The objective of this chapter is to present a disaggregated model at cell-level, which assumes that the intensity of the counts in each cell of a cluster is related to an available covariate value. The proposed model fits rare and grouped populations sampled under adaptive cluster design. Therefore, the probability of selecting a given sample should be incorporated into the model likelihood function. Introducing additional information in the model seems to be an intuitive idea, provided that the prior knowledge indicates a relationship between the phenomenon occurrence and some covariate.

In Section 4.1, the model is introduced, a new sampling procedure is proposed and aspects of inference are discussed. Section 4.2 presents a simulation study for assessing the effectiveness of the proposed model and the model proposed by Rapley & Welsh (2008). In Section 4.3 both approaches are compared through a design-based perspective under different scenarios, as well as a real data application. Finally, a simulation study to evaluate the estimation of model parameters under different degrees of rare and clustered

(26)

populations is presented in Section 4.4.

4.1 Proposed model for cell counts using covariates

Suppose the phenomenon of interest is related to covariates, the values of which are available for each one of the cells in R. Let C be the set of all nonempty cells of R and C the set containing all empty cells of R. Let η(c) be the count of a given phenomenon of interest in the cell c, and vc = (1, v1(c), . . . , vk(c))0 the vector with the k covariates

associated with cell c, for all c ∈ R. Let η be the set with the counts for all nonempty cells, that is, η = {η(c) | c ∈ C}.

In order to perform inference about the population total T = X

c∈C

η(c), one must specify the joint distribution of {X, P, Y, η} for the entire population and the sampling mechanism that provides a particular sample of m networks from M −X+P in population. First, we model the nonempty network structure and then, conditional on it, model the count on the nonempty network’s cells, similarly to Rapley & Welsh (2008)’s approach. Since the model applies to nonempty cells, to avoid degeneration problems it is assumed that there is at least one nonempty cell in R, so distributions are left truncated at zero. The proposed model can be written as follows:

η(c) | vc, θ ∼ truncated Poisson(λ(c)), η(c) ≥ 1, c ∈ C, Y | X, P ∼ 1P + Multinomial X − P, 1 P1P , Yi = 1, . . . , X − P, P | X, β ∼ truncated Binomial(X, β), P = 1, . . . , X, X | α ∼ truncated Binomial(M, α), X = 1, . . . , M. (4.1) where λ(c) = exp{v0_cθ}, θ = (θ0, θ1, . . . , θk)0 represents the regression coefficients vector

associated with vc and 1P+ Multinomial(·) represents the truncated at one Multinomial

distribution. Note that the M − X empty cells have their respective counts equal to zero, that is, η(c) = 0, for all c ∈ C.

Making an equivalence with the notation presented in Section 2.2, the following corre-spondence can be obtained: H = (X, P, Y0)0 and ψ = (α, β)0. According to the sampling procedure, the drawn sample is composed of networks, from which the cells that compose each of them and its respective counts are observed and used to model the population total at a more disaggregated level. Furthermore, we can make an equivalence among the vectors η and N since the first one contains the counts for each of the M cells of R and the second one for each of the M − P + X networks of R. Let Ni be the count associated

(27)

with the i-th network, that can be empty or nonempty. If i-th network is empty, then it is composed of a single cell and Ni = η(c), where c is the cell that composes the i-th

network. On the other hand, if i-th network is nonempty, then it is composed of a set of cells ci and Ni =

P

c∈ciη(c). Moreover, the probability of selecting the sample s

does not depend directly on the quantities of the model, since they do not appear on its expression, but on the allocation process that produces the set of cells that compose unsampled networks and, consequently, the set Gij,j in equation 4.3. Thus, it is said that

the sampling plan is informative.

4.1.1 Model inference

The sampling procedure entails observing Yi for the networks {i1, . . . im} and the

counts η(c) for its respective cells. Since adaptive cluster sampling procedure depends on the population structure, it is characterized as an informative sampling design and the probability of selecting the sample s = {i1, . . . , im} of m networks, [s | X, P, Y], should

be incorporated into the model likelihood function. Set the subscript ‘s’ to identify the observed component and ¯s to the unobserved component, and define Y = (Y0_s, Y0_s_¯)0, X = Xs+X¯sand P = Ps+P¯sto distinguish between observed and unobserved quantities.

Let Csbe the set of the sample’s nonempty cells, i.e., the cells that compose the networks

with sizes Ys; and Cs¯ be the set of the out-of-sample nonempty cells, i.e., the cells that

compose the non-sampled networks with sizes Y¯s. Thus, define η = (η0s, η0s¯)0, where

η_s = {η(c) | c ∈ Cs} and ηs¯ = {η(c) | c ∈ C¯s}. A natural predictor of the population

total T is given by:

T = X c∈Cs η(c) + X c∈C¯s ˜ η(c),

where ˜η(c) represents the posterior mean of the count of the cell c, for c ∈ C¯s.

Following the Bayesian paradigm, independent priors are also assumed for the un-known parameters θ, α and β and their marginal prior distributions are denoted, respec-tively, by [θ], [α] and [β]. Let [θ] be a non-informative prior with a zero-mean vector and covariance matrix σ2

θIk+1, where Ik+1 denotes the (k + 1)-dimensional identity matrix

and σ2

θ = 104. For α we assumed a Beta(aα, bα) distribution with aα = 3 and bα = 15,

and for β a Beta(aβ, bβ) distribution with aβ = 1 and bβ = 9. The prior distributions of

α and β are chosen to reflect the fact that α and β are necessarily small in a rare and clustered population, as considered in Rapley & Welsh (2008). In this case, the objective is not only to estimate the parameters of the model based on a sample, but also to make predictions of the unobserved parts.

(28)

The joint distribution of all the quantities in the model is: [η, Y, P, X, θ, β, α] = [s | X, P, Y][η | θ][Y | X, P ][P | X, β][X | α][θ][α][β] ∝ [s | X, P, Y] × Y c∈C exp{− exp{v0_cθ} + η(c)v0_cθ} η(c)!(1 − exp{− exp{v0 cθ}}) ×(x − p)! p Y i=1 1 (yi − 1)! 1 p yi−1 × x p ! βp_{(1 − β)}x−p 1 − (1 − β)x × M x ! αx_{(1 − α)}M −x 1 − (1 − α)M × exp − 1 2σ2 θ θ0θ × αaα−1_{(1 − α)}bα−1_{× β}aβ−1_{(1 − β)}bβ−1_. _(4.2)

We perform inference via MCMC to obtain samples from the resulting posterior distri-bution. The full conditional posterior distributions and the methods adopted to sample from each of them are detailed in Appendix A. In comparison with the sampling pro-cedure proposed by Salehi M. & Seber (1997), our improved sampling process leads to draw a greater number of networks, providing samples that may include all networks from R (see details in Subsection 4.1.2). Thus, our proposal distribution, different from Rapley & Welsh (2008)’s approach, may lead to none out-of-sample nonempty cells and, consequently, none out-of-sample networks (see details in Appendix A). The estimation procedure consists of the following steps:

(1) Initialize the counter j = 2 and set initial values for the parameters and quantities of the model: θ(1), α(1)_{, β}(1)_{, X}(1) ¯ s , P (1) ¯ s , Y (1) ¯ s and η (1) ¯ s ;

(2) Update the model parameters θ, α and β from the conditional distributions: [θ | α(j−1), β(j−1), X(j−1), P(j−1), Y(j−1), η(j−1)],

[α | θ(j), β(j−1), X(j−1), P(j−1), Y(j−1), η(j−1)], [β | θ(j), α(j), X(j−1), P(j−1), Y(j−1), η(j−1)], described in Appendix A;

(3) Generate the non-sampled quantities X¯s, Ps¯ and Y¯s according to the proposal

distribution described in Section A.4;

(4) Allocate the P¯s networks of Ys¯ according to the allocating procedure described in

Subsection 4.1.1.1;

(5) Generate η_¯_sand jointly update X¯s, Ps¯, Y¯sand η¯sfrom the conditional distribution:

[Xs¯, P¯s, Ys¯, η¯s | θ

(j)_{, α}(j)_{, β}(j)_{, X}

(29)

(6) Increment the counter j to j + 1 and iterate from (2).

Note that the regression coefficients θ are updated on step (2) based only on the sample information. Moreover, from them, we can easily obtain the Poisson distribution’s intensity λ(c) = exp{v0_cθ} for any non-sampled cell c of R, which is used later to estimate η(c) for all nonempty and non-sampled cell c of R. Let λ be the set of intensities assigned to all cells of R. Then, after generating the non-sampled quantities Xs¯, P¯sand Ys¯, all that

remains is to find out which cells form each of these Ps¯ networks on step (4), according

to the allocating procedure presented in Subsection 4.1.1.1.

4.1.1.1 Allocating procedure

Determining the cells that compose the out-of-sample nonempty networks is a crucial step in the proposed model estimation since the resulting allocation directly impacts: the cells that compose Cs¯ and the estimated value of η¯s. Each one of the generated

out-of-sample networks are allocated sequentially, according to its size: the bigger networks are allocated first and the smaller ones later. It is assumed that the bigger the size of the network, the higher its cells’ intensity values. Note that the cells that compose the set of the out-of-sample cells, C¯s, must not be part of the set of sampled cells, Cs, nor of

the sampled nonempty networks’ borders (if it happens, we would be able to modify a network previously sampled).

The allocating procedure aims to draw the cells that compose each generated out-of-sample network according to determined weights. In this case, we will use the set of intensities λ, although one could sample the cells based on other practical weights. The Cs cells’ and visited borders’ weights λ are admitted to be zero. An example of

this procedure is illustrated in Figure 4.1. The allocating method of a network of size Y proceeds as follows: draw an available cell c with probability proportional to the weights λ and, if Y > 1, draw another cell from the neighbors of that cell and continue to draw another neighbors’ cells until we obtain a set of Y contiguous nonempty grid cells surrounded by empty grid cells. Then remove this network from the population, select one of the remaining grid cells with probability proportional to the weights λ and proceed in this way until we have allocated all the Ps¯networks. Note that the cells that were not

(30)

● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ● ● ● ●● ●● ●● ● ●● ● ● ●● ● ● ● ● ● ●●● ● ● ●●● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●●● ● ● ● ● ● ● ● ● ●●●●●●● ●●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●●●●●● ● ● ● ● (a) → ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●● ● ● ● ● ● ●●● ● ● ●●● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●●● ● ● ● ● ● ● ● ● ●●●●●● ●●●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●●●●●● ● ● ● ● (b) → ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ● ● ● ●● ●● ●● ● ●● ● ● ●● ● ● ● ● ● ●●● ● ● ●●● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●●● ● ● ● ● ● ● ● ● ●●●●●● ●●●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●●●●●● ● ● ● ● (c) ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●●● ● ● ● ● ● ● ● ● ●●●●●●● ●●● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●●●●●● ● ● ● ● (d) → ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●●● ● ● ● ● ● ● ● ● ●●●●●● ●●●● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●●●●●● ● ● ● ● (e) → ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ●●● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●●● ● ● ● ● ● ● ● ● ●●●●●● ●●●● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●●●●●● ● ● ● ● (f)

Figure 4.1: Allocation method illustration of two out-of-sample networks of sizes 3 and 2, based on weights λ (gray background). The lighter the cells’ color, the higher the λ value (intensity) of that cell. The white cells with bold borders, whose weights are equal to zero, are the sampled networks’ cells and the hatched ones correspond to the nonempty networks’ border cells. The red borders surround cells that can be drawn in each stage of the procedure. The cells that compose the first and second allocated networks are blue-painted and green-painted, respectively. The example proceeds as follows: draw one of the red-surrounded cells of Panel (a). The sampled cell is blue-indicated in Panel (b) and only its neighbors can be sorted to keep building this network. In Panel (c) and (d) the allocation of the 3-sized network is finished, and we can draw any cell with the red border to start the allocation of the 2-sized network. Panels (e) and (f) present the cells chosen to compose this network in green.

4.1.2 Sampling procedure

A variation of the sampling procedure proposed by Salehi M. & Seber (1997) is pro-posed here to improve the sampling process, aiming to sample more nonempty networks. Let π be the set of sampling weights assigned to all cells of R and π(c) the weight for a given cell c. The procedure consists of sampling a grid cell from the set of M grid cells with probability proportional to the weights π and, if it is nonempty, the entire network containing the selected grid cell. After removing this network from the population, a new cell is selected from the remaining set of grid cells and the method proceeds in this way until we have selected m networks in the sample. Note that a nonempty network is surrounded by empty cells that make up its border and can be resampled.

(31)

4.2, is divided into two stages and is based on weights that are used to draw the sample. In the first stage, m1 networks are selected considering grid cells with equal weights, i.e.

π(c) is constant for all c ∈ R. The sampling procedure continues until all nonempty cells in the neighborhood are observed and stop when empty units are visited. Thus, the networks are selected with probability proportional to their size. Note that, during this process, although the border cells are visited, they are not added to the sample. Based on the fit of the proposed model in equation (4.1) to this first sample with m1

networks, we obtain the vector of weights ω for all non-sampled cells of R, which are used to select the second sample. Let ω(c) be the weight defined by the posterior mean of η(c), for each cell c ∈ R. Note that the higher the posterior mean of a cell count, the more chances of selecting that cell. Due to the inference process, the weights ω associated with the border cells are assigned to be zero. Since the first sample of the network’s cells must not be drawn in the second sampling stage, the weights associated with these cells are assumed to be zero too. Then, a second sample of m2 networks is

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ●●● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ●

(a) Population and constant weights

1st _stage −−−−−−→ ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ●●● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ●

(b) First sample: m1 networks

● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ●●● ●● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ●

(c) First sample and weights ω

2nd _stage −−−−−−→ ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ●●● ●● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ●

(d) Final sample: m1 + m2 networks

Figure 4.2: Proposed sampling procedure illustration of a population (points) distributed in a region with M = 400 cells and weights (gray background) used in this scheme. The lighter the cells’ color, the higher the weight of that cell. The white cells, whose weights are equal to zero, with bold borders are the sampled networks’ cells and hatched cells correspond to the nonempty networks’ border cells. Panels (a) and (b) present the same grayscale, since all cells have constant weight in the first stage; and Panels (c) and (d) show different shades of gray due to the second stage’s different weights.

(32)

drawn with probability proportional to the weights ω. Hence, the final sample will be given by s = s1∪ s2 = {i1, . . . , im1, im1+1, . . . , im1+m2}, with size m = m1 + m2.

To motivate the notation for the probability of selecting a given sample, consider a population consisting of networks of size Z from which we obtain the ordered sample s = {i1, . . . , im}. The probability of selecting the j-th network of the sample, that is

a network of size Zij, is given by the sum of probabilities of selecting each unselected

network of size Zij after j − 1 networks have been observed, since networks with the

same size are considered alike. Thus, the probability of selecting a network in the sample depends on its size Zi, which is only observed for the sampled networks after their

selection in the sample.

Let cj be the set of sampled cells in the j-th draw. Thus, cj is composed of the drawn

grid cell and, if it is nonempty, cj contains the entire network containing the selected grid

cell. Let Gij,j be the set of cells that compose unselected networks of size Zij after j − 1

networks have been selected. Thus, in general, the probability of selecting the sample s = {i1, . . . , im} of m networks is given by:

[s | X, P, Y] = m Y j=1 X g∈G_{ij ,j} π(g) X r∈R π(r) − j−1 X k=0 X c∈ck π(c) , (4.3)

where π(c) represents the weight of the cell c and is:

π(c) = (

constant, if c ∈ s1;

ω(c), if c ∈ s2.

Note that in equation (4.3), the index j represents j-th draw, so c ∈ s1 for j =

1, . . . , m1, and c ∈ s2 for j = m1 + 1, . . . , m. When the proposed model in equation

(4.1) is fitted to the first sample (to obtain the weights ω), the weights π(c) are constant and the probability given in expression (4.3) matches with the probability of selecting a sample s given in Rapley & Welsh (2008). On the other hand, differently from Rapley & Welsh (2008), the probability of selecting a given sample s does not depend directly on the quantities of the model, but on the weights of the networks’ cells.

The cells that compose each non-sampled network are defined from their allocation process (described in Subsection 4.1.1.1), which directly impacts the weights ω used to select the second sample. Therefore, the proposed model must properly determine the cells that compose the out-of-sample nonempty networks. If a cell is part of C¯s in a

large number of MCMC iterations, this cell tends to be a nonempty cell of R and the associated posterior mean will be high, while, if a cell does not compose Cs¯ in a large

(33)

number of MCMC iterations, this cell tends to be an empty cell of R. It is expected that this novel sampling method based on weights will lead us to a more efficient selection of networks, as we are assigning higher chances to the cells where the phenomenon of interest is expected to be found and avoiding sampling in areas where the expected intensity of the phenomenon’s occurrence is low.

The proposed sampling methodology consists of the following steps:

(1) Consider a region R containing a rare, clustered population, partitioned into M cells and draw an adaptive cluster sample of m1 networks, which is equivalent to

drawing a sample of m1 networks with probability proportional to their sizes, i.e.,

the elements of the vector of probabilities π are constant;

(2) Fit the proposed model in equation (4.1) to this first sample to obtain the posterior mean of the cells’ counts η(c), given in the vector ω, which will be used as weights to select the second sample;

(3) Since the first sample of the networks cells must not be drawn in the second sam-pling stage, set the weights associated with the cells of the first sample as zero, as well as, the non-empty networks’ border cells;

(4) From the remaining cells of R, drawn m2 networks with probability proportional

to the weights ω;

(5) Finally, fit the proposed model in equation (4.1) to the final sample of size m = m1+ m2 to estimate the population total.

4.1.2.1 Border-sampling procedure

Through the proposed sampling method, we survey a selected grid cell and, if it is nonempty, the entire network containing the selected grid cell. It is important to remark that nonempty networks are surrounded by empty cells that compose its border, which are not removed from R unless they are drawn as an empty network. Thus, a surveyed border cell can be drawn later, although we know that it is empty.

To avoid surveying the same border cell twice, we propose an alternative sampling method, given as follows: draw a grid cell from R with probability proportional to the weights π, survey that grid cell and, if it is nonempty, survey the entire network contain-ing the selected cell. After removcontain-ing this network and its border from the population, select a new cell from the remaining set of grid cells and proceed in this way until we have selected m networks in the sample. In practice, proceeding this way is equivalent to surveying clusters instead networks, though the final sample structure is the same as

Model-based Inference for Rare and Clustered Population from Adaptative Cluster Sampling using Auxiliary Variables

Model-based Inference for Rare and

Clustered Populations from Adaptive

Cluster Sampling using Auxiliary Variables

Izabel Nolau de Souza

Orientadores: Kelly Cristina Mota Gon¸

calves

e

Jo˜

ao Batista de Morais Pereira

Universidade Federal do Rio de Janeiro

Instituto de Matem´

atica

Departamento de M´etodos Estat´ısticos

2020

Model-based Inference for Rare and

Clustered Populations from Adaptive

Cluster Sampling using Auxiliary Variables

Izabel Nolau de Souza

Agradecimentos

Abstract

Contents

List of Figures

List of Tables

Chapter 1

Introduction

Chapter 2

Inference for finite populations

2.1

Fixed population approach

2.2

Superpopulation models

2.2.1

Informative sampling design

Chapter 3

Adaptive cluster sampling

3.1

Thompson (1990)’ approach

3.2

Extensions of adaptive cluster sampling

3.3

Model-based inference

Chapter 4

Model for rare and clustered

populations under adaptive sampling

using covariates

4.1

Proposed model for cell counts using covariates

4.1.1

Model inference

4.1.2

Sampling procedure