• Nenhum resultado encontrado

Small Sample Spaces for Gaussian Processes

N/A
N/A
Protected

Academic year: 2023

Share "Small Sample Spaces for Gaussian Processes"

Copied!
26
0
0

Texto

(1)

Small Sample Spaces for Gaussian Processes Karvonen, Toni

2023-05

Karvonen , T 2023 , ' Small Sample Spaces for Gaussian Processes ' , Bernoulli , vol. 29 , no. 2 , pp. 875-900 . https://doi.org/10.3150/22-BEJ1483

http://hdl.handle.net/10138/356782 https://doi.org/10.3150/22-BEJ1483

acceptedVersion

Downloaded from Helda, University of Helsinki institutional repository.

This is an electronic reprint of the original article.

This reprint may differ from the original in pagination and typographic detail.

Please cite the original version.

(2)

Small Sample Spaces for Gaussian Processes

TONI KARVONEN1,2,*

1The Alan Turing Institute, 96 Euston Road, London NW1 2DB, United Kingdom

2University of Helsinki, Department of Mathematics and Statistics, Pietari Kalmin katu 5, 00560 Helsinki, Finland E-mail:*toni.karvonen@helsinki.fi

It is known that the membership in a given reproducing kernel Hilbert space (RKHS) of the samples of a Gaussian processXis controlled by a certain nuclear dominance condition. However, it is less clear how to identify a “small”

set of functions (not necessarily a vector space) that contains the samples. This article presents a general approach for identifying such sets. We usescaled RKHSs, which can be viewed as a generalisation of Hilbert scales, to define thesample support setas the largest set which is contained in every element of full measure under the law ofX in theσ-algebra induced by the collection of scaled RKHS. This potentially non-measurable set is then shown to consist of those functions that can be expanded in terms of an orthonormal basis of the RKHS of the covariance kernel ofXand have their squared basis coefficients bounded away from zero and infinity, a result suggested by the Karhunen–Loève theorem.

Keywords:Gaussian processes; sample path properties; reproducing kernel Hilbert spaces

1. Introduction

LetK:T×T→Rbe a positive-semidefinite kernel on a setT and consider any Gaussian process (X(t))t∈T with mean zero and covarianceK, which we denote(X(t))t∈T ∼ GP(0, K). LetH(K)be the reproducing kernel Hilbert space (RKHS) ofKequipped with inner producth·,·iKand normk·kK. It is a well-known fact, apparently originating with Parzen [28], that the samples ofX are not contained inH(K)if this space is infinite-dimensional. Furthermore, Driscoll [9]; Fortet [12]; and Luki´c and Beder [24] have used the zero-one law of Kallianpur [15] to show essentially that, given another kernel Rand under certain mild assumptions,

P

X∈H(R)

= 1 ifRK and P

X∈H(R)

= 0 ifR6K,

whereRKsignifies thatRdominatesK(i.e.,H(K)⊂H(R)) and, moreover, that the dominance is nuclear(see Section4.1for details). ThisDriscoll’s theoremis an exhaustive tool for verifying whether or not the samples from a Gaussian process are contained in a given RKHS. A review of the topic can be found in [13, Chapter 4]. Two questions now arise:

• How to construct a kernelRsuch thatRK?

• Is it possible to exploit the fact that P[X ∈H(R1)\H(R2)] = 1for any kernels such that R1KandR26Kto identify in some sense the smallest set of functions which contains the samples with probability one?

Answers to questions such as these are instructive for theory and design of Gaussian process based learning [11,40], emulation and approximation [18,41], and optimisation [3] methods. For simplicity we assume that the domainT is a complete separable metric space, that the kernelKis continuous and its RKHS is separable, and that the samples ofXare continuous. Although occasionally termed

“rather restrictive” [9, p. 309], these continuity assumptions are satisfied by the vast majority of domains and Gaussian processes commonly used in statistics and machine learning literature [31,35], such as

1

arXiv:2103.03169v3 [math.PR] 15 Mar 2022

(3)

stationary processes with Gaussian or Matérn covariance kernels. Our motivation for imposing these restrictions is that they imply that RKHSs are measurable.

1.1. Contributions

First, we present a flexible construction for a kernelRsuch thatRK. For any orthonormal basis Φ = (φn)n=1ofH(K)the kernelKcan be written asK(t, t0) =P

n=1φn(t)φn(t0)for allt, t0∈T. Given a positive sequenceA= (αn)n=1such thatP

n=1αnφn(t)2<∞for allt∈T, we define the scaled kernel

KA,Φ(t, t0) =

X

n=1

αnφn(t)φn(t0). (1.1)

This is a significant generalisation of the concept of powers of kernels which has been previously used to construct RKHSs which contain the samples by Steinwart [36]. We call the sequenceAaΦ-scaling ofH(K). Ifαn→ ∞asn→ ∞, the correspondingscaled RKHS1,H(KA,Φ), is a proper superset of H(K), though not necessarily large enough to contain the samples ofX. We show that convergence of the seriesP

n=1α−1n controls whether or not samples are contained inH(KA,Φ). IfP n=1α−1n converges slowly, we can therefore interpretH(KA,Φ)as a “small” RKHS which contains the samples.

Main Result I (Theorem4.3). LetΦ = (φn)n=1be an orthonormal basis ofH(K)andA= (αn)n=1 aΦ-scaling ofH(K). IfKA,Φis continuous anddK(t, t0) =kK(·, t)−K(·, t0)kKis a metric onT, then either

P

X∈H(KA,Φ)

= 0 and

X

n=1

1

αn =∞ or P

X∈H(KA,Φ)

= 1 and

X

n=1

1 αn <∞.

In Section5, we use this result to study sample path properties of Gaussian processes defined by infinitely smooth kernels. These appear to be the first sufficiently descriptive results of their kind. An example of an infinitely smooth kernel that we consider is the univariate (i.e.,T⊂R) Gaussian kernel K(t, t0) = exp(−(t−t0)2/(2`2))with length-scale` >0 for which we explicitly construct several scaled kernelsRwhose RKHSs are “small” but still contain the samples of(X(t))t∈T ∼ GP(0, K). In Section6, Theorem4.3is applied to provide an intuitive explanation for a conjecture by Xu and Stein [41] on asymptotic behaviour of the maximum likelihood estimate of the scaling parameter of the Gaussian kernel when the data are generated by a monomial function on a uniform grid.

Secondly, we use Theorem4.3to construct a “small” set which “almost” contains the samples. This sample support setis distinct from the traditional topological support of a Gaussian measure; see the discussion at the end of Section2. LetC(T)denote the set of continuous function onT.

Main Result II (Theorems4.8and4.11). LetΦ = (φn)n=1be an orthonormal basis ofH(K)and suppose there is aΦ-scalingA= (αn)n=1such thatP

n=1α−1n <∞andKA,Φ is continuous. Let S(R)be theσ-algebra generated by the collection of scaled RKHSs consisting of continuous functions andSR(K)the largest subset ofC(T)that is contained in everyH∈ S(R)such thatP[X∈H] = 1.

Suppose thatdK(t, t0) =kK(·, t)−K(·, t0)kKis a metric onT. ThenSR(K)consists precisely of the functionsf=P

n=1fnφnsuch that lim inf

n→∞ fn2>0 and sup

n≥1

fn2<∞. (1.2)

1These spaces are not to be confused with classical Hilbert scales defined via powers of a strictly positive self-adjoint operator;

see [10, Section 8.4] and [20].

(4)

Furthermore, for everyH∈ S(R)such thatP[X∈H] = 1there existsF∈ S(R)such thatSR(K)is a proper subset ofF andFis a proper subset ofH.

The setSR(K)may fail to be measurable. The latter part of the above result is therefore important in demonstrating that it is possible to construct sets which are arbitrarily close toSR(K)and contain the samples using countably many elementary set operations of scaled RKHSs. At its core this is a manifestation of the classical result that there is no meaningful notion of a boundary between convergent and divergent series [19, § 41]. The general form of the Karhunen–Loève theorem is useful in explaining the characterisation in (1.2). If(φn)n=1is any orthonormal basis ofH(K), then the Gaussian process can be written asX(t) =P

n=1ζnφn(t)for allt∈T, whereζnare independent standard normal random variables. The series converges in L2(P), but if almost all samples ofX are continuous, convergence is also uniform onTwith probability one [1, Theorem 3.8]. BecausekX(t)k2K=P

n=1ζn2 andE[ζn2] = 1for everyn, the Karhunen–Loève expansion suggests, somewhat informally, that the samples are functionsf=P

n=1fnφnfor which the sequence(fn2)n=1satisfies (1.2).

1.2. On Measurability and Continuity

Suppose for a moment thatT is an arbitrary uncountable set,Ka positive-semidefinite kernel onT such that its RKHSH(K)is infinite-dimensional, and(X(t))t∈T ∼ GP(0, K)a generic Gaussian process defined on a generic probability space(Ω,A,P). LetRT be the collection of real-valued functions onT andB˜theσ-algebra generated by cylinder sets of the form{f∈RT : (f(t1), . . . , f(tn))∈Bn}for any n∈Nand any Borel setBn⊂Rn. LetΦX(ω) =X(·, ω). Thenµ˜X =P◦Φ−1X is the law ofX on the measurable space(RT,B). Consequently,˜ P[X∈H] = ˜µX {ω∈Ω :X(·, ω)∈H}

forH∈B. Let˜ (RT,B˜0,µ˜X,0)be the completion of(RT,B,˜ µ˜X)andR: T×T→Ra positive-semidefinite kernel

andH(R)its RKHS. The following facts are known about the measurability ofH(K)andH(R):

• In general,H(R)∈/B˜. For example, ifT is equipped with a topology andRis continuous on T×T, thenH(R)⊂C(T). However, no non-empty subset ofC(T)can be an element ofB.˜

• LePage [21, p. 347] has proved thatH(K)∈B˜0andµ˜X,0(H(K)) = 0, a claim which originates with Parzen [27,28]. A version which requires separability and continuity appears in [16].

• If the RKHSH(R)is infinite-dimensional andR6K, thenH(R)∈B˜0andµ˜X,0(H(R)) = 0.

This result is contained in the proof of Theorem 7.3 in [24]. See also [13, Proposition 4.5.1].

• LePage [21, Corollary 2] has proved a dichotomy result which states that ifG⊂RT is an additive group andG∈B˜0, then eitherµ˜X,0(G) = 0orµ˜X,0(G) = 1. Furthermore,µ˜X,0(G) = 1implies thatH(K)⊂G. This is a general version of the zero-one law of Kallianpur [15, Theorem 2].

It appears that not much more can be said without imposing additional structure or constructing versions ofX, as is done in [24].

Suppose that(T, dT)is a complete separable metric space, that the kernelKis continuous, and that almost all samples of(X(t))t∈T ∼ GP(0, K)are continuous, which is to say that the( ˜B0,µ˜X,0)-outer measure ofC(T)is one. Define the probability space(C(T),B, µX)as

B=C(T)∩B˜0 and µX(C(T)∩H) = ˜µX,0(H) for H∈B˜0. (1.3) The rest of this article is concerned with (C(T),B, µX)and it is to be understood thatP[X ∈H]

stands forµX(H)for anyH∈ B. In this setting Driscoll [9, p. 313] has proved thatH(R)∈ BifRis continuous and positive-definite. By using Theorem 1.1 in [12] (Theorem 4.1 in [24]) one can generalise this result for a continuous and positive-semidefiniteR; see the proof of Theorem 7.3 in [24].

(5)

1.3. Notation and Terminology

For non-negative real sequences(an)n=1 and(bn)n=1we writeanbnif there isC >0such that an≤Cbnfor all sufficiently largen. If bothanbnandbnanhold, we writeanbn. Ifan/bn→1 asn→ ∞, we writean∼bn. For two setsFandGwe useF(Gto indicate thatF is a proper subset ofG. A kernelR:T×T→Rispositive-semidefiniteif

N

X

i=1 N

X

j=1

aiajR(ti, tj)≥0 (1.4)

for anyN≥1,a1, . . . , aN∈Randt1, . . . , tN∈T. In the remainder of this article positive-semidefinite kernels are simply referred to askernels. If the inequality in (1.4) is strict for any pairwise distinct t1, . . . , tN, the kernel is said to bepositive-definite.

1.4. Standing Assumptions

For ease of reference our standing assumptions are collected here. We assume that (i)(T, dT)is a complete separable metric space; (ii) the covariance kernelK: T×T→Riscontinuousandpositive- semidefiniteonT ×T; (iii) the RKHSH(K)induced byKisinfinite-dimensionalandseparable2; and (iv)(X(t))t∈T ∼ GP(0, K)is a zero-mean Gaussian process on a probability space(Ω,A,P)with continuous paths. The lawµXofXis defined on the measurable space(C(T),B)which was constructed in Section1.2. Some of our results have natural generalisations for general second-order stochastic processes; see [24], in particular Sections 2 and 5, and [36]. We do not pursue these generalisations.

2. Related Work

Reproducing kernel Hilbert spaces which contain the samples of(X(t))t∈T ∼ GP(0, K)have been constructed by means of integrated kernels in [23], convolution kernels in [5] and [11, Section 3.1], and, most importantly,powers of RKHSs[39] in [17, Section 4] and [36]. Namely, letT be a compact metric space,Ka continuous kernel onT×T, andνa finite and strictly positive Borel measure onT. Then the integral operatorTν, defined forf∈L2(ν)via

(Tνf)(t) = Z

T

K(t, t0)f(t0) dν(t0), (2.1) has decreasing and positive eigenvalues(λn)n=1, which vanish asn→ ∞, and eigenfunctions(ψn)n=1 inH(K)such that(√

λnψn)n=1 is an orthonormal basis ofH(K). The kernel has the uniformly convergent Mercer expansionK(t, t0) =P

n=1λnψn(t)ψn(t0)for allt, t0 ∈T. For θ >0, theθth power ofKis defined as

K(θ)(t, t0) =

X

n=1

λθnψn(t)ψn(t0). (2.2)

2Most famously, separable RKHSs are induced by Mercer kernels, which are continuous kernels defined on compact subsets ofRd[29, Section 11.3].

(6)

The series (2.2) converges ifP

n=1λθnψn(t)2<∞for allt∈T. Furthermore,H(K2))(H(K1)) if θ1< θ2 andP[X ∈H(K(θ))] = 1 if and only ifP

n=1λ1−θn <∞[36, Theorem 5.2]. When it comes to sample properties, the power kernel construction has two significant downsides: (i) The measureν is anuisance parameter. If one is only interested in sample path properties of Gaussian processes this measure should not have an intrinsic part to play in the analysis and results. (ii) The construction is somewhatinflexibleandunsuitable for infinitely smooth kernels. BecauseH(K(θ)) consists precisely of the functionsf=P

n=1fnλ1/2n ψnsuch thatP

n=1fn2λ1−θn <∞andλn→0as n→ ∞, how much largerH(K(θ))is thanH(K) =H(K(θ=1))is determined by rate of decay of the eigenvalues. Power RKHSs are more descriptive and fine-grained when the kernel is finitely smooth and its eigenvalues have polynomial decayn−afora >0(e.g., Matérn kernels) than when the kernel is infinitely smooth with at least exponential eigenvalue decaye−bnforb >0(e.g., Gaussian): the change the decay conditionP

n=1fn2<∞for the coefficients(fn)n=1toP

n=1fn2n−a(1−θ)<∞is arguably less substantial than that fromP

n=1fn2<∞toP

n=1fn2e−b(1−θ)n<∞. Indeed, as pointed out by Kanagawa et al. [17, Section 4.4], when the kernel is GaussianeveryH(K(θ))withθ <1contains the samples with probability one, which renders powers of RKHSs of dubious utility in that setting because H(K(θ=1)) =H(K)does not contain the samples. The relationship between powers of RKHSs and scaled RKHSs is discussed in more detail at the end of Section3. In Section5we demonstrate that scaled RKHSs are more useful in describing sample path properties of Gaussian processes defined by infinitely smooth kernels than powers of RKHSs.

To the best of our knowledge, the question about a “minimal” set which contains the samples with probability one has received only cursory discussion in the literature. Perhaps the most relevant digression on the topic is an observation by Steinwart [36, pp. 369–370], given here in a somewhat applied form and without some technicalities, that the samples are contained in the set

\

r<s

W2r(T)

!

\W2s(T) (2.3)

with probability one ifH(K)is norm-equivalent to the fractional Sobolev spaceW2s+d/2(T)fors >0 on a suitable domainT⊂Rd. In the Sobolev case the samples are therefore “d/2less smooth” than functions in the RKHS ofK. BecauseW2s(T) =∪r≥sW2r(T), the set in (2.3) has the same form as the sample support set in (4.2). This observation is, of course, a general version of the familiar result that the sample paths of the Brownian motion, whose covariance kernelK(t, t0) = min{t, t0}onT= [0,1]

induces the Sobolev spaceW21([0,1])with zero boundary condition at the origin, have regularity1/2 in the sense that they are almost surelyα-Hölder continuous if and only ifα <1/2. That is, there isC >0such that, for almost everyω∈Ω,|X(t, ω)−X(t0, ω)| ≤C|t−t0|αfor allt, t0∈[0,1]and anyα <1/2. However, Lévy’s modulus of continuity theorem [26, Section 1.2] improves this to

|X(t, ω)−X(t0, ω)| ≤Cp

hlog(1/h)whenh=|t−t0|is sufficiently small. Since the Sobolev space W2s(T)consists of those functionsf: T→Rwhich admit anL2(Rd)-extensionfe:Rd→Rwhose Fourier transform satisfies

Z

Rd

1 +kξk2s

|fbe(ξ)|2dξ <∞, (2.4) Lévy’s modulus of continuity theorem suggests replacing the weight in (2.4) with, for example, (1 +kξk2)slog(1 +kξk)so that the resulting function space is a proper superset ofW2s(T)and a proper subset ofW2r(T)for everyr < sand hence a proper subset of the set in (2.3). Some results and discussion in [18, Section 4.2] and [34] have this flavour.

(7)

Finally, we remark that classical results about the topological support of a Gaussian measure are distinct from the results in this article. LetC(T)be equipped with the standard supremum normk·k. Thetopological support,suppC(T)X), of the measureµXis the set

suppC(T)X) ={f∈C(T) :µX(B(f, r))>0for allr >0},

whereB(f, r)is thef-centeredr-ball in(C(T),k·k). It is a classical result [16, Theorem 3] that suppC(T)X) =H(K), whereH(K)is the closure ofH(K)in(C(T),k·k). In other words, the topological support ofµX contains every continuous functionf such that for everyε >0there exist gε∈H(K)satisfyingkf−gεk< ε. Now, recall that a kernelRisuniversalifH(R)is dense in C(T)[37, Section 4.6]. Most kernels of interest to practitioners are universal, including Gaussians, Matérns, and power series kernels. But, by definition, the closure of the RKHS of a universal kernel equalsC(T). ThereforesuppC(T)X) =H(K) =C(T)ifKis a universal kernel. This result does not provide any information about the samples because we have assumed that the samples are continuous to begin with. See [2, Section 3.6] for further results on general topological supports of Gaussian measures.

3. Scaled Reproducing Kernel Hilbert Spaces

For any orthonormal basisΦ = (φn)n=1 ofH(K)the kernel has the pointwise convergent expan- sionK(t, t0) =P

n=1φn(t)φn(t0)for allt, t0∈T. By the standard characterisation of a separable Hilbert space, the RKHS consists of precisely those functionsf:T →Rthat admit an expansion f=P

n=1fnφnfor coefficients such thatP

n=1fn2<∞. The Cauchy–Schwarz inequality ensures that this expansion converges pointwise onT. For given functionsf=P

n=1fnφnandg=P n=1gnφn in the RKHS the inner product ishf, giK=P

n=1fngn.

Definition3.1(Scaled kernel and RKHS). We say that a positive sequenceA= (αn)n=1is aΦ-scaling ofH(K)ifP

n=1αnφn(t)2<∞for everyt∈T. The kernel KA,Φ(t, t0) =

X

n=1

αnφn(t)φn(t0) (3.1)

is called ascaled kerneland its RKHSH(KA,Φ)ascaled RKHS.

See [30,42,43] for prior appearances of scaled kernels under different names and not in the context of Gaussian processes. Although many of the results in this section have appeared in some form in the literature, all proofs are included here for completeness.

Proposition 3.2. LetΦ = (φn)n=1be an orthonormal basis ofH(K)andA= (αn)n=1aΦ-scaling ofH(K). Then (i) the scaled kernelKA,Φis positive-semidefinite, (ii) the collection(√

αnφn)n=1is an orthonormal basis ofH(KA,Φ), and (iii) the scaled RKHS is

H(KA,Φ) = (

f=

X

n=1

fnφn :kfk2K

A,Φ=

X

n=1

fn2 αn<∞

)

, (3.2)

where convergence is pointwise, and for anyf =P

n=1fnφnandg=P

n=1gnφninH(KA,Φ)its inner product is

hf, giKA,Φ=

X

n=1

fngn

αn . (3.3)

(8)

Proof. By the Cauchy–Schwarz inequality andP

n=1αnφn(t)2<∞for everyt∈T,

X

n=1

nφn(t)φn(t0)| ≤

X

n=1

αnφn(t)2

1/2 X

n=1

αnφn(t0)2 1/2

<∞

for anyt, t0∈T. This proves that the scaled kernel in (3.1) is well-defined via an absolutely convergent series. To verify thatKA,Φ is positive-semidefinite, note that, for anyN≥1,a1, . . . , aN∈R, and t1, . . . , tN∈T,

N

X

i=1 N

X

j=1

aiajKA,Φ(ti, tj) =

X

n=1

αn

N

X

i=1 N

X

j=1

aiajφn(tin(tj) =

X

n=1

αn N

X

i=1

aiφn(ti) 2

is non-negative because eachαnis positive. Because

X

n=1

|fn

αnφn(t)| ≤

X

n=1

fn2

1/2 X

n=1

αnφn(t)2 1/2

<∞ for every t∈T ifP

n=1fn2<∞, the space defined in (3.2) and (3.3) is a Hilbert space of functions with an orthonor- mal basis(√

αnφn)n=1. SinceKA,Φ(t, t0) =P

n=1αnφn(t)φn(t0), the scaled kernel is the unique reproducing kernel of this space [e.g.,25, Theorem 9].

A scaled RKHS depends on the ordering of the orthonormal basis ofH(K)used to construct it.

For example, letΦ = (φn)n=1be an orthonormal basis ofH(K)and suppose thatαn=ndefines a Φ-scaling ofH(K). Define another ordered orthonormal basisΨ = (ψn)n=1by settingψ2n+12n forn≥0and interleaving the remainingφnto produceΨ = (φ1, φ3, φ2, φ5, φ4, φ6, φ8, φ7, φ16, . . .). The functionf=P

n=0φ2n=:P

n=1fΦ,nφnis inH(KA,Φ)because kfk2K

A,Φ=

X

n=1

fΦ,n2 αn =

X

n=0

1 2n <∞ but not inH(KA,Ψ)becausef=P

n=0φ2n=P

n=0ψ2n+1=:P

n=1fΨ,nψnand therefore kfk2K

A,Ψ=

X

n=1

fΨ,n2 αn =

X

n=0

1

2n+ 1=∞.

In practice, the orthonormal basis usually has a natural ordering. For instance, the decreasing eigenvalues specify an ordering for a basis obtained from Mercer’s theorem (see Section2) or the basis may have a polynomial factor, the degree of which specifies an ordering (see the kernels in Sections5.2and5.3).

The following results compare sizes of scaled RKHSs: the fasterαngrows, the larger the RKHS H(KA,Φ)is. A number of additional properties between scaled RKHSs can be proved in a similar manner but are not needed in the developments of this article. Some of the below results or their variants can be found in the literature. In particular, see [42, Section 6] and [43, Section 4] for a version of Proposition3.5and some additional results.

Proposition 3.3. LetΦ = (φn)n=1be an orthonormal basis ofH(K)andA= (αn)n=1andB= (βn)n=1twoΦ-scalings ofH(K). ThenH(KB,Φ)⊂H(KA,Φ)if and only ifβnαn. In particular,

H(K)⊂H(KA,Φ)if and only ifinfn≥1αn>0.

(9)

Proof. Ifβnαn, then for anyf=P

n=1fnφn∈H(KB,Φ)we have kfk2K

A,Φ=

X

n=1

fn2 αn=

X

n=1

βn αn

fn2

βn≤ kfk2K

B,Φsup

n≥1

βn

αn <∞. (3.4)

Consequently,H(KB,Φ)⊂H(KA,Φ). Suppose then thatH(KB,Φ)⊂H(KA,Φ)and assume to the contrary thatsupn≥1α−1n βn=∞so that there is a subsequence(nm)m=1such thatα−1nmβnm≥2m. Thenf=P

m=12−m/2p

βnmφnm∈H(KB,Φ)\H(KA,Φ)since kfk2K

B,Φ=

X

m=1

nm 2m/2

2 1 βnm =

X

m=1

2−m= 1 but kfk2K

A,Φ=

X

m=1

2−mβnm αnm

X

m=1

1 =∞,

which contradicts the assumption thatH(KB,Φ)⊂H(KA,Φ). Thussupn≥1α−1n βn<∞. The second statement follows by settingβn= 1for everyn∈Nand noting that thenH(KB,Φ) =H(K).

Two normed spacesF andGare said to benorm-equivalentif they are equal as sets and if there exist positive constantsC1andC2such thatC1kfkF ≤ kfkG≤C2kfkF for allf∈F. From (3.4) it follows thatH(KA,Φ)andH(KB,Φ)are norm-equivalent if and only ifαnβn.

Corollary 3.4. Let Φ = (φn)n=1 be an orthonormal basis ofH(K)andA= (αn)n=1 andB= (βn)n=1twoΦ-scalings ofH(K). ThenH(KA,Φ)andH(KB,Φ)are norm-equivalent if and only if

αnβn.

Proposition 3.5. LetΦ = (φn)n=1be an orthonormal basis ofH(K)andA= (αn)n=1andB= (βn)n=1 twoΦ-scalings ofH(K). ThenH(KB,Φ)(H(KA,Φ)if and only ifsupn≥1αnβn−1=∞

andβnαn.

Proof. Assume first thatsupn≥1αnβ−1n =∞andβnαn. Sinceβnαn, Proposition3.3yields H(KB,Φ)⊂H(KA,Φ). ThusH(KB,Φ)is a proper subset ofH(KA,Φ)ifH(KA,Φ)is not a subset ofH(KB,Φ). But, again by Proposition3.3,H(KA,Φ)⊂H(KB,Φ)if and only ifαnβn, which contradicts the assumption thatsupn≥1αnβn−1=∞. HenceH(KB,Φ)is a proper subset ofH(KA,Φ). Assume then thatH(KB,Φ)(H(KA,Φ). Thenβnαnby Proposition3.3. Ifsupn≥1αnβn−1=∞ did not hold, there would existC >0such thatαn≤Cβnfor alln∈N, which is to sayαnβn. But by Proposition3.3this would imply thatH(KA,Φ)⊂H(KB,Φ), which would contradict the assumption thatH(KB,Φ)is a proper subset ofH(KA,Φ). This completes the proof.

Remark3.6. LetH(R)be another separable RKHS of functions onT. The RKHSsH(K)andH(R) aresimultaneously diagonalisableif there exists an orthonormal basis(φn)n=1ofH(K)which is an orthogonal basis ofH(R). That is,(kφnk−1R φn)n=1is an orthonormal basis ofH(R)and consequently H(R) =H(KA,Φ)for the scaling withαn=kφnk−2R .

We conclude this section by demonstrating that scaled RKHSs generalise powers of RKHSs. We say that aΦ-scalingAρ= (αn)n=1 ofH(K)isρ-hyperharmonicif αn=nρfor someρ≥0. The RKHSH(KAρ)is a ρ-hyperharmonic scaled RKHS. Recall from Section2 that ifT is a com- pact metric space, K is continuous on T ×T, and ν is a finite and strictly positive Borel mea- sure onT, then the integral operator in (2.1) has eigenfunctions (ψn)n=1 and decreasing positive

(10)

eigenvalues(λn)n=1andΨ = (√

λnψn)n=1is an orthonormal basis ofH(K). Forθ >0the kernel K(θ)(t, t0) =P

n=1λθnψn(t)ψn(t0)is theθth power ofKand its RKHSH(K(θ))theθth power of H(K). These objects are well-defined ifP

n=1λθnψn(t)<∞for allt∈T. We immediately recognise thatK(θ)equals the scaled kernelKA,Ψfor the scalingA= (λθ−1n )n=1because

KA,Ψ(t, t0) =

X

n=1

λθ−1n λnψn(t)ψn(t0) =

X

n=1

λθnψn(t)ψn(t0) =K(θ)(t, t0).

Polynomially decaying eigenvalues (λnn−p for somep >0) are an important special case. For example, this holds withp=−2s/difT⊂RdandH(K)is norm-equivalent to the Sobolev space W2s(T)fors > d/2[36, p. 370]. Ifλnn−pandρ=p(1−θ), theρ-hyperharmonic scaled RKHS is norm-equivalent to the power RKHSH(K(θ))by Corollary3.4sincenρλnn−θpλθn.

4. Sample Path Properties

This section contains the main results of the article. First, we consider a specialisation to scaled RKHSs of a theorem originally proved by Driscoll [9] and later generalised by Lukíc and Beder [24]. Then we define general sample support sets and characterise them forσ-algebras generated by scalings ofH(K).

4.1. Domination and Generalised Driscoll’s Theorem for Scaled RKHSs

A kernelRonT dominatesKifH(K)⊂H(R). In this case there exists [24, Theorem 1.1] a unique linear operatorL:H(R)→H(K), called thedominance operator, whose range is contained inH(K) and which satisfieshf, giR=hLf, giKfor allf∈H(R)andg∈H(K). The dominance is said to be nuclear, denotedRK, ifH(R)is separable and the operatorLis nuclear, which is to say that

tr(L) =

X

n=1

hLψn, ψniR<∞ (4.1)

for any orthonormal basis(ψn)n=1ofH(R).3

Define the pseudometricdR(t, t0) =kR(·, t)−R(·, t0)kR=p

R(t, t)−2R(t, t0) +R(t0, t0)onT.

IfR is positive-definite,dRis a metric. However, positive-definiteness is not necessary fordR to be a metric. For example, the Brownian motion kernel R(t, t0) = min{t, t0} onT = [0,1]is only positive-semidefinite becauseR(t,0) = 0 for everyt∈T but nevertheless yields a metric because dR(t, t0) =p

t−2 min{t, t0}+t0vanishes if and only ift=t0. See [24, Section 4] for more properties ofdR. Injectivity of the mappingt7→R(·, t)is equivalent todRbeing a metric.

By the following theorem, a special case of the zero-one law of Kallianpur [15,21] and a generalisation by Luki´c and Beder [24, Theorem 7.5] of an earlier result by Driscoll [9, Theorem 3], the nuclear domi- nance condition determines whether or not the samples of a Gaussian process(X(t))t∈T ∼ GP(0, K) lie inH(R). In particular, the probability of them being inH(R)is always either one or zero.

Theorem 4.1(Generalised Driscoll’s Theorem). LetRbe a continuous kernel onT×Twith separable RKHS and(X(t))t∈T ∼ GP(0, K). IfdRis a metric, then either

P

X∈H(R)

= 0 and R6K or P

X∈H(R)

= 1 and RK.

3A change of basis shows thattr(L)does not depend on the orthonormal basis.

(11)

Proof. Theorem 7.5 in [24] is otherwise identical except thatRis not assumeddT-continuous and the samples ofXare assumeddR-continuous. However, whenRisdT-continuous,dT-continuity of the samples, which one of our standing assumptions, implies theirdR-continuity.

Summability of the reciprocal scaling coefficients controls whether or not a scaled RKHS contains the sample paths.

Lemma 4.2. LetΦ = (φn)n=1be an orthonormal basis ofH(K)andA= (αn)n=1aΦ-scaling of H(K). LetR=KA,Φ. IfdKis a metric, then so isdR.

Proof. BecausedKis a metric,dK(t, t0)2=K(t, t)−2K(t, t0) +K(t0, t0) =P

n=1n(t)−φn(t0)]2 vanishes if and only ift=t0. SincedR(t, t0)2=P

n=1αnn(t)−φn(t0)]2andαnare positive, we conclude thatdR(t, t0) = 0if and only ift=t0.

Theorem 4.3. LetΦ = (φn)n=1be an orthonormal basis ofH(K),A= (αn)n=1 aΦ-scaling of H(K), and(X(t))t∈T ∼ GP(0, K). IfKA,Φis continuous anddKis a metric, then either

P

X∈H(KA,Φ)

= 0 and

X

n=1

1

αn =∞ or P

X∈H(KA,Φ)

= 1 and

X

n=1

1 αn <∞.

Proof. Assume first that the scaling is such thatH(K)⊂H(KA,Φ). It is easy to verify using Propo- sition3.2that the dominance operatorL: H(KA,Φ)→H(K)is given byLf=P

n=1fnα−1n φnfor anyf=P

n=1fnφn∈H(KA,Φ). Because(√

αnφn)n=1is an orthonormal basis ofH(KA,Φ)and L(√

αnφn) = 1/√

αn, the nuclear dominance condition (4.1) is tr(L) =

X

n=1

αnn,√ αnφn

KA,Φ=

X

n=1

n, φniKA,Φ=

X

n=1

1 αn,

and the claim follows from Theorem4.1since Lemma4.2guarantees that dRforR=KA,Φ is a metric. Assume then thatH(K)6⊂H(KA,Φ). It is trivial thatKA,Φ6K. ThusP[X∈H(KA,Φ)] = 0. If we hadP

n=1α−1n <∞, then it would necessarily hold thatsupn≥1α−1n <∞and consequently H(K)⊂H(KA,Φ)by Proposition3.3, which is a contradiction. ThereforeP

n=1α−1n =∞.

4.2. Sample Support Sets

Theorems4.1and4.3motivate us to define the sample support set of a Gaussian process with respect to a collection of kernels as the largest set on the “boundary” between their induced RKHSs of probabilities one and zero. LetRbe a collection of continuous kernelsRonTfor whichdRis a metric andH(R)the corresponding set of RKHSs. Every element ofH(R)is a subset ofC(T). By the generalised Driscoll’s theorem each element ofH(R)hasµX-measure one or zero, depending on the nuclear dominance condition. Define the disjoint sets

R1(K) ={R∈R :RK} and R0(K) ={R∈R :R6K}

which partitionR. We assume that bothR1(K)andR0(K)are non-empty and introduce the notion of a sample support set.

(12)

Definition4.4(Sample support set). LetS(R) =σ(H(R))be theσ-algebra generated byH(R). The sample support set,SR(K), of the Gaussian process(X(t))t∈T ∼ GP(0, K)with respect toRis the largest subset ofC(T)such thatSR(K)⊂Hfor everyH∈ S(R)such thatµX(H) = 1.

Proposition 4.5. It holds that

SR(K) = \

R1∈R1(K)

H(R1)\ [

R0∈R0(K)

H(R0). (4.2)

Proof. Suppose that there isf ∈SR(K)which is not contained in the set on the right-hand side of (4.2). That is, we have eitherf /∈ ∩R1∈R1(K)H(R1)orf∈ ∪R0∈R0(K)H(R0). In the former case there isR1∈R1(K)such thatf /∈H(R1). But becauseµX(H(R1)) = 1andSR(K)⊂H(R1)by definition, this violates the assumption thatf∈SR(K). In the latter case there isR0∈R0(K)such thatf∈H(R0). AsµX(H(R0)) = 0, we have for anyR1∈R1thatµX(H(R1)\H(R0)) = 1. But sincef /∈H(R1)\H(R0), the assumption thatf∈SR(K)is again violated and we conclude that SR(K)⊂ ∩R1∈R1(K)H(R1)\ ∪R0∈R0(K)H(R0).

Since all elements ofH(R)are either of measure zero or one, so are those ofS(R). It is therefore clear that∩R1∈R1(K)H(R1)\ ∪R0∈R0(K)H(R0)is contained in everyH∈ S(R)such thatµX(H) = 1. Consequently,∩R1∈R1(K)H(R1)\ ∪R0∈R0(K)H(R0)⊂SR(K). This concludes the proof.

The sample support set is the largest set which is contained in every set of probability one under the law ofXthat can be expressed in terms of countably many elementary set operations of the RKHSs H(R)forR∈R. The largerRis, the more preciselySR(K)describes the samples ofX. But there is an important caveat. IfRis countable, the sample support set is in theσ-algebraB, defined in (1.3), and hasµX-measure one. However, whenRis uncountable and does not contain countable subsets R01(K)⊂R1(K)andR00(K)⊂R0(K)such that

\

R1∈R1(K)

H(R1) = \

R1∈R01(K)

H(R1) and [

R0∈R0(K)

H(R0) = [

R0∈R00(K)

H(R0),

it cannot be easily determined ifSR(K)is an element ofB.

We are mainly interested in sample support sets with respect toRwhich consist of all scaled kernels (and will in Theorem4.8characterise this set). It is nevertheless conceivable that one may want to or be forced to work with less rich set of kernels—scaled or not—and with such an eventuality in mind we have introduced the more general concept of a sample support set. IfRis a collection of scaled kernels, the sample support set takes a substantially more concrete form. For this purpose we introduce the concept an approximately constant sequence, which is inspired by the results collected in [19, § 41].

Definition4.6(Approximately constant sequence). LetΣbe a collection of non-negative sequences.

A non-negative sequence(an)n=1is said to beΣ-approximately constantif for every(bn)n=1∈Σthe seriesP

n=1bnandP

n=1anbneither both converge or diverge.

We mention two properties of approximately constant sequences: (i) If(an)n=1and(a0n)n=1are two Σ-approximately constant sequences, then so is their sum. (ii) The largerΣis, the fewerΣ-approximately sequences there are. That is, ifΣ1andΣ2 are two collections of non-negative sequences such that Σ1⊂Σ2, then a non-negative sequence isΣ1-approximately constant if it isΣ2-approximately constant.

For the RKHSH(K)and any of its orthonormal basisΦ = (φn)n=1we letR(Σ,Φ)denote the set of all functionsf =P

n=1fnφnsuch that the series converges pointwise onT and(fn2)n=1is a

(13)

Σ-approximately constant sequence. The following theorem provides a crucial connection between sample support sets with respect to scaled kernels and functions defined as orthonormal expansions with approximately constant coefficients.

Theorem 4.7. LetΦ = (φn)n=1be an orthonormal basis ofH(K)andΣΦa collection ofΦ-scalings ofH(K)such that the corresponding scaled kernels are continuous. Suppose thatdKis a metric and letR={KA,Φ :A∈ΣΦ}. ThenSR(K) =R(ΣΦ,Φ).

Proof. Note first that, by Lemma4.2, dR is a metric for every R∈R. Because every scaling of H(K)has an orthonormal basis that is a scaled version of(φn)n=1, everyf ∈SR(K)can be written asf =P

n=1fnφnfor some real coefficientsfn. LetΣ1(K)andΣ0(K)stand for the collections of(αn)n=1∈ΣΦ such thatP

n=1α−1n <∞ and P

n=1α−1n =∞, respectively. Then, by Theo- rem4.3,KA,Φ∈R1(K)ifA∈Σ1(K)andKA,Φ∈R0(K)ifA∈Σ0(K). Because, by definition, SR(K)⊂H(KA,Φ)for anyA∈Σ1(K)andSR(K)∩H(KA,Φ) =∅for anyA∈Σ0(K)it follows that for everyf ∈SR(K)and any(αn)n=1∈ΣΦwe have

X

n=1

fn2

αn <∞ and

X

n=1

1

αn<∞ or

X

n=1

fn2

αn=∞ and

X

n=1

1 αn=∞.

That is,(fn2)n=1is aΣΦ-approximately constant sequence and thusSR(K)⊂ R(ΣΦ,Φ). Conversely, iff∈ R(ΣΦ,Φ), thenf ∈H(KA,Φ)for everyA∈Σ1(K)andf /∈H(KA,Φ)for everyA∈Σ0(K). Hencef∈SR(K)and thusSR(K) =R(ΣΦ,Φ).

Next we use Theorem4.7to describe the sample support set more concretely.

4.3. Sample Support Sets for Scaled RKHSs

LetΣbe the set of all positive sequences. Then the collection ofΣ-approximately constant sequences is precisely the collection of non-negative sequences(an)n=1such that

lim inf

n→∞ an>0 and sup

n≥1

an<∞. (4.3)

For suppose that there existed aΣ-approximately constant sequence(an)n=1that violated (4.3). If lim infn→∞an= 0, then there is a subsequence(anm)m=1such thatanm≤2−mfor allm∈N. Let (bn)n=1∈Σbe a sequence such thatbn= 2−na−1n forn /∈(nm)m=1andbnm= 1form∈N. Then

P

n=1bndiverges but

X

n=1

anbn= X

n /∈(nm)m=1

anbn+

X

m=1

anmbnm≤ X

n /∈(nm)m=1

2−n+

X

m=1

2−m<∞,

which contradicts the assumption that (an)n=1 is aΣ-approximately constant sequence. A similar argument (withanm≥2m,bn= 2−n, andbnm= 2−m) shows the second condition in (4.3) cannot be violated either; thus everyΣ-approximately constant sequence satisfies (4.3). A sequence satisfying (4.3) is triviallyΣ-approximately constant because the conditions imply the existence of constants0< c1≤c2 such thatc1≤an≤c2for all sufficiently largen. This, together with Theorem4.7, yields the following theorem which we consider the main result of this article. The full proof is more complicated than the above argument as we cannot assume that every positive sequence is a scaling ofH(K).

Referências

Documentos relacionados

Para suprir as necessidades das institui ções e demais segmentos da sociedade atuantes na regi ão nordestina, no atendimento à popula ção quanto à garantia de oferta