Small Sample Spaces for Gaussian Processes

(1)

Small Sample Spaces for Gaussian Processes Karvonen, Toni

2023-05

Karvonen , T 2023 , ' Small Sample Spaces for Gaussian Processes ' , Bernoulli , vol. 29 , no. 2 , pp. 875-900 . https://doi.org/10.3150/22-BEJ1483

http://hdl.handle.net/10138/356782 https://doi.org/10.3150/22-BEJ1483

acceptedVersion

Downloaded from Helda, University of Helsinki institutional repository.

This is an electronic reprint of the original article.

This reprint may differ from the original in pagination and typographic detail.

Please cite the original version.

(2)

Small Sample Spaces for Gaussian Processes

TONI KARVONEN^1,2,*

1The Alan Turing Institute, 96 Euston Road, London NW1 2DB, United Kingdom

2University of Helsinki, Department of Mathematics and Statistics, Pietari Kalmin katu 5, 00560 Helsinki, Finland E-mail:^*toni.karvonen@helsinki.fi

It is known that the membership in a given reproducing kernel Hilbert space (RKHS) of the samples of a Gaussian processXis controlled by a certain nuclear dominance condition. However, it is less clear how to identify a “small”

set of functions (not necessarily a vector space) that contains the samples. This article presents a general approach for identifying such sets. We usescaled RKHSs, which can be viewed as a generalisation of Hilbert scales, to define thesample support setas the largest set which is contained in every element of full measure under the law ofX in theσ-algebra induced by the collection of scaled RKHS. This potentially non-measurable set is then shown to consist of those functions that can be expanded in terms of an orthonormal basis of the RKHS of the covariance kernel ofXand have their squared basis coefficients bounded away from zero and infinity, a result suggested by the Karhunen–Loève theorem.

Keywords:Gaussian processes; sample path properties; reproducing kernel Hilbert spaces

1. Introduction

LetK:T×T→Rbe a positive-semidefinite kernel on a setT and consider any Gaussian process (X(t))_t∈T with mean zero and covarianceK, which we denote(X(t))_t∈T ∼ GP(0, K). LetH(K)be the reproducing kernel Hilbert space (RKHS) ofKequipped with inner producth·,·i_Kand normk·k_K. It is a well-known fact, apparently originating with Parzen [28], that the samples ofX are not contained inH(K)if this space is infinite-dimensional. Furthermore, Driscoll [9]; Fortet [12]; and Luki´c and Beder [24] have used the zero-one law of Kallianpur [15] to show essentially that, given another kernel Rand under certain mild assumptions,

P

X∈H(R)

= 1 ifRK and P

X∈H(R)

= 0 ifR6K,

whereRKsignifies thatRdominatesK(i.e.,H(K)⊂H(R)) and, moreover, that the dominance is nuclear(see Section4.1for details). ThisDriscoll’s theoremis an exhaustive tool for verifying whether or not the samples from a Gaussian process are contained in a given RKHS. A review of the topic can be found in [13, Chapter 4]. Two questions now arise:

• How to construct a kernelRsuch thatRK?

• Is it possible to exploit the fact that P[X ∈H(R₁)\H(R₂)] = 1for any kernels such that R₁KandR₂6Kto identify in some sense the smallest set of functions which contains the samples with probability one?

Answers to questions such as these are instructive for theory and design of Gaussian process based learning [11,40], emulation and approximation [18,41], and optimisation [3] methods. For simplicity we assume that the domainT is a complete separable metric space, that the kernelKis continuous and its RKHS is separable, and that the samples ofXare continuous. Although occasionally termed

“rather restrictive” [9, p. 309], these continuity assumptions are satisfied by the vast majority of domains and Gaussian processes commonly used in statistics and machine learning literature [31,35], such as

1

arXiv:2103.03169v3 [math.PR] 15 Mar 2022

(3)

stationary processes with Gaussian or Matérn covariance kernels. Our motivation for imposing these restrictions is that they imply that RKHSs are measurable.

1.1. Contributions

First, we present a flexible construction for a kernelRsuch thatRK. For any orthonormal basis Φ = (φ_n)^∞_n=1ofH(K)the kernelKcan be written asK(t, t⁰) =P∞

n=1φ_n(t)φ_n(t⁰)for allt, t⁰∈T. Given a positive sequenceA= (α_n)^∞_n=1such thatP∞

n=1α_nφ_n(t)²<∞for allt∈T, we define the scaled kernel

K_A,Φ(t, t⁰) =

∞

X

n=1

α_nφ_n(t)φ_n(t⁰). (1.1)

This is a significant generalisation of the concept of powers of kernels which has been previously used to construct RKHSs which contain the samples by Steinwart [36]. We call the sequenceAaΦ-scaling ofH(K). Ifα_n→ ∞asn→ ∞, the correspondingscaled RKHS¹,H(K_A,Φ), is a proper superset of H(K), though not necessarily large enough to contain the samples ofX. We show that convergence of the seriesP∞

n=1α⁻¹_n controls whether or not samples are contained inH(K_A,Φ). IfP∞ n=1α⁻¹_n converges slowly, we can therefore interpretH(K_A,Φ)as a “small” RKHS which contains the samples.

Main Result I (Theorem4.3). LetΦ = (φ_n)^∞_n=1be an orthonormal basis ofH(K)andA= (α_n)^∞_n=1 aΦ-scaling ofH(K). IfK_A,Φis continuous andd_K(t, t⁰) =kK(·, t)−K(·, t⁰)k_Kis a metric onT, then either

P

X∈H(K_A,Φ)

= 0 and

∞

X

n=1

1

α_n =∞ or P

X∈H(K_A,Φ)

= 1 and

∞

X

n=1

1 α_n <∞.

In Section5, we use this result to study sample path properties of Gaussian processes defined by infinitely smooth kernels. These appear to be the first sufficiently descriptive results of their kind. An example of an infinitely smooth kernel that we consider is the univariate (i.e.,T⊂R) Gaussian kernel K(t, t⁰) = exp(−(t−t⁰)²/(2`²))with length-scale` >0 for which we explicitly construct several scaled kernelsRwhose RKHSs are “small” but still contain the samples of(X(t))_t∈T ∼ GP(0, K). In Section6, Theorem4.3is applied to provide an intuitive explanation for a conjecture by Xu and Stein [41] on asymptotic behaviour of the maximum likelihood estimate of the scaling parameter of the Gaussian kernel when the data are generated by a monomial function on a uniform grid.

Secondly, we use Theorem4.3to construct a “small” set which “almost” contains the samples. This sample support setis distinct from the traditional topological support of a Gaussian measure; see the discussion at the end of Section2. LetC(T)denote the set of continuous function onT.

Main Result II (Theorems4.8and4.11). LetΦ = (φ_n)^∞_n=1be an orthonormal basis ofH(K)and suppose there is aΦ-scalingA= (α_n)^∞_n=1such thatP∞

n=1α⁻¹_n <∞andK_A,Φ is continuous. Let S(R)be theσ-algebra generated by the collection of scaled RKHSs consisting of continuous functions andS_R(K)the largest subset ofC(T)that is contained in everyH∈ S(R)such thatP[X∈H] = 1.

Suppose thatd_K(t, t⁰) =kK(·, t)−K(·, t⁰)k_Kis a metric onT. ThenS_R(K)consists precisely of the functionsf=P∞

n=1f_nφ_nsuch that lim inf

n→∞ f_n²>0 and sup

n≥1

f_n²<∞. (1.2)

1These spaces are not to be confused with classical Hilbert scales defined via powers of a strictly positive self-adjoint operator;

see [10, Section 8.4] and [20].

(4)

Furthermore, for everyH∈ S(R)such thatP[X∈H] = 1there existsF∈ S(R)such thatS_R(K)is a proper subset ofF andFis a proper subset ofH.

The setS_R(K)may fail to be measurable. The latter part of the above result is therefore important in demonstrating that it is possible to construct sets which are arbitrarily close toS_R(K)and contain the samples using countably many elementary set operations of scaled RKHSs. At its core this is a manifestation of the classical result that there is no meaningful notion of a boundary between convergent and divergent series [19, § 41]. The general form of the Karhunen–Loève theorem is useful in explaining the characterisation in (1.2). If(φ_n)^∞_n=1is any orthonormal basis ofH(K), then the Gaussian process can be written asX(t) =P∞

n=1ζ_nφ_n(t)for allt∈T, whereζ_nare independent standard normal random variables. The series converges in L²(P), but if almost all samples ofX are continuous, convergence is also uniform onTwith probability one [1, Theorem 3.8]. BecausekX(t)k²_K=P∞

n=1ζ_n² andE[ζ_n²] = 1for everyn, the Karhunen–Loève expansion suggests, somewhat informally, that the samples are functionsf=P∞

n=1f_nφ_nfor which the sequence(f_n²)^∞_n=1satisfies (1.2).

1.2. On Measurability and Continuity

Suppose for a moment thatT is an arbitrary uncountable set,Ka positive-semidefinite kernel onT such that its RKHSH(K)is infinite-dimensional, and(X(t))_t∈T ∼ GP(0, K)a generic Gaussian process defined on a generic probability space(Ω,A,P). LetR^T be the collection of real-valued functions onT andB˜theσ-algebra generated by cylinder sets of the form{f∈R^T : (f(t₁), . . . , f(t_n))∈Bⁿ}for any n∈Nand any Borel setBⁿ⊂Rⁿ. LetΦ_X(ω) =X(·, ω). Thenµ˜_X =P◦Φ⁻¹_X is the law ofX on the measurable space(R^T,B). Consequently,˜ P[X∈H] = ˜µ_X {ω∈Ω :X(·, ω)∈H}

forH∈B. Let˜ (R^T,B˜₀,µ˜_X,0)be the completion of(R^T,B,˜ µ˜_X)andR: T×T→Ra positive-semidefinite kernel

andH(R)its RKHS. The following facts are known about the measurability ofH(K)andH(R):

• In general,H(R)∈/B˜. For example, ifT is equipped with a topology andRis continuous on T×T, thenH(R)⊂C(T). However, no non-empty subset ofC(T)can be an element ofB.˜

• LePage [21, p. 347] has proved thatH(K)∈B˜₀andµ˜_X,0(H(K)) = 0, a claim which originates with Parzen [27,28]. A version which requires separability and continuity appears in [16].

• If the RKHSH(R)is infinite-dimensional andR6K, thenH(R)∈B˜₀andµ˜_X,0(H(R)) = 0.

This result is contained in the proof of Theorem 7.3 in [24]. See also [13, Proposition 4.5.1].

• LePage [21, Corollary 2] has proved a dichotomy result which states that ifG⊂R^T is an additive group andG∈B˜₀, then eitherµ˜_X,0(G) = 0orµ˜_X,0(G) = 1. Furthermore,µ˜_X,0(G) = 1implies thatH(K)⊂G. This is a general version of the zero-one law of Kallianpur [15, Theorem 2].

It appears that not much more can be said without imposing additional structure or constructing versions ofX, as is done in [24].

Suppose that(T, d_T)is a complete separable metric space, that the kernelKis continuous, and that almost all samples of(X(t))_t∈T ∼ GP(0, K)are continuous, which is to say that the( ˜B₀,µ˜_X,0)-outer measure ofC(T)is one. Define the probability space(C(T),B, µ_X)as

B=C(T)∩B˜₀ and µ_X(C(T)∩H) = ˜µ_X,0(H) for H∈B˜₀. (1.3) The rest of this article is concerned with (C(T),B, µ_X)and it is to be understood thatP[X ∈H]

stands forµ_X(H)for anyH∈ B. In this setting Driscoll [9, p. 313] has proved thatH(R)∈ BifRis continuous and positive-definite. By using Theorem 1.1 in [12] (Theorem 4.1 in [24]) one can generalise this result for a continuous and positive-semidefiniteR; see the proof of Theorem 7.3 in [24].

(5)

1.3. Notation and Terminology

For non-negative real sequences(a_n)^∞_n=1 and(b_n)^∞_n=1we writea_nb_nif there isC >0such that a_n≤Cb_nfor all sufficiently largen. If botha_nb_nandb_na_nhold, we writea_nb_n. Ifa_n/b_n→1 asn→ ∞, we writea_n∼b_n. For two setsFandGwe useF(Gto indicate thatF is a proper subset ofG. A kernelR:T×T→Rispositive-semidefiniteif

N

X

i=1 N

X

j=1

a_ia_jR(t_i, t_j)≥0 (1.4)

for anyN≥1,a₁, . . . , a_N∈Randt₁, . . . , t_N∈T. In the remainder of this article positive-semidefinite kernels are simply referred to askernels. If the inequality in (1.4) is strict for any pairwise distinct t₁, . . . , t_N, the kernel is said to bepositive-definite.

1.4. Standing Assumptions

For ease of reference our standing assumptions are collected here. We assume that (i)(T, d_T)is a complete separable metric space; (ii) the covariance kernelK: T×T→Riscontinuousandpositive- semidefiniteonT ×T; (iii) the RKHSH(K)induced byKisinfinite-dimensionalandseparable²; and (iv)(X(t))_t∈T ∼ GP(0, K)is a zero-mean Gaussian process on a probability space(Ω,A,P)with continuous paths. The lawµ_XofXis defined on the measurable space(C(T),B)which was constructed in Section1.2. Some of our results have natural generalisations for general second-order stochastic processes; see [24], in particular Sections 2 and 5, and [36]. We do not pursue these generalisations.

2. Related Work

Reproducing kernel Hilbert spaces which contain the samples of(X(t))_t∈T ∼ GP(0, K)have been constructed by means of integrated kernels in [23], convolution kernels in [5] and [11, Section 3.1], and, most importantly,powers of RKHSs[39] in [17, Section 4] and [36]. Namely, letT be a compact metric space,Ka continuous kernel onT×T, andνa finite and strictly positive Borel measure onT. Then the integral operatorT_ν, defined forf∈L²(ν)via

(T_νf)(t) = Z

T

K(t, t⁰)f(t⁰) dν(t⁰), (2.1) has decreasing and positive eigenvalues(λ_n)^∞_n=1, which vanish asn→ ∞, and eigenfunctions(ψ_n)^∞_n=1 inH(K)such that(√

λ_nψ_n)^∞_n=1 is an orthonormal basis ofH(K). The kernel has the uniformly convergent Mercer expansionK(t, t⁰) =P∞

n=1λ_nψ_n(t)ψ_n(t⁰)for allt, t⁰ ∈T. For θ >0, theθth power ofKis defined as

K^(θ)(t, t⁰) =

∞

X

n=1

λ^θ_nψ_n(t)ψ_n(t⁰). (2.2)

2Most famously, separable RKHSs are induced by Mercer kernels, which are continuous kernels defined on compact subsets ofR^d[29, Section 11.3].

(6)

The series (2.2) converges ifP∞

n=1λ^θ_nψ_n(t)²<∞for allt∈T. Furthermore,H(K^(θ²⁾)(H(K^(θ¹⁾) if θ₁< θ₂ andP[X ∈H(K^(θ))] = 1 if and only ifP∞

n=1λ^1−θ_n <∞[36, Theorem 5.2]. When it comes to sample properties, the power kernel construction has two significant downsides: (i) The measureν is anuisance parameter. If one is only interested in sample path properties of Gaussian processes this measure should not have an intrinsic part to play in the analysis and results. (ii) The construction is somewhatinflexibleandunsuitable for infinitely smooth kernels. BecauseH(K^(θ)) consists precisely of the functionsf=P∞

n=1f_nλ^1/2_n ψ_nsuch thatP∞

n=1f_n²λ^1−θ_n <∞andλ_n→0as n→ ∞, how much largerH(K^(θ))is thanH(K) =H(K^(θ=1))is determined by rate of decay of the eigenvalues. Power RKHSs are more descriptive and fine-grained when the kernel is finitely smooth and its eigenvalues have polynomial decayn^−afora >0(e.g., Matérn kernels) than when the kernel is infinitely smooth with at least exponential eigenvalue decaye^−bnforb >0(e.g., Gaussian): the change the decay conditionP∞

n=1f_n²<∞for the coefficients(f_n)^∞_n=1toP∞

n=1f_n²n^−a(1−θ)<∞is arguably less substantial than that fromP∞

n=1f_n²<∞toP∞

n=1f_n²e^{−b(1−θ)n}<∞. Indeed, as pointed out by Kanagawa et al. [17, Section 4.4], when the kernel is GaussianeveryH(K^(θ))withθ <1contains the samples with probability one, which renders powers of RKHSs of dubious utility in that setting because H(K^(θ=1)) =H(K)does not contain the samples. The relationship between powers of RKHSs and scaled RKHSs is discussed in more detail at the end of Section3. In Section5we demonstrate that scaled RKHSs are more useful in describing sample path properties of Gaussian processes defined by infinitely smooth kernels than powers of RKHSs.

To the best of our knowledge, the question about a “minimal” set which contains the samples with probability one has received only cursory discussion in the literature. Perhaps the most relevant digression on the topic is an observation by Steinwart [36, pp. 369–370], given here in a somewhat applied form and without some technicalities, that the samples are contained in the set

\

r<s

W₂^r(T)

!

\W₂^s(T) (2.3)

with probability one ifH(K)is norm-equivalent to the fractional Sobolev spaceW₂^s+d/2(T)fors >0 on a suitable domainT⊂R^d. In the Sobolev case the samples are therefore “d/2less smooth” than functions in the RKHS ofK. BecauseW₂^s(T) =∪_r≥sW₂^r(T), the set in (2.3) has the same form as the sample support set in (4.2). This observation is, of course, a general version of the familiar result that the sample paths of the Brownian motion, whose covariance kernelK(t, t⁰) = min{t, t⁰}onT= [0,1]

induces the Sobolev spaceW₂¹([0,1])with zero boundary condition at the origin, have regularity1/2 in the sense that they are almost surelyα-Hölder continuous if and only ifα <1/2. That is, there isC >0such that, for almost everyω∈Ω,|X(t, ω)−X(t⁰, ω)| ≤C|t−t⁰|^αfor allt, t⁰∈[0,1]and anyα <1/2. However, Lévy’s modulus of continuity theorem [26, Section 1.2] improves this to

|X(t, ω)−X(t⁰, ω)| ≤Cp

hlog(1/h)whenh=|t−t⁰|is sufficiently small. Since the Sobolev space W₂^s(T)consists of those functionsf: T→Rwhich admit anL²(R^d)-extensionf_e:R^d→Rwhose Fourier transform satisfies

Z

R^d

1 +kξk²s

|fb_e(ξ)|²dξ <∞, (2.4) Lévy’s modulus of continuity theorem suggests replacing the weight in (2.4) with, for example, (1 +kξk²)^slog(1 +kξk)so that the resulting function space is a proper superset ofW₂^s(T)and a proper subset ofW₂^r(T)for everyr < sand hence a proper subset of the set in (2.3). Some results and discussion in [18, Section 4.2] and [34] have this flavour.

(7)

Finally, we remark that classical results about the topological support of a Gaussian measure are distinct from the results in this article. LetC(T)be equipped with the standard supremum normk·k_∞. Thetopological support,supp_C(T₎(µ_X), of the measureµ_Xis the set

supp_C(T₎(µ_X) ={f∈C(T) :µ_X(B(f, r))>0for allr >0},

whereB(f, r)is thef-centeredr-ball in(C(T),k·k_∞). It is a classical result [16, Theorem 3] that supp_C(T₎(µ_X) =H(K), whereH(K)is the closure ofH(K)in(C(T),k·k_∞). In other words, the topological support ofµ_X contains every continuous functionf such that for everyε >0there exist g_ε∈H(K)satisfyingkf−g_εk_∞< ε. Now, recall that a kernelRisuniversalifH(R)is dense in C(T)[37, Section 4.6]. Most kernels of interest to practitioners are universal, including Gaussians, Matérns, and power series kernels. But, by definition, the closure of the RKHS of a universal kernel equalsC(T). Thereforesupp_C(T₎(µ_X) =H(K) =C(T)ifKis a universal kernel. This result does not provide any information about the samples because we have assumed that the samples are continuous to begin with. See [2, Section 3.6] for further results on general topological supports of Gaussian measures.

3. Scaled Reproducing Kernel Hilbert Spaces

For any orthonormal basisΦ = (φ_n)^∞_n=1 ofH(K)the kernel has the pointwise convergent expan- sionK(t, t⁰) =P∞

n=1φ_n(t)φ_n(t⁰)for allt, t⁰∈T. By the standard characterisation of a separable Hilbert space, the RKHS consists of precisely those functionsf:T →Rthat admit an expansion f=P∞

n=1f_nφ_nfor coefficients such thatP∞

n=1f_n²<∞. The Cauchy–Schwarz inequality ensures that this expansion converges pointwise onT. For given functionsf=P∞

n=1f_nφ_nandg=P∞ n=1g_nφ_n in the RKHS the inner product ishf, gi_K=P∞

n=1f_ng_n.

Definition3.1(Scaled kernel and RKHS). We say that a positive sequenceA= (α_n)^∞_n=1is aΦ-scaling ofH(K)ifP∞

n=1α_nφ_n(t)²<∞for everyt∈T. The kernel K_A,Φ(t, t⁰) =

∞

X

n=1

α_nφ_n(t)φ_n(t⁰) (3.1)

is called ascaled kerneland its RKHSH(K_A,Φ)ascaled RKHS.

See [30,42,43] for prior appearances of scaled kernels under different names and not in the context of Gaussian processes. Although many of the results in this section have appeared in some form in the literature, all proofs are included here for completeness.

Proposition 3.2. LetΦ = (φ_n)^∞_n=1be an orthonormal basis ofH(K)andA= (α_n)^∞_n=1aΦ-scaling ofH(K). Then (i) the scaled kernelK_A,Φis positive-semidefinite, (ii) the collection(√

α_nφ_n)^∞_n=1is an orthonormal basis ofH(K_A,Φ), and (iii) the scaled RKHS is

H(K_A,Φ) = (

f=

∞

X

n=1

f_nφ_n :kfk²_K

A,Φ=

∞

X

n=1

f_n² α_n<∞

)

, (3.2)

where convergence is pointwise, and for anyf =P∞

n=1f_nφ_nandg=P∞

n=1g_nφ_ninH(K_A,Φ)its inner product is

hf, gi_K_A,Φ=

∞

X

n=1

f_ng_n

α_n . (3.3)

(8)

Proof. By the Cauchy–Schwarz inequality andP∞

n=1α_nφ_n(t)²<∞for everyt∈T,

∞

X

n=1

|α_nφ_n(t)φ_n(t⁰)| ≤ ^∞

X

n=1

α_nφ_n(t)²

1/2 ^∞ X

n=1

α_nφ_n(t⁰)² 1/2

<∞

for anyt, t⁰∈T. This proves that the scaled kernel in (3.1) is well-defined via an absolutely convergent series. To verify thatK_A,Φ is positive-semidefinite, note that, for anyN≥1,a₁, . . . , a_N∈R, and t₁, . . . , t_N∈T,

N

X

i=1 N

X

j=1

a_ia_jK_A,Φ(t_i, t_j) =

∞

X

n=1

α_n

N

X

i=1 N

X

j=1

a_ia_jφ_n(t_i)φ_n(t_j) =

∞

X

n=1

α_n N

X

i=1

a_iφ_n(t_i) 2

is non-negative because eachα_nis positive. Because

∞

X

n=1

|f_n√

α_nφ_n(t)| ≤ ^∞

X

n=1

f_n²

1/2 ^∞ X

n=1

α_nφ_n(t)² 1/2

<∞ for every t∈T ifP∞

n=1f_n²<∞, the space defined in (3.2) and (3.3) is a Hilbert space of functions with an orthonormal basis(√

α_nφ_n)^∞_n=1. SinceK_A,Φ(t, t⁰) =P∞

n=1α_nφ_n(t)φ_n(t⁰), the scaled kernel is the unique reproducing kernel of this space [e.g.,25, Theorem 9].

A scaled RKHS depends on the ordering of the orthonormal basis ofH(K)used to construct it.

For example, letΦ = (φ_n)^∞_n=1be an orthonormal basis ofH(K)and suppose thatα_n=ndefines a Φ-scaling ofH(K). Define another ordered orthonormal basisΨ = (ψ_n)^∞_n=1by settingψ_2n+1=φ₂ⁿ forn≥0and interleaving the remainingφ_nto produceΨ = (φ₁, φ₃, φ₂, φ₅, φ₄, φ₆, φ₈, φ₇, φ₁₆, . . .). The functionf=P∞

n=0φ₂n=:P∞

n=1f_Φ,nφ_nis inH(K_A,Φ)because kfk²_K

A,Φ=

∞

X

n=1

f_Φ,n² α_n =

∞

X

n=0

1 2ⁿ <∞ but not inH(K_A,Ψ)becausef=P∞

n=0φ₂n=P∞

n=0ψ_2n+1=:P∞

n=1f_Ψ,nψ_nand therefore kfk²_K

A,Ψ=

∞

X

n=1

f_Ψ,n² α_n =

∞

X

n=0

1

2n+ 1=∞.

In practice, the orthonormal basis usually has a natural ordering. For instance, the decreasing eigenvalues specify an ordering for a basis obtained from Mercer’s theorem (see Section2) or the basis may have a polynomial factor, the degree of which specifies an ordering (see the kernels in Sections5.2and5.3).

The following results compare sizes of scaled RKHSs: the fasterα_ngrows, the larger the RKHS H(K_A,Φ)is. A number of additional properties between scaled RKHSs can be proved in a similar manner but are not needed in the developments of this article. Some of the below results or their variants can be found in the literature. In particular, see [42, Section 6] and [43, Section 4] for a version of Proposition3.5and some additional results.

Proposition 3.3. LetΦ = (φ_n)^∞_n=1be an orthonormal basis ofH(K)andA= (α_n)^∞_n=1andB= (β_n)^∞_n=1twoΦ-scalings ofH(K). ThenH(K_B,Φ)⊂H(K_A,Φ)if and only ifβ_nα_n. In particular,

H(K)⊂H(K_A,Φ)if and only ifinf_n≥1α_n>0.

(9)

Proof. Ifβ_nα_n, then for anyf=P∞

n=1f_nφ_n∈H(K_B,Φ)we have kfk²_K

A,Φ=

∞

X

n=1

f_n² α_n=

∞

X

n=1

β_n α_n

f_n²

β_n≤ kfk²_K

B,Φsup

n≥1

β_n

α_n <∞. (3.4)

Consequently,H(K_B,Φ)⊂H(K_A,Φ). Suppose then thatH(K_B,Φ)⊂H(K_A,Φ)and assume to the contrary thatsup_n≥1α⁻¹_n β_n=∞so that there is a subsequence(n_m)^∞_m=1such thatα⁻¹_n_mβ_n_m≥2^m. Thenf=P∞

m=12^−m/2p

β_n_mφ_n_m∈H(K_B,Φ)\H(K_A,Φ)since kfk²_K

B,Φ=

∞

X

m=1

pβ_n_m 2^m/2

2 1 β_n_m =

∞

X

m=1

2^−m= 1 but kfk²_K

A,Φ=

∞

X

m=1

2^−mβ_n_m α_n_m ≥

∞

X

m=1

1 =∞,

which contradicts the assumption thatH(K_B,Φ)⊂H(K_A,Φ). Thussup_n≥1α⁻¹_n β_n<∞. The second statement follows by settingβ_n= 1for everyn∈Nand noting that thenH(K_B,Φ) =H(K).

Two normed spacesF andGare said to benorm-equivalentif they are equal as sets and if there exist positive constantsC₁andC₂such thatC₁kfk_F ≤ kfk_G≤C₂kfk_F for allf∈F. From (3.4) it follows thatH(K_A,Φ)andH(K_B,Φ)are norm-equivalent if and only ifα_nβ_n.

Corollary 3.4. Let Φ = (φ_n)^∞_n=1 be an orthonormal basis ofH(K)andA= (α_n)^∞_n=1 andB= (β_n)^∞_n=1twoΦ-scalings ofH(K). ThenH(K_A,Φ)andH(K_B,Φ)are norm-equivalent if and only if

α_nβ_n.

Proposition 3.5. LetΦ = (φ_n)^∞_n=1be an orthonormal basis ofH(K)andA= (α_n)^∞_n=1andB= (β_n)^∞_n=1 twoΦ-scalings ofH(K). ThenH(K_B,Φ)(H(K_A,Φ)if and only ifsup_n≥1α_nβ_n⁻¹=∞

andβ_nα_n.

Proof. Assume first thatsup_n≥1α_nβ⁻¹_n =∞andβ_nα_n. Sinceβ_nα_n, Proposition3.3yields H(K_B,Φ)⊂H(K_A,Φ). ThusH(K_B,Φ)is a proper subset ofH(K_A,Φ)ifH(K_A,Φ)is not a subset ofH(K_B,Φ). But, again by Proposition3.3,H(K_A,Φ)⊂H(K_B,Φ)if and only ifα_nβ_n, which contradicts the assumption thatsup_n≥1α_nβ_n⁻¹=∞. HenceH(K_B,Φ)is a proper subset ofH(K_A,Φ). Assume then thatH(K_B,Φ)(H(K_A,Φ). Thenβ_nα_nby Proposition3.3. Ifsup_n≥1α_nβ_n⁻¹=∞ did not hold, there would existC >0such thatα_n≤Cβ_nfor alln∈N, which is to sayα_nβ_n. But by Proposition3.3this would imply thatH(K_A,Φ)⊂H(K_B,Φ), which would contradict the assumption thatH(K_B,Φ)is a proper subset ofH(K_A,Φ). This completes the proof.

Remark3.6. LetH(R)be another separable RKHS of functions onT. The RKHSsH(K)andH(R) aresimultaneously diagonalisableif there exists an orthonormal basis(φ_n)^∞_n=1ofH(K)which is an orthogonal basis ofH(R). That is,(kφ_nk⁻¹_R φ_n)^∞_n=1is an orthonormal basis ofH(R)and consequently H(R) =H(K_A,Φ)for the scaling withα_n=kφ_nk⁻²_R .

We conclude this section by demonstrating that scaled RKHSs generalise powers of RKHSs. We say that aΦ-scalingA_ρ= (α_n)^∞_n=1 ofH(K)isρ-hyperharmonicif α_n=n^ρfor someρ≥0. The RKHSH(K_A_ρ_,Φ)is a ρ-hyperharmonic scaled RKHS. Recall from Section2 that ifT is a compact metric space, K is continuous on T ×T, and ν is a finite and strictly positive Borel measure onT, then the integral operator in (2.1) has eigenfunctions (ψ_n)^∞_n=1 and decreasing positive

(10)

eigenvalues(λ_n)^∞_n=1andΨ = (√

λ_nψ_n)^∞_n=1is an orthonormal basis ofH(K). Forθ >0the kernel K^(θ)(t, t⁰) =P∞

n=1λ^θ_nψ_n(t)ψ_n(t⁰)is theθth power ofKand its RKHSH(K^(θ))theθth power of H(K). These objects are well-defined ifP∞

n=1λ^θ_nψ_n(t)<∞for allt∈T. We immediately recognise thatK^(θ)equals the scaled kernelK_A,Ψfor the scalingA= (λ^θ−1_n )^∞_n=1because

K_A,Ψ(t, t⁰) =

∞

X

n=1

λ^θ−1_n λ_nψ_n(t)ψ_n(t⁰) =

∞

X

n=1

λ^θ_nψ_n(t)ψ_n(t⁰) =K^(θ)(t, t⁰).

Polynomially decaying eigenvalues (λnn^−p for somep >0) are an important special case. For example, this holds withp=−2s/difT⊂R^dandH(K)is norm-equivalent to the Sobolev space W₂^s(T)fors > d/2[36, p. 370]. Ifλ_nn^−pandρ=p(1−θ), theρ-hyperharmonic scaled RKHS is norm-equivalent to the power RKHSH(K^(θ))by Corollary3.4sincen^ρλ_nn^−θpλ^θ_n.

4. Sample Path Properties

This section contains the main results of the article. First, we consider a specialisation to scaled RKHSs of a theorem originally proved by Driscoll [9] and later generalised by Lukíc and Beder [24]. Then we define general sample support sets and characterise them forσ-algebras generated by scalings ofH(K).

4.1. Domination and Generalised Driscoll’s Theorem for Scaled RKHSs

A kernelRonT dominatesKifH(K)⊂H(R). In this case there exists [24, Theorem 1.1] a unique linear operatorL:H(R)→H(K), called thedominance operator, whose range is contained inH(K) and which satisfieshf, gi_R=hLf, gi_Kfor allf∈H(R)andg∈H(K). The dominance is said to be nuclear, denotedRK, ifH(R)is separable and the operatorLis nuclear, which is to say that

tr(L) =

∞

X

n=1

hLψ_n, ψ_ni_R<∞ (4.1)

for any orthonormal basis(ψ_n)^∞_n=1ofH(R).³

Define the pseudometricd_R(t, t⁰) =kR(·, t)−R(·, t⁰)k_R=p

R(t, t)−2R(t, t⁰) +R(t⁰, t⁰)onT.

IfR is positive-definite,d_Ris a metric. However, positive-definiteness is not necessary ford_R to be a metric. For example, the Brownian motion kernel R(t, t⁰) = min{t, t⁰} onT = [0,1]is only positive-semidefinite becauseR(t,0) = 0 for everyt∈T but nevertheless yields a metric because d_R(t, t⁰) =p

t−2 min{t, t⁰}+t⁰vanishes if and only ift=t⁰. See [24, Section 4] for more properties ofd_R. Injectivity of the mappingt7→R(·, t)is equivalent tod_Rbeing a metric.

By the following theorem, a special case of the zero-one law of Kallianpur [15,21] and a generalisation by Luki´c and Beder [24, Theorem 7.5] of an earlier result by Driscoll [9, Theorem 3], the nuclear dominance condition determines whether or not the samples of a Gaussian process(X(t))_t∈T ∼ GP(0, K) lie inH(R). In particular, the probability of them being inH(R)is always either one or zero.

Theorem 4.1(Generalised Driscoll’s Theorem). LetRbe a continuous kernel onT×Twith separable RKHS and(X(t))_t∈T ∼ GP(0, K). Ifd_Ris a metric, then either

P

X∈H(R)

= 0 and R6K or P

X∈H(R)

= 1 and RK.

3A change of basis shows thattr(L)does not depend on the orthonormal basis.

(11)

Proof. Theorem 7.5 in [24] is otherwise identical except thatRis not assumedd_T-continuous and the samples ofXare assumedd_R-continuous. However, whenRisd_T-continuous,d_T-continuity of the samples, which one of our standing assumptions, implies theird_R-continuity.

Summability of the reciprocal scaling coefficients controls whether or not a scaled RKHS contains the sample paths.

Lemma 4.2. LetΦ = (φ_n)^∞_n=1be an orthonormal basis ofH(K)andA= (α_n)^∞_n=1aΦ-scaling of H(K). LetR=K_A,Φ. Ifd_Kis a metric, then so isd_R.

Proof. Becaused_Kis a metric,d_K(t, t⁰)²=K(t, t)−2K(t, t⁰) +K(t⁰, t⁰) =P∞

n=1[φ_n(t)−φ_n(t⁰)]² vanishes if and only ift=t⁰. Sinced_R(t, t⁰)²=P∞

n=1α_n[φ_n(t)−φ_n(t⁰)]²andα_nare positive, we conclude thatd_R(t, t⁰) = 0if and only ift=t⁰.

Theorem 4.3. LetΦ = (φ_n)^∞_n=1be an orthonormal basis ofH(K),A= (α_n)^∞_n=1 aΦ-scaling of H(K), and(X(t))_t∈T ∼ GP(0, K). IfK_A,Φis continuous andd_Kis a metric, then either

P

X∈H(K_A,Φ)

= 0 and

∞

X

n=1

1

α_n =∞ or P

X∈H(K_A,Φ)

= 1 and

∞

X

n=1

1 α_n <∞.

Proof. Assume first that the scaling is such thatH(K)⊂H(K_A,Φ). It is easy to verify using Propo- sition3.2that the dominance operatorL: H(K_A,Φ)→H(K)is given byLf=P∞

n=1f_nα⁻¹_n φ_nfor anyf=P∞

n=1f_nφ_n∈H(K_A,Φ). Because(√

α_nφ_n)^∞_n=1is an orthonormal basis ofH(K_A,Φ)and L(√

α_nφ_n) = 1/√

α_n, the nuclear dominance condition (4.1) is tr(L) =

∞

X

n=1

√

α_nLφ_n,√ α_nφ_n

KA,Φ=

∞

X

n=1

hφ_n, φ_ni_K_A,Φ=

∞

X

n=1

1 α_n,

and the claim follows from Theorem4.1since Lemma4.2guarantees that d_RforR=K_A,Φ is a metric. Assume then thatH(K)6⊂H(K_A,Φ). It is trivial thatK_A,Φ6K. ThusP[X∈H(K_A,Φ)] = 0. If we hadP∞

n=1α⁻¹_n <∞, then it would necessarily hold thatsup_n≥1α⁻¹_n <∞and consequently H(K)⊂H(K_A,Φ)by Proposition3.3, which is a contradiction. ThereforeP∞

n=1α⁻¹_n =∞.

4.2. Sample Support Sets

Theorems4.1and4.3motivate us to define the sample support set of a Gaussian process with respect to a collection of kernels as the largest set on the “boundary” between their induced RKHSs of probabilities one and zero. LetRbe a collection of continuous kernelsRonTfor whichd_Ris a metric andH(R)the corresponding set of RKHSs. Every element ofH(R)is a subset ofC(T). By the generalised Driscoll’s theorem each element ofH(R)hasµ_X-measure one or zero, depending on the nuclear dominance condition. Define the disjoint sets

R₁(K) ={R∈R :RK} and R₀(K) ={R∈R :R6K}

which partitionR. We assume that bothR₁(K)andR₀(K)are non-empty and introduce the notion of a sample support set.

(12)

Definition4.4(Sample support set). LetS(R) =σ(H(R))be theσ-algebra generated byH(R). The sample support set,S_R(K), of the Gaussian process(X(t))_t∈T ∼ GP(0, K)with respect toRis the largest subset ofC(T)such thatS_R(K)⊂Hfor everyH∈ S(R)such thatµ_X(H) = 1.

Proposition 4.5. It holds that

S_R(K) = \

R1∈R₁(K)

H(R₁)\ [

R0∈R₀(K)

H(R₀). (4.2)

Proof. Suppose that there isf ∈S_R(K)which is not contained in the set on the right-hand side of (4.2). That is, we have eitherf /∈ ∩_R₁_∈R₁_(K)H(R₁)orf∈ ∪_R₀_∈R₀_(K)H(R₀). In the former case there isR₁∈R₁(K)such thatf /∈H(R₁). But becauseµ_X(H(R₁)) = 1andS_R(K)⊂H(R₁)by definition, this violates the assumption thatf∈S_R(K). In the latter case there isR₀∈R₀(K)such thatf∈H(R₀). Asµ_X(H(R₀)) = 0, we have for anyR₁∈R₁thatµ_X(H(R₁)\H(R₀)) = 1. But sincef /∈H(R₁)\H(R₀), the assumption thatf∈S_R(K)is again violated and we conclude that S_R(K)⊂ ∩_R₁_∈R₁_(K)H(R₁)\ ∪_R₀_∈R₀_(K)H(R₀).

Since all elements ofH(R)are either of measure zero or one, so are those ofS(R). It is therefore clear that∩_R₁_∈R₁_(K)H(R₁)\ ∪_R₀_∈R₀_(K)H(R₀)is contained in everyH∈ S(R)such thatµ_X(H) = 1. Consequently,∩_R₁_∈R₁_(K)H(R₁)\ ∪_R₀_∈R₀_(K)H(R₀)⊂S_R(K). This concludes the proof.

The sample support set is the largest set which is contained in every set of probability one under the law ofXthat can be expressed in terms of countably many elementary set operations of the RKHSs H(R)forR∈R. The largerRis, the more preciselyS_R(K)describes the samples ofX. But there is an important caveat. IfRis countable, the sample support set is in theσ-algebraB, defined in (1.3), and hasµ_X-measure one. However, whenRis uncountable and does not contain countable subsets R⁰₁(K)⊂R₁(K)andR⁰₀(K)⊂R₀(K)such that

\

R1∈R₁(K)

H(R₁) = \

R1∈R⁰₁(K)

H(R₁) and [

R0∈R₀(K)

H(R₀) = [

R0∈R⁰₀(K)

H(R₀),

it cannot be easily determined ifS_R(K)is an element ofB.

We are mainly interested in sample support sets with respect toRwhich consist of all scaled kernels (and will in Theorem4.8characterise this set). It is nevertheless conceivable that one may want to or be forced to work with less rich set of kernels—scaled or not—and with such an eventuality in mind we have introduced the more general concept of a sample support set. IfRis a collection of scaled kernels, the sample support set takes a substantially more concrete form. For this purpose we introduce the concept an approximately constant sequence, which is inspired by the results collected in [19, § 41].

Definition4.6(Approximately constant sequence). LetΣbe a collection of non-negative sequences.

A non-negative sequence(a_n)^∞_n=1is said to beΣ-approximately constantif for every(b_n)^∞_n=1∈Σthe seriesP∞

n=1b_nandP∞

n=1a_nb_neither both converge or diverge.

We mention two properties of approximately constant sequences: (i) If(a_n)^∞_n=1and(a⁰_n)^∞_n=1are two Σ-approximately constant sequences, then so is their sum. (ii) The largerΣis, the fewerΣ-approximately sequences there are. That is, ifΣ₁andΣ₂ are two collections of non-negative sequences such that Σ₁⊂Σ₂, then a non-negative sequence isΣ₁-approximately constant if it isΣ₂-approximately constant.

For the RKHSH(K)and any of its orthonormal basisΦ = (φ_n)^∞_n=1we letR(Σ,Φ)denote the set of all functionsf =P∞

n=1f_nφ_nsuch that the series converges pointwise onT and(f_n²)^∞_n=1is a

(13)

Σ-approximately constant sequence. The following theorem provides a crucial connection between sample support sets with respect to scaled kernels and functions defined as orthonormal expansions with approximately constant coefficients.

Theorem 4.7. LetΦ = (φ_n)^∞_n=1be an orthonormal basis ofH(K)andΣ_Φa collection ofΦ-scalings ofH(K)such that the corresponding scaled kernels are continuous. Suppose thatd_Kis a metric and letR={K_A,Φ :A∈Σ_Φ}. ThenS_R(K) =R(Σ_Φ,Φ).

Proof. Note first that, by Lemma4.2, d_R is a metric for every R∈R. Because every scaling of H(K)has an orthonormal basis that is a scaled version of(φ_n)^∞_n=1, everyf ∈S_R(K)can be written asf =P∞

n=1f_nφ_nfor some real coefficientsf_n. LetΣ₁(K)andΣ₀(K)stand for the collections of(α_n)^∞_n=1∈Σ_Φ such thatP∞

n=1α⁻¹_n <∞ and P∞

n=1α⁻¹_n =∞, respectively. Then, by Theo- rem4.3,K_A,Φ∈R₁(K)ifA∈Σ₁(K)andK_A,Φ∈R₀(K)ifA∈Σ₀(K). Because, by definition, S_R(K)⊂H(K_A,Φ)for anyA∈Σ₁(K)andS_R(K)∩H(K_A,Φ) =∅for anyA∈Σ₀(K)it follows that for everyf ∈S_R(K)and any(α_n)^∞_n=1∈Σ_Φwe have

∞

X

n=1

f_n²

α_n <∞ and

∞

X

n=1

1

α_n<∞ or

∞

X

n=1

f_n²

α_n=∞ and

∞

X

n=1

1 α_n=∞.

That is,(f_n²)^∞_n=1is aΣ_Φ-approximately constant sequence and thusS_R(K)⊂ R(Σ_Φ,Φ). Conversely, iff∈ R(Σ_Φ,Φ), thenf ∈H(K_A,Φ)for everyA∈Σ₁(K)andf /∈H(K_A,Φ)for everyA∈Σ₀(K). Hencef∈S_R(K)and thusS_R(K) =R(Σ_Φ,Φ).

Next we use Theorem4.7to describe the sample support set more concretely.

4.3. Sample Support Sets for Scaled RKHSs

LetΣbe the set of all positive sequences. Then the collection ofΣ-approximately constant sequences is precisely the collection of non-negative sequences(a_n)^∞_n=1such that

lim inf

n→∞ a_n>0 and sup

n≥1

a_n<∞. (4.3)

For suppose that there existed aΣ-approximately constant sequence(a_n)^∞_n=1that violated (4.3). If lim inf_n→∞a_n= 0, then there is a subsequence(a_n_m)^∞_m=1such thata_n_m≤2^−mfor allm∈N. Let (b_n)^∞_n=1∈Σbe a sequence such thatb_n= 2⁻ⁿa⁻¹_n forn /∈(n_m)^∞_m=1andb_n_m= 1form∈N. Then

P∞

n=1b_ndiverges but

∞

X

n=1

a_nb_n= X

n /∈(nm)^∞_m=1

a_nb_n+

∞

X

m=1

a_n_mb_n_m≤ X

n /∈(nm)^∞_m=1

2⁻ⁿ+

∞

X

m=1

2^−m<∞,

which contradicts the assumption that (a_n)^∞_n=1 is aΣ-approximately constant sequence. A similar argument (witha_n_m≥2^m,b_n= 2⁻ⁿ, andb_n_m= 2^−m) shows the second condition in (4.3) cannot be violated either; thus everyΣ-approximately constant sequence satisfies (4.3). A sequence satisfying (4.3) is triviallyΣ-approximately constant because the conditions imply the existence of constants0< c₁≤c₂ such thatc₁≤a_n≤c₂for all sufficiently largen. This, together with Theorem4.7, yields the following theorem which we consider the main result of this article. The full proof is more complicated than the above argument as we cannot assume that every positive sequence is a scaling ofH(K).