The inverse problem of history matching, a probabilistic framework for reservoir characterization and real time updating

(1)

T H E I N V E R S E P R O B L E M O F H I S T O R Y M AT C H I N G

A Probabilistic Framework for

Reservoir Characterization and Real Time Updating

BY

júlio hoffimann mendes

A dissertation submitted in partial fulfillment of the requirements for the degree of

Master of Science in

Civil Engineering in the

Department of Civil Engineering Federal University of Pernambuco

Recife, PE 50670-901

(2)

advisors

Ramiro Brito Willmersdorf Ézio da Rocha Araújo

committee

Alexandre Anozé Emerick Bernardo Horowitz

Júlio Hoffimann Mendes: The Inverse Problem of History Matching, A Probabilistic Framework for Reservoir Characterization and Real Time Updating

c

(3)

Catalogação na fonte

Bibliotecária Margareth Malta, CRB-4 / 1198

M538i Mendes, Júlio Hoffimann.

The inverse problem of history matching, a probabilistic framework for reservoir characterization and real time updating / Júlio Hoffimann Mendes. - Recife: O Autor, 2014.

xv, 121 folhas, il., gráfs., tabs.

Orientador: Prof. Dr. Ramiro Brito Willmersdorf. Cooreintador: Prof. Dr. Ézio da Rocha Araújo.

Dissertação (Mestrado) – Universidade Federal de Pernambuco. CTG. Programa de Pós-Graduação em Engenharia Civil, 2014.

Inclui Referências.

1. Engenharia civil. 2. Simulação de Reservatórios. 3. Ajuste ao Histórico I. Willmersdorf, Ramiro Brito. (Orientador). II. Araújo, Ézio da Rocha. (Coorientador). III. Título.

UFPE

(4)

! " # "

$ "

#% &

"

%# "

" $ !" " ' #

()*

$$

!!

!"

! " #

!" " %

$ + ,- "

" ,./0

# !% " # 1

2222222222222222222222222222222222222222222

# $3

#3

#

# % 4 **

# " #$

5 # !% " #6

2222222222222222222222222222222222222222222

# $3

#3 78

"

9 # ):

5 # !% " #6

!

! " # 1

2222222222222222222222222222222222222222222

# $3

#3

#

# % 4 **

# " #$

5 # !% " #6

222222222222222222222222222222222222222222

#3

*

!"#

! 8;

# < =

%# >#?

5 ! " #

% #! 6

222222222222222222222222222222222222222222

# $3

#3

#! #"

# @ %8 =

5 ! " # !% #! 6

(5)

Dedicated to my family and friends for all their love.

To one of the pioneers of this theory, Albert Tarantola. 1949– 2009

(6)

R E S U M O

Em Engenharia de Petróleo e outras áreas da ciência, Mitigação de Incertezas baseada em Histórico (MIH) é o termo moderno usado por especialistas ao se referirem a ajustes contínuos de um modelo matemático dadas observações. Tais ajustes tem maior valor quando acompanhados de diagnósticos que incluem intervalos de confiança, momentos estatísticos, e idealmente caracterização completa das dis-tribuições de probabilidade associadas.

Neste trabalho, o bastante conhecido problema de ajuste ao histórico em campos de petróleo é revisado sob uma perspectiva Bayesiana que leva em consideração toda possível fonte de incerteza teórica ou experimental. É uma aplicação direta da metodologia geral desenvolvida por Albert Tarantola no seu livro intitulado ‘’Inverse Problem Theory and Methods for Model Parameter Estimation”.

Nosso objetivo é fornecer a pesquisadores da área de Óleo & Gás um software escrito em uma linguagem de programação moderna (i. e. Python) que possa ser facilmente modificado para outras aplicações; realizar a inversão probabilística com dezenas de milhares de células como uma prova de conceito; e desenvolver casos de estudo reproduzíveis para que outros interessados neste tema possam realizar “benchmarks” e sugerir melhoramentos.

Diferentemente de outros métodos de sucesso para MIH como Ensemble Kalman Filters (EnKF), o método proposto, denomidado Ensemble MCMC (EnMCMC), não assume distribuições a priori Gaussianas. Pode ser entendido como uma cadeia de Markov de ensemblese teoricamente é capaz de lidar com qualquer distribuição de probabilidade multimodal.

Dois casos de estudo sintéticos são implementados em um cluster de computação de alto desempenho usando o modelo MPI de execução paralela para distribuir as diversas simulações de reservatório em diferentes nós computacionais. Resultados mostram que a implementação falha em amostrar a distribuição a posteriori, mas que ainda pode ser utilizada na obtenção de estimativas maximum a posteriori (MAP) sem fortes hipóteses a respeito dos dados (e. g. a priori Gaussianas). Palavras-chave:simulação de reservatórios, ajuste ao histórico.

(7)

A B S T R A C T

In Petroleum Engineering and other fields, History-based Uncertainty Mitigation (HUM) is the modern generic term used by experts when referring to continuous adjustments of a mathematical model given observations. Such adjustments have greater value when accompanied by uncertainty diagnostics including statistical bounds, moments, and ideally full characterization of underlying distributions.

In this work, the well-known/ill-posed history matching problem is reviewed under a Bayesian perspective which takes into account every possible source of experimental and theoretical uncertainty. It is the direct application of the general framework developed by Albert Tarantola on his textbook entitled ‘’Inverse Problem Theory and Methods for Model Parameter Estimation”.

Our aim is to provide other researchers in the field with a software package writ-ten in a modern programming language (i. e. Python) that can be easily adapted to their needs; to perform probabilistic inversion with tens of thousands of cells as a proof of concept; and to develop reproducible case studies for others to benchmark and propose improvements.

Unlike other successfull HUM methods such as Ensemble Kalman Filters (EnKF), the proposed method here called Ensemble MCMC (EnMCMC) does not assume Gaussian priors. It can be thought as a Markov chain of ensembles and theoretically can deal with any (multimodal) probability distribution.

Two synthetic case studies are implemented in a time-shared high-performance computer cluster using the MPI parallel execution model to distribute reservoir simulations among various computational nodes. Results show the implementa-tion fails to sample the posterior distribuimplementa-tion but still can be used to obtain clas-sical maximum a posteriori (MAP) estimates without strong assumptions on the data (e. g. Gaussian priors).

Keywords:reservoir simulation, history matching.

(8)

P U B L I C AT I O N S

Some ideas and figures have appeared in: Mendes, Willmersdorf, and Araújo[2013a] Mendes, Willmersdorf, and Araújo[2013b] Mendes, Willmersdorf, and Araújo[2013c]

Software implementations of the methods presented in this thesis are publicly available athttps://github.com/juliohm/HUM. Refer toAppendix Bfor additional

code snippets discussed throughout the text.

Please cite this work if you plan to use the software.

(9)

Old wood best to burn, old wine to drink, old friends to trust, and old authors to read. Francis Bacon

A C K N O W L E D G E M E N T S

My family and friends are quite important to me. I won’t list all but my parents for educating me with precious principles. Rigoberto Mendes da Silva & Lindevany Hoffimann de Lima Mendes, I love you. A special thanks goes to my father for his unconditional support and presence.

I would like to thank my advisor, Ramiro Brito Willmersdorf, for pushing me towards hard problems, for his harsh sincerity againist my mistakes and for his friendship along these years. He is certaintly one of the greatest finds of my career and definitely has shaped who I am today.

Although rare, the long philosophical discussions at the campus with my co-advisor, Ézio da Rocha Araújo, were inspiring. His vast knowledge on various subjects and clear understanding of the fundamentals contributed to many of the insights I had.

Thanks to Alexandre Emerick, Flávia Pacheco and Régis Romeu from cenpes for their accreditation in our research; to petrobras for the financial support medi-ated by the pfrh-26 program; and to cenapad-pe for providing computational resources.

(10)

C O N T E N T S

I inverse problem theory 1

1 basic concepts 2

1.1 What is an inverse problem? . . . 3

1.2 Why inverse problems are hard? . . . 6

1.3 The maximum likelihood principle . . . 8

1.4 Tarantola’s postulate . . . 9

1.5 Classical vs. probabilistic framework . . . 10

2 classical framework 12 2.1 Basic taxonomy for inverse problems . . . 13

2.2 Linear regression and the least-squares estimate . . . 16

2.3 Tikhonov regularization . . . 19

2.4 Levenberg-Marquardt solution to nonlinear regression . . . 24

3 probabilistic framework 28 3.1 Definition of probability . . . 29

3.2 States of information . . . 36

3.3 Bayesian inversion . . . 37

3.4 Ensemble Markov chain Monte Carlo . . . 42

II history matching 55 4 prelude 56 4.1 Problem description . . . 57 4.2 Case studies . . . 58 4.3 Comments on reproducibility . . . 63 5 channelized reservoir 64 5.1 Setting priors . . . 65 5.2 Probabilistic inversion . . . 69

5.3 Analysis of the results . . . 70

6 brugge field 80 6.1 Setting priors . . . 81

6.2 Probabilistic inversion . . . 82

(11)

contents xi

6.3 Analysis of the results . . . 83

7 conclusion 86 7.1 General comments . . . 86 7.2 Technical difficulties . . . 88 7.3 Suggested improvements . . . 89 III appendix 91 a omitted proofs 92 a.1 The majority of inverse problems is ill-posed . . . 92

a.2 Maximum likelihood estimation for i.i.d. Gaussians . . . 92

a.3 System of equations for discrete linear inverse problems . . . 93

a.4 Maximum likelihood and least-squares . . . 93

a.5 Weighted linear least-squares estimate . . . 94

a.6 Levenberg-Marquardt gradient and Hessian . . . 94

a.7 Conditional probability by conjunction of states . . . 95

a.8 Kernel density estimation as a convolution . . . 95

b code snippets 97 b.1 Iteratively reweighted least-squares . . . 97

b.2 Least absolute shrinkage and selection operator . . . 98

b.3 Metropolis algorithm . . . 99

b.4 Online Bayesian inversion . . . 100

b.5 Histogram fitting with kernel density estimation . . . 101

c kernel pca 103 c.1 Kernel Gramian matrix . . . 104

c.2 Centering in the feature space . . . 105

c.3 Eigenproblem and normalization . . . 106

c.4 Preimage problem . . . 107

(12)

L I S T O F F I G U R E S

Figure 1.1 Geostatistical inference . . . 4

Figure 1.2 Non-bijective map G: M 7−→ D . . . . 6

Figure 1.3 3D non-convex surfaces . . . 10

Figure 2.1 Univariate Gaussians . . . 14

Figure 2.2 Polynomial regression models . . . 16

Figure 2.3 Bullet trajectory prediction . . . 18

Figure 2.4 Ridge regression . . . 20

Figure 2.5 L-curve criterion . . . 22

Figure 2.6 L1-norm regression . . . 23

Figure 2.7 Numerical simulator as a black box . . . 26

Figure 3.1 Homogeneous distribution for Jeffreys parameters . . . 32

Figure 3.2 Disjunction for producing histograms . . . 32

Figure 3.3 Impact clouds on a cathodic screen . . . 33

Figure 3.4 The p-event of an interval for a Jeffreys parameter. . . 34

Figure 3.5 Conditioning probability densities . . . 35

Figure 3.6 Dirac delta function δ(x; x0) . . . 37

Figure 3.7 Nescience towards Omniscience . . . 37

Figure 3.8 2D Gaussian N mprior, Cm . . . 38

Figure 3.9 Uncertainty in the forward operator . . . 39

Figure 3.10 Graph for a 3-state machine . . . 44

Figure 3.11 Limiting behavior for a 3-state Markov chain . . . 44

Figure 3.12 Histogram for a Gaussian mixture . . . 47

Figure 3.13 Trace and autocorrelation for a Gaussian mixture . . . 48

Figure 3.14 Probabilistic inversion of y = x2 _{. . . .} ₅₀

Figure 3.15 KDE for a 2D Gaussian . . . 51

Figure 3.16 Anisotropic joint distribution . . . 52

Figure 3.17 Stretch move θ(t) k → θ (t+1) k . . . 52

Figure 4.1 Mind map of algorithms and concepts . . . 56

Figure 4.2 Training image of size 250x250 pixels . . . 58

Figure 4.3 Ten-spot well configuration . . . 59

(13)

List of Figures xiii

Figure 4.4 Oil/water saturation within channelized reservoir . . . 59

Figure 4.5 Production history for channelized reservoir . . . 59

Figure 4.6 Brugge field . . . 60

Figure 4.7 Top view of Brugge field realization . . . 61

Figure 4.8 Brugge field permeability curves . . . 62

Figure 4.9 Brugge field oil production history . . . 62

Figure 5.1 Filtersim realizations . . . 66

Figure 5.2 kPCA for increasing polynomial kernel degrees . . . 68

Figure 5.3 Stretch move: bean plot of prior and posterior log-probabilities 71 Figure 5.4 KDE move: bean plot of prior and posterior log-probabilities 71 Figure 5.5 KDE move: production history for the prior ensemble . . . . 72

Figure 5.6 KDE move: production history for the posterior ensemble . 73 Figure 5.7 KDE move: acceptance fraction for each walker . . . 74

Figure 5.8 KDE move: 25 most probable images in prior ensemble . . . 75

Figure 5.9 KDE move: 25 most probable images in posterior ensemble 76 Figure 5.10 KDE move: maximum a posteriori estimate . . . 77

Figure 5.11 Filtersim: bean plot of prior and posterior log-probabilities . 77 Figure 5.12 Filtersim: acceptance fraction for each walker . . . 77

Figure 5.13 Filtersim: 25 most probable images in posterior ensemble . . 78

Figure 5.14 Filtersim: production history for the posterior ensemble . . 78

Figure 5.15 Filtersim: maximum a posteriori estimate . . . 79

Figure 5.16 kPCA reconstruction . . . 79

Figure 6.1 Prior on observations changing over time . . . 82

Figure 6.2 KDE move: bean plot of prior and posterior log-probabilities 83 Figure 6.3 KDE move: acceptance fraction for each walker . . . 84

Figure 6.4 KDE move: production history for the prior ensemble . . . . 84 Figure 6.5 KDE move: production history for the posterior ensemble . 85

(14)

L I S T O F TA B L E S

Table 4.1 Channelized reservoir summary table . . . 58 Table 4.2 Brugge rock formations . . . 61 Table 5.1 SSIM statistics for Filtersim and KDE-based proposals . . . 75

(15)

L I S T O F A L G O R I T H M S

2.1 Coordinate descent for sparse regularization . . . 23 3.1 Metropolis-Hastings . . . 46 3.2 Stretch move in Rn_{. . . .} ₅₃

(16)

L I S T I N G S

Listing 5.1 Filtersim parameters file . . . 65

Listing B.1 Iteratively reweighted least-squares . . . 97

Listing B.2 Least absolute shrinkage and selection operator . . . 98

Listing B.3 Metropolis algorithm . . . 99

Listing B.4 Online Bayesian inversion . . . 100

Listing B.5 Histogram fitting with kernel density estimation . . . 101

(17)

A C R O N Y M S

SVM Support Vector Machine

MPS Multiple-Point Statistics

SVD Singular Value Decomposition

LASSO Least Absolute Shrinkage and Selection Operator KKT Karush-Kuhn-Tucker

DFP Davidon-Fletcher-Powell

BFGS Broyden-Fletcher-Goldfarb-Shanno MCMC Markov chain Monte Carlo EnKF Ensemble Kalman Filter MAP Maximum a Posteriori KDE Kernel Density Estimation

RML Randomized Maximum Likelihood K-L Karhunen-Loève

kPCA Kernel Principal Component Analysis

kMAF Kernel Maximum Autocorrelation Factor

MPI Message Passing Interface

GPGPU General-Purpose Computing on Graphics Processing Units

HUM History-based Uncertainty Mitigation TPFA Two-Point Flux Approximation

EnMCMC Ensemble Markov chain Monte Carlo SSIM Structural Similarity

(18)

Part I

I N V E R S E P R O B L E M T H E O R Y

In which the general problem of estimating parameters of a (physical) system based on real evidence (i. e. observation) is set up. The classi-cal and probabilistic framework for the solution are briefly reviewed emphasizing the very different questions they target.

(19)

1

B A S I C C O N C E P T S

For certainly it is excellent discipline for an author to feel that he must say all he has to say in the fewest possible words, or his reader is sure to skip them; and in the plainest possible words, or his reader will certainly misunderstand them. John Ruskin

1.1 What is an inverse problem? . . . 3

1.2 Why inverse problems are hard? . . . 6

1.3 The maximum likelihood principle . . . 8

1.4 Tarantola’s postulate . . . 9

1.5 Classical vs. probabilistic framework . . . 10

What is an inverse problem? Why is it so hard to solve? What to expect as a reasonable solution?All these questions should be addressed in the following sections.

Inverse problems are quite general and don’t require concrete examples to be understood. Like the elementary concept of an inverse image (a. k. a. preimage) of a function, these problems are nothing but abstraction. Indeed, the solution of an inverse problem is an inverse image1_.

Without even knowing what an inverse problem is, the smart reader is probably speculating about ill-posedness and how to deal with it. There isn’t magic and the most natural question2_{has to be adapted for approximate answers to be inferred,}

in a least-squares sense.

However, redesigning the questions can be surprisingly innovative and reward one’s mind with new points of view.

1 Or a collection of images with known distribution. 2 What is x = f−1_(y)?

(20)

1.1 what is an inverse problem? 3

1.1 what is an inverse problem?

Consider a function G: M 7−→ D that for each parameter m ∈ M associates a response d ∈ D. It’s possible to conceptualize exactly three types of inference [4]:

F O R WA R D P R O B L E M Given m and G; Find d = G(m). I N V E R S E P R O B L E M Given d and G; Find m= G? −1(d). S U P E RV I S E D L E A R N I N G Given m and d; Find G.

Regardless of the meaning of M and D, hereafter called model space and data space, or the complexity of G, the transfer function3_{; these problems have rather distinct}

tractability. The forward problem is the easiest as its solution is obtained by direct function evaluation. Techniques for solving the inverse problem will be discussed throughout this thesis because clearly the inverse G−1 _{may not exist}

or be available. Finally, supervised learning4 _{has being extensively studied}

as a Machine Learning subtopic and various successful classification/regression techniques such as SVM were developed for solving it [5].

important note: In the literature, the term forward problem is also used to designate the inductive process of deriving physical laws from experiments. This is essentially a human being quality, and it’s much harder, if not impossible, for a computer to reproduce. In this text, the transfer function G is given.

For better understanding, the following concrete examples attach meaning to M, Dand G. They’re all practical applications.

Example 1.1 (Geostatistics)

A random field Z(x; ω) is a stochastic process on spatial coordinates x ∈ Rn _[₆_].

For the oil industry it characterizes the uncertainty on petrophysical properties in a reservoir model [7, 8,9]. A very common task a professional in this area has to accomplish is conditional sampling:

3 Often representing an expensive simulation code.

4 It isn’t the more intuitive name for the third type of inference, system identification problem is a possible alternative [4].

(21)

A spatial property is measured at few locations within an acceptable deviation from the true (unknown) value. This hard data comes from boreholes as the prod-uct of laboratory experiments with cores, well tests, or sometimes are indirectly obtained from well logs, see Figure 1.1a. The task is to fill in the 3D reservoir model with the mentioned property honoring the hard data and prior probability distribution of the field. A complementary step not considered in this example is soft dataintegration [10,11].

(a) Rock porosity at well locations. (b) Ordinary Kriging estimation. Figure 1.1: Geostatistical inference on a 3D reservoir model.

The result of applying Ordinary Kriging to the Stanford VI data set [11] is shown

inFigure 1.1b. Similar to Stanford V [12], this stratigraphic model was synthesized

as a fluvial channel system with the purpose of extensively testing (MPS) algo-rithms for reservoir characterization.

In spite of a neat deterministic solution, it presents non-physical smoothness [13]. This inverse problem is better solved by sequential simulation which is a technique for drawing random values from univariate distributions sequentially built with the hard data [14]. In fact, this subject has been widely discussed [15,16,17]. Se-quential simulation allows multiple realizations to be drawn to characterize the ran-dom field.

In mathematical language, m is defined as the flattened array of length nx× ny×

nzcontaining all porosity values of the 3D grid; d is of reduced length, the number

of cells with hard data; and G is the “selection” operation, it simply discards grid locations that aren’t in d, and is represented by a matrix whose entries are either 1 or 0.

(22)

Thus, a linear inverse problem of the form:        d        = G z }| {     1 0 · · · 0 0 0 0 ... 1 0 ... ... ... ... ...                      m                  (1.1) Example 1.2 (Well Testing)

For assessing a newly discovered reservoir under dynamic conditions, the just drilled pilot well is provisionally completed for controlled production during a short period of time, and closed until hydrostatic equilibrium is reestablished.

The well pressure is registered along with the imposed production rate. The phenomenon can be analytically modelled by the diffusion equation with the ap-propriate boundary conditions for the transient regime5_{. The analytical solution}

for vertical wells under constant production rate is [18,19]: pw= pi− qµ 2πκh 1 2Ei φµctr2w 4κt (1.2) with Ei(x) def= R∞_x e −ξ

ξ dξ, the exponential integral function. The ultimate goal of

the test is to estimate the productivity index for the well in the long term by first estimating the rock permeability—the only unknown inEquation 1.2. Similar estimation is desired when in situ permanent sensors are installed [20]. Example 1.3 (Physical Measurements)

A collection of instruments is used for measuring parameters of a physical system. That may involve human interaction and/or impossible to control environmental conditions. As a result, instrument readings may be (and generally are) not precise, even though accurate6_.

Suppliers of these instruments then provide a statistical analysis of the uncer-tainties involved in the measurement process that should be used to describe the output. For instance, if parameters m are to be measured by an instrument I: M7−→ M, the supplier provides the conditional probability Pr(mout| m)or the

covariance CM.

5 The reservoir has hypothetical infinite extension.

(23)

1.2 why inverse problems are hard? 6

Ideally, the instrument is the identity function I ≡ IdM, the forward and inverse

problem share the same solution. In the real world, is common practice to assume the (Gaussian) error ǫ is independent of the input:

mout = I(m) = m + ǫ (1.3)

The inverse problem is to find the true values for the physical parameters m given the instrument measurements mout, or as for the later assumption, to

sub-tract m = mout− ǫ. Note the error is a random variable.

This work is concerned with the inverse problem of history matching, a much more computational expensive problem that the oil industry is encouraging re-searchers to devote time thinking. It’ll be introduced in the following chapters along with its theoretical and practical difficulties.

1.2 why inverse problems are hard?

m1 m2 m3 M M d1 d2 d3 D D G ? ∄ G−1

Figure 1.2: Non-bijective map between model and data space.

Back in 1902, the French mathematician Jacques Hadamard (1865 – 1963) con-ceived the term well-posed problem to designate a very important concept that can be adapted for use in inverse problem theory. A boundary value problem in math-ematical physics is said to be well-posed in the sense of Hadamard if it satisfies all of the following criteria [21]:

• There exists a solution • The solution is unique

• The solution depends continuously on the data7

(24)

1.2 why inverse problems are hard? 7

A problem that isn’t well-posed in this sense is termed ill-posed problem, see

Figure 1.2for an illustration.

Theorem 1. The majority of inverse problems is ill-posed.

Proof. Given two equinumerous sets |A| = |B|, the number of bijections from A to Bnever exceeds the number of non-bijections, except when both sets are empty or singleton.

infinite case Denote fA_7→B, bA_7→B and nA_7→B the number of functions,

bi-jections and non-bibi-jections from A to B, respectively. It follows that: bA7→B6fA7→B = fA\{a}_7→B 6nA7→B

fA_7→B = fA\{a}_7→B because A and A \ {a} have the same cardinality; fA\{a}_7→B 6

nA_7→B because a function f: A \ {a} 7−→ B can be extended by mapping f(a) ∈

f(A \ {a}), which is non-injective.

finite case |A| = |B| = n, fA_7→B = nn, bA_7→B = n!, nA_7→B = nn− n!

Apart from the empty and singleton sets for which only one (bijective) function is defined, the result (left for the reader) is proved by induction:

nn− n! > n! (_{∀n > 2)}

Put differently, well-posed inverse problems are accidental. For instance, assume the model and data space are finite; if |M| 6= |D|, none of the inverse problems is well-posed, else |M| = |D| = n and the percentage8 _of n!

nn → 0% vanishes very

rapidly with n.

Not only ill-posed, inverse problems are sometimes computationally expensive within current solving strategies. If a more detailed characterization of the model space is desired and the transfer function is a demanding physical simulation, the wall time required for the solution increases considerably. Up to date, two frameworks coexist for the solution of inverse problems, the probabilistic being in general more expensive than the classical, for reasons that will be clear in next sections.

(25)

1.3 the maximum likelihood principle 8

1.3 the maximum likelihood principle

Given a statistical model for experiment outcomes x1, x2, . . . , xm parameterized by

θ, the likelihood function is a reinterpretation of the joint probability density f as if the observations were “fixed parameters”:

L(θ | x1, x2, . . . , xm)def= f(x1, x2, . . . , xm| θ) (1.4)

The maximum likelihood principle states the model parameters are to be set to maximize the experiment probability (i. e. an extremum estimator):

ˆθ = arg max

θ_∈Θ

L(θ | x1, x2, . . . , xm) (1.5)

This is to say over all plausible parameters θ ∈ Θ, only ˆθ is of interest. For instance, if the univariate Gaussian model is assumed and the random variables are i.i.d., the parameters can be shown to match the sample mean and standard deviation (seeAppendix A.2):

ˆθ = ˆµ ˆσ2 = ₁ m P ixi 1 m P i(xi−ˆµ)2 (1.6) For inverse problems, the principle is interpreted likewise: find the parameters ˆ

m∈ M that best honor the observed data d ∈ D through the forward operator G according to some loss function (e. g. kd − G(m)kL₂):

ˆ

m =arg min

m∈M

loss (d, G(m)) (1.7)

It can be in general a non-convex optimization problem over a high-dimensional space. In Example 1.1, the Kriging model has parameters with 6 million entries (i. e. cells) and the loss function is the mean square estimation error or estimation variance σ2 _{= E}_{( ˆ}_{Z(x; ω) − Z(x; ω))}2_{with the estimator ˆZ(x; ω)}def₌P

iλiZ(xi; ω)

a linear combination of the hard data.

The maximum likelihood principle is widely applied in statistics and real life engineering applications.

(26)

1.4 tarantola’s postulate 9

1.4 tarantola’s postulate

“The most general solution of an inverse problem provides a probability distri-bution over the model space.”– Albert Tarantola

On his book Inverse Problem Theory and Methods for Model Parameter Estimation [22], Tarantola argued that solutions to inverse problems are of greater value if they come attached to probability distributions. He developed a very clever theory in which all information available about the problem is modeled within a richer probabilistic framework. This dissertation is an attempt to apply this framework to the problem of history matching and, at worst, will serve to clarify what parts of it requires further work for practical use by the oil industry.

A careful review of the theory is given in Chapter 3. For now, it’s sufficient to know that once a coordinate system is fixed for the data and model spaces, probability densities can be defined, and the Bayesian updating performed:

σM(m) = k ρM(m) L(m) (1.8)

The posterior distribution over the model space σM(m) is obtained from the

prior ρM(m)by incorporating the likelihood L(m) which expresses how good the

parameters m are in explaining the data. In Equation 1.8, k is a normalization constant.

Having a continually updated probability distribution for the parameters allows answering questions such as What is the most probable parameter?9 _{What is the}

proba-bility of a parameter being in a certain range?and many others. Tarantola’s framework is therefore, as previously mentioned, “richer” at providing the analyst a variety of insights.

However, it requires higher level of abstraction from the implementer as to iden-tify and model the various states of information. Not to mention, information isn’t always available, specially if it costs billions of dollars to be acquired.

The probabilistic framework is relatively recent, few researchers have employed it in large scale problems for assessing its efficiency.

(27)

1.5 classical vs. probabilistic framework 10

1.5 classical vs. probabilistic framework

It’s clear that the classical framework, as the direct application of the maximum likelihood principle, does not address many important questions. As Tarantola once said: “This is not the solution; it is, rather, the mean of all possible solutions. Looking at this mean provides much less information than looking at a movie of realizations.”. Referring to the least-squares solution of an inverse problem [22].

It’s also clear that, by what was exposed in Section 1.4, the probabilistic frame-work is more general as it considers the maximum likelihood estimation as one of many possible post-processing steps. All the embedded information is preserved and accumulates as new data becomes available.

Following this reasoning, the comparison classical vs. probabilistic is unfair, nevertheless it is important to be raised. The point is to make it clear when it is worth applying one framework over another.

This dissertation endorses the Tarantola’s postulate with the justification being multimodal hypersurfaces in high-dimensional spaces.Figure 1.3is a collection of 3D synthetic surfaces for which the maximum likelihood estimation isn’t helpful.

(28)

1.5 classical vs. probabilistic framework 11

The classical framework is the option for rough estimates, if no data is available to assemble representative distributions or if the problem is known to be unimodal beforehand.

Concerning ill-posedness handling, maximum likelihood estimation does it ex-plicitly with the help of regularization techniques, whereas Bayesian updating guarantees the existence of an unique well-defined posterior. Alternatively, the probabilistic framework always gives an unique answer: the posterior distribution over the model space. This explicit vs. implicit ill-posedness treatment will be revisited in more depth when it’s appropriate.

Algorithmically, Tarantola’s framework relies on randomness for effective explo-ration of high-dimensional spaces. The classical approach uses, in general, deter-ministic optimization routines for solving slightly modified (i. e. biased) problems. In summary, what to expect as a reasonable solution? Notwithstanding a subjec-tive question, a probability distribution over the model space is almost always a satisfactory answer.

(29)

2

C L A S S I C A L F R A M E W O R K

If you don’t know anything about computers, just remember that they are machines that do exactly what you tell them but often surprise you in the result. Richard Dawkins

2.1 Basic taxonomy for inverse problems . . . 13

2.2 Linear regression and the least-squares estimate . . . 16

2.3 Tikhonov regularization . . . 19

2.4 Levenberg-Marquardt solution to nonlinear regression . . . 24

Likely to be employed by frequentists1_{, the classical framework aims at finding the}

most probable parameters for the (physical) system that honors the observations. It consists of maximizing the likelihood function, no matter the prior state of in-formation on the system2_{. For the practical experimentalist, the framework often}

translates into minimizing the misfit on the training data, disregarding previous attempts or externally acquired knowledge.

As optimization is omnipresent, it makes sense to approach the framework by informally classifying which inverse problems reflect “good” objective functions and which don’t. A basic taxonomy is presented that groups inverse problems as discrete, continuous, linear and nonlinear.

It occurs that adding engineer-designed terms to the objective can improve the solution process and the estimation itself—a technique generically referred to as regularization. The most commonly used regularization schemes are presented for reference, elaborating on this topic is out of scope of the present work.

1 Followers of frequentist statistics.

2 A virtual prior is often introduced for stability.

(30)

2.1 basic taxonomy for inverse problems 13

Finally, the problem of nonlinear inversion is investigated with a specialized version of the Gauss-Newton method for unconstrained optimization.

This chapter is heavily influenced byAster, Borchers, and Thurber’s “Parameter Estimation and Inverse Problems”[4].

2.1 basic taxonomy for inverse problems 2.1.1 Discrete vs. continuous

Inverse problems for which the model and data space have continuous functions as elements of study are termed continuous. These problems are often expressed as integral operators:

Zb

a

g(s, x)m(x) dx = d(s) (2.1)

where m(x) is the unknown, d(s) the observation, and g(s, x) the kernel function.

Equation 2.1 occurs so often in mathematical models, it has a name—Fredholm

integral equation of the first kind. For instance, in electrical engineering, the kernel depends explicitly on s − x; convolution/deconvolution arises as one of the most important forward/inverse problems:

Z∞

−∞

g(s − x)m(x) dx = d(s) (2.2)

For further clarification, consider g(s, x) = 1 over [a, b] ⊂ R, the inverse problem has no solution unless d(s) = C is a constant. Moreover, there are multiple func-tions m(x) for which the definite integral evaluates to C. The problem is ill-posed. Since continuous functions cannot be represented by digital computers, these problems must be discretized first. The parameters m = (m1, m2, . . . , mn)⊤ and

observations d = (d1, d2, . . . , dm)⊤ are both represented by a finite3 number of

coordinates and the inversion is said to be discrete.

Parameter estimation is an alternative name for discrete inverse problems, it is originated from the fact that discretization is generally achieved by series expan-sion and parametrization. For example, the univariate Gaussian distribution is en-tirely determined by two parameters N µ, σ2_{, no matter the complex}

approxima-tions the computer does for reproducing it:

(31)

2.1 basic taxonomy for inverse problems 14 −6 −5 −4 −3 −2 −1 0 1 2 3 4 5 6 µ = 0 σ2 = 0.6 µ = 0 σ2 = 1.0 µ = 0 σ2_{= 3.0} µ = −2 σ2_{= 0.8}

Figure 2.1: Univariate Gaussians parameterized by µ, σ2⊤_.

Given that large-scale inverse problems are solved by computers, discretization is of major importance. Few continuous inverse problems have an elegant solution like inExample 2.1.

Example 2.1 (Lagrange Interpolation)

Find a polynomial p(X) ∈ R[X] of degree n with given zeros x1, x2, . . . , xn ∈

R. This problem is inverse to the direct problem of finding the roots of a given polynomial p(X) ∈ R[X]. In this case, the inverse problem is conceptually easier to solve: p(X) = c(X − x1)(X − x2)· · · (X − xn) for c ∈ R.

More generally, find a polynomial p(X) ∈ R[X] of degree n that assumes given values y1, y2, . . . , yn ∈ R at given distinct points x1, x2, . . . , xn ∈ R. The solution

is given by the Langrange interpolation theorem.

Whether this example is an inverse problem or supervised learning, it

is open to philosophical discussion.

2.1.2 Linear vs. nonlinear

An inverse problem is linear if its forward operator G satisfies superposition and scaling:

G(m1+ m2) = G(m1) + G(m2) (2.3)

G(αm) = α G(m) (2.4)

It’s nonlinear otherwise, and also harder to solve. The model and data spaces of discrete inverse problems are assumed to be linear manifolds so that it makes sense to add and scale coordinates (a. k. a. components). This is a reasonable assumption with no penalties to most real applications.

(32)

2.1 basic taxonomy for inverse problems 15

Every discrete linear inverse problem can be trivially written as a system of linear equations by looking up the basis for the associated model and data spaces

(seeAppendix A.3):

Gm = d (2.5)

with G the matrix for the linear transformation. The problem then reduces to “solving” the system for m even when an inverse G−1 _{does not exist. Various}

well-established results from linear algebra have been applied for the solution of these problems, they are reviewed inSection 2.2.

Fredholm integral equations of the first kind are a good example of continuous linear inverse problems,Equation 2.1is an integral and as such:

Zb a g(s, x)α m1(x) + m2(x) dx = α Zb a g(s, x)m1(x) dx + Zb a g(s, x)m2(x) dx (2.6)

Since coupled multiphysics and complex models are considered in realistic En-gineering simulations, nonlinear outcomes are obtained in general. Inversion is much difficult and requires iterative optimization for an appropriate solution. The Levenberg-Marquardt algorithm is discussed inSection 2.4.

2.1.3 Well-posed vs. ill-posed

The Hadamard definition for well-posedness is revisited for completeness with normed spaces [23].

Definition 2.1 (Well-posedness). Let X and Y be normed spaces, T : X 7−→ Y a (linear or nonlinear) mapping. The equation T(x) = y is called properly-posed or well-posed if the following holds:

• Existence and uniqueness: ∀y ∈ Y, ∃!x ∈ X, y = T(x).

• Stability: for every sequence (xn)n_∈N with T(xn) → T(x), it follows that

xn→ x.

Equations for which (at least) one of these properties does not hold are called improperly-posed or ill-posed.

important note: Discrete linear inverse problems can be further classified as purely underdetermined, purely overdetermined or mixed determined depending on the range and null space of G [24].

(33)

2.2 linear regression and the least-squares estimate 16

2.2 linear regression and the least-squares estimate

Since no exact preimage exists for noisy observations, parameterized mathematical models are adjusted to fit data with respect to some misfit measure, as illustrated

inFigure 2.2. Linear regression is the term used when the model being fit depends

linearly on the parameters ˆd = Gm.

Figure 2.2: Polynomial regression models.

The widely employed procedure is to minimize the sum of squared errors be-tween model and observation dobs4:

mL2=arg min

m∈M

(dobs− Gm)⊤(dobs− Gm) (2.7)

It has a closed form solution mL2 = (G⊤G)

−1_G⊤_d_obs _{when G has full column}

rank and probabilistic meaning5 _{when the noise is Gaussian (see} _{Appendix A.4}_).

When the null space of G denoted N(G) isn’t trivial, the pseudo-inverse defined through SVD is used to compute a least-squares and minimum-length solution:

G† def= VpS−1p U⊤p (2.8)

with rank(G) = p in the singular value decomposition for G: G = USV⊤=h_U_p _U₀i " Sp 0 0 0 # h Vp V0 i_⊤ (2.9) The pseudo-inverse solution m_† = G†dobs always exist and, as a least-squares

solution, satisfies the normal equations:

(G⊤G)m_†= G⊤dobs (2.10)

4 dobsand d are used indistinguishably. 5 It’s the maximum likelihood estimate.

(34)

Most importantly, it can be shown by pure algebraic manipulation that m_† min-imizes the length kmkL2 over all residuals kd − GmkL2 (refer to Aster et al.).

The major problem with pseudo-inverses is that they introduce non-negligible bias to the solution, whereas for instance, the least-squares estimate mL₂ is unbiased

under Gaussian assumption, E [mL2] = mtrue. Bounds are derived for the

intro-duced bias with the concept of model resolution: for the pseudo-inverse G† _{of the}

forward operator G define the resolution matrix RG

def

= G†Gand notice

RGm = G†Gm = G†(Gm) ≅ m (2.11)

is a defeatured version of m that will be exact approximation RGm = mif nothing

is missed in the null space N(G). A simple measure for the bias is the trace Tr(RG);

the closer to that of the identity matrix, the lower the bias. An exact (but not very useful) quantification is obtained comparing the expected value for m_† and the true (unknown) parameters mtrue:

E_m_†= EhG†di= G†E_{[d] = G}†_Gm_true_{= R}_G_m_true (2.12)

∴BIAS = Em_†− mtrue= (RG− I)mtrue (2.13)

Equation 2.13gives a theoretical bound kBIASk 6 kRG− Ikkmtruek since no prior

knowledge exists for kmtruek. It can be further manipulated to incorporate SVD

factors RG− I = VpV⊤p − VV⊤= −V0V⊤0.

Another important issue is the instability of the generalized inverse solution. Small singular values cause it to be extremely sensitive to noise in the data as translated by the condition number of G. In practice, all the solutions discussed— least-squares and generalized inverse—aren’t implemented directly; techniques for stabilizing the inverse problem produce considerably better results, the so called regularizationor damped estimation is revised in the next section.

The regression model can also be very sensitive to outliers, as is the case for least-squares. The usual misfit measure for inconsistent data is the L1-norm of

the residual kd − GmkL1, its resistance to outliers is often explained by arguing

the inconsistency isn’t magnified by taking squares. L1 regression is robust, the

main reason it’s the second choice is non-differentiability. The minimization is performed by iteratively reweighted least-squares [25] which is a sequence of least-squares solutions converging to the L1 estimate mL1. A possible implementation

in theGNU Octaveprogramming language is presented inAppendix B.1.

Therein, the weighted system G⊤_{RGm = G}⊤_Rd_{with R} def₌ _diag_{|d − Gm|}p−2

(35)

extremely stable for p = 1, specially if a cut-off value is used for rounding up residuals that are close to zero.

Example 2.2 (Bullet Trajectory)

This is the classical problem of fitting a parabola to consecutive snapshots of a bullet trajectory with the discrete linear model y(t) = m1+ m2t −1₂m3t2 for the

elevation at instant t > 0. The parameters to be estimated (m1, m2, m3)⊤ have

well-known physical meaning.

Observations are made and the linear system Gm = d built:     1 t1 −1₂t21 1 t2 −1₂t2₂ ... ... ...            m1 m2 m3        =        y1 y2 ...       

In the presence of an outlier, the L2 regression underestimates the initial bullet

velocity and the gravitational field giving a poor trajectory prediction. The L1 is

less sensitive to the inconsistent observation, as shown inFigure 2.3.

Time [t] Ele vation [y (t )] Observed data L1regression L₂regression outlier

Figure 2.3: Bullet trajectory prediction with L1and L2regression.

This plot was reproduced from the already mentioned textbook, example 2.4, as it illustrates the danger in applying pure least-squares for inversion on arbitrary,

realistic data sets.

Weighted systems like those for the Lpsolution naturally arise when uncertainty

in the measurements is stochastically modeled. The Euclidean distance is replaced by a scale-invariant metric that is better adapted to settings involving non spheri-cally symmetric distributions.

(36)

2.3 tikhonov regularization 19

Definition 2.2(Mahalanobis distance). The Mahalanobis distance of a multivariate vector x ∈ Rn_{from a group of values with mean µ ∈ R}n_{and covariance Σ ∈ R}n_×n

is defined as: DM(x) =

q

(x − µ)⊤_Σ−1_{(x − µ)}

The metric inDefinition 2.2is used to formulate a weighted linear least-squares problem with Cd the covariance matrix for the measurements:

arg min

m∈M

(dobs− Gm)⊤C−1d (dobs− Gm) (2.14)

where the best estimate6 _{might have closed form m}

M= (G⊤C−1_d G)−1G⊤C−1_d dobs

(seeAppendix A.5). In practice, the noise is assumed to be separate from the input

as inExample 1.3[26,27] and the covariance to be diagonal Cd=diag σ2₁, σ2₂, . . . , σ2m

. The least-squares solution is recovered for Cd= I.

To end this section, it is important to know that uncertainty in the data can be propagated through all linear estimators m(_·) = G(·)dobs because of a basic result

from multivariate statistics.

Lemma 1. The covariance of a linear mapping y = Ax + b is Cy= ACxA⊤with Cxthe

covariance for x [6].

The covariance for the least-squares estimate is, for instance, derived by setting A = (G⊤G)−1G⊤inLemma 1:

Cm_L2 = (G⊤G)−1G⊤CdG(G⊤G)−1 (2.15)

and if the measurements are uncorrelated under the same degree of uncertainty

Cd= σ2I,Equation 2.15simplifies to:

Cm_L2 = σ2(G⊤G)−1 (2.16)

Remark 1.The classical framework for discrete linear inverse problems gives gen-eralized linear estimates for the parameters and models uncertainty through direct covariance propagation.

2.3 tikhonov regularization

In the previous section, the general solution to discrete linear inverse problems was reviewed and, at that time, it was already mentioned that using those estimates

(37)

could lead to erroneous predictions. This is mainly caused by ill-conditioning as briefly explained with SVD on rank-deficient matrices, or by the presence of out-liers as illustrated byExample 2.2.

Regularization consists of solving the trade-off between resolution and stability of the estimate. It can be thought as the process of penalizing terms in the SVD of Gthat are highly sensitive to noise—terms associated to small singular values.

Recapping Section 2.2, it was mentioned that the pseudo-inverse estimate m_† is also minimum-length, meaning it’s the solution to the following optimization problem:

minimize kmkL2

s.t. kd − GmkL2 6t(δ)

(2.17) where t(δ) is an increasing function of δ. The constraint is usually incorporated into the objective in a damped minimization and the resulting problem is known in statistics as Ridge regression.

minimize (d − Gm)⊤_{(d − Gm)} | {z } least-squares + δ2m⊤m | {z } regularizer (2.18)

The regularizer contributes to pushing the solution towards the origin—as δ2

increases, resolution is lost—and in the case of Ridge regression is represented by concentric circles in a 2D model space, see Figure 2.4. For any δ2 _{∈ [0, ∞), the}

Ridge estimate mδ lies on a point of tangency between ellipses (i. e. least-squares)

and circles; in the lower extreme lim

δ_7→0mδ = mL2. m1 m2 _{least-squares} regularizer + mL2 δ2 ⇁ 0 ∞ ↼ δ2 _m δ

(38)

The effect of the regularizer on the normal equations is numerically very intu-itive, the damped objective is rewritten as augmented norm:

minimize " G δI # m − " d 0 # 2 L₂ (2.19) and the normal projection gives:

h G⊤ δI i"_G δI # m =hG⊤ δI i"_d 0 # (2.20)

Equation 2.20 simplifies to G⊤_{G + δ}2_I_{m = G}_⊤_{d, it’s very clear δ}2 _{is being}

added to the diagonal of G⊤_G_{to fix its condition number, in which case the}

esti-mate can be safely resolved: mδ =

G⊤G + δ2I−1G⊤d (2.21)

What is a reasonable value for δ? Once again, it’s a trade-off between resolution and stability. If the SVD of G ∈ Rm_×n _{is substituted in}_{Equation 2.21}_{, a damped}

decomposition is produced where the damping factors (a. k. a. filter factors) are well-determined in terms of the singular values s1, s2, . . . , smin(m,n):

fidef=

s2_i

s2_i+ δ2 (2.22)

The smaller they are s2

i ≪ δ2, the higher the penalty fi ≈ 0, whereas bigger

values s2

i ≫ δ2 aren’t discarded, fi≈ 1.Equation 2.21can be rewritten in terms of

the filter matrix Fdef=diag(f1, f2, . . . , fmin(m,n)):

mδ = G⊤G + δ2I −1 G⊤d = Gδd = VFS†U⊤d (2.23)

and the resolution RG,δ def= GδG = VFV⊤is clearly a function of δ2, the regularizer

damping constant.

A reasonable candidate for δ can be obtained by the L-curve criterion: the log-log cross-plot of kmδkL2 and kd − GmδkL2 parameterized by δ has an L-shape.

The damping constant should be selected near the corner to minimize both terms simultaneously, seeFigure 2.57_.

(39)

2.3 tikhonov regularization 22 kmδk_L₂ k d − Gm δ kL 2 good candidates high instability high bias

Figure 2.5: L-curve criterion for choosing reasonable damping constants δ2_.

Another very common technique for this selection, mostly applied to super-vised learning problems, is k-fold cross-validation: the original data set is ran-domly partitioned into k equal size subsets, one of which is retained for validation, the other k − 1 are used for training the model. The process is repeated k times and the final estimator is taken as the average of all.

important note: Apart from the mentioned techniques—L-curve and k-fold cross-validation—damping constant selection for Tikhonov regularization can be made implicit within an iterative optimization approach [28].

Ridge regression (a. k. a. zeroth-order Tikhonov regularization) damps the objec-tive with the L2-norm of the model parameters. It’s also interesting consider other

norms on different arguments. The first well-known modification is the use of a L1

regularizer, producing the LASSO:

minimize (d − Gm)⊤_{(d − Gm) + δ}2_kmk

L1 (2.24)

It has many advantages over the L2 regularizer, especially when it comes to

fea-ture selection (i. e. discarding parameters). The contour lines for the L1-norm have

♦-shape and because the minimum occurs at the intersection with least-squares ellipses, it almost always happens at the diamond corners, seeFigure 2.6. The cor-ners are on the axis meaning many parameters in the solution are exactly zero, thus the name Sparse Regularization.

(40)

2.3 tikhonov regularization 23 m1 m 2 least-squares regularizer 0 mδ

Figure 2.6: LASSO ♦-shape regularizer.

The augmented objective with L1 regularization is no longer differentiable, the

optimum is derived using the concept of subderivative from convex optimization, directly embedded inAlgorithm 2.1. A possible implementation of this shrinkage operator [29] is presented inAppendix B.2.

In Tibshirani’s “The lasso problem and uniqueness” [30], the LASSO solution is investigated from the KKT conditions using the term “well-defined” in favour of well-posed.

Algorithm 2.1:Coordinate descent for sparse regularization

Input: G ∈ Rm×n_{, d ∈ R}m_{, δ ∈ R}

Output: mδ=arg minm∈Mkd − Gmk2L2+ δ2kmkL1

// initialize m randomly or use the Ridge estimate

m← (G⊤_{G + δ}2_I)−1_G⊤_d

repeat

foreachcolumn Ghji_do_{update m}

j: aj← 2Ghji⊤Ghji cj← 2Ghji⊤(d − Gm + mjGhji) if cj< −δ2then mj← cj+ δ2 a_j else if cj> δ2then mj← cj− δ2 aj else mj← 0 end untilconverge return m

(41)

2.4 levenberg-marquardt solution to nonlinear regression 24

When combined these two regularizers form what is known in the literature as Elastic net regularization. This is mainly done to overcome LASSO limitations for “small m, large n” problems.

minimize (d − Gm)⊤_{(d − Gm) + λ}₁_kmk

L1+ λ2kmk

2

L2 (2.25)

Ridge regression and the LASSO are recovered for λ1 = 0and λ2 = 0respectively.

Note that their exponents differ in the objective function.

Bias towards the origin isn’t the only choice. If guesses m0 are allowed, they are

incorporated by translation km − m0kL2. Furthermore, higher-order regularizers

are produced with the introduction of finite-difference operators.

The first-order Tikhonov regularization takes the first derivative of the parame-ters into account using the forward difference matrix D1. Similarly, second-order

Tikhonov regularizationuses central differences D2 to count for second derivatives:

D1 def =          −1 1 −1 1 · · · −1 1 −1 1          D2 def =          1 −2 1 1 −2 1 · · · 1 −2 1 1 −2 1          (2.26)

The most general regularizer here discussed kD(_·)mkL(·) serves for a variety of

purposes. For instance, Ridge regression is equivalent to D = I and L2-norm.

There are no limits to creativity. . . minimize (d − Gm)⊤_C−1 d (d − Gm) | {z } weighted least-squares + λ1km − m0kL₁ | {z } regularized towards m0 + λ2kD1mk2L₂ | {z } first derivative (2.27) and a quote by Alan Kay is appropriate to end this section: “The best way to predict the future is to invent it.”

2.4 levenberg-marquardt solution to nonlinear regression Very often the forward operator G: M 7−→ D is much more complex than a matrix multiplication d = Gm. It might represent an entire engineering system with a broad variety of nonlinear interactions. For such systems, linear algebra can’t be applied directly as in the previous derivations for generalized linear estimators.

(42)

One possible way to tackle a general operator d = G(m) is by iterative optimiza-tion. First, recall the Newton-Raphson method for finding the roots of a continu-ously differentiable nonlinear square system of equations F(m) = 0:

J(mk) mk+1− mk= −F(mk) (2.28)

with J(m) ≡ ∂Fi(m)

∂mj the Jacobian matrix. It can be employed to find the local

min-ima m∗ ₌ _{arg min}

m∈Mf(m) of a twice continuously differentiable function f(m)

through the necessary condition ∇f(m∗_{) = 0}_:

H(mk) mk+1− mk= −_∇f(mk) (2.29)

with H(m) ≡ ∂2_f(m)

∂mi∂mj the Hessian. Second, note the Newton-Raphson update

can-not be performed for the system G(m) − d = 0 even though an analytical expres-sion may exist. This is because the inverse problem is not guaranteed to have a solution G(m∗_{) = d}_{and the system to be square (i. e. m ≪ n).}

The Levenberg-Marquardt solution to nonlinear inverse problems consist of ap-plying Newton-Raphson to a damped least-squares objective:

f(m)def= m X i=1 G(m)i− di σi 2 (2.30) As with linear regression, if the observations are assumed to be Gaussian, then to maximize the likelihood is equivalent to minimize the objective inEquation 2.308_.

By introducing the notation f(m) = Pm

i=1fi(m)2, and the misfit vector F(m) =

(f1(m), f2(m), . . . , fm(m))⊤, the gradient and the Hessian needed for the

Newton-Raphson update are given by (seeAppendix A.6):

∇f(m) = 2J(m)⊤F(m) (2.31)

H(m) = 2J(m)⊤J(m) + Q(m) (2.32)

with J(m) ≡ ∇F(m) being the Jacobian and Q(m)def

= 2Pm_i=1fi(m)∇2fi(m). A good

approximation to the Hessian is obtained ignoring the Q(m) term inEquation 2.32, in this case the update ∆m = mk+1_{− m}k _{is such that:}

J(mk)⊤J(mk)∆m = −J(mk)⊤F(mk) (2.33)

Equation 2.33is sometimes referred to as the Gauss-Newton method. The

Levenberg-Marquardt algorithm is introduced with a slight modification:

J(mk)⊤J(mk) + λ2I

∆m = −J(mk)⊤F(mk) (2.34)

(43)

where λ2 _{is adjusted during optimization. Larger values leads to steepest decent,}

whereas steps with small penalization λ2_I _{mimic the Gauss-Newton method. The}

final effect of this penalizer is very much that of the Tikhonov regularization, but with a different dynamic interpretation. Refer toAster et al.for comparing the two.

A specialized derivation for linear operators d = Gm and Gaussian priors can be found inOliver, Reynolds, and Liu’s “Inverse Theory for Petroleum Reservoir Char-acterization and History Matching”[24].

important note: More advanced and well-established strategies with simi-lar adaptive behavior for solving unconstrained optimization exist as part of the Quasi-Newton family of methods (e. g. DFP, BFGS). They use a numerical approxi-mationof the Hessian based on successive gradient evaluations.

The most critical issue with these algorithms is the need for derivatives. The nu-merical simulator G is in general a black box either on purpose to avoid complexity or because its source code isn’t available. Moreover, commercial software doesn’t necessarily implement adjoint code [31,32,33].

m

G

d

u = −λK∇p ∇ · u = qw ρw + qo ρo

Figure 2.7: Numerical simulator as a black box.

Finite differences are very sensitive to the stencil and also very costly. The ap-proach is unfeasible for demanding simulators in high-dimensions unless a robust framework such as DAKOTA is used for computing them in parallel, or a low fidelity proxy is iteratively fitted (e. g. Polynomial, Kriging, Particle filters).

Unlike discrete linear inverse problems—for which inversion is performed by SVD, Cholesky factorization or conjugate gradients9_{—nonlinear inverse problems}

are challenging due to the lack of characterization of nonlinearities in black box

(44)

solvers. Introspection and dedicated analysis of the source code for G can over-come this actual limitation and further improve performance of existent software.

Other methods such as Kalman filters developed for linear dynamical systems perform very well in practice even in the presence of nonlinearities. These filters that linearize about the current mean and covariance are sometimes referred to as extended Kalman filters and detailed explanation is out of the scope of this work. More recent and powerful variations are obtained with the use of an ensemble, producing the EnKF [35,36,37] or with probabilistic collocation techniques [38,39].

(45)

3

P R O B A B I L I S T I C F R A M E W O R K

Information can tell us everything. It has all the answers. But they are answers to questions we have not asked, and which doubtless don’t even arise. Jean Baudrillard

3.1 Definition of probability . . . 29

3.2 States of information . . . 36

3.3 Bayesian inversion . . . 37

3.4 Ensemble Markov chain Monte Carlo . . . 42

Responding to the statement that one estimate alone isn’t informative, specially on the richness of high-dimensional spaces, the probabilistic framework relies on a subjective degree of belief to model the plausibility1 _{of any given preimage. This}

means that each estimate m ∈ M is assigned a measure of consistency with previ-ous observations and expert knowledge (e. g. probability).

The first important distinction that has to be made against the classical frame-work is the retification2 _{of plausibility distributions. In this chapter, these are the}

main objects of attention and carry all the information about the inverse problem to be solved—the state of information. Two extreme states are identified, represent-ing maximum uncertainty (i. e. flat shape) and total confidence (i. e. peak shape); and a Bayesian rule derived from Kolmogorov axioms, suitable for distributions that aren’t necessarily normalizable, is used to navigate from one extreme to the other in the learning direction.

As in Chapter 2, the specificities of the forward operator G: M 7−→ D and the associated spaces are completely hidden so to highlight the core assumptions of

1 The terms plausibility, probability and belief are used interchangeably in this document. 2 In Programming Languages, retification is the process of creating first-class objects.

(46)

3.1 definition of probability 29

the theory and to make the text accessible to readers coming from different fields. They will only be introduced in Part IIof the dissertation for the general history matching problem or within small simple examples.

At a higher abstraction level, the solution is naturally formulated as an integral that represents the marginalization of the posterior with respect to the noisy output measurements d ∈ D. It is in general more computationally demanding than the mathematical optimization techniques previously presented.

The MCMC strategy for exact sampling of the posterior is briefly reviewed with a “hands on” approach. Among the many variants of the algorithm, only those with support for distributed parallel execution should be considered. This is a strict and important requirement if the forward operator is expensive.

This chapter is heavily influenced by Tarantola’s “Inverse Problem Theory and Methods for Model Parameter Estimation”[22].

3.1 definition of probability

Plausibility measures follow a multitude of similar axioms depending on the goal of the theorist [40, 41]. Herein, the term “probability” is formally redefined and all other names such as plausibility, belief, etc. assigned the same meaning. This linguistic abuse is to avoid confusion with subtle conceptual differences and make the reading more pleasant.

Definition 3.1 (Probability). For a probability space (X, F, P) with X a finite-dimensional universe, and F the associated σ-algebra, the probability measure of an event A ⊆ X, denoted P(A) ∈ R, satisfies the Kolmogorov axioms:

• P(A ∪ B) = P(A) + P(B) for disjoint events A ∩ B = ∅.

• There is continuity at zero, i. e. , if a non-increasing chain A1 ⊇ A2 ⊇ · · ·

tends to the empty set, then P(Ai)→ 0.

The function itself P : F 7−→ R with no reference to a particular event is called the probability distributionand is written P(·) for short.

It follows from Definition 3.1that the empty set has zero probability P(∅) = 0. The universe X is generic notation for either M or D, but it can also represent the Cartesian product X = M × D. In all cases, there is no guarantee of finite probability3_.

(47)

3.1 definition of probability 30

Non-normalizable distributions evaluate to probabilities that can’t be interpreted intuitively, but are still useful as a relative measure: given two events A, B ∈ F, it’s still possible to compare P(A), P(B) ∈ R. This is a crucial point often obfuscated by overloaded notation.

Definition 3.2 (Density). For any probability distribution P(·) over X and fixed coordinate system, there exists (Radon-Nikodym theorem) f(x), called probability densitysuch that ∀A ⊆ X, P(A) =RAdx f(x).

Example 3.1 emphasizes that probability densities in Definition 3.2 aren’t

nec-essarily bounded nor intuitive. This exotic behavior will be suppressed with the notion of p-events in the following paragraphs.

Example 3.1 (Jeffreys Parameters)

In Physics, reciprocal variables are usually defined to provide the scientist with different arguments about the same phenomenon:

x ←→ 1/x

Frequency f _←→ Period T = 1/f

Resistivity ρ ←→ Conductivity σ = 1/ρ Compressibility β ←→ Bulk modulus k = 1/β

Consider that x > 0 is strictly positive like in the pairs in the above diagram. For any two samples xa, xb ∈ X the absolute differences don’t match |xa− xb| 6=

x1a− 1 xb

, and it would be incorrect to arbitrarily select one of them as the distance between points. A good distance is invariant under change of coordinates, for instance take the resistivity/conductivity pair:

D(ρa, ρb)def= log ρa ρb = log σa σb = D(σa, σb) (3.1)

Equation 3.1is accounting for “octaves”4_{instead of plain differences. In differential}

form, the distance element for D(xa, xb) =

logxa xb is given by dL(x) = 1xdx and

therefore the unbounded density f(x) = 1/x assigns probabilities proportional to the length of the event. These positive reciprocal variables are here called Jeffreys

parametersas suggested by Tarantola [42].

Outside engineering and other applied fields, the notion of density is sometimes discarded in favour of volumetric probabilities. Unlike densities that are affected by change of coordinates x∗_{= x}∗_(x): f∗(x∗) = f(x) ∂x ∂x∗ (3.2)