Epistemic Uncertainty Estimation - Uncertainty Estimation for Dense Stereo Matching using Bayes

disparity, which is used as observation, becomes most likely. While this approach is applied to a diverse field of applications, such as semantic segmentation and monocular depth estimation (Kendall and Gal, 2017), optical flow (Gast and Roth, 2018) and multi-view stereo based on a single moving camera (Vogiatzis and Hern´andez, 2011), it is far less popular in the context of dense stereo matching compared to the concept of confidence. This might be explained by the requirement of more detailed knowledge on the real error distribution. However, contrary to the concept of confidence, this approach allows to additionally quantify the uncertainty in pixels or metric units.

Kendall (2017) adopts this Bayesian approach for uncertainty estimation in the context of dense stereo matching, assuming a Gaussian distribution over the predictions of his model. By formulating the loss function as the negative log likelihood of the assumed Gaussian distribution, the likelihood is maximised using common deep learning optimisation strategies (see Sec. 2.3). Kendall and Gal (2017) employ the same procedure for the task of monocular depth estimation, but assume a Laplacian instead of a Gaussian distribution. They argue that the L1 norm, which corresponds to the Laplacian distribution, commonly outperforms the L2 norm for vision-related regression tasks. Moreover, it becomes apparent that this approach not only allows to quantify aleatoric uncertainty, but also improves the accuracy of the disparity estimates by acting as a regulariser which weights the disparity error of a training sample by the inverse of the estimated uncertainty.

Both, the Gaussian as well as the Laplacian model, assume a uni-modal error distribution expecting a distinct global optimum that is good to localise. However, this is a strong simplification that is often violated in real-world scenarios, e.g., in the context of dense stereo matching due to weakly textured or occluded regions in an image. In contrast, Vogiatzis and Hern´andez (2011) and Pizzoli et al. (2014) assume a mixture distribution to model the uncertainty inherent in the results of multi- view stereo reconstructed from images captured with a single moving camera. The basic idea of this assumption is that a depth sensor produces two types of measurements: good measurements that are normally distributed around the correct depth and outlier measurements that are uniformly distributed in an interval that contains the correct depth. Both distributions are combined by a weighting factor set individually for each pixel, whereby the parameters characterising this mixed distribution are iteratively adjusted by convex optimisation. Although the assumption of such a mixture distribution better approximates the real error distribution, it has neither been investigated in the context of dense stereo matching nor in a deep learning framework so far. Consequently, the verification whether this approach is suitable to be combined with a deep learning-based functional model, as described in the previous section, is still to be done.

Consequently, this section focuses on the possibilities to define the stochastic model, treating the functional model as given. Compared to aleatoric uncertainty, which is commonly treated as an additional predictive value, the estimation of epistemic uncertainty is typically more difficult. How- ever, this type of uncertainty helps to mitigate the problem of overconfident predictions (Gal and Ghahramani, 2016), which occurs in particular in the context of neural networks, and to identify cases in which a method is highly uncertain regarding its prediction, for example, processing data outside of the learned data distribution. To cope with this task, in particular the use of stochastic neural networks has proven to be well suited. In contrast to the commonly employed deterministic neural networks that learn point estimates as parameter values, stochastic ones allow to learn a distribution over the parameters (Jospin et al., 2020). Epistemic uncertainty is then commonly estimated via Monte Carlo sampling. Aggregating the predictions of the individual samples, this procedure allows to approximate the central moments of the probability distribution describing the final result, such as the mean and the variance.

In this context, ensemble learning can be understood as an approximation of such stochastic neural networks. Training a set of structurally identical deterministic networks independently, the predictions of the individual networks are combined at test time to estimate the epistemic uncertainty. The most popular way of setting up such an ensemble is to use different initial values for the network parameters, which is typically achieved by random initialisation with varying seed values (Lakshminarayanan et al., 2017). However, also other types of ensembles exist, such as training individual networks on different subsets of the training data (Breiman, 1996; Moukari et al., 2019) or using the parameter values of the same network obtained after various numbers of training epochs as individual networks to form an ensemble (Huang et al., 2017). Generally, it is argued in the literature that procedures based on ensemble learning require less computational effort than other variants of stochastic neural networks, while still leading to good results (Lakshminarayanan et al., 2017; Ovadia et al., 2019). However, while such an approach may be reasonable for some scenarios, it is not feasible for large architectures, especially if a larger set of networks is to be considered in the ensemble. Because training (and potential fine-tuning) is commonly carried out independently, the computational effort grows linearly with the size of the ensemble, also requiring that the parameter values of all networks are present at test time, leading to an enlarged memory footprint. Moreover, ensemble learning does typically not allow to consider prior knowledge on the uncertainty or to model correlations between the network parameters or the individual instances forming an ensemble.

A second realisation of the concept of stochastic neural networks is Monte Carlo dropout, which is, for example, used by Gal and Ghahramani (2016), Kendall et al. (2017a) and Kendall and Gal (2017). Similar to the variant of dropout commonly used for the purpose of regularisation during training (Srivastava et al., 2014), Monte Carlo dropout places a Bernoulli distribution over the network weights setting them to zero with a certain probability. However, the Monte Carlo variant applies this procedure not only during training but also at test time. Thus, with every forward pass a slightly different parametrisation of the same network is used to obtain a prediction, leading to varying results. In contrast to most ensemble learning techniques, only the parameters of a single network need to be learned and be present at test time, which clearly reduces the computational effort during training as well as the memory footprint when testing. Furthermore, using Monte

Carlo dropout, the number of forward passes and thus the size of the ensemble can be changed flexibly at test time without the need to train further variants of the network. A potential weakness of Monte Carlo dropout is that the uncertainty predictions are typically not calibrated, in the sense that systematic deviations may exist between the estimated and the actual observable variance in the test data. However, this problem can be mitigated when also learning the dropout probability from the training data, for example, using variational dropout proposed by Kingma et al. (2015).

Another limitation is the frequent absence of the possibility to take into account prior knowledge or the correlation between parameters of the network. Especially the latter appears to be problematic in the context of CNNs, since the convolutional filter kernels are spatially correlated. While Ghiasi et al. (2018) have investigated this issue for conventional dropout and presented a solution to improve the regularisation capability by taking into account spatial correlations, the implications of such correlations on the estimation of epistemic uncertainty remain an open research question.

BNNs constitute the third and last realisation of stochastic neural networks being discussed in this section that allows to define a prior for the parameters of the network, treating the uncertainty in a Bayesian manner (cf. Sec. 2.3.2). Despite the fact that the basic concepts of BNNs are already known for decades (MacKay, 1992; Neal, 1995), they have only recently been used in practice for more complex tasks, such as image-based object classification (Brosse et al., 2020).

For a rather long time, the complexity of larger BNNs was a major challenge causing training via variational inference (Graves, 2011) to converge slowly and to sub-optimal solutions while requiring high computational effort. Consequently, such approaches were mainly theoretically motivated but had limited relevance in practice. However, more recent advances in the field of variational inference, such as stochastic variational inference (Hoffman et al., 2013), Bayes by backprop (Blundell et al., 2015), the reparametrisation trick (Kingma et al., 2015) and flipout (Wen et al., 2018), have mitigated this problems significantly. With a runtime required for training that is only slightly increased compared to a deterministic baseline, nowadays BNNs can be trained faster than an ensemble of networks. Moreover, this type of stochastic neural networks offers the flexibility to model the distribution over the parameters in various ways, considering prior information, and to also take into account correlations between parameters. In this context, the network parameters are not learned directly, but drawn from a learned variational distribution. While the randomness of the sampled network parameters offers a natural regularisation during training and minimises the risk of over-fitting, learning a variational distribution may lead to a significant increase of trainable parameters. Assuming, for example, that every network parameter is drawn from a Gaussian distribution, two variational parameters would need to be learned per network parameter to parameterise this Gaussian. Thus, already a rather simple type of distribution, such as a Gaussian, doubles the number of trainable parameters compared to a deterministic baseline if the covariance matrix is assumed to be diagonal, i.e., correlations are not considered. Addressing this issue, Zeng et al.

(2018) and Brosse et al. (2020) have recently proposed to treat only some layers of the network in a probabilistic manner while keeping all others deterministic. Following this procedure, the number of parameters and the computational effort can be reduced, while achieving comparable results with respect to the prediction accuracy and the quality of the estimated epistemic uncertainty.

However, so far this approach is only examined with respect to classification tasks and requires further investigations, especially with respect to regressions tasks such as dense stereo matching.

No documento Uncertainty Estimation for Dense Stereo Matching using Bayesian Deep Learning (páginas 42-45)