High Dynamic Range and Super-Resolution from Raw Image Bursts

(1)

High Dynamic Range and Super-Resolution from Raw Image Bursts

BRUNO LECOUAT,

Inria and DI/ENS (ENS-PSL, CNRS, Inria), France

THOMAS EBOLI,

Université Paris-Saclay, ENS Paris-Saclay, Centre Borelli, France

JEAN PONCE,

Inria and DI/ENS (ENS-PSL, CNRS, Inria), France & Center for Data Science, New York University, USA

JULIEN MAIRAL,

Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, France

Input Ours

Fig. 1. An example of joint super-resolution (SR) and high-dynamic range (HDR) imaging.Left:An 18-photo burst was shot at night from a hand-held Pixel 4a smartphone at 12MP resolution with an exposure time varying from1/340s to1/4s. The left half of the central image from the burst is shown along with the right half of the 192MP HDR image reconstructed by our algorithm with a super-resolution factor of×4(after tone mapping).Right:Three small crops of the two images corresponding to the colored square regions on the left. Crops from the central image of the burst are rendered using Adobe Camera Raw to convert raw files into jpg with highest quality setting. The HDR/SR results are rendered using the PhotoMatix tone mapperhttps://www.hdrsoft.com/. Note that the 192MP HDR image on the left is not reproduced at full resolution because of the corresponding file’s size.

Photographs captured by smartphones and mid-range cameras have limited spatial resolution and dynamic range, with noisy response in underexposed regions and color artefacts in saturated areas. This paper introduces the first approach (to the best of our knowledge) to the reconstruction of high- resolution, high-dynamic range color images from raw photographic bursts captured by a handheld camera with exposure bracketing. This method uses a physically-accurate model of image formation to combine an iterative optimization algorithm for solving the corresponding inverse problem with a learned image representation for robust alignment and a learned natural Authors’ addresses: Bruno Lecouat, Inria and DI/ENS (ENS-PSL, CNRS, Inria), Paris, France, bruno.lecouat@inria.fr; Thomas Eboli, Université Paris-Saclay, ENS Paris-Saclay, Centre Borelli , Paris, France, thomas.eboli@ens-paris-saclay.fr; Jean Ponce, Inria and DI/ENS (ENS-PSL, CNRS, Inria), France & Center for Data Science, New York University, New York, USA, jean.ponce@inria.fr; Julien Mairal, Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, Grenoble, France, julien.marial@inria.fr.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored.

For all other uses, contact the owner/author(s).

0730-0301/2022/7-ART38

https://doi.org/10.1145/3528223.3530180

image prior. The proposed algorithm is fast, with low memory requirements compared to state-of-the-art learning-based approaches to image restoration, and features that are learned end to end from synthetic yet realistic data. Extensive experiments demonstrate its excellent performance with super-resolution factors of up to×4 on real photographs taken in the wild with hand-held cameras, and high robustness to low-light conditions, noise, camera shake, and moderate object motion.

CCS Concepts: •Computing methodologies→Image processing. Additional Key Words and Phrases: Computational photography, raw bursts, high-dynamic range imaging, super-resolution

ACM Reference Format:

Bruno Lecouat, Thomas Eboli, Jean Ponce, and Julien Mairal. 2022. High Dynamic Range and Super-Resolution from Raw Image Bursts.ACM Trans.

Graph.41, 4, Article 38 (July 2022), 21 pages. https://doi.org/10.1145/3528223.

3530180

1 INTRODUCTION

Key factors limiting the level of detail of photographs captured by digital cameras are their spatial resolution and dynamic range:

High resolution is necessary to zoom on small image regions, and

(2)

high dynamic range is needed to reveal details hidden in dark areas (e.g.,shadows) and avoid color artefacts due to saturation in bright ones (e.g.,highlights). For a given sensor size, higher resolution also means smaller pixel size, with less light reaching each pho- toreceptor, resulting in lower dynamic range and increased noise in dark regions, an effect exacerbated in smartphones by their small sensor size. It is natural, and by now rather common, to use multiple photographs to reconstruct an image with higher spatial resolution, a process known assuper-resolution(orSRfor short in this presentation, see, for example [Wronski et al. 2019]), or dynamic range, a process known ashigh dynamic range(orHDR) imaging (see, for example [Debevec and Malik 1997]).

We propose in this paper a novel method forjointSR and HDR imaging from therawimage bursts featuring a range of different exposures that can now be captured by most smartphones and mid-range cameras (Figure 1). A major challenge tackled by our algorithm is the automated alignment with sub-pixel accuracy of the burst elements required to compensate for camera shake and possibly (moderate) object motion, despite the variations in saturation and signal-to-noise ratio due to the different exposures used across the burst. Other notable difficulties include the high con- trasts and noise levels encountered in night scenes for example, where a photo might feature both very dark and noisy regions and saturated ones near light sources, as well as the fact that a digital camera only captures one color channel at each pixel according to the correspondingcolor filter array(orCFA, often a Bayer pattern).

Despite the latter challenge, it now seems clear that it is better to work directly with the raw image data than with the sRGB pictures produced by theimage signal processor(orISP) of the camera since their construction involves several steps, including white balance, denoising, demosaicking, gamma correction, compression of each color channel content to 8 bits, etc., that result in an unavoidable loss of information in high spatial frequencies and dynamic range.

The approach proposed in the rest of this presentation extends the algorithm for multi-frame super-resolution of [Lecouat et al. 2021]

to jointly perform blind denoising, demosaicking, super-resolution and HDR image reconstruction from raw bursts. Its key features can be summarized as follows:

• Our method uses a physically-accurate model of image formation that accounts for the successive transformations applied to the original analog irradiance image, including quantization of the signal, noise, exposure and spatial quantization.

• We combine an iterative optimization algorithm for solving the corresponding inverse problem with a learned image representation for robust alignment and a learned natural image prior. This is the first main technical novelty of our paper, enabling us to address the joint reconstruction of high- resolution, high-dynamic range color images from raw photo- graphics bursts captured by a handheld camera with exposure bracketing.

• The proposed algorithm is fast, with low memory requirements compared to state-of-the-art learning-based approaches to image restoration, and features that are learned end to end

from synthetic yet realistic data, generated using again our image formation model.

• We introduce an image alignment method to compensate for camera shake which is robust to (moderate) object motions and an image fusion technique which is itself tolerant to alignment errors. Together, these form the second main technical novelty of our paper, and they are key factors in the robustness of our algorithm in both the SR and HDR imaging tasks with, notably, significant improvement over [Lecouat et al. 2021] in super-resolution.

• Extensive experiments demonstrate the excellent performance of the proposed approach with super-resolution factors of up to×4 on real photographs taken in the wild with hand-held cameras, and high robustness to low-light conditions, noise, camera shake, and moderate object motion. These results are confirmed by quantitative and qualitative comparisons with the state of the art in super-resolution and HDR imaging tasks on synthetic and real image bursts.

2 BACKGROUND

2.1 High dynamic range imaging

Bracketing techniques.[Debevec and Malik 1997; Granados et al.

2010; Hasinoff et al. 2010; Mann and Picard 1995] construct an HDR image by combining multiple photographs of the same scene with different exposures. The darkest pictures are used to reconstruct areas prone to saturation and the brightest ones are needed for restoring dark regions that are likely to be noisy (we will come back to that point later). They typically work onlinRGBimages, that is, demosaicked imagesbeforethey are transformed by the camera’s ISP intosRGBimages ready for display. A sequence of sRGB input photographs must therefore in general be “linearized” by invert- ing this mapping, also known as the camera response function (or CRF). The HDR image is then reconstructed as a weighted sum of the linearized bracket images, normalized by the corresponding shutter speed. Its pixel values are typically represent as single- precision floating-point numbers, with min and max those of the image bracket. Bracketing-based approaches to HDR imaging face a number of classical issues, including choosing the optimal fusion weights, estimating the CRF [Debevec and Malik 1997], leveraging accurate raw image noise models [Aguerrebere et al. 2014; Granados et al. 2010; Hanji et al. 2020], selecting the best exposure parameters for a fixed number of frames in the bracket [Gallo et al. 2012; Hasi- noff et al. 2010], registering images with different exposures [Gallo et al. 2015; Zimmer et al. 2011], which is significantly more challenging than aligning same-exposure images [Ma et al. 2017], and removing ghosting artefacts [Sen et al. 2012; Tursun et al. 2016] due to misalignment.

Using raw bursts with constant shutter speed.Unlike classical exposure bracketing techniques, HDR+ [Hasinoff et al. 2016] takes as input a burst of raw underexposed images captured with the same exposure time. These are mostly free of saturation but noisy in dark regions. A 12-bit, denoised raw image is obtained by aggregating the 10-bit photos of the burst. It is then demosaicked and tone mapped.

(3)

Recent updates of HDR+ use a couple of well-exposed frames to achieve better denoising and deghosting [Ernst and Wronski 2021], or leverage the metering technique of Hasinoff et al. [2010] to adapt the original algorithm to low-light situations [Liba et al. 2019].

Using pixelwise ISO sensitivities. Instead of relying on classical imaging devices, Nayar and Mitsunaga [2000] reconstruct a single HDR image from a sensor with spatially-varying pixel exposures.

This approach can be further combined with learning-based methods [Martel et al. 2020; Serrano et al. 2016]. Even though our work focuses on standard sensors, we believe it to be flexible enough to be adapted to pixelwise ISO sensitive sensors under simple modifi- cation of the image formation model. This is an interesting research direction for future work, but beyond the scope of our paper.

Learning-based methods.for HDR imaging have also been proposed. Kalantari and Ramamoorthi [2017] introduce a convolutional neural network (CNN) to predict the irradiance from three low- dynamic range (LDR) images, with different exposures, camera poses and possibly moving subjects, pre-aligned with an optical flow algorithm. Most recent CNN-based multi-image methods [Niu et al. 2021; Pérez-Pellitero et al. 2021; Wu et al. 2018; Yan et al. 2021, 2020] learn to align and fuse demosaicked images in an end-to-end manner, and they typically operate on image triplets such as those in the dataset of Kalantari and Ramamoorthi [2017]. CNN-based approaches to single-image HDR include [Eilertsen et al. 2017; Endo et al. 2017; Liu et al. 2020; Santos et al. 2020]. They rely on machine learning to recover missing details in the darkest and saturated areas of tone-mapped images.

2.2 Super-resolution

We limit here our discussion to multi-frame super-resolution algorithms. Although single-image learning-based techniques have been used to generate very impressive and highly-detailed images [e.g.

Dahl et al. 2017; Menon et al. 2020], their objective is not the same as ours: they aim at generating a high-resolution picturecompatible with one input photograph, whereas we want to reconstruct the details that areactually availablein the input burst.

Energy-based methods. High-frequency information present in low-resolution (LR) photos with aliasing artefacts is useful for recon- structing a high-resolution (HR) image from multiple LR frames [Far- siu et al. 2006]. Unfortunately, this information is typically lost during the denoising and demosaicking steps performed by the camera ISP pipeline to produce sRGB images. Farsiu et al. [2006] estimate an HR demosaicked image from a sequence of raw photographs by minimizing a penalized energy—that is, they solve an inverse problem via optimization. Wronski et al. [2019] adapt the kernel method of Takeda et al. [2007] and exploit natural hand tremor to jointly demosaick and super-resolve a raw image burst with magnification factors up to×3 in a fraction of a second on a handheld smartphone.

Learning-based techniques.Bhat et al. [2021a] learn a CNN with attention module to align, demosaick and super-resolve a burst of raw images. In a follow-up work, Bhat et al. [2021b] minimize a penalized energy including a data term comparing the sum of parameterized features residuals. Lecouat et al. [2021], learn instead

a hybrid neural network alternating between aligning the images with the Lucas-Kanade algorithm [Lucas and Kanade 1981], predict- ing an HR image by solving a model-based least-squares problem and evaluating a learned prior function. Luo et al. [2021] propose a neural network architecture that aligns an input burst of images while performing super-resolution with a non-local fusion module.

2.3 Joint HDR imaging and super-resolution

The algorithms proposed by Choi et al. [2009]; Gunturk and Gevrekci [2006] address joint SR and HDR imaging with an existing SR energy- based solver. To tackle the multi-exposure setting, they introduce weights inspired by bracketing techniques in the least-squares term.

More generally, this joint image restoration problem has been ad- dressed in a two-stage fashion: (i) image registration with an algorithm robust to varying exposures and (ii) solving a least-squares problem including operators modelling both SR and HDR. For instance, Rad et al. [2007] propose an exposure-invariant transform before applying the FFT-based registration technique of Vandewalle et al. [2006]. The image is then obtained by solving a penalized least-squares problem. Zimmer et al. [2011] use an optical flow approach with normalized gradients for robustness to changes of exposure, and the HR/HDR image is found by solving again a penalized least-squares problem. Traonmilin and Aguerrebere [2014] adapt a backprojection algorithm to the multi-exposure setting and simply solve a weighted least-squares problem without prior, with compa- rable performance but lower computational cost. Vasu et al. [2018]

explore the case where the LDR SR images are also blurred with camera shake or motion blur. Similar to the HDR case, CNNs have also been proposed for single-image joint SR and HDR,e.g.,[Kim et al. 2019], while Deng et al. [2021] address instead joint SR, HDR and tone mapping by merging a pair of previously aligned over- and under-exposed images with a two-stream CNN. In contrast with these techniques, we use trainable image features to adapt the raw image registration module of Lecouat et al. [2021] to the varying- exposure setting in a robust manner, and jointly learn these features and a parametric image prior in an end-to-end manner.

Figure 2 shows examples of the input data these methods use and samples of the the predicted high-resolution HDR images we predict with the proposed approach.

3 IMAGE FORMATION MODEL

We now describe the process generating a burst of low-dynamic low-resolution raw images from a high-resolution HDR image. This process yields a natural inverse problem formulation, which we will leverage later to build a trainable architecture.

3.1 Dynamic range

After analog-to-digital conversion, a camera sensor outputs a black- and-white mosaicked image whose pixel values are integers obtained by quantizing the number of photons collected by each photosite on a linear𝑞-bit scale [Clark 2006], where𝑞is called thebit depthof the sensor. We denote by𝑃_𝑞the set of the discrete values a pixel may take, as measured indata numbers(orDNs[Clark 2006;

Martinec 2008]), from 0 to 2^𝑞−1.

(4)

LR frame 1 LR frame 6 LR frame 12 LR frame 18 Ours (HDR+SR×4) Fig. 2. Exposure bracketing:Left:Three high dynamic range, high-resolution images obtained by our method from 18-image bursts taken by a handheld Pixel 4a smartphone with a×4super-resolution factor. We show post-processed sRGB pictures for the sake of presentation.Right:Small crops from sample photos in the burst and our reconstruction. Note the high level of noise in the short-exposure images, in particular in the second row, and the saturated regions in the long-exposure ones. As shown by the last column of the figure, our algorithm recovers details in saturated areas and remove noise in the darkest regions. The reader is invited to zoom in on a computer screen.

The dynamic range𝑅(𝑢)for a pixel𝑢is defined as the ratio of the largest to the smallest values this pixel may take: the larger the bit depth of the sensor, the greater is its maximal value in𝑃_𝑞. The ratio is usually given in photographicstops, where each stop corresponds to a multiple of 2. In practice, the largest value𝑢can take is limited either by the bit depth𝑞or the white level𝑐set by the camera, to prevent color artefacts in highlights [Luijk 2007], whereas the lowest value is actually limited by the noise𝜀(𝑢)and by the camera black level𝑏[Foi et al. 2008]. Note that even in the absence of light,𝜀(𝑢)is never 0 since any digital camera suffers from various sources of electronic noise [Hasinoff et al. 2010]. This also shows that increasing dynamic range is strongly related to denoising, as discussed later in this section.

3.2 Exposure

As mentioned above, raw pixel values depend linearly on the number of photons captured by each photosite (ignoring quantization effects) and thus on exposure time. In photography, this effect is quantified by theexposure value(orEV): Increasing it by +1EV (resp. decreasing by -1EV) corresponds to doubling (resp. halving) the raw pixel values.

The EV depends on theISO gain, aperture size and exposure time.

In this work, we will only control the exposure timeΔ𝑡, keeping it small enough to (mostly) avoid motion blur, and keep the other two quantities constant since modifying the ISO gain may change the noise distribution [Hasinoff et al. 2010] and adjusting the aperture size changes the blur of out-of-focus regions [Levin et al. 2007].

The raw value𝑦(𝑢)in𝑃_𝑞recorded at some pixel𝑢is thus related to the irradiance𝑥(𝑢)inR⁺at the same location by

𝑦(𝑢)=𝑆(Δ𝑡 𝑥(𝑢)), (1)

−3.5 −3.0 −2.5 −2.0 log₁₀(α)

−7

−6

−5

−4

−3

log10(β)

55 99

198 395

798 1598

3199 6399

Fig. 3. Empirical measurements of the shot and read noise levels𝛼and𝛽 from the metadata of raw images taken with the Google Pixel3a smartphone.

The numbers next to the markers are the corresponding empirical ISO levels.

As observed by Brooks et al. [2019], there exists a linear relationship between log₁₀(𝛼)andlog₁₀(𝛽)that we leverage to train our models.

where𝑆is the function mapping pixel values fromR⁺to𝑃_𝑞. This equation is only valid when𝑆(Δ𝑡 𝑥(𝑢))<2^𝑞−1, with saturation occurring for higher values. Using short exposure times limits saturation, but, as shown in the next section, leads to a poorsignal-to-noise ratio(orSNR).

3.3 Noise and SNR

The raw image noise𝜀(𝑢)at each pixel comes from the physics of light and the electronics of the camera. The former is calledshot

(5)

noise, and it can be modelled with a Poisson distribution [Foi et al.

2008]. The latter is often referred to asreadnoise and corresponds to random signal fluctuations caused by the electronics and quantization effects. It is usually modelled with a zero-mean Gaussian distribution [Foi et al. 2008]. The combination of shot and read noise can be modelled by a single random variable𝜀(𝑢)following a zero-mean Gaussian distribution with pixel-dependent standard deviation, defined for any pixel value𝑦(𝑢)as [Brooks et al. 2019;

Foi et al. 2008; Plötz and Roth 2017]:

𝑠(𝑢)=p

𝛼𝑦(𝑢) +𝛽, (2)

where𝛼and𝛽are respectively the variances of the shot and read noise. Figure 3 shows the distribution of𝛼(shot noise level) and𝛽 (read noise level) for the Google Pixel3a camera. We have obtained theses values from the EXIF metadata of raw images taken with the smartphone. Each marker corresponds to a couple(log₁₀(𝛼),log₁₀(𝛽)) for an ISO level. In dark regions, read noise dominates shot noise, and limits the total dynamic range.

For the Poissonian-Gaussian noise model of Eq. (2), the SNR is:

SNR(𝑢)=𝑚(𝑢)𝑦(𝑢)

𝑠(𝑢) = 𝑚(𝑢)𝑦(𝑢) p

𝛼𝑦(𝑢) +𝛽

, (3)

where𝑚is a binary mask excluding the saturated pixels. It is a monotonically increasing function of the pixel value𝑦(𝑢), essentially linear in dark regions (e.g.,shadows) where read noise dominates shot noise, and essentially proportional top

𝑦(𝑢)in bright regions (e.g.,highlights) where the opposite occurs [Granados et al.

2010]. As already discussed in the previous section, noise removal is essential for generating images with high dynamic range, and Equation (3) shows that high raw pixel values lead to better SNR and thus better dynamic range in both dark and bright image regions.

But high pixels values everywhere in an image can typically only be achieved at the cost of saturating the brightest areas. Exposure bracketing avoids this problem by using the longest exposures to eliminate read noise from dark regions and the shortest ones to avoid saturation in bright spots.

3.4 Overall image formation model

The original analog image cannot be recovered on a computer and we instead focus on estimating a discrete HR/HDR𝑠ℎ×𝑠𝑤 ×3 photograph𝑥 with pixel values inR+from a burst of𝐾raw LR and LDR images𝑦_𝑘 (𝑘 = 1, . . . , 𝐾) of sizeℎ×𝑤 with entries in 𝑃_𝑞. The integer𝑠is the super-resolution factor. Following [Lecouat et al. 2021], let us introduce the warp operator𝑊_𝑘associated with the𝑘th photo in the burst and accounting for camera shake, the blur operator𝐵 taking into account the integration of the signal over the pixel area is modeled by a convolution, the decimation operator𝐷_𝑠associated with the super-resolution factor𝑠, and the𝐶 operator is a binary mask modeling the sensor CFA. Putting them together and taking into account the exposure timeΔ𝑡

𝑘, the analog low-resolution image associated with the irradiance image𝑥 is 𝑎_𝑘=𝐶 𝐷_𝑠𝐵𝑊_𝑘(Δ𝑡_𝑘𝑥), which can be rewritten as𝑎_𝑘=𝐴_𝑘𝑥, where 𝐴_𝑘 = Δ𝑡_𝑘𝐶 𝐷_𝑠𝐵𝑊_𝑘 (the factorΔ𝑡_𝑘commutes with the operators since it only scales the image values).

Combining this model with Eq. (2) and (1) yields, for all𝑘 = 1, . . . , 𝐾:

𝑦_𝑘=𝑆(𝐴_𝑘𝑥+𝜀_𝑘), (4) where we abuse the notation so𝑆operates on a whole image instead of a scalar, and𝜀_𝑘is a zero-mean Gaussian noise with pixel- dependent variance𝛼 𝐴_𝑘𝑥+𝛽 according to Eq. (2). The operator 𝐶 𝐷_𝑠𝐵impacts the spatial resolution, while𝑆and the noise variance limit the dynamic range of each image𝑦

𝑘.

Note that our model assumes that the scene is static during burst acquisition, which may result in ghosting artefacts in the presence of scene motion, when using this model within an inverse problem formulation. We will, however, introduce in the next section simple weighting strategies to make our approach robust to moderate scene motion.

4 PROPOSED APPROACH

The goal of this work is to design a function𝐹_𝜃with learnable param- eter𝜃which, given𝐾raw images𝑌 ={𝑦₁, . . . , 𝑦_𝐾}and corresponding exposure timesΔ={Δ𝑡₁, . . . ,Δ𝑡_𝐾}, predicts a single-precision floating-point estimate ˆ𝑥of the the HR𝑠ℎ×𝑠𝑤×3 irradiance map:

ˆ

𝑥=𝐹_𝜃(𝑌 ,Δ). (5)

As explained later in this section, all images of the burst are automatically aligned on a reference frame𝑦_𝑘

0(typically the central one that has in general a reasonable exposure).

4.1 Formulation of the problem

Inverse problem.Our image formation model (4) suggests using an inverse problem formulation to the design of𝐹_𝜃and the recovery ofb𝑥. We first convert the discrete raw pixel values from𝑦_𝑘in𝑃_𝑞into 32-bits real values in[0,1], and construct the binary mask𝑚(𝑦_𝑘) representing saturated pixels containing non-informative values.

With an abuse of notation, we keep the notation𝑦_𝑘for the floating- point burst images in the rest of this presentation, and formulate the solution of our inverse problem as the joint recovery of the warp operators𝑊₁, . . . , 𝑊_𝐾(parameterized with a piecewise-affine model, as detailed later), and the irradiance image𝑥:

min

𝑥 ,𝑊₁,...,𝑊_𝐾

1 2

𝐾

Õ

𝑘=1

∥𝑤

𝑘⊙ (𝑦

𝑘−𝐴

𝑘𝑥) ∥²_𝐹+𝜆Ω(𝑥), (6) where𝐴_𝑘is the image formation operator defined in the previous section,⊙denotes pointwise multiplication, and the functionΩis a regularizer, and it will be discussed in details later. Theℎ×𝑤maps𝑤_𝑘 store pixel-wise weights used to control the relative contribution of each frame to the reconstruction of each pixel, a key factor for robustness in bracketing methods [Aguerrebere et al. 2014; Granados et al. 2010].

A robust weighting strategy. We write 𝑤_𝑘= Δ𝑡_𝑘𝑚(𝑦_𝑘)

Í𝐾

𝑗=1Δ𝑡_𝑗𝑚(𝑦_𝑗) ⊙𝑔(𝑦_𝑘, 𝑊_𝑘𝑦₁), (7) where𝑚(𝑦_𝑘)is the binary with zero values at saturated pixels (this formulation assumes the existence of non-saturated pixels at corresponding locations in the burst; when all pixels are saturated, we use

(6)

Fig. 4. Joint HDR imaging and super-resolution×4with a burst taken with a hand-held Pixel4a at night, facing a spotlight.Top:The original burst.Middle:

The central image in the burst (left) and the reconstructed HDR/SR image after tone mapping (right).Bottom:Six crops showing details of the original and HDR/SR images, presented respectively in the first and second rows.

uniform weights instead). Here, the function𝑔is a confidence factor, often used in HDR imaging to weight down images incorrectly aligned [Tursun et al. 2016] and avoid ghosting effects. It can be handcrafted from classical image features and/or priors, but we will instead follow a plug-and-play strategy (detailed in the next section) to directly learn a parametric function𝑔from supervisory data. Our overall weighting strategy is useful for HDR since it provides larger weights to frames obtained with longer exposure time that are less noisy, but it also accounts for registration errors through the learned function𝑔, which turns out to be critical for robustness to moderate scene motion.

Warp parameterization.We align images with piecewise-affine warps𝑊_𝑘=𝑊_𝑝

𝑘, where𝑊_𝑘

0is the identity and𝑝={𝑝₁, . . . , 𝑝_𝐾}is the set of warp parameters. This is implemented by tiling the images

into small (e.g.,200×200) crops, that are aligned independently with affine transformations with 6 parameters.

Regularizer. Many classical regularizers can be used in the formulation of inverse problems in image processing applications, for example sparse total-variation priors [Choi et al. 2009] or combi- nations of penalty functions computed from pixel or histogram values [Debevec and Malik 1997; Heide et al. 2014; Rad et al. 2007].

We instead follow the same plug-and-play strategy as for the confidence function𝑔, and learn a CNN in place of the proximal operator [Parikh and Boyd 2014] of the penalty functionΩ. We detail its implementation in Sec. 4.3.

4.2 Optimization strategy

We solve our optimization problem with half-quadratic splitting (or HQS) [Geman and Yang 1995] by introducing an auxiliary variable

(7)

𝑧and minimize

𝑥 ,𝑧,𝑝min 1 2

𝐾

Õ

𝑘=1

∥𝑤_𝑘⊙ (𝑦_𝑘−𝐴_𝑘𝑧) ∥²_𝐹+𝜂

2∥𝑥−𝑧∥²_𝐹+𝜆Ω(𝑥). (8) The parameter𝜂is usually increased at each iteration according to some preset schedule, which guarantees that, as𝜂grows, the solution of this relaxed problem converges to that of the original one (6) [Geman and Yang 1995]. As detailed in Sec. 4.3, we choose instead to learn this parameter from training data, which improves performance in practice. Note that we now find the warp operators by minimizing the energy with respect to the warp parameters 𝑝, and that all operators involved are implemented efficiently by exploiting the image structure (e.g.,convolutions instead of large sparse operators, etc.). The optimization is carried out by first initial- izing𝑧and𝑝, then, in an alternate fashion, repeating𝑇times (𝑇 =3 in our implementation) an HQS stage consisting of the three steps detailed below. The motivation for this strategy is that it allows us to gracefully convert our optimization method into a trainable architecture, as discussed in Sec. 4.3, thanks to automatic differentiation tools [Baydin et al. 2018] implemented in modern deep learning frameworks.

Updating𝑧.The auxiliary image𝑧is updated by a few steps of a simple gradient descent (GD) algorithm:

𝑧←𝑧−𝛿 𝜂(𝑧−𝑥) +

𝐾

Õ

𝑘=1

𝐴^⊤

𝑘

𝑤²

𝑘⊙ (𝐴_𝑘𝑧−𝑦_𝑘)

! , (9) where𝛿is a step size (which will be learned automatically by the procedure presented in the next section), and of course𝐴_𝑘depends on the current warping parameters𝑝.

Updating𝑥.Minimizing (8) with respect to the image𝑥 while keeping the other variables fixed amounts to compute the so-called proximal operator𝐺ofΩ[Parikh and Boyd 2014]:

𝑥=𝐺(𝑧, 𝜆/𝜂)=arg min

𝑥

1

2∥𝑥−𝑧∥_F²+𝜆

𝜂Ω(𝑥). (10) We will detail in the next section how we implement𝐺.

Updating𝑝.Lecouat et al. [2021] estimate the warp parameters 𝑝_𝑘(𝑘≠𝑘₀) on 200×200 tiles in a 4-scale Gaussian image pyramid, running three stages of the Lucas-Kanade algorithm [Lucas and Kanade 1981] at each stage. We will show in Sec. 4.3 how to do significantly better, both quantitatively and qualitatively, by using a similar approach to alignlearnedfeatures instead.

Initialization of𝑝and𝑧. A fast and coarse initialization of the warp parameters𝑝is obtained using a sub-pixel variant of the FFT- based algorithm of [Anuta 1970] with the features of [Ward 2003].

After having estimated𝑝for the first time with the Lucas-Kanade algorithm and before the first𝑧-update stage, we initialize𝑧as follows: we demosaick each frame𝑦

𝑘with bilinear interpolation, align them with the warping operators𝑊_𝑘, average them with the normalized weightsΔ𝑘/Í𝐾

𝑗=1Δ^𝑗, and finally upscale the resulting image by a factor𝑠with bilinear interpolation. This procedure yields a fast and coarse estimate of the HR and HDR image to start the GD algorithm in Eq. (9).

4.3 Learnable architecture

The optimization procedure described in the previous section is implemented as a function𝐹_𝜃 that produces an estimate ˆ𝑥from a burst𝑌and exposure timesΔ, according to Eq. (5). By writing this function as a finite sequence of operations that are differentiable with respect to the model parameters𝜃, it is then possible to leverage training data—that is, pairs of HR/HDR images𝑥 associated to LR/LDR bursts—tolearnthese parameters for the reconstruction task. This of course raises questions about data collection and generation, which are discussed later, but it also opens up many possibilities for further improvements. In particular, as described in the rest of this section, this allows us to learn implicitly the regular- ization functionΩby taking advantage of deep learning principles, as well as learning appropriate weighting strategies, and robust features to improve image alignment.

Learnable proximal operator𝐺.Following theplug-and-playstrat- egy [Venkatakrishnan et al. 2013] which has proven powerful in the signal processing literature, we replace the proximal operator𝐺 above by a function𝐺_𝜔represented by a CNN and parameterized by𝜔, such that the update (10) becomes

𝑥 =𝐺_𝜔(𝑧, 𝛾), (11)

where𝛾is also a trainable parameter. The CNN has a residual U-net architecture, which is a smaller variant of the network of Zhang et al. [2020] for single-image super resolution. This network has four scales with respectively 32,64,128,128 channels per scale. We also run experiments with an even smaller version of the network with 32 features per channel (dubbedsmall) and 16 features per channel (dubbedtiny). Note that for our problem, the first layer has 4 input channels: three for the predicted RGB auxiliary variable𝑧 and one for the scalar𝛾.

Learnable confidence function𝑔.Similarly, since designing the function𝑔by hand is difficult, we choose to learn instead a CNN 𝑔_𝜌, and the fusion weights𝑤_𝑘become for all𝑘≠𝑘₀:

𝑤_𝑘= Δ𝑡_𝑘𝑚(𝑦_𝑘, 𝑐) Í𝐾

𝑗=1Δ𝑡_𝑗𝑚(𝑦_𝑗, 𝑐) ⊙𝑔_𝜌(𝑦_𝑘, 𝑊_𝑘𝑦_𝑘

0), (12)

The function𝑔_𝜌is implemented with the tiny variant of the U-Net architecture used above. The network takes as input the concatena- tion along the channel dimension of RGB versions of the images𝑦_𝑘 and𝑊

𝑘𝑦

𝑘₀obtained by bilinear interpolation.

Learnable features for alignment.A classical approach to the registration of frame captured with different exposure times is to use MTB features [Ward 2003]. Here, we construct instead a single- channel feature map for each raw image using again the tiny CNN with U-net architecture, then perform the multi-scale Lucas Kanade algorithm for a fixed number of iterations (3 iteration per scale of the pyramid)directly on the feature map. Our implementation of the forward additive version of the Lucas Kanade algorithm is fully differentiable. Therefore we can learn the parameters of the feature map jointly with all the trainable parameters of our model, following a strategy similar to [Chang et al. 2017]. As shown in the experimental section this significantly improves registration performance.

(8)

4.4 Learning the model parameters𝜃

We denote here by𝜃all the learnable parameters of our methods, including those of the CNNs and the scalar parameters involved in the HQS optimization procedure introduced above (e.g.,,𝛿,𝜂, ...). We use triplets of the form(𝑥^(𝑖⁾, 𝑌^(𝑖),Δ^(𝑖))(𝑖 =1, . . . , 𝑛) of training data to supervise the learning procedure. In our setting where ground-truth HDR/HR images are normally not available for real image bursts, the training data is necessarily semi-synthetic, that is, obtained by applying various transformations to real images.

Obtaining robust inference with real raw bursts is thus challenging.

The hybrid nature of our algorithm, which exploits both a learning- free inverse problem formulation and data-driven priors, appears to be a key to achieving good generalization on real raw data acquired in various conditions that do not necessarily occur in the training dataset.

Dataset generation.Given a collection of𝑠 𝑅𝐺 𝐵images, we construct bursts of LDR/LR raw images and HDR/HR RGB targets using the ISP inversion method of [Brooks et al. 2019] and our image formation pipeline, adjusting the gain to simulate different exposure times. The noise levels are sampled following the empirical model of Figure 3.

Training loss.With this training data in hand, we supervise our model using theℓ₁distance between the target irradiance images 𝑥⁽^𝑖⁾and the predicted ones𝐹_𝜃(𝑌⁽^𝑖⁾,Δ⁽^𝑖⁾), and minimize the cost function:

min

𝜃 𝑛

Õ

𝑖=1

𝑥^(𝑖)−𝐹_𝜃(𝑌^(𝑖),Δ^(𝑖))

₁. (13) We have also tried to use the so-called𝜇-law [Kalantari and Ra- mamoorthi 2017] to include some kind of tone mapping in the supervision but it only marginally improved the visual quality of the images predicted by our model.

Optimizer.We minimize Eq. (13) using Adam optimizer with learning rate set to 10⁻⁴for 400k iterations. We decrease the learning rate by 0.5 every 100k iterations. The weights of the CNNs are randomly initialized with the default setting of the PyTorch library.

5 RESULTS

We first show in Section 5.1 several qualitative results illustrat- ing the performance of our method for joint HDR imaging, super- resolution, demosaicking and denoising from real raw image bursts.

Qualitative and quantitative comparisons with existing methods for super-resolution, HDR imaging, and registrations are presented in Section 5.2, Section 5.3 and Section 5.4 respectively. The effect of the choice of prior and the robustness of our method for real images are discussed in Section 5.5. Additional results, ablations studies, and discussions of its limitations can be found in the appendix.

Note that all the HDR images are rendered usingPhotomatix¹ for tone mapping, which is itself a challenging task [Reinhard et al.

2002] beyond the scope of this paper. For baselines operating on RGB images instead of raw photographs, we first process raw files with Adobe Camera Raw to generate RGB images with the highest quality possible.

1https://www.hdrsoft.com/

5.1 Joint SR and HDR on raw image bursts

To the best of our knowledge, we are the first to address jointly HDR, super resolution, demosaicking, and denoising on bursts of raw images. Therefore, we will mostly present here qualitative results, and will defer quantitative comparisons to the following sections that evaluate the performance of our algorithm on separate HDR or SR tasks.

We consider bursts acquired in different settings by a Pixel 3a or 4a camera, by using an Android application to shoot bursts of 11 to 18 raw images. We choose an EV step of 1/3 to 2/3 between each shot.

This is particularly important for night scenes to avoid motion blur in the longest-exposure frames. Our method successfully restores finer details and extends the dynamic range of the original shot by denoising dark areas and restoring clipped signals. More precisely:

• Figures 4 and 6 show night-time photos with large dynamics, similar to Figure 1, with both under- and over-exposed areas in the low-resolution central frame. Both the dynamics and the resolution are significantly improved by our algorithm.

• An outdoor day-time photograph is shown in Figure 5, with a particularly large dynamic range. The scene contains both under-exposed, noisy areas in the shadows and large bright saturated areas. Note also that the scene contains patterns which are smaller than the resolution of the native image which is a particularly hard setting for demosaicking. Our approach handles such situations well.

• A night scene with both very dark building parts and light bulbs, resulting in a very large dynamic range (Figure 6). Our approach, unlike our competitors, can recover details in both the dark and saturated areas.

5.2 Pure super-resolution

We now move to pure super-resolution from raw image bursts, and compare our approach with Lecouat et al. [2021], using examples from their paper. The bursts in this section all have the same exposure, making the alignment simpler compared to the previous section. We first perform a quantitative evaluation on the semi- synthetic benchmark of Bhat et al. [2021a], following their experimental setup and using their dataset. Table 1 presents a comparison of our approach for SR, which can be seen as an improved variant of Lecouat et al. [2021]. All methods in the comparison are designed to process raw image bursts. We first note that our improvements in the image registration module yields +1dB over [Lecouat et al. 2021]

for similar network capacities. The geometric error, measuring alignment discrepancies, is also four times smaller than that of [Lecouat et al. 2021], which further suggests the usefulness of our modified Lucas-Kanade module. Since the other methods of the panel do not explicitly predict any motion vector, we cannot compute the corresponding geometric errors. We also have a PSNR gain of about 0.5 to 1dB over three of the recent competitors and fall only behind Luo et al. [2021] by less than 1dB but with 13 times fewer learnable parameters. Therefore, the proposed approach is also a compact and competitive algorithm for SR alone. A speed comparison, presented later in Section A.4 also shows that our method is faster at inference time.

(9)

Low resolution

(dcraw) ACR (x2)

(1 frame) Kim et al (x2)

(1 frame) Ours (x4)

(18 frames) Fig. 5. Day-time comparisons of joint HDR imaging and super-resolution algorithms with bursts acquired by a Pixel4a.Left:The central image in the burst (top) and our reconstruction (bottom).Right:Comparison of close-ups of the reconstructions obtained by the CNN-based Adobe Camera Raw single-image algorithm for×2super-resolution and demosaicking, the CNN-based×2super-resolution method of [Kim et al. 2019], and our method. (Note: part of the phone number legible in our case is masked for privacy reasons.)

Low resolution

(dcraw) ACR (x2)

(1 frame) Kim et al (x2)

(1 frame) Ours (x4)

(18 frames)

Fig. 6. Night-time comparisons of joint HDR imaging and super-resolution algorithms with bursts acquired by a Pixel4a.Left:The central image in the burst (top) and our reconstruction (bottom).Right:Comparison of close-ups of the reconstructions obtained by the CNN-based Adobe Camera Raw single-image algorithm for×2super-resolution and demosaicking, the CNN-based×2super-resolution method of [Kim et al. 2019], and our method.

(10)

Fig. 7. Visual comparison for super-resolution only on real same-exposure raw bursts, of respectively𝐾=20and𝐾 =30frames, with state-of-the-art competitors. Wedo notpresent HDR results in this figure. Our approach limits Moiré artefacts in the first row and reveals in general more high frequency image details in both rows. The last row shows an example requiring deghosting. The ghosted LR image on the left is obtained by averaging the whole burst to show the pedestrian’s motion. Bhat et al. [2021a] and our method effectively handle small object motions. The reader is invited to zoom in.

Low resolution [Bhat et al. 2021a] [Lecouat et al. 2021] [Luo et al. 2021] Ours The previous comparison is conducted on semi-synthetic data,

both for training the models and for testing, which makes its con- clusions difficult to generalize to the real world of raw bursts from handheld cameras. Nevertheless, it remains the best existing quantitative experimental setup, to our knowledge, since it is not possible to acquire reliable HR ground-truth data along with LR raw bursts.

Figure 7 shows two challenging real-world examples on which we compare qualitatively the approaches of Bhat et al. [2021a], Lecouat et al. [2021] and Luo et al. [2021] to ours, for×4 super-resolution factor. We display in the first row the results for a burst of𝐾=20 raw frames of a textured surface. Moiré artefacts and aliasing can respectively be noticed in the results from Bhat et al. [2021a] and Luo et al. [2021]. Such artefacts are not visible in our reconstruction and that of of Lecouat et al. [2021]. The second row shows the results for a burst of𝐾=30 raw images from [Lecouat et al. 2021]. Amongst the four methods in the panel, ours returns the sharpest image, with for instance easier-to-read characters than competitors. We point out that wehave notused any sharpening algorithm on any of these images.

As remarked by Wronski et al. [2019], there is a physical limit to the maximum frequency one can reconstruct with aliasing, due to the sensor pitch or the lens point-spread function. We verify this property in Figure 8 where we show two crops from the same image, with×2 and×4 resolution factors. The first row shows details of a balcony clearly benefiting from a×4 gain in resolution compared to its×2 counterpart. The second row shows however that sometimes,

Table 1. Super-resolution (×4) comparison with a selected panel of recent methods with average PSNR and geometric error when it can be computed.

Wedo notperform HDR generation in this experiment. Our method falls behind that Luo et al. [2021] within a margin of less than 1dB but with 13 times fewer parameters. We gain 1dB compared to [Lecouat et al. 2021]

with a similar number of parameters by upgrading the registration module.

Model # parameters PSNR Geom (avg)

[Bhat et al. 2021a] 13M 40.76 N/A

[Lecouat et al. 2021] 3M 41.45 2.56

[Bhat et al. 2021b] - 41.56 N/A

[Dudhane et al. 2021] 6.6M 41.93 N/A

[Luo et al. 2021] 26M 43.35 N/A

Ours 3M 42.42 0.80

as predicted by Wronski et al. [2019],×4 upsampling factor may not reveal finer details that its×2 counterpart.

5.3 Pure HDR imaging

We evaluate the ability of our approach to align and merge raw images into HDR image at the same resolution as the input.

We compare our approach with a bracketing technique, implemented with the weights of [Hasinoff et al. 2010], two state-of-the- art CNNs [Wu et al. 2018; Yan et al. 2021] trained to predict a 32-bit image from only three LDR images with -2, 0 and +2EV or -3, 0 and +3EV, and recent single-image HDR CNNs [Liu et al. 2020; Santos

(11)

LR HR (×2) HR (×4) Fig. 8. Visualizing super-resolution limit at resolutions increased by×2and

×4with our model. The first image in the first row benefits from the×4im- provements whereas the one in the second row (from the same photograph) is not further enhanced after×2. See the discussion in the text.

et al. 2020]. We generate 266 raw bursts with 32-bit ground-truth images, each burst containing 11 synthetic raw images with small random shifts and rotations and Poissonian-Gaussian noise with parameters𝛼and𝛽selected according the distribution in Figure 3.

More details about data generation can be in found in Section B of the appendix. To evaluate the CNNs trained on RGB images, we first pick the three raw frames corresponding to {-2.4,0,+2.4} EV in the burst and demosaick them with the approach of Malvar et al. [2004].

We also demosaick the frames before merging the HDR images with the bracketing technique. If the raw frames are not aligned, after demosaicking, we align the frames either with the phase correlation algorithm [Tursun et al. 2016] on the MTB features Ward [2003] or with our Lucas-Kanade-based registration technique. For fairness with the CNNs, we compare our approach when there are only three frames in the bracket (the same as for the CNNs) and with the whole burst.

We present in Table 2 the results of our comparison. We evaluate the PSNR and the SSIM metrics on both the output of each algorithm and after evaluating the irradiance maps with𝜇-law, playing the role of a tone mapping algorithm [Kalantari and Ramamoorthi 2017].

However these typical image processing metrics may not be adapted to HDR imaging [Aydin et al. 2008; Eilertsen et al. 2021] we thus also report the HDRVDP2 perceptual quality score of Narwaria et al.

[2015] (version 2.2.2). Note that Wu et al. [2018] and Yan et al. [2021]

use RGB images for training, while our method leverages more information by directly processing raw frames. We also compare our method to the single-image methods of Santos et al. [2020] and Liu et al. [2020] running on the central frame of the burst.

Our algorithm using 11 frames achieves the best results as ex- pected, with HDRVDP2 margins ranging from +4 to +9 over recent CNN-based methods and of +4 over the bracketing technique of Hasinoff et al. [2010] using 11 frames too. The gap with CNNs comes

from our ability to restore the darker areas in raw photographs containing large read noise whereas these networks are trained on RGB images only. Figure 10 shows qualitative comparisons with the baselines in Table 2 for bursts of 21 images taking during day time and night time. Our method achieves the best visual results in both dark and saturated areas. Note that the CNN baselines considered here have been designed to handle 1 or 3 images only, which is not sufficient to achieve effective denoising through image fusion in challenging settings.

We also compare our approach with a public-domain implementation [Monod et al. 2021] of Google’s HDR+ [Hasinoff et al. 2016] that addresses HDR imaging by fusing images with the same exposure.

In this setting, HDR essentially boils down to burst denoising, which is effectively handled by our approach. Figure 9 shows a qualitative comparison of HDR+ with our technique. We achieve better denoising, especially in the darkest areas, while also increasing spatial resolution.

5.4 Multi-exposure registration

We evaluate the performance of our registration module based on learnable features. We measure the geometric alignment error between the ground-truth motion and predicted one [Sanchez 2016]

computing the Euclidean distance between the aligned image cor- ners with that of the ground-truth ones and is counted in number of pixels in the HR image.

We report the mean and the median over 266 validation bursts (containing 11 images per burst) synthesized with the same protocol as for generating the training data. We compare a typical multi- exposure registration scheme consisting in combining MTB features and phase correlation [Tursun et al. 2016] (used for prealigning the images in our model), with the 3 iterations of the pyramid Lucas- Kanade (PLK) algorithm over plain pixels and deep features learnt in an end-to-end manner. The three methods are run on the mosaicked and possibly noisy images, prior to any ISP processing. We evaluate this panel over three scenarios: (i) HDR generation without SR from noise-free raw bursts, (ii) HDR generation without SR from raw bursts with noise and (iii) joint HDR and SR with factor×4 from raw bursts with noise.

Table 3 shows that, in all cases, our approach achieves the best quantitative results, with a margin ranging from 0.5px for the (un- realistic) noise-free benchmark to more than 1px for the more challenging ones featuring noise. Interestingly, using more iterations does not always mean a better alignment. A plausible explanation is that our model is trained for using three iterations of the LK algorithm, and may be sub-optimal for more iterations.

We have also empirically observed that the errors in this table are always greater than that reported by Lecouat et al. [2021] in their work for aligning frames with the same exposure. This gap is caused in practice by the darkest and brightest frames, much harder to align because of the noise in dark regions and large saturated areas.

Figure 11 compares the advantage of running the Lucas-Kanade algorithm with deep features and plain pixels in a real situation.

Note the purple zipping artefacts caused by faulty alignment before

(12)

Fig. 9. Comparison between the public-domain implementation [Monod et al. 2021] dubbed here HDR+Ipol of Google’s HDR+ [Hasinoff et al. 2016] (top) with our HDR/SR×4method.Left:The images reconstructed by HDR+Ipol (top) and our method (bottom) from a burst of 8 same-exposure images acquired by a Nexus 5. Note that they are barely distinghuisable at this resolution.Right:Crops showing that our algorithm reveals finers details while effectively suppressing noise in dark areas.

Table 2. Quantitative comparison of various algorithms for HDR imaging – we do not perform super-resolution in this experiment – on a synthetic dataset consisting of bracketed raw bursts simulated with our pipeline. Our method directly takes raw frames as an input. The other methods process RGB frames obtained here with VNG demosaicking. Our algorithm quantitatively outperforms the other HDR methods on this dataset, which is not surprising as it is trained leverage the information lost in the raw to rgb conversion.

Method PSNR (dB) 𝜇-PSNR (dB) SSIM (%) 𝜇-SSIM (%) HDR-VDP2 (Q)

K=1 frames

[Liu et al. 2020] 20.11 24.42 0.611 0.690 57.32

[Santos et al. 2020] 22.14 25.85 0.641 0.702 62.94

K=3 frames

[Hasinoff et al. 2010] + MTB 28.08 29.46 0.819 0.847 61.13

[Hasinoff et al. 2010] + PLK 27.25 28.69 0.814 0.836 60.82

[Wu et al. 2018] 26.47 27.61 0.771 0.782 61.80

[Yan et al. 2021] 26.31 27.11 0.761 0.774 61.14

Ours 33.75 34.39 0.942 0.943 63.24

K=11 frames

[Hasinoff et al. 2010] + MTB 29.54 30.96 0.862 0.892 62.07

[Hasinoff et al. 2010] + PLK 28.80 30.21 0.862 0.888 61.95

Ours 37.83 39.22 0.964 0.971 65.44

image fusion in the left image obtained with the plain-pixel Lucas- Kanade algorithm. These artefacts vanish in the image on the right using deep features.

5.5 Discussion

Choice of the prior function.An important component of our approach is the image proximal operator𝐺_𝜔. Figure 12 shows a qualitative comparison of a prior-free version, solely aligning and merging the frames, using the image gradients soft-thresholding function derived from the classical TV-ℓ₁prior, and our approach with a

learnable module. The TV-based version is significantly sharper than that the one without prior. The parametric prior returns a better zoomed-in image,e.g.,next to the head and the dress of the statue.

Robustness on real images.A key advantage of our approach is the accuracy of its registration module, as detailed on Table 3 and illustrated in Figure 11. We have remarked that this module is particularly efficient for aligning raw frames with the same exposure, as illustrated by Figure 7 in the context of SR with factor×4. Given

(13)

[Yan et al. 2021]

(3 frames) [Wu et al. 2018]

(3 frames) [Santos et al. 2020]

(1 frame) [Liu et al. 2020]

(1 frame) Ours

(3 frames, no SR)

[Yan et al. 2021]

(3 frames) [Wu et al. 2018]

(3 frames) [Santos et al. 2020]

(1 frame) [Liu et al. 2020]

(1 frame) Ours

(3 frames, no SR) Fig. 10. Comparison with CNN-based HDR methods processing one to three input frames.Left:A sequence of three input frames, followed by our result after tone mapping, for two scenes.Right:Small crops from the scenes obtained by various methods. To be fair, we compare them with a version of our model that does not perform super-resolution (×1upscaling factor) and only processes a burst of 3 images, in the EV range [-2.4,0,2.4]. We observe that, in well-exposed regions, the reconstruction performances of the three methods are similar. Our method appears to be more robust to noise, but more sensitive to non-rigid motion as shown in the case of the flag.

LK LK, with deep features

Fig. 11. Qualitative comparison of the reconstructed image with a pyramid of Lucas-Kanade run on plain pixels or deep features. Note the zipping artefacts along the edges of the large white rectangle. Our learnable variant is faster and leads to more accurate results. The reader is invited to zoom in.

a burst of raw photographs including moving objects with reasonable motion during exposure,e.g.,the pedestrian in the figure, we can predict high-quality HR image well-aligned with the reference frame whereas the competitors may introduce ghosting or colored

artefacts. Notwithstanding, we have also noted that non-rigid motions in the raw frame burst may lead to blur in the final predicted image. For instance, Figure 10 compares the restoration results from Wu et al. [2018], Yan et al. [2021] and our model for a crop featuring a waving flag,i.e.,a non-rigid motion. We select𝐾=3 images with EV values of {-2.4,0,2.4} EV for the CNNs and for our model. The CNNs trained to remove ghosting artefacts accurately align the flag with the reference frame whereas our prediction is blurry in the red section of the flag. This may stem from the fact that multi-exposure image registration is a very challenging problem and that we have not such non-rigid motions in our training data. For a better deghosting, the injection of non-rigid motion in the training data, similarly to the dataset introduced in [Kalantari and Ramamoorthi 2017], is an interesting future research direction.

6 CONCLUSION

We have introduced an effective algorithm for the reconstruction of high-resolution, high-dynamic range color images from raw photographic bursts with exposure bracketing. We have demonstrated its excellent performance with super-resolution factors of up to

(14)

Table 3. Quantitative comparison of registration methods on synthetic data with average and median geometric errors [Sanchez 2016]. We compare MTB features combined with sub-pixelic phase correlation [Tursun et al.

2016], the pyramid Lucas-Kanade (PLK) algorithm and our variant of PLK using deep features. The three algorithms are run on the mosaicked images.

On each benchmark, we outperform both vanilla PLK and the MTB-based approach.

Model Geom (avg.) Geom (med.)

×1- No noise - 11 raw frames

MTB + phase correlation 2.93 2.61

3 PLK iterations 1.32 0.97

3 PLK iteration +deep features (ours) 0.91 0.60

5 PLK iterations + deep features (ours) 0.88 0.61

×1- Noise - 11 raw frames

MTB + Phase correlation 3.58 2.99

×4(aliasing) - Noise - 11 raw frames

MTB + Phase correlation 5.93 4.67

Low-resolution No prior

TV-ℓ₁ Ours

Fig. 12. Visual comparison of the impact of the prior for joint HDR and×4 SR. We fuse𝐾=20images in this example. We note for the three methods effectively suppress the noise present in the original LR frames. However, our learnable prior (here with 300k parameters) yields a higher quality image.

The reader is invited to zoom in.

×4 on real photographs taken in the wild with hand-held cameras, and high robustness to low-light conditions, noise, camera shake, and moderate object motion. We have also shown that it compares

favorably to the state of the art in both the HDR imaging and super- resolution domains. The keys to the success of this algorithm are the powerful underlying image formation model, a hybrid approach for solving the corresponding inverse problem that combines the in- terpretability and generalization power of model-based techniques with the flexibility and robustness of (deep) machine learning tech- nology, and new twists on classical alignment techniques that make them robust by design. The proposed approach also has limitations:

In particular its performance in saturated regions is far from per- fect, and it is only robust to relatively small scene motions. Future work will include improving these aspects, but also adapting the method to large camera displacements (wide-baseline setting) and scene motions, which in turn will lead to 3D capture applications.

A missing piece is of course handling motion and defocus blur, and we also plan to adapt our method to a new range of applications such as focus stacking, where image alignment requires comparing images with different focuses that may be even more different than those obtained under changing exposures. Finally, we are also interested in scientific applications in astronomy, microscopy, and remote sensing.

ACKNOWLEDGMENTS

This work was funded in part by the French government under management of Agence Nationale de la Recherche as part of the

“Investissements d’avenir” program, reference ANR-19-P3IA-0001 (PRAIRIE 3IA Institute). JM and BL were supported by the ERC grant number 714381 (SOLARIS project) and by ANR 3IA MIAI@Grenoble Alpes (ANR-19-P3IA-0003). JP was supported in part by the Louis Vuitton/ENS chair in artificial intelligence and the Inria/NYU col- laboration. This work was granted access to the HPC resources of IDRIS under the allocation 2022-AD011011252R2 made by GENCI.

This work was done while TE was a PhD student at Inria.

REFERENCES

Cecilia Aguerrebere, Julie Delon, Yann Gousseau, and Pablo Musé. 2014. Best Algorithms for HDR Image Generation. A Study of Performance Bounds. SIAM Journal on Imaging Science7, 1 (2014), 1–34.

Paul E. Anuta. 1970. Spatial Registration of Multispectral and Multitemporal Digital Imagery Using Fast Fourier Transform Techniques.IEEE Transactions on Geoscience eletronics8, 4 (1970), 353–368.

Tunç Ozan Aydin, Rafal Mantiuk, and Hans-Peter Seidel. 2008. Extending quality metrics to full luminance range images. InProceedings of Human Vision and Electronic Imaging (SPIE Proceedings), Bernice E. Rogowitz and Thrasyvoulos N. Pappas (Eds.), Vol. 6806. SPIE, 68060B.

Atilim Gunes Baydin, Barak A Pearlmutter, Alexey Andreyevich Radul, and Jeffrey Mark Siskind. 2018. Automatic differentiation in machine learning: a survey.Journal of Machine Learning Research (JMLR)18 (2018), 1–43.

Goutam Bhat, Martin Danelljan, Luc Van Gool, and Radu Timofte. 2021a. Deep Burst Super-Resolution. InProceedings of the Conference on Computer Vision and Pattern Recognition (CVPR). 9209–9218.

Goutam Bhat, Martin Danelljan, Fisher Yu, Luc Van Gool, and Radu Timofte. 2021b.

Deep Reparametrization of Multi-Frame Super-Resolution and Denoising. (2021), 2460–2470.

Tim Brooks, Ben Mildenhall, Tianfan Xue, Jiawen Chen, Dillon Sharlet, and Jonathan T.

Barron. 2019. Unprocessing Images for Learned Raw Denoising. InProceedings of the Conference on Computer Vision and Pattern Recognition (CVPR). 11036–11045.

Che-Han Chang, Chun-Nan Chou, and Edward Y Chang. 2017. CLKN: Cascaded lucas- kanade networks for image alignment. InProceedings of the Conference on Computer Vision and Pattern Recognition (CVPR). 2213–2221.

Jongseong Choi, Min Kyu Park, and Moon Gi Kang. 2009. High Dynamic Range Image Reconstruction with Spatial Resolution Enhancement.Computer Journal52, 1 (2009), 114–125.

Roger N. Clark. 2006. Digital Camera Reviews and Sensor Performance Summary.

"https://clarkvision.com/articles/digital.sensor.performance.summary/".