HighRes-net: Recursive Fusion for Multi-.txt

HighRes-net: Recursive Fusion for Multi-Frame Super-Resolution
of Satellite Imagery
Michel Deudon * 1 Alfredo Kalaitzis * 1 Israel Goytom 2 Md Rifat Arefin 2 Zhichao Lin 1 Kris Sankaran 2 3
Vincent Michalski 2 3 Samira E. Kahou 2 4 Julien Cornebise 1 Yoshua Bengio 2 3
Abstract
Generative deep learning has sparked a new wave
of Super-Resolution (SR) algorithms that enhance
single images with impressive aesthetic results,
albeit with imaginary details. Multi-frame Super-
Resolution (MFSR) offers a more grounded ap-
proach to the ill-posed problem, by conditioning
on multiple low-resolution views. This is impor-
tant for satellite monitoring of human impact on
the planet – from deforestation, to human rights vi-
olations – that depend on reliable imagery. To this
end, we present HighRes-net, the first deep learn-
ing approach to MFSR that learns its sub-tasks in
an end-to-end fashion: (i) co-registration, (ii) fu-
sion, (iii) up-sampling, and (iv) registration-at-
the-loss. Co-registration of low-resolution views
is learned implicitly through a reference-frame
channel, with no explicit registration mechanism.
We learn a global fusion operator that is applied re-
cursively on an arbitrary number of low-resolution
pairs. We introduce a registered loss, by learning
to align the SR output to a ground-truth through
ShiftNet. We show that by learning deep represen-
tations of multiple views, we can super-resolve
low-resolution signals and enhance Earth Obser-
vation data at scale. Our approach recently topped
the European Space Agency’s MFSR competition
on real-world satellite imagery.
1. Introduction
Multiple low-resolution images of the same scene con-
tain collectively more information than any individual low-
resolution image, due to minor geometric displacements,
e.g. shifts, rotations, atmospheric turbulence, and instru-
ment noise. Multi-Frame Super-Resolution (MFSR) (Tsai,
1984) aims to reconstruct hidden high-resolution details
*Equal contribution 1Element AI, London, UK 2Mila, Montreal,
Canada 3Universit ´e de Montr ´eal, Montreal, Canada 4McGill Uni-
versity, Montreal, Canada. Correspondence to: Alfredo Kalaitzis
<freddie@element.ai>.
from multiple low-resolution views of the same scene. Sin-
gle Image Super-Resolution (SISR), as a special case of
MFSR, has attracted much attention in the computer vision
and deep learning communities in the last 5 years, with neu-
ral networks learning complex image priors to upsample and
interpolate images (Xu et al., 2014; Srivastava et al., 2015;
He et al., 2016). However, not much work has explored the
learning of representations for the more general setting of
MFSR to address the additional challenges of co-registration
and fusion of multiple low-resolution images.
This paper shows how Multi-Frame Super-Resolution
(MFSR) — a valuable capability in remote sensing — can
benefit from recent advances in deep learning. Specifically,
this work is the first to introduce a deep-learning approach
that solves the co-registration, fusion and registration-at-the-
loss problems in an end-to-end learning fashion.
This research is driven by the proliferation of planetary-
scale Earth Observation to monitor climate change and the
environment. Such observation can be used to inform policy,
achieve accountability and direct on-the-ground action, e.g.
within the framework of the Sustainable Development Goals
(Jensen & Campbell, 2019).
1.1. Nomenclature
First, we cover some useful terminology:
Registration is the problem of estimating the relative geo-
metric differences between two images (e.g. due to shifts,
rotations, deformations).
Fusion in the context of MFSR, is the problem of map-
ping multiple low-resolution representations into a single
representation.
Co-registration is the problem of registering all low-
resolution views to improve their fusion.
Registration-at-the-loss is the problem of registering the
super-resolved (SR) reconstruction to the high-resolution
(HR) target image prior to computing the loss function.
Similarly, a registered loss function is effectively invariant
to geometric differences between the reconstruction and the
target, whereas a conventional (un-registered) loss would
cause undue changes to even a perfect reconstruction.
Co-registration of multiple images is required for longitu-
dinal studies of land change and environmental degrada-
tion. The fusion of multiple images is key to exploiting
cheap, high-revisit-frequency satellite imagery, but of low-
resolution, moving away from the analysis of infrequent and
expensive high-resolution images. Finally, beyond fusion
itself, super-resolved generation is required throughout the
technical stack: both for labeling, but also for human over-
sight (Drexler, 2019) demanded by legal context (Harris
et al., 2018).
1.2. Contributions
HighRes-net We propose a deep architecture that learns
to fuse an arbitrary number of low-resolution frames with
implicit co-registration through a reference-frame channel.
ShiftNet Inspired by HomographyNet (DeTone et al.,
2016), we define a model that learns to register and align the
super-resolved output of HighRes-net, using ground-truth
high-resolution frames as supervision. This registration-at-
the-loss mechanism enables more accurate feedback from
the loss function into the fusion model, when comparing
a super-resolved output to a ground truth high resolution
image. Otherwise, a MFSR model would naturally yield
blurry outputs to compensate for the lack of registration, to
correct for sub-pixel shifts and account for misalignments
in the loss.
End-to-end fusion + registration By combining the two
components above, we propose the first architecture to learn
the tasks of fusion and registration in an end-to-end fashion.
We test and compare our approach to several baselines on
real-world imagery from the PROBA-V satellite of ESA.
Our performance has topped the Kelvins competition on
MFSR, organized by the Advanced Concepts Team of ESA
(M¨artens et al., 2019) (see section 5).
The rest of the paper is divided as follows: in Section 2, we
discuss related work on SISR and MFSR; Section 3 outlines
HighRes-net and section 4 presents ShiftNet, a differentiable
registration component that drives our registered loss mech-
anism during end-to-end training. We present our results in
section 5, and in Section 6 we discuss some opportunities
for and limitations and risks of super-resolution.
2. Background
How much detail can we resolve in the digital sample of
some natural phenomenon? (Nyquist, 1928) observed that
it depends on the instrument’s sampling rate and the oscil-
lation frequency of the underlying natural signal. (Shan-
non, 1949) built a sampling theory that explained Nyquist’s
observations when the sampling rate is constant (uniform
sampling) and determined the conditions of aliasing in a
sample. Figure 2 illustrates this phenomenon.
Sampling at high-resolution (left) maintains the frequency
of the chirp signal (top). Sampling at a lower resolution
(right), this apparent chirped frequency is lost due to alias-
ing, which means that the lower-resolution sample has a
fundamentally smaller capacity for resolving the informa-
tion of the natural signal, and a higher sampling rate can
resolve more information.
Shannon’s sampling theory has since been generalized
for multiple interleaved sampling frames (Papoulis, 1977;
Marks, 2012). One result of the generalized sampling theory
is that we can go beyond the Nyquist limit of any individual
uniform sample by interleaving several uniform samples
taken concurrently. When an image is down-sampled to a
lower resolution, its high-frequency details are lost perma-
nently and cannot be recovered from any image in isolation.
However, by combining multiple low-resolution images, it is
possible to recover the original scene at a higher resolution
2.1. Multi-frame / multi-image super-resolution
Different low-resolution samples may be sampled at dif-
ferent phase shifts, such that the same high-resolution fre-
quency information will be packed with various phase shifts.
Consequently, when multiple low-resolution samples are
available, the fundamental problem of MFSR is one of
fusion by de-aliasing — that is, to disentangle the high-
frequency components packed in low-resolution imagery.
Such was the first work on MSFR by Tsai (1984), who
framed the reconstruction of a high-resolution image by
fusion of low-resolution images in the Fourier domain, as-
suming that their phase shifts are known. However, in
practice the shifts are never known, therefore the fusion
problem must be tackled in conjunction with the registra-
tion problem (Irani & Peleg, 1991; Fitzpatrick et al., 2000;
Capel & Zisserman, 2001). Done the right way, a composite
super-resolved image can reveal some of the original high-
frequency detail that would have been unrecoverable from a
single low-resolution image.
Until now, these tasks have been learned and / or performed
separately. So any incompatible inductive biases that are
not reconciled through co-adaptation, would limit the ap-
plicability of that particular fuser-register combination. To
that end, we introduce HighRes-Net, the first fully end-to-
end deep architecture for MFSR settings, that jointly learns
and co-adapts the fusion and (co-)registration tasks to one
another.
2.1.1. VIDEO AND STEREO SUPER-RESOLUTION
The setting of MFSR is related to Video SR and Stereo
SR, although MFSR is less constrained in a few important
ways: In MFSR, a model must fuse and super-resolve from
sets, not sequences, of low-resolution inputs. The training
input to our model is an unordered set of low-resolution
views, with unknown timestamps. The target output is a
single high-resolution image — not another high-resolution
video or sequence. When taken at different times, the low-
resolution views are also referred to as multi-temporal (see
e.g. Molini et al. 2019).
In Video SR, the training input is a temporal sequence of
frames that have been synthetically downscaled. An auto-
regressive model (one that predicts at time t = T based
on past predictions at t < T ), benefits from this additional
structure by estimating the motion or optical flow in the
sequence of frames (Tao et al., 2017; Sajjadi et al., 2018;
Yan et al., 2019; Wang et al., 2019b).
In Stereo SR, the training input is a pair of low-resolution
images shot simultaneously with a stereoscopic camera, and
the target is the same pair in the original high-resolution.
The problem is given additional structure in the prior knowl-
edge that the pair differ mainly by a parallax effect (Wang
et al., 2019d;a).
2.2. Generative perspective
In addition to aliasing, MFSR deals with random processes
like noise, blur, geometric distortions – all contributing to
random low-resolution images. Traditionally, MFSR meth-
ods have assumed prior knowledge of the motion model,
blur kernel, noise and degradation process that generate the
data; see for example, (Pickup et al., 2006). Given multiple
low-resolution images, the challenge of MFSR is to recon-
struct a plausible image of higher-resolution that could have
generated the observed low-resolution images. Optimiza-
tion methods aim to improve an initial guess by minimizing
an error between simulated and observed low-resolution im-
ages. These methods traditionally model the additive noise
 and prior knowledge about natural images explicitly, to
constrain the parameter search space and derive objective
functions, using e.g. Total Variation (Chan & Wong, 1998;
Farsiu et al., 2004), Tikhonov regularization (Nguyen et al.,
2001) or Huber potential (Pickup et al., 2006) to define
appropriate constraints on images.
In some situations, the image degradation process is com-
plex or not available, motivating the development of non-
parametric strategies. Patch-based methods learn to form
high-resolution images directly from low-resolution patches,
e.g. with k-nearest neighbor search (Freeman et al., 2002;
Chang et al., 2004), sparse coding and sparse dictionary
methods (Yang et al., 2010; Zeyde et al., 2010; Kim & Kwon,
2010)). The latter represents images in an over-complete
basis and allows for sharing a prior across multiple sites.
In this work, we are particularly interested in super-
resolving satellite imagery. Much of the recent work in
Super-Resolution has focused on SISR for natural images.
For instance, (Dong et al., 2014) showed that training a
CNN for super-resolution is equivalent to sparse coding and
dictionary based approaches. (Kim et al., 2016) proposed
an approach to SISR using recursion to increase the recep-
tive field of a model while maintaining capacity by sharing
weights. Many more networks and learning strategies have
recently been introduced for SISR and image deblurring.
Benchmarks for SISR (Timofte et al., 2018), differ mainly
in their upscaling method, network design, learning strate-
gies, etc. We refer the reader to (Wang et al., 2019d) for a
more comprehensive review.
Few deep-learning approaches have considered the more
general MFSR setting and attempted to address it in an end-
to-end learning framework. Recently, (Kawulok et al., 2019)
proposed a shift-and-add method and suggested “including
image registration” in the learning process as future work.
In the following sections, we describe our approach to solv-
ing both aspects of the registration problem – co-registration
and registration-at-the-loss – in a memory-efficient manner.

igure 3. Schematic of the full processing pipeline, trained end-to-end. At test time, only HighRes-net is used. (a) HighRes-net: In the
Encode stage, an arbitrary number of LR views are paired with the reference low-resolution image (the median low-resolution in this
work). Each LR view–reference pair is encoded into a view-specific latent representation. The LR encodings are fused recursively into a
single global encoding. In the Decode stage, the global representation is upsampled by a certain zoom factor (×3 in this work). Finally,
the super-resolved image is reconstructed by combining all channels of the upsampled global encoding. (b) Registered loss: Generally,
the reconstructed SR will be shifted with respect to the ground-truth HR. ShiftNet learns to estimate the (∆x, ∆y) shift that improves the
loss. Lanczos resampling: (∆x, ∆y) define two 1D shifting Lanczos kernels that translate the SR by a separable convolution

3. HighRes-net: MFSR by recursive fusion
In this section, we present HighRes-net, a neural network for
multi-frame super-resolution inside a single spectral band
(greyscale images), using joint co-registration and fusion
of multiple low-resolution views in an end-to-end learning
framework. From a high-level, HighRes-net consists of an
encoder-decoder architecture and can be trained by stochas-
tic gradient descent using high-resolution ground truth as
supervision, as shown in Figure 3.
Notation We denote by θ the parameters of HighRes-net
trained for a given upscaling factor γ. LRv,i ∈ RC×W ×H is
one of a set of K low-resolution views from the same site v,
where C, W and H are the number of input channels, width
and height of LRv,i, respectively. We denote by SRθ
v =
F γ
θ (LRv,1, . . . , LRv,K ), the output of HighRes-net and
by HRv ∈ RC×γW ×γH a ground truth high-resolution
image. We denote by [T1, T2] the concatenation of two
images channel-wise. In the following we supress the index
v over sites for clarity.
HighRes-Net consists of three main steps: (1) encoding,
which learns relevant features associated with each low-
resolution view, (2) fusion, which merges relevant informa-
tion from views within the same scene, and (3) decoding,
which proposes a high-resolution reconstruction from the
fused summary.
3.1. Encode, Fuse, Decode
Embed, Encode The core assumption of MFSR is that
the low-resolution image set contains collectively more in-
formation than any single low-resolution image alone, due
to differences in photometric or spatial coverage for instance.
However, the redundant low frequency information in mul-
tiple views can hinder the training and test performance of
a MFSR model. We thus compute a reference image ref as
a shared representation for multiple low-resolution views
(LRi)K
i=1 and embed each image jointly with ref. This high-
lights differences across the multiple views (Sanchez et al.,
HighRes-net: Recursive Fusion for Multi-Frame Super-Resolution of Satellite Imagery
2019), and potentially allows HighRes-net to focus on dif-
ficult high-frequency features such as crop boundaries and
rivers during super-resolution. The shared representation or
reference image intuitively serves as an anchor for implicitly
aligning and denoising multiple views in deeper layers. We
refer to this mechanism as implicit co-registration.
HighRes-net’s embedding layer embθ consists of a convolu-
tional layer and two residual blocks with PReLu activations
(He et al., 2015) and is shared across all views. The embed-
ded hidden states s0
i are computed in parallel as follows:
ref (c,i,j) = median (LR1(c,i,j), . . . , LRK (c,i,j)) , (1)
s0
i = embθ ([LRi, ref ]) ∈ RCh×W ×H , (2)
where ref ∈ RC×W ×H , and Ch denotes the channels of the
hidden state.
The imageset is padded if the number of low-resolution
views K′ is not a power of 2: we pad the set with dummy
zero-valued views, such that the new size of the imageset
K is the next power of 2. See Algorithm 1, line 1.
Fuse The embedded hidden states s0
i are then fused recur-
sively, halving by two the number of low-resolution states
at each fusion step t, as shown in Figure 4. Given a pair of
hidden states st
i, st
j , HighRes-net computes a new represen-
tation:
[˜st
i, ˜st
j
] = [st
i, st
j
] + gθ
([st
i, st
j
]) ∈ R2Ch×W ×H (3)
st+1
i = st
i + αj fθ
(˜st
i, ˜st
j
) ∈ RCh×W ×H , (4)
where ˜st
i, ˜st
j are intermediate representations; gθ is a shared-
representation within an inner residual block (equation 3);
fθ is a fusion block, and αj is 0 if the j-th low-resolution
view is part of the padding, and 1 otherwise. fθ squashes
2Ch input channels into Ch channels and consists of a
(conv2d+PreLu). Intuitively, gθ aligns the two represen-
tations and it consists of two (conv2d + PreLU) layers.
Figure 4. HighRes-net’s global fusion operator consists of a co-
registration gθ and a fusion fθ block which aligns and combines
two representations into a single representation.
The blocks (fθ , gθ ) are shared across all pairs and depths,
giving it the flexibility to deal with variable size inputs and
significantly reduce the number of parameters to learn.
Upscale and Decode After T = log2 K fusion layers,
the final low-resolution encoded state sT
i contains infor-
mation from all K input views. Any information of a
spatial location that was initially missing from LRi, is
now encoded implicitly in sT
i . T is called the depth of
HighRes-net. Only then, sT
i is upsampled with a deconvo-
lutional layer (Xu et al., 2014) to a higher-resolution space
sT
HR ∈ RCh×γW ×γH . The hidden high-resolution encoded
state sT
HR is eventually convolved with a 1×1 2D kernel to
produce a final super-resolved image SRθ ∈ RC×γW ×γH .
The overall architecture of HighRes-net is summarized in
Figure 3(a) and the pseudo-code for the forward pass is
given in Algorithm 1.
Algorithm 1 HighRes-net forward pass
Input: low-resolution views LR1 . . . LRK′
# pad inputs to next power of 2
(LR1 . . . LRK , α1 . . . αK ) = pad (LR1 . . . LRK′ )
s0
i = encode (LRi) // parallelized across K views
T = log2 K
k = K
for t = 1 . . . T do
for i = 1 . . . k/2 do
st
i = fuse (st−1
i , st−1
k−i, αk−i
) # fuse encodings
end for
k = k/2
end for
SR = decode (sT
i
) # output super-resolved view
4. Registration matters
Co-registration matters for fusion. HighRes-net learns to
implicitly co-register multiple low-resolution views LRi
and fuse them into a single super resolved image SRθ . We
note that since the recursive fusion stage accepts only the
encoded low-resolution / reference pairs. So no aspect of
our low-res co-registration scheme comes with the built-
in assumption that the difference in low-resolution images
must be explained only by translational motion.
A more explicit registration-at-the-loss can also be used for
measuring similarity metrics and distances between SRθ
and HR. Indeed, training HighRes-Net alone, by minimiz-
ing a reconstruction error such as the mean-squared error
between SRθ and HR, leads to blurry outputs, since the
neural network has to compensate for pixel and sub-pixel
misalignments between its output SRθ and HR.
Here, we present ShiftNet-Lanczos, a neural network that
can be paired with HighRes-net to account for pixel and
sub-pixel shifts in the loss, as depicted in Figure 3(b). Our
ablation study A.2 and qualitative visual analysis suggest
that this strategy helps HighRes-net learn to super-resolve
and leads to clearly improved results.

4.1. ShiftNet-Lanczos
ShiftNet learns to align a pair of images with sub-pixel
translations. ShiftNet registers pairs of images by predicting
two parameters defining a global translation. Once a sub-
pixel translation is found for a given pair of images, it is
applied through a Lanczos shift kernel to align the images.
ShiftNet The architecture of ShiftNet is adapted from
HomographyNet (DeTone et al., 2016). Translations are
a special case of homographies. In this sense, ShiftNet
is simply a special case of HomographyNet, predicting 2
shift parameters instead of 8 homography parameters. See
Supplementary Material ??, for details on the architecture
of ShiftNet.
One major difference from HomographyNet is the way we
train ShiftNet: In (DeTone et al., 2016), HomographyNet is
trained on synthetically transformed data, supervised with
ground-truth homography matrices. In our setting, Shift-
Net is trained to cooperate with HighRes-net, towards the
common goal of MFSR (see section Objective function
below).
Lanczos kernel for shift / interpolation To shift and
align an image by a sub-pixel amount, it must be convolved
with a filter that shifts for the integer parts and interpolates
for the fractional parts of the translation. Standard options
for interpolation include the nearest-neighbor, sinc, bilinear,
bicubic, and Lanczos filters (Turkowski, 1990). The sinc
filter has an infinite support as opposed to any digital signal,
so in practice it produces ringing or ripple artifacts — an
example of the Gibbs phenomenon. The nearest-neighbor
and bilinear filters do not induce ringing, but strongly at-
tenuate the higher-frequency components (over-smoothing),
and can even alias the image. The Lanczos filter reduces the
ringing significantly by using only a finite part of the sinc
filter (up to a few lobes from the origin). Experimentally,
we found the Lanczos filter to perform the best.
4.2. Objective function
In our end-to-end setting, registration improves super-
resolution as HighRes-net receives more informative gra-
dient signals when its output is aligned with the ground
truth high-resolution image. Conversely, super-resolution
benefits registration, since good features are key to align
images (Clement et al., 2018). We thus trained HighRes-Net
and ShiftNet-Lanczos in a cooperative setting, where both
neural networks work together to minimize an objective
function, as opposed to an adversarial setting where a gener-
ator tries to fool a discriminator. HighRes-net infers a latent
super-resolved variable and ShiftNet maximizes its similar-
ity to a ground truth high-resolution image with sub-pixel
shifts.
By predicting and applying sub-pixel translations in a dif-
ferentiable way, our approach for registration and super-
resolution can be combined in an end-to-end learning frame-
work. Shift-Net predicts a sub-pixel shift ∆ from a pair
of high-resolution images. The predicted transformation
is applied with Lanczos interpolation to align the two im-
ages at a pixel level. ShiftNet and HighRes-Net are trained
end-to-end, to minimize a joint loss function. Our objective
function is composed of a registered reconstruction loss,
described in Algorithm 2.
Algorithm 2 Sub-pixel registered loss
Input: super-resolution SRθ , ground-truth high-res HR
(∆x, ∆y) = ShiftNet (SRθ , HR) # register SR to HR
# 1D Lanczos kernels for x and y sub-pixel shifts
(κ∆x, κ∆y ) = LanczosShiftKernel (∆x, ∆y)
# 2D sub-pixel shift by separable 1D convolutions
SRθ,∆ = SRθ ∗ κ∆x ∗ κ∆y
# sub-pixel registered loss
`θ,∆ = loss (SRθ,∆, HR) # `θ,∆ 6 loss (SRθ , HR)
Leaderboard score & loss We iterated our design based
on our performance on the leaderboard of the ESA competi-
tion. The leaderboard is ranked by the cPSNR (clear Peak
Signal-to-Noise Ratio) score. It is similar to the mean-
squared error, but also corrects for brightness bias and
clouds in satellite images (M ¨artens et al., 2019), but the
proposed architecture is decoupled from the choice of loss.
See Algorithm 2 for computing the registered loss `θ,∆. We
further regularize the L2 norm of ShiftNet’s output with a
hyperparameter λ and our final joint objective is given by:
Lθ,∆(SRθ , HR) = `θ,∆ + λ||∆||2 (5)
5. Experiments
The PROBA-V satellite carries two different cameras for
capturing a high-resolution / low-resolution pair. This makes
the PROBA-V dataset one of the first publicly available
datasets for MFSR with naturally occurring low-resolution
and high-resolution pairs of satellite imagery.
Perils of synthetic data This is in contrast to most of the
work in SR (Video, Stereo, Single-Image, Multi-Frame),
where it is common practice to train with low-resolution
images that have been artificially generated through simple
bilinear down-sampling 1 (Bulat et al., 2018; Wang et al.,
2019c; Nah et al., 2019). Methods trained on artificially
1See also, image restoration track at CVPR19. The vast ma-
jority of challenges were performed on synthetically downscaled /
degraded images: REDS4: “NTIRE 2019 challenge on video de-
blurring: Methods and results”, NTIRE Workshop at CVPR2019.
downscaled datasets are biased, in the sense that they learn
to undo the action of a simplistic downscaling operator.
On the downside, they also tune to its inductive biases.
For example, standard downsampling kernels (e.g. bilin-
ear, bicubic) are simple low-pass filters, hence the eventual
model implicitly learns that all input images come from the
same band-limited distribution. In reality, natural complex
images are only approximately band-limited (Ruderman,
1994). Methods that are trained on artificially downscaled
datasets fail to generalize to real-world low-resolution, low
quality images (Shocher et al., 2018). For this reason, we
experiment only on PROBA-V, a dataset that does not suffer
from biases induced by artificial down-sampling.
5.1. Proba-V Kelvin dataset
The performance of our method is illustrated with satellite
imagery from the Kelvin competition, organized by ESA’s
Advanced Concept Team (ACT).
The Proba-V Kelvin dataset (M ¨artens et al., 2019) contains
1450 scenes (RED and NIR spectral bands) from 74 hand-
selected Earth regions around the globe at different points
in time. The scenes are split into 1160 scenes for training
and 290 scenes for testing. Each data-point consists of ex-
actly one 100m resolution image as 384 × 384 greyscale
pixel images (HR) and several 300m resolution images from
the same scene as 128 × 128 greyscale pixel images (LR),
spaced days apart. We refer the reader to the Proba-V man-
ual (Wolters et al., 2014) for further details on image acqui-
sition.
Each scene comes with at least 9 low-resolution views, and
an average of 19. Each view comes with a noisy quality
map. The quality map is a binary map, that indicates con-
cealed pixels due to volatile features, such as clouds, cloud
shadows, ice, water and snow. The sum of clear pixels (1s
in the binary mask) is defined as the clearance of a low-
resolution view. These incidental and noisy features can
change fundamental aspects of the image, such as the con-
trast, brightness, illumination and landscape features. We
use the clearance scores to randomly sample from the im-
ageset of low-resolution views, such that views with higher
clearance are more likely to be selected. This strategy helps
to prevent overfitting. See Supplementary Material ?? for
more details.
Working with missing & noisy values A quality map
can be used as a binary mask to indicate noisy or occluded
pixels, due to clouds, snow, or other volatile objects. Such a
mask can be fed as an additional input channel in the respec-
tive low-resolution view, in the same fashion as the reference
frame. When missing value masks are available, neural net-
works can learn which parts of the input are anomalous,
noisy, or missing, when provided with such binary masks
(see e.g. (Che et al., 2018)). In satellite applications where
clouds masks are not available, other segmentation methods
would be in order to infer such masks as a preprocessing
step (e.g. (Long et al., 2015)). In the case of the PROBA-V
dataset, we get improved results when we make no use of the
masks provided. Instead we use the masks only to inform
the sampling scheme within the low-resolution imageset to
prevent overfitting.
5.2. Comparisons
All experiments use the same hyperparameters, see Supple-
mentary Material ??. By default, each imageset is padded to
32 views for training and testing, unless specified otherwise.
Our PyTorch implementation takes less than 9h to train on a
single NVIDIA V100 GPU. At test time, it takes less than
0.2 seconds to super-resolve (×3 upscaling) a scene with
32 low-resolution 128 × 128 views. Our implementation is
available on GitHub 2 .
We evaluated different models on ESA’s Kelvin competition.
Our best model, HighRes-Net trained jointly with ShiftNet-
Lanczos, scored consistently at the top of the public and
final leaderboard, see Table 1.
We compared HighRes-net to several other approaches:
ESA baseline — upsamples and averages the subset of low-
resolution views with the highest clearance in the set.
SRResNet — SISR approach by Ledig et al. (2017).
SRResNet+ShiftNet — trained jointly with ShiftNet.
SRResNet-6 + ShiftNet — at test time, it independently
upsamples 6 low-resolution views, then co-registers/aligns
them with ShiftNet, and averages them.
ACT (Advanced Concepts Team, M ¨artens et al., 2019) —
CNN with 5 channels for the 5 clearest low-resolution views.
DeepSUM (Molini et al., 2019) — like SRResNet-6+Shift-
Net, it upsamples independently a fixed number of low-
resolution views, co-registers and fuses them. Here, the
co-registration task is learned separately. One caveat with
upsampling-first approaches is that the memory cost and
training cycle grows quadratically with the upscaling factor.
On the ESA dataset, this means that DeepSUM must train
with 3 × 3 times the volume of intermediate representations,
which can take from several days to a week to train on a
NVIDIA V100 GPU.
HighRes-net (trained jointly with ShiftNet, see sections 3
and 4) — in contrast to DeepSUM, our approach upsamples
after fusion, conserving memory, and takes up to 9 hours to
train on a NVIDIA V100 GPU.
HighRes-net+ — averages the outputs of two pre-trained
Table 1. Public & final leaderboard cPSNR scores in ESA’s Kelvin
competition. Lower is better.
METHOD PUBLIC FINAL
SRRESNET 1.0095 1.0084
ESA BASELINE 1.0000 1.0000
SRRESNET + SHIFTNET 1.0002 0.9995
ACT 0.9874 0.9879
SRRESNET-6 + SHIFTNET 0.9808 0.9794
HIGHRES-NET (OURS) 0.9496 0.9488
HIGHRES-NET+ (OURS) 0.9474 0.9477
DEEPSUM 0.9488 0.9474
HighRes-net models at test time, one with K bounded to 16
input views and the other to 32.
5.3. ESA Kelvin leaderboard
The Kelvin competition used the corrected clear PSNR
(cPSNR) quality metric as the standardized measure of per-
formance. The cPSNR is a variant of the Peak Signal to
Noise Ratio (PSNR) used to compensate for pixel-shifts
and brightness bias. We refer the reader to (M ¨artens et al.,
2019) for the motivation and derivation of this quality met-
ric. The cPSNR metric is normalized by the score of the
ESA baseline algorithm so that a score smaller than 1 means
“better than the ESA baseline”. We also use it as our training
objective with sub-pixel registration (see also section 3(b)
on ShiftNet).
5.3.1. ABLATION STUDY
We ran an ablation study on the labeled data (1450 image
sets), split in 90% / 10% for training and testing. Our re-
sults suggest that more low-resolution views improve the
reconstruction, plateauing after 16 views, see Supplemen-
tary Material ??. Another finding is that registration matters
for MFSR, both in co-registering low-resolution views, and
the registered loss, see Supplementary Material ??. Finally,
selecting the k clearest views for fusion leads to overfitting.
A workaround is to randomly sample views with a bias for
clearance, see ??.
6. Discussion
On the importance of grounded detail Scientific and in-
vestigative application warrant a firm grounding of any pre-
diction on real, not synthetic or hallucinated, imagery. The
PROBA-V satellite (Dierckx et al., 2014) was launched by
ESA to monitor Earth’s vegetation growth, water resources
and agriculture. As a form of data fusion and enrichment,
multi-frame super-resolution could enhance the vision of
such satellites for scientific and monitoring applications
(Carlson & Ripley, 1997; Pettorelli et al., 2005). More
broadly, satellite imagery can help NGOs and non-profits
monitor the environment and human rights (Cornebise et al.,
2018; Helber et al., 2018; Rudner et al., 2019; Rolnick et al.,
2019) at scale, from space, ultimately contributing to the UN
sustainable development goals. Low-resolution imagery is
cheap or sometimes free, and it is frequently updated. How-
ever, with the addition of fake or imaginary details, such
enhancement would be of little value as scientific, legal, or
forensic evidence.
6.1. Future work
Registration matters for the fusion and for the loss. The for-
mer is not explicit in our model, and its mechanism deserves
closer inspection. Also, learning to fuse selectively with
attention, would allow HighRes-net reuse all useful parts of
a corrupted image. It is hard to ensure the authenticity of
detail. It will be important to quantify the epistemic uncer-
tainty of super-resolution for real world applications. In the
same vein, a meaningful super-resolution metric depends
on the downstream prediction task. More generally, good
similarity metrics remain an open question for many com-
puter visions tasks (Bruna et al., 2015; Johnson et al., 2016;
Isola et al., 2017; Ledig et al., 2017).
6.2. Conclusion
We presented HighRes-net – the first deep learning approach
to MFSR that learns the typical sub-tasks of MFSR in an
end-to-end fashion: (i) co-registration, (ii) fusion, (iii) up-
sampling, and (iv) registration-at-the-loss.
It recursively fuses a variable number of low-resolution
views by learning a global fusion operator. The fusion
also aligns all low-resolution views with an implicit co-
registration mechanism through the reference channel. We
also introduced ShiftNet-Lanczos, a network that learns to
register and align the super-resolved output of HighRes-net
with a high-resolution ground-truth.
Registration is vital, to align many low-resolution inputs
(co-registration) and to compute similarity metrics between
shifted signals. Our experiments suggest that an end-to-
end cooperative setting (HighRes-net + ShiftNet-Lanczos)
improves training and test performance. By design, our
approach is fast to train and to test, with a low memory-
footprint by doing the bulk of the compute (co-registration
+ fusion) while maintaining the low-resolution image height
& width.
There is an abundance of low-resolution yet high-revisit
low-cost satellite imagery, but they often lack the detailed
information of expensive high-resolution imagery. We be-
lieve MFSR can uplift its potential to NGOs and non-profits
that contribute to the UN Sustainable Development Goals.

HighRes-net: Supplementary Material
1. Experimental details
We trained our models on low-resolution patches of size
64 × 64. HighRes-net’s architecture is described in Ta-
ble 2. We denote by Conv2d(in, out, k, s, p)
a conv2D layer with in and out input/output channels,
kernels of size k × k, stride s and padding p. We used
the ADAM optimizer (Kingma & Ba, 2014) with default
hyperparameters and trained our models on batches of size
32, for 400 epochs, using 90% of the data for training and
10% for validation. Our learning rate is initialized to 0.0007,
decayed by a factor of 0.97 if the validation loss plateaus
for more than 2 epochs. To regularize ShiftNet, we set
λ = 10−6.
Table 1. ResidualBlock(h) architecture
LAYER0 Conv2d(in=h, out=h, k3, s1, p1)
LAYER1 PReLU
LAYER2 Conv2d(in=h, out=h, k3, s1, p1)
LAYER3 PReLU
Thanks to weight sharing, HighRes-net super-resolves
scenes with 32 views in 5 recursive steps, while requir-
ing less than 600K parameters. ShiftNet has more than
34M parameters (34,187,648) but is dropped during test
time. We report GPU memory requirements in table 3 for
reproducibility purposes.
1.1. How many frames do you need?
We trained and tested HighRes-net with ShiftNet using 1
to 32 frames. With a single image, our approach performs
worse than the ESA baseline. Doubling the number of
frames significantly improves both our training and valida-
tion scores. After 16 frames, our model’s performance stops
increasing as show in Figure 1.
*Equal contribution 1Element AI, London, UK 2Mila, Montreal,
Canada 3Universit ´e de Montr ´eal, Montreal, Canada 4McGill Uni-
versity, Montreal, Canada. Correspondence to: Alfredo Kalaitzis
<freddie@element.ai>.1 2 4 8 16 32
No. of views
0.90
0.92
0.94
0.96
0.98
1.00
Scores
test_score
train_score
Figure 1. Public leaderboard scores vs. nviews for HighRes-net +
ShiftNet. Lower is better.
1.2. Registration matters
1.2.1. REGISTERED LOSS
The only explicit registration that we perform is at the loss
stage, to allow the model partial credit for a solution. This
solution can be enhanced but otherwise mis-registered with
respect to the ground truth. We trained our base model
HighRes-net without ShiftNet-Lanczos and observed a drop
in performance as shown in Table 4. Registration matters
and aligning outputs with targets helps HighRes-net gener-
ate sharper outputs and achieve competitive results.
1.2.2. IMPLICIT CO-REGISTRATION
The traditional practice in MFSR is to explicitly co-register
the LR views prior to super-resolution (Tsai, 1984; Molini
et al., 2019). The knowledge of sub-pixel miss-alignments
tells an algorithm what pieces of information to fuse from
each LR image for any pixel in the SR output. Contrary
to the conventional practice in MFSR, we propose implicit
co-registeration by pairing LR views with a reference frame,
also known as an anchor. In this sense, we never explicitly
compute the relative shifts between any LR pair. Instead, we
simply stack each view with a chosen reference frame as an
additional channel to the input. We call this strategy implicit
co-registration. We found this strategy to be effective in the
following ablation study which addresses the impact of the
choice of a reference frame aka anchor.
We observe the median reference is the most effective in
terms of train and test score. We suspect the median per-
forms better than the mean because the median is more
robust to outliers and can help denoise the LR views. In-
terestingly, training and testing without a shared reference
performed worse than the ESA baseline. This shows that
co-registration (implicit or explicit) matters. This can be
due to the fact that the model lacks information to align and
fuse the multiple views.

1.2.3. SHIFTNET ARCHITECTURE
ShitNet has 8 layers of (conv2D + BatchNorm2d +
ReLU) modules. Layer 2, 4 and 6 are followed by
MaxPool2D. The final output is flattened to a vector x of
size 32, 768. Then, we compute a vector of size 1, 024, x =
Table 5. Scores for HighRes-net + ShiftNet-Lanczos trained and
tested with different references as input. Lower is better.
SCORE
REFERENCE TRAIN TEST
NO CO-REGISTRATION 1.0131 1.0088
MEAN OF 9 LRS 0.9636 0.9690
MEDIAN OR 9 LRS (BASE) 0.9501 0.9532
ReLU(fc1(dropout(x))). The final shift prediction
is fc2(x) of size 2. The bulk of the parameters come from
fc1, with 32, 768 × 1, 024 weights. These alone, account
for 99% of ShiftNet’s parameters. Adding a MaxPool2D
on top of layer 3, 5, 7 or 8 halves the parameters of ShiftNet.

Table 2. HRNet architecture
STEP LAYERS NUMBER OF PARAMETERS
ENCODE Conv2d(in=2, out=64, k3, s1, p1) 1216
PReLU 1
ResidualBlock(64) 73,858
ResidualBlock(64) 73,858
Conv2d(in=64, out=64, k3, s1, p1) 36,928
FUSE ResidualBlock(128) 295,170
Conv2d(in=128, out=64, k3, s1, p1) 73,792
PReLU 1
DECODE ConvTranspose2d(in=64, out=64, k3, s1) 36,928
PreLU 1
Conv2d(in=64, out=1, k1, s1) 65
RESIDUAL
(OPTIONAL) Upsample(scale factor=3.0, mode=’bicubic’) 0
591,818 (TOTAL)

Table 3. GPU memory requirements to train HighRes-net + Shift-
Net on patches of size 64 × 64 with batches of size 32, and a
variable number of low-resolution frames.
# VIEWS 32 16 4
GPU MEMORY (GB) 27 15 6
Table 4. Registration matters: Train and test scores for HighRes-
net trained with and without ShiftNet-Lanczos. Lower is better.
SCORE
HIGHRES-NET TRAIN TEST
UNREGISTERED LOSS 0.9616 0.9671
REGISTERED LOSS 0.9501 0.9532

Table 5. Scores for HighRes-net + ShiftNet-Lanczos trained and
tested with different references as input. Lower is better.
SCORE
REFERENCE TRAIN TEST
NO CO-REGISTRATION 1.0131 1.0088
MEAN OF 9 LRS 0.9636 0.9690
MEDIAN OR 9 LRS (BASE) 0.9501 0.9532

1.2.3. SHIFTNET ARCHITECTURE
ShitNet has 8 layers of (conv2D + BatchNorm2d +
ReLU) modules. Layer 2, 4 and 6 are followed by
MaxPool2D. The final output is flattened to a vector x of
size 32, 768. Then, we compute a vector of size 1, 024, x =
ReLU(fc1(dropout(x))). The final shift prediction
is fc2(x) of size 2. The bulk of the parameters come from
fc1, with 32, 768 × 1, 024 weights. These alone, account
for 99% of ShiftNet’s parameters. Adding a MaxPool2D
on top of layer 3, 5, 7 or 8 halves the parameters of ShiftNet.
1.3. Towards permutation invariance
A desirable property of a fusion model acting on an un-
ordered set of images, is permutation-invariance: the output
of the model should be invariant to the order in which the LR
views are fused. An easy approach to encourage permutation
invariant neural networks is to randomly shuffle the inputs
at training time before feeding them to a model (Vinyals
et al., 2015).
In addition to randomization, we still want to give more
importance to clear LR views (with high clearance score),
which can be done by sorting them by clearance. A good
trade-off between uniform sampling and deterministic sort-
ing by clearance, is to sample k LR views without replace
ment and with a bias towards higher clearance:
p(i | C1, . . . , Ck) = eβCi
∑k
j=1 eβCj
, (1)
where k is the total number of LR views, Ci is the clear-
ance score of LRi and β regulates the bias towards higher
clearance scores,
When β = 0, this sampling strategy corresponds to uniform
sampling and when β = +inf , this corresponds to picking
the k-clearest views in a deterministic way. Our default
model was trained with β = 50 and our experiments are
reported in Table 6.
From Table 6, β = ∞ reaches best training score and worst
testing score. For β = 50 and β = 0, the train/test gap is
much more reduced. This suggests that the deterministic
strategy is overfitting and randomness prevents overfitting
(diversity matters). On the other hand, β = 50 performs
significantly better than β = 0 suggesting that biasing a
model towards higher clearances could be beneficial i.e.,
clouds matter too.

Table 6. scores per sampling strategy for HighRes-net + ShiftNet.
Lower is better.
SCORE
SAMPLING STRATEGY TRAIN TEST
β = ∞ (K-CLEAREST) 0.9386 0.9687
β = 0 (UNIFORM-K) 0.9638 0.9675
β = 50 (BASE) 0.9501 0.9532

2. On the parallax effect
The parallax p is a measure of space that is inversely pro-
portional to the distance d from the object, see e.g. (Zeilik
& Gregory, 1998):
p ∝ 1/d
If 32 low-resolution satellite views are acquired during a
single fly-over, the successive geolocations would indeed
be significantly different and the parallax effect would be
magnified. This is indeed the case in aerial photography, for
instance, because the imagery is captured from much closer
to the ground.
With satellite imagery, one might be interested in detect-
ing vegetation growth (PROBA-V), road networks (infras-
tructure), farms / ranches (agriculture), deforestation in the
Amazon, or human presence and buildings. In all these
monitoring applications, the objects of interests are no more
than 50m tall, e.g. trees.
The parallax effect between low-res images does not inhibit
the super-resolution of the such objects. The lowest of
Low Earth Orbit (LEO) altitudes for a satellite is 300 km
(PROBA-V is about 800km), so the relative depth variation
is at most 50m / 300,000m = 0.0033%. In other words, the
parallax effect is imperceptible for 50m tall objects.
Here is a calculation to support this argument: Given a point
A at height 50m (distance dA = 300, 000m − 50m), and a
point B at height 0 (distance dB = 300, 000m), their rela-
tive change in motion is: pA/pB = dB /dA = 30/29.995.
This means that if point A moves 30m, then point B moves
5 millimeters less than 30m due to parallax. In the case of a
fast LEO satellite like PROBA-V, its geolocation is accurate
enough such that the translational shifts are mostly within a
sub-pixel accuracy, and they almost never exceed 2 pixels.
On the ground, 2 pixels amount to a baseline length of (2 px)
* (300 m/px) = 600m. So between two images where point
A (50m altitude) moved 600m, and point B (0m altitude)
has moved 0.005 * 20 = 0.1 m. Hence, the parallax effect
is imperceptible for 50m tall objects. Even less so for the
objects that we have underlined above, and our experimental
state-of-the-art results support this claim.