You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@dan and I had some confusion (#22) of what exactly this embedding learnt and the effect of averaging all 64 embeddings into one. So here I'm trying to reason it. Please @weiji14 and @srmsoumya, check me.
TL;DR It's mostly probably safe to use average.
My understanding is that:
The 256x256 image "chip" is split into a 8 x 8 grid of "windows".
While doing the "vanilla MAE" training, randomly, up to 75% of the windows are blacked out, and the loss function aims to recover its content.
This translates that the embedding of the available windows will try to enconde the semantics of the blacked out ones, so the decoder can recreate its content.
When we tweak the vanilla MAE into Clay v0 we are adding the absolute location and time, so this will also add the capacity of the windows to learn about the content based not only on other local windows, but also the actual absolute location.
If the above is correct, it means that the embeddings of each window have learned to predict the content of the surrounding windows, which makes most window embeddings similar. This reduces the effect when we average all 64 window embeddings into a single chip embedding.
The similarity can be checked on the embeddings example @weiji14 shared. Min cosine similarity is .999
For example:
a chip with a red small house in the middle, a road leading to it, and rest grass.
When MAE removes the red house, windows with the road (vanilla enconder) and also all available windows (using absolute location from the clay v0 modification) will converge to learn that there is a house in the blanked-out part.
When we average al 64 window embeddings, the semantics of the house will be retained since it's present in several if not all windows.
The text was updated successfully, but these errors were encountered:
@brunosan your understanding of how the encoder side of MAE works is spot on.
In our modified architecture, we are making two changes for now:
We are splitting the image spatially and channel-wise.
We are adding time and lat/lon information as learnable embeddings. These embeddings provide additional information to the model about when and where a certain feature appears.
Transformers have a concept called the cls token, which is used to capture a generic vector representation of the input space (EO imagery in our case). This idea is borrowed from the BERT paper and is commonly used in Vision Transformers. We can choose to use the embeddings from the cls token, which represents what the image as a whole looks like. Alternatively, we can take the mean of the remaining vectors, which should capture the same information but be more robust to certain outliers.
For the example you shared, let's say we completely remove the small red building from the image and ask the MAE to recreate the original image. In this case, there is less chance of the MAE adding a building back to the image. Although we have absolute lat/lon and time information available as embeddings, we can expect the network to learn a general understanding of the information rather than very specific features. However, I might be totally wrong here. Training a large model might actually encode such granular features and be able to recreate them. We have to test and try that out to see.
@dan and I had some confusion (#22) of what exactly this embedding learnt and the effect of averaging all 64 embeddings into one. So here I'm trying to reason it. Please @weiji14 and @srmsoumya, check me.
TL;DR It's mostly probably safe to use average.
My understanding is that:
If the above is correct, it means that the embeddings of each window have learned to predict the content of the surrounding windows, which makes most window embeddings similar. This reduces the effect when we average all 64 window embeddings into a single chip embedding.
The similarity can be checked on the embeddings example @weiji14 shared. Min cosine similarity is
.999
For example:
The text was updated successfully, but these errors were encountered: