Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Embedding receptive field #62

Closed
brunosan opened this issue Dec 1, 2023 · 1 comment
Closed

Embedding receptive field #62

brunosan opened this issue Dec 1, 2023 · 1 comment

Comments

@brunosan
Copy link
Member

brunosan commented Dec 1, 2023

@dan and I had some confusion (#22) of what exactly this embedding learnt and the effect of averaging all 64 embeddings into one. So here I'm trying to reason it. Please @weiji14 and @srmsoumya, check me.

TL;DR It's mostly probably safe to use average.

image

My understanding is that:

  1. The 256x256 image "chip" is split into a 8 x 8 grid of "windows".
  2. While doing the "vanilla MAE" training, randomly, up to 75% of the windows are blacked out, and the loss function aims to recover its content.
  3. This translates that the embedding of the available windows will try to enconde the semantics of the blacked out ones, so the decoder can recreate its content.
  4. When we tweak the vanilla MAE into Clay v0 we are adding the absolute location and time, so this will also add the capacity of the windows to learn about the content based not only on other local windows, but also the actual absolute location.

If the above is correct, it means that the embeddings of each window have learned to predict the content of the surrounding windows, which makes most window embeddings similar. This reduces the effect when we average all 64 window embeddings into a single chip embedding.

The similarity can be checked on the embeddings example @weiji14 shared. Min cosine similarity is .999

image

For example:

  • a chip with a red small house in the middle, a road leading to it, and rest grass.
  • When MAE removes the red house, windows with the road (vanilla enconder) and also all available windows (using absolute location from the clay v0 modification) will converge to learn that there is a house in the blanked-out part.
  • When we average al 64 window embeddings, the semantics of the house will be retained since it's present in several if not all windows.
@srmsoumya
Copy link
Collaborator

@brunosan your understanding of how the encoder side of MAE works is spot on.

In our modified architecture, we are making two changes for now:

  1. We are splitting the image spatially and channel-wise.
  2. We are adding time and lat/lon information as learnable embeddings. These embeddings provide additional information to the model about when and where a certain feature appears.

Transformers have a concept called the cls token, which is used to capture a generic vector representation of the input space (EO imagery in our case). This idea is borrowed from the BERT paper and is commonly used in Vision Transformers. We can choose to use the embeddings from the cls token, which represents what the image as a whole looks like. Alternatively, we can take the mean of the remaining vectors, which should capture the same information but be more robust to certain outliers.

For the example you shared, let's say we completely remove the small red building from the image and ask the MAE to recreate the original image. In this case, there is less chance of the MAE adding a building back to the image. Although we have absolute lat/lon and time information available as embeddings, we can expect the network to learn a general understanding of the information rather than very specific features. However, I might be totally wrong here. Training a large model might actually encode such granular features and be able to recreate them. We have to test and try that out to see.

@Clay-foundation Clay-foundation locked and limited conversation to collaborators Dec 4, 2023
@brunosan brunosan converted this issue into discussion #67 Dec 4, 2023

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants