Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why use centroids for training and generation #15

Open
ChaofanTao opened this issue Jan 22, 2022 · 2 comments
Open

Why use centroids for training and generation #15

ChaofanTao opened this issue Jan 22, 2022 · 2 comments

Comments

@ChaofanTao
Copy link

Hi,

Thanks for your implementation of image-gpt.

I wonder whether quantize the input to centroids is an optional processing for both training and generation, and the advantages of using centroids. Thanks again.

@teddykoker
Copy link
Owner

Hi! See this passage from the original paper:

An IR of 32^2 × 3 is still quite computationally intensive.
While working at even lower resolutions is tempting, prior
work has demonstrated human performance on image classi-
fication begins to drop rapidly below this size (Torralba et al.,
2008). Instead, motivated by early color display palettes,
we create our own 9-bit color palette by clustering (R, G,
B) pixel values using k-means with k = 512. Using this
palette yields an input sequence length 3 times shorter than
the standard (R, G, B) palette, while still encoding color
faithfully. A similar approach was applied to spatial patches
by Ranzato et al. (2014). We call the resulting context length
(32^2 or 48^2 or 64^2) the model resolution (MR). Note that
this reduction breaks permutation invariance of the color
channels, but keeps the model spatially invariant.

The idea is that by discretizing the 3 RGB values into single bins you can reduce the sequence length by a factor of 3, which greatly reduces compute as the attention mechanism has a quadratic complexity with respect to sequence length.

@ChaofanTao
Copy link
Author

Thanks, I get it! Do you have experiments on the effect of the value of 'k' in the k-means, especially for ImageNet (large dataset).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants