If you are familiar with the paper "Attention is All You Need", you would be familiar with the Transformers architecture. With its attention mechanism, you would be able to consider the context and focus of words in a sentence. You can read the paper here: Attention is All You Need
The key difference between Transformers and Vision Transformers (ViT) is that instead of consuming word tokens, ViT takes in patch embeddings of images as input. Everything else works the same way. You can read the paper in the README file at the root of this repository. Refer to this link for more clarification: Vision Transformers (ViT) Explained