Source Paper: https://arxiv.org/pdf/2010.11929v2
You might have already heard of the Transformers architecture, with the popular GPT models that seems to be all the craze these days. Some one these models can understand images as well. This is made possible with the help of Vision Transformers. A popular example would be OpenAI’s CLIP model.