ViLT (Vision-and-Language Transformer) is a minimal vision-and-language model that processes raw image patches directly without using a separate visual encoder like CNNs or region features. This makes it significantly faster while maintaining competitive performance.

Specifications

Context Window
512 tokens
Released
February 2021

Capabilities

Visual Question AnsweringImage-Text MatchingVisual Reasoning

Best For

Rate this model

4.6(5 ratings)

Click to rate this AI model