ViLT

Name: ViLT
Availability: OnlineOnly
Rating: 4.6 (5 reviews)
Author: Academic Research

ViLT (Vision-and-Language Transformer) is a minimal vision-and-language model that processes raw image patches directly without using a separate visual encoder like CNNs or region features. This makes it significantly faster while maintaining competitive performance.

Specifications

Context Window: 512 tokens
Released: February 2021

Capabilities

Visual Question AnsweringImage-Text MatchingVisual Reasoning

Best For

Rate this model

4.6(5 ratings)

Click to rate this AI model

Related Models

LeNet-5

by Academic Research

LeNet-5 is a pioneering convolutional neural network developed by Yann LeCun and colleagues in 1998. It was designed for handwritten digit recognition and is considered one of the foundational architectures in deep learning, establishing many patterns still used in modern CNNs.

AlexNet

by Academic Research

AlexNet is a landmark convolutional neural network that won the ImageNet Large Scale Visual Recognition Challenge in 2012 by a significant margin. Developed by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, it sparked the deep learning revolution in computer vision.

VGG

by Academic Research

VGG is a deep convolutional neural network architecture developed by the Visual Geometry Group at Oxford. Known for its simplicity and depth (16-19 layers), VGG demonstrated that network depth is critical for good performance and became widely used for transfer learning.

ERNIE

4K ctx

by Academic Research

ERNIE (Enhanced Representation through kNowledge IntEgration) is a series of language models developed by Baidu. It incorporates knowledge graphs and entity-level masking to achieve better understanding of semantic relationships and world knowledge.