Model Architectures

CNN versus Visual Transformers

CNNs have an inductive spatial bias baked into them with convolutional kernels whereas vision transformers are based on a much more general architecture. […] “If a CNN’s approach is like starting at a single pixel and zooming out, a transformer slowly brings the whole fuzzy image into focus.”

While CNNs have a proven track record in various computer vision tasks and handle large-scale datasets efficiently, Vision Transformers offer advantages in scenarios where global dependencies and contextual understanding are crucial.

Image Embeddings

An image embedding is a numeric representation of an image that encodes the semantics of contents in the image. Embeddings are calculated by computer vision models which are usually trained with large datasets of pairs of text and image. The goal of such a model is to build an “understanding” of the relationship between images and text.

Sources