🪴 Jachére

❯

❯

❯

Computer Vision

Computer Vision

Aug 23, 20241 min read

Glossary

Model Architectures

CNN versus Visual Transformers

CNNs have an inductive spatial bias baked into them with convolutional kernels whereas vision transformers are based on a much more general architecture. […] “If a CNN’s approach is like starting at a single pixel and zooming out, a transformer slowly brings the whole fuzzy image into focus.”

While CNNs have a proven track record in various computer vision tasks and handle large-scale datasets efficiently, Vision Transformers offer advantages in scenarios where global dependencies and contextual understanding are crucial.

Image Embeddings

An image embedding is a numeric representation of an image that encodes the semantics of contents in the image. Embeddings are calculated by computer vision models which are usually trained with large datasets of pairs of text and image. The goal of such a model is to build an “understanding” of the relationship between images and text.

Sources

What is an Image Embedding?
Image embeddings. Image similarity and building… | by Romain Beaumont | Medium
DINO-V2: Learning Robust Visual Features without Supervision — Model Explained | by Shrinivasan Sankar | Medium
Vision Transformers vs CNNs at the Edge - Edge AI and Vision Alliance

Graph View

Model Architectures
Image Embeddings
Sources

Backlinks

Glossary, Terms, and Citations

Created with Quartz v4.3.1 © 2024

GitHub
Discord Community