Applied pure transformer directly to sequences of image patches.
Influence
Performance scales with dataset size and becomes a new de facto for image backbone.
For an input image
Sample implementation:
from einops import rearrange
proj = nn.Linear((patch_size**2)*channels, dim)
x_p = rearrange(img, 'b c (h p) (w p) -> b (h w) (p p c)', p = patch_size)
embedding = proj(x_p)
or equivalently:
conv = nn.Conv2d(channels, dim, kernel_size=patch_size, stride=patch_size)
embedding = rearrange(conv(img), 'b c h w -> b (h w) c')
MHSAs and Convs exhibit opposite behaviors. MHSAs are low-pass filters, but Convs are high-pass filters.
MHSAs improve not only accuracy but also generalization by flattening the loss landscapes.
ViT models are less effective in capturing the high-frequency components (related to local structures) of images than CNN models [1].
For ViTs to capture high-frequency components:
Robustness to input perturbations:
ResNet: noise has a high frequency component and localized structure [1]
ViT: relatively low frequency component and a large structure (The border is clearly visible in the
Is dicision based on texture or shape?
It turns out that it is not "Attention is All You Need".
Learning directly from raw text about images leverages a broader source of supervision compared to using a fixed set of predetermined object categories. CLIP introduced a simple yet effective contrastive objective to learn a vision-language embedding space, where similar concepts are pulled together and different concepts are pushed apart.
Sample implementation:
# extract feature representations of each modality
I_f = image_encoder(I) #[n, d_i]
T_f = text_encoder(T) #[n, d_t]
# joint multimodal embedding [n, d_e]
I_e = l2_normalize(np.dot(I_f, W_i), axis=1)
T_e = l2_normalize(np.dot(T_f, W_t), axis=1)
# scaled pairwise cosine similarities [n, n]
logits = np.dot(I_e, T_e.T) * np.exp(t)
# symmetric loss function
labels = np.arange(n)
loss_i = cross_entropy_loss(logits, labels, axis=0)
loss_t = cross_entropy_loss(logits, labels, axis=1)
loss = (loss_i + loss_t)/2
The text embeddings provide a flexible label representation and help generalize to previously unseen categories at test time, without retraining or even requiring a single additional training sample.
Language-driven Semantic Segmentation (LSeg) embeds text labels and image pixels into a common space, and assigns the closest label to each pixel.
https://theaisummer.com/vision-transformer/