Robot Perception and Control

Robot Perception in 2D

Last updated: Jul / 25 /2024
Kashu Yamazaki
kyamazak@andrew.cmu.edu

Image Perception Tasks

Kashu Yamazaki, 2024

Image Classification

ImageNet

ImageNet is a large-scale dataset for image classification, known for its use in Large Scale Visual Recognition Challenge (ILSVRC).

  • (2009): ImageNet was released by Dr. Li Fei-Fei team.
  • (2012): AlexNet made a breakthrough achieving 16% top-5 error rate - kickstarting the boom in deep learning.
  • (2017): 95% accuracy reached and ILSVRC concluded.

After ILSVRC Challenge

ImageNet continues to be a valuable resource for researchers for pre-training image encoder (backbone) models.

Kashu Yamazaki, 2024

AlexNet

AlexNet is a convolutional neural network (CNN) architecture, designed by Alex Krizhevsky that contains eight layers: the first five are convolutional layers, some of them followed by max-pooling layers, and the last three are fully connected layers.

#center

Influence

AlexNet is considered one of the most influential papers published in computer vision, having spurred many more papers published employing CNNs and GPUs to accelerate deep learning.

Kashu Yamazaki, 2024

ResNet arxiv

ResNet layers learn residual functions with respect to the layer inputs.
A residual function represents difference between the underlying function and the input . A general form of ResNet layer is written as:

The operation of is called residual connection that performs an identity mapping to connect the input of the subnetwork with its output.

Influence

Learning the residual function can make the training of very deep networks easier. It helps in addressing the vanishing gradient problem.

Kashu Yamazaki, 2024

ViT arxiv

Applied pure transformer directly to sequences of image patches.

  • An image patch is a pixel crop of the image that will be treated as an image token.
  • The patchify stem is implemented by a stride-, convolution ( by default) applied to the input image [1].

Influence

Performance scales with dataset size and becomes a new de facto for image backbone.

Kashu Yamazaki, 2024

Representing Image as Patches

For an input image and patch size , the image patches will be created.

  • for original ViT; thus image is represented as image tokens.

Sample implementation:

from einops import rearrange
proj = nn.Linear((patch_size**2)*channels, dim)
x_p = rearrange(img, 'b c (h p) (w p) -> b (h w) (p p c)', p = patch_size)
embedding = proj(x_p)

or equivalently:

conv = nn.Conv2d(channels, dim, kernel_size=patch_size, stride=patch_size)
embedding = rearrange(conv(img), 'b c h w -> b (h w) c')
Kashu Yamazaki, 2024

ViT vs ResNet arxiv

  • MHSAs and Convs exhibit opposite behaviors. MHSAs are low-pass filters, but Convs are high-pass filters.

  • MHSAs improve not only accuracy but also generalization by flattening the loss landscapes.

Kashu Yamazaki, 2024

ViT vs ResNet

ViT models are less effective in capturing the high-frequency components (related to local structures) of images than CNN models [1].
For ViTs to capture high-frequency components:

  • knowledge distillation using a CNN teacher model [2].
  • utilizing convolutional-like operation or multi-scale feature maps.
  • use RandAugment [3].
Kashu Yamazaki, 2024

ViT vs ResNet

Robustness to input perturbations:

  • ResNet: noise has a high frequency component and localized structure [1]

  • ViT: relatively low frequency component and a large structure (The border is clearly visible in the size patch.)

    • When pre-trained with a sufficient amount of data, ViT are at least as robust as the ResNet counterparts on a broad range of perturbations [2].
Kashu Yamazaki, 2024

ViT vs ResNet arxiv

Is dicision based on texture or shape?

  • ResNet: relies on texture rather than shape [1]
  • ViT: little more robust to texture perturbation
  • Human: much robust to texture perturbation

center

Kashu Yamazaki, 2024

Do We Really Need Attention? arxiv

It turns out that it is not "Attention is All You Need".

  • As long as the tokens can be mixed, MetaFormer architecture can achieve the similar performance as the Transformer.
  • Meta architecture of transformer layer can be viewed as a special case of ResNet layer with following componets:
    • Token mixers (self-attention, etc.)
    • Position Encoding
    • Channel MLP (1x1 convolution)
    • Normalization (LayerNorm, etc.)

Kashu Yamazaki, 2024

CLIP arxiv github

Learning directly from raw text about images leverages a broader source of supervision compared to using a fixed set of predetermined object categories. CLIP introduced a simple yet effective contrastive objective to learn a vision-language embedding space, where similar concepts are pulled together and different concepts are pushed apart.

#center

Kashu Yamazaki, 2024

CLIP arxiv github

Sample implementation:

# extract feature representations of each modality
I_f = image_encoder(I) #[n, d_i]
T_f = text_encoder(T)  #[n, d_t]
# joint multimodal embedding [n, d_e]
I_e = l2_normalize(np.dot(I_f, W_i), axis=1)
T_e = l2_normalize(np.dot(T_f, W_t), axis=1)
# scaled pairwise cosine similarities [n, n]
logits = np.dot(I_e, T_e.T) * np.exp(t)
# symmetric loss function
labels = np.arange(n)
loss_i = cross_entropy_loss(logits, labels, axis=0)
loss_t = cross_entropy_loss(logits, labels, axis=1)
loss = (loss_i + loss_t)/2
Kashu Yamazaki, 2024

Object Detection

YOLO

Kashu Yamazaki, 2024

DETR

Kashu Yamazaki, 2024

Segmentation

LSeg arxiv github

The text embeddings provide a flexible label representation and help generalize to previously unseen categories at test time, without retraining or even requiring a single additional training sample.

#center

Kashu Yamazaki, 2024

LSeg arxiv github

Language-driven Semantic Segmentation (LSeg) embeds text labels and image pixels into a common space, and assigns the closest label to each pixel.

center

Kashu Yamazaki, 2024

SAM arxiv

#center

Kashu Yamazaki, 2024

https://theaisummer.com/vision-transformer/