24 Jun 2024

Representing Image as Patches

For an input image and patch size , the image patches will be created.

for original ViT; thus image is represented as image tokens.

Sample implementation:

from einops import rearrange
proj = nn.Linear((patch_size**2)*channels, dim)
x_p = rearrange(img, 'b c (h p) (w p) -> b (h w) (p p c)', p = patch_size)
embedding = proj(x_p)

or equivalently:

conv = nn.Conv2d(channels, dim, kernel_size=patch_size, stride=patch_size)
embedding = rearrange(conv(img), 'b c h w -> b (h w) c')

Robot Perception and Control

Robot Perception in 2D

Image Perception Tasks

Image Classification

ImageNet

AlexNet

ResNet arxiv

ViT arxiv

Representing Image as Patches

ViT vs ResNet arxiv

ViT vs ResNet

ViT vs ResNet

ViT vs ResNet arxiv

Do We Really Need Attention? arxiv

CLIP arxiv github

CLIP arxiv github

Object Detection

YOLO

DETR

Segmentation

LSeg arxiv github

LSeg arxiv github

SAM arxiv