24 Jun 2024

Masked Multi-Head Attention

Masking of the unwanted tokens can be done by setting them to . The binary mask is added to the attention scores so that attention weight will be zero on those unwanted tokens.

Sample implementation:

qkv = to_qvk(x) # to_qvk = nn.Linear(dim, dim_head * heads * 3, bias=False)
q, k, v = tuple(rearrange(qkv, 'b t (d k h) -> k b h t d ', k=3, h=num_heads))
scaled_dot_prod = torch.einsum('b h i d , b h j d -> b h i j', q, k) * scale
scaled_dot_prod = scaled_dot_prod.masked_fill(mask, -np.inf)
attention = torch.softmax(scaled_dot_prod, dim=-1)
out = torch.einsum('b h i j , b h j d -> b h i d', attention, v)
out = rearrange(out, "b h t d -> b t (h d)")

Robot Perception and Control

Introduction

Logistics

Logistics

Logistics

Robot Learning

What is Robot Learning?

Applications of Robot Learning

When Should Robots Learn?

How to make robots learn?

Deep Learning

Basics

Backpropagation

Linear/Dense Layer

Convolution Layer

Recurrent Cells

Scaled Dot product Attention

Multi-Head Attention (MHA)

Masked Multi-Head Attention

Self-Attention vs Cross-Attention

Inductive Bias

Resources: Books

Resources: Online Materials

Robot Perception and Control

Introduction

Logistics

Logistics

Logistics

Robot Learning

What is Robot Learning?

Applications of Robot Learning

When Should Robots Learn?

How to make robots learn?

Multi-modal Sensory

Deep Learning

Basics

Backpropagation

Linear/Dense Layer

Convolution Layer

Recurrent Cells

Scaled Dot product Attention

Multi-Head Attention (MHA)

Masked Multi-Head Attention

Self-Attention vs Cross-Attention

Inductive Bias

Resources: Books

Resources: Online Materials