Vision Transformers and Constrastive Learning

Masking for Transformers revisited¶

For setup, recall from Sheet w02

\textrm{Attention}(Q, K, V) = \textrm{softmax}\biggl(\frac{QK^T}{\sqrt{d_k}}\biggr)V.

(1)

Implement (or use your implementation) of calculating the attention.

“Masking” -- how and at what point do we mask in a transformer?

Implement the following different ways to mask into your attention implementation. Briefly state potential problems of the respective approaches.
- set the attention entries $\alpha_{j, i}\gets 0$ ,
- set the calculated similarity $\mathbf{q}_j\cdot \mathbf{k}_i \gets 0$ ,
- set the calculated similarity $\mathbf{q}_j\cdot \mathbf{k}_i \gets -\infty$ (use float("-inf")),
- set the calculated similarity $\mathbf{q}_j\cdot \mathbf{k}_i \gets -10^9$ ,
- set the attention entries $\alpha_{j, i}\gets 0$ , renormalize the attention scores to 1 afterwards.

Vision Transformer (ViT)¶

Calculate the size in GB of the attention matrices for a transformer architecture that works directly on full images with 256 $\times$ 256 pixels, 32 layers, and 16 heads. Assume 32 bit floats.

Now calculate the size in GB of the attention matrices for a ViT architecture with 16 $\times$ 16 patches on the same task.

What task is the ViT as presented in the lecture usually trained on?

What do positional embeddings for 2D image patches look like in the ViT architecture?

In the Vanilla Transformer discussed in the last lectures, we used token embeddings. Why does the ViT not make use of an embedding layer?

Is the ViT a decoder-only, encoder-only, or encoder-decoder Transformer?

The classification token in ViT is borrowed from BERT, where it was originally introduced (see the paper). Briefly explain what role this token plays and why one might add it rather than, e.g., reading out one of the patch tokens. What is the alternative discussed in the lecture?

Vision Transformers have displaced CNNs in many vision tasks. Why?

Contrastive self-supervised learning¶

What is the goal of Contrastive Learning of Representations (CLR)? Explain in your own words.

Let $\mathrm{score}$ be a similarity measurement and $f_\theta$ a trained encoder model. Reformulate the goal of CLR as a mathematical formula using $f_\theta$ and $\mathrm{score}$ .

In order to train our encoder model $f_\theta$ to follow the CLR goal, we need to adapt our goal as a loss function that the model is optimized for. State the InfoNCE loss function the lecture introduced and explain the components of the formula. In particular, explain the each input variable to the formula. Can we interpret the loss function as a classification task?

Which $\mathrm{score}$ function is used in the SimCLR framework? Explain the role of the linear projection used in the $\mathrm{score}$ computation.

Complete the following python function:

def calc_simclr_affinity_matrix(encoded_pairs: torch.Tensor) -> torch.Tensor:
    """
    Construct the SimCLR affinity matrix.
    
    As the similarity measurement, use the cosine similarity directly.

    Args:
        encoded_pairs (torch.Tensor): A torch tensor of shape `(N, 2, D)`
            of `N` positive data pairs with encoding dimension `D`.
    
    Returns:
        torch.Tensor: The affininty matrix as a tensor of shape `(2N, 2N)`. 
    """
    # Add your solution here

Complete the following python function:

def calc_info_nce(affinity_matrix: torch.Tensor) -> float:
    """
    Calculate the InfoNCE loss from the SimCLR affinity matrix.
    
    Args:
        affinity_matrix (torch.Tensor): A torch tensor
            of shape (2N, 2N) with the SimCLR affinity
            matrix of the `N` positive data pairs.
    
    Returns:
        float: The calculated InfoNCE loss.
    """
    # Add your solution here

Using your implementation of calc_simclr_affinity_matrix and calc_info_nce, compute the affinity matrix and the InfoNCE loss for the following example:

import torch

# construct dummy data
data = torch.eye(3)
noise = torch.full_like(data, 0.1)
encoded_pairs = torch.stack([data, data + noise], dim=1)

affinity_matrix = calc_simclr_affinity_matrix(encoded_pairs)
loss = calc_info_nce(affinity_matrix)
# assert correct value
torch.testing.assert_close(0.9665, loss, rtol=1e-4, atol=1e-4)

How does the training batch size affect SimCLR? Explain the limitations of SimCLR that follow from the influence of the batch size.

SimCLR is a typical example for self-supervised learning. The model is trained on a pretext objective but used for a different downstream task. Explain what the SimCLR pretext task is and name one possible downstream objective that SimCLR is used for.

Consider the following approach:
We train a model on a large set of images to classify images into 1000 image classes. We heavily augment the images in training (similar to the augmentations in SimCLR). The final layer of the model is a linear map followed by a softmax. We therefore take the activations before the final layer as learned image representations for downstream tasks.
Compare the alternative approach to SimCLR. How are they different? What are the advantages and disadvantages of both approaches?

CLIP¶

On what kind of data was CLIP trained on? What are risks and benefits of the data collection approach?

CLIP encodes images and text into a shared embedding space and scores matches using cosine similarity.

\text{sim}(\mathbf{u}, \mathbf{v}) = \frac{\mathbf{u} \cdot \mathbf{v}}{\|\mathbf{u}\|\, \|\mathbf{v}\|}

(2)

Given an image embedding $\mathbf{v}_{\text{img}} = (3,\ 0,\ 4,\ 0)$ and three candidate captions:

	Caption	Embedding
A	“a cat sitting on a windowsill”	$(4,\ 0,\ 3,\ 0)$
B	“a red sports car on a highway”	$(0,\ 2,\ 0,\ 0)$
C	“a black dog in a dark room”	$(5,\ 5,\ 5,\ 5)$

Which caption would CLIP predict? Why does CLIP normalize the embeddings first?

Out-of-distribution behaviour comparison to ImageNet – why does ViT generalise better?

How does the CLIP architecture relate to a ViT? Is the ViT necessary for CLIP?

What are some limitations of CLIP?

How does CoCa extend the CLIP approach?

Homework¶

Finish this exercise sheet and review the solutions! The sheets are not graded but solving them helps a lot to prepare for the exam. We provide solutions to the exercises on Ilias. Write your own solutions down FIRST (best by hand!) and review our solutions only AFTER you finished the exercises.
Read the paper on DINO, Caron et al. 2021, and optionally on DINOv2, Oquab et al. 2023 as well.
Read TabPFN, Hollmann et al. 2023, a tabular foundation model.