Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Vision Transformers and Constrastive Learning

Masking for Transformers revisited

For setup, recall from Sheet w02

Attention(Q,K,V)=softmax(QKTdk)V.\textrm{Attention}(Q, K, V) = \textrm{softmax}\biggl(\frac{QK^T}{\sqrt{d_k}}\biggr)V.

Implement (or use your implementation) of calculating the attention.

  1. “Masking” -- how and at what point do we mask in a transformer?

  1. Implement the following different ways to mask into your attention implementation. Briefly state potential problems of the respective approaches.

    • set the attention entries αj,i0\alpha_{j, i}\gets 0,

    • set the calculated similarity qjki0\mathbf{q}_j\cdot \mathbf{k}_i \gets 0,

    • set the calculated similarity qjki\mathbf{q}_j\cdot \mathbf{k}_i \gets -\infty (use float("-inf")),

    • set the calculated similarity qjki109\mathbf{q}_j\cdot \mathbf{k}_i \gets -10^9,

    • set the attention entries αj,i0\alpha_{j, i}\gets 0, renormalize the attention scores to 1 afterwards.

Vision Transformer (ViT)

  1. Calculate the size in GB of the attention matrices for a transformer architecture that works directly on full images with 256 ×\times 256 pixels, 32 layers, and 16 heads. Assume 32 bit floats.

  1. Now calculate the size in GB of the attention matrices for a ViT architecture with 16 ×\times 16 patches on the same task.

  1. What task is the ViT as presented in the lecture usually trained on?

  1. What do positional embeddings for 2D image patches look like in the ViT architecture?

  1. In the Vanilla Transformer discussed in the last lectures, we used token embeddings. Why does the ViT not make use of an embedding layer?

  1. Is the ViT a decoder-only, encoder-only, or encoder-decoder Transformer?

  1. The classification token in ViT is borrowed from BERT, where it was originally introduced (see the paper). Briefly explain what role this token plays and why one might add it rather than, e.g., reading out one of the patch tokens. What is the alternative discussed in the lecture?

  1. Vision Transformers have displaced CNNs in many vision tasks. Why?

Contrastive self-supervised learning

  1. What is the goal of Contrastive Learning of Representations (CLR)? Explain in your own words.

  1. Let score\mathrm{score} be a similarity measurement and fθf_\theta a trained encoder model. Reformulate the goal of CLR as a mathematical formula using fθf_\theta and score\mathrm{score}.

  1. In order to train our encoder model fθf_\theta to follow the CLR goal, we need to adapt our goal as a loss function that the model is optimized for. State the InfoNCE loss function the lecture introduced and explain the components of the formula. In particular, explain the each input variable to the formula. Can we interpret the loss function as a classification task?

  1. Which score\mathrm{score} function is used in the SimCLR framework? Explain the role of the linear projection used in the score\mathrm{score} computation.

  1. Complete the following python function:

def calc_simclr_affinity_matrix(encoded_pairs: torch.Tensor) -> torch.Tensor:
    """
    Construct the SimCLR affinity matrix.
    
    As the similarity measurement, use the cosine similarity directly.

    Args:
        encoded_pairs (torch.Tensor): A torch tensor of shape `(N, 2, D)`
            of `N` positive data pairs with encoding dimension `D`.
    
    Returns:
        torch.Tensor: The affininty matrix as a tensor of shape `(2N, 2N)`. 
    """
    # Add your solution here
  1. Complete the following python function:

def calc_info_nce(affinity_matrix: torch.Tensor) -> float:
    """
    Calculate the InfoNCE loss from the SimCLR affinity matrix.
    
    Args:
        affinity_matrix (torch.Tensor): A torch tensor
            of shape (2N, 2N) with the SimCLR affinity
            matrix of the `N` positive data pairs.
    
    Returns:
        float: The calculated InfoNCE loss.
    """
    # Add your solution here
  1. Using your implementation of calc_simclr_affinity_matrix and calc_info_nce, compute the affinity matrix and the InfoNCE loss for the following example:

import torch

# construct dummy data
data = torch.eye(3)
noise = torch.full_like(data, 0.1)
encoded_pairs = torch.stack([data, data + noise], dim=1)

affinity_matrix = calc_simclr_affinity_matrix(encoded_pairs)
loss = calc_info_nce(affinity_matrix)
# assert correct value
torch.testing.assert_close(0.9665, loss, rtol=1e-4, atol=1e-4)
  1. How does the training batch size affect SimCLR? Explain the limitations of SimCLR that follow from the influence of the batch size.

  1. SimCLR is a typical example for self-supervised learning. The model is trained on a pretext objective but used for a different downstream task. Explain what the SimCLR pretext task is and name one possible downstream objective that SimCLR is used for.

  1. Consider the following approach:

    We train a model on a large set of images to classify images into 1000 image classes. We heavily augment the images in training (similar to the augmentations in SimCLR). The final layer of the model is a linear map followed by a softmax. We therefore take the activations before the final layer as learned image representations for downstream tasks.

    Compare the alternative approach to SimCLR. How are they different? What are the advantages and disadvantages of both approaches?

CLIP

  1. On what kind of data was CLIP trained on? What are risks and benefits of the data collection approach?

  1. CLIP encodes images and text into a shared embedding space and scores matches using cosine similarity.

sim(u,v)=uvuv\text{sim}(\mathbf{u}, \mathbf{v}) = \frac{\mathbf{u} \cdot \mathbf{v}}{\|\mathbf{u}\|\, \|\mathbf{v}\|}

Given an image embedding vimg=(3, 0, 4, 0)\mathbf{v}_{\text{img}} = (3,\ 0,\ 4,\ 0) and three candidate captions:

CaptionEmbedding
A“a cat sitting on a windowsill”(4, 0, 3, 0)(4,\ 0,\ 3,\ 0)
B“a red sports car on a highway”(0, 2, 0, 0)(0,\ 2,\ 0,\ 0)
C“a black dog in a dark room”(5, 5, 5, 5)(5,\ 5,\ 5,\ 5)

Which caption would CLIP predict? Why does CLIP normalize the embeddings first?

  1. Out-of-distribution behaviour comparison to ImageNet – why does ViT generalise better?

  1. How does the CLIP architecture relate to a ViT? Is the ViT necessary for CLIP?

  1. What are some limitations of CLIP?

  1. How does CoCa extend the CLIP approach?

Homework

  1. Finish this exercise sheet and review the solutions! The sheets are not graded but solving them helps a lot to prepare for the exam. We provide solutions to the exercises on Ilias. Write your own solutions down FIRST (best by hand!) and review our solutions only AFTER you finished the exercises.

  2. Read the paper on DINO, Caron et al. 2021, and optionally on DINOv2, Oquab et al. 2023 as well.

  3. Read TabPFN, Hollmann et al. 2023, a tabular foundation model.