Transformers - Foundation Models

Tokenization¶

Transformers operate on the embeddings of discrete tokens. However, transforming input data into a stream of tokens is a non-trivial and necessary preprocessing step.

On text data, a naive approach would be to use full words as tokens and just use those as tokens. What problem do you see with this approach?

The algorithm typically used for text tokenization is byte-pair encoding. Briefly summarize how this works.

Describe a tokenization procedure for image data. Hint: A widely used approach was introduced in the Vision Transformer (ViT)^[1].

Query, Key, and Values¶

Remark: If you are confused by the attention mechanism, we highly recommend the excellent YouTube video by 3blue1brown on this topic (youtu.be/eMlx5fFNoYc).

Using the terms “query”, “key”, and “value”, compare the core idea of transformers to a database lookup. What are similarities and what are differences?

For the rest of this task we fix the following notation: let $Q$ be the query matrix, $K$ the key matrix, $V$ the value matrix, and $\textbf{e}_i$ the embedding of token $i$ . The keys and queries have dimension $d_k$ and the values have dimension $d_v$ .

Write down the formula for comuting the attention score $\alpha_{j, i}$ between the query vector $\mathbf{q}_j$ and the key vector $\textbf{k}_i$ . Explain the purpose of each step in the computation.

Write down the formula to compute the attention output $\mathbf{o}_j$ for a query $\mathbf{q}_j$ given the attention scores $\alpha_{j, i}$ .

Given the embedding vector $\mathbf{e}_i$ of token $i$ , how do we compute the corresponding query vector $\mathbf{q}_i$ , key vector $\mathbf{k}_i$ and value vector $\mathbf{v}_i$ ? How can we derive the vectors from the matrices $Q$ , $K$ and $V$ ?

What dimensions does each matrix $K$ , $Q$ , $V$ have?

In task 3, we derived the attention outputs for a single query vector. Write down the formula for the outputs in matrix form instead, i.e. for all queries in parallel.

Calculate the attention scores as well as the outputs for the following example:
$\begin{gathered}Q = \begin{pmatrix}1 & 0\\0 & 1\end{pmatrix},\quad K = \begin{pmatrix}1 & 1\\ -1 & 1\end{pmatrix},\quad V = \begin{pmatrix}1 & 5 & 25\\ 5 & 0 & 0\end{pmatrix}\end{gathered}.$
(1)
You may use the fact $\exp\bigl(1/\sqrt{2}\bigr) \approx 2$ .

Explain multi-head attention and why it is helpful.

What is the difference between cross-attention and self-attention?

Positional encoding¶

What is positional encoding and why do we need it?

Calculate the sinusoidal positional encoding $\mathbf{z}_i \in \mathbb{R}^6$ of position $i$ for $d_k = 6$ .

Transformer components¶

Compare the transformer from “The Annotated Transformer”^[2] with “nanoChat”^[3]. Name three differences and explain what the differences are.

Homework¶

Finish this exercise sheet and review the solutions! The sheets are not graded but solving them helps a lot to prepare for the exam. We provide solutions to the exercises on Ilias. Write your own solutions down FIRST (best by hand!) and review our solutions only AFTER you finished the exercises.

Footnotes¶

An Image is Worh 16x16 Words: Transformers for Image Recognition at Scale, Dosovitskiy et al. arxiv.org/pdf/2010.11929
↩
nlp.seas.harvard.edu/annotated-transformer/
↩
github.com/karpathy/nanochat/discussions/481
↩