Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Transformers

Tokenization

Transformers operate on the embeddings of discrete tokens. However, transforming input data into a stream of tokens is a non-trivial and necessary preprocessing step.

  1. On text data, a naive approach would be to use full words as tokens and just use those as tokens. What problem do you see with this approach?

  1. The algorithm typically used for text tokenization is byte-pair encoding. Briefly summarize how this works.

  1. Describe a tokenization procedure for image data. Hint: A widely used approach was introduced in the Vision Transformer (ViT)[1].

Query, Key, and Values

Remark: If you are confused by the attention mechanism, we highly recommend the excellent YouTube video by 3blue1brown on this topic (youtu.be/eMlx5fFNoYc).

  1. Using the terms “query”, “key”, and “value”, compare the core idea of transformers to a database lookup. What are similarities and what are differences?

For the rest of this task we fix the following notation: let QQ be the query matrix, KK the key matrix, VV the value matrix, and ei\textbf{e}_i the embedding of token ii. The keys and queries have dimension dkd_k and the values have dimension dvd_v.

  1. Write down the formula for comuting the attention score αj,i\alpha_{j, i} between the query vector qj\mathbf{q}_j and the key vector ki\textbf{k}_i. Explain the purpose of each step in the computation.

  1. Write down the formula to compute the attention output oj\mathbf{o}_j for a query qj\mathbf{q}_j given the attention scores αj,i\alpha_{j, i}.

  1. Given the embedding vector ei\mathbf{e}_i of token ii, how do we compute the corresponding query vector qi\mathbf{q}_i, key vector ki\mathbf{k}_i and value vector vi\mathbf{v}_i? How can we derive the vectors from the matrices QQ, KK and VV?

  1. What dimensions does each matrix KK, QQ, VV have?

  1. In task 3, we derived the attention outputs for a single query vector. Write down the formula for the outputs in matrix form instead, i.e. for all queries in parallel.

  1. Calculate the attention scores as well as the outputs for the following example:

    Q=(1001),K=(1111),V=(1525500).\begin{gathered}Q = \begin{pmatrix}1 & 0\\0 & 1\end{pmatrix},\quad K = \begin{pmatrix}1 & 1\\ -1 & 1\end{pmatrix},\quad V = \begin{pmatrix}1 & 5 & 25\\ 5 & 0 & 0\end{pmatrix}\end{gathered}.

    You may use the fact exp(1/2)2\exp\bigl(1/\sqrt{2}\bigr) \approx 2.

  1. Explain multi-head attention and why it is helpful.

  1. What is the difference between cross-attention and self-attention?

Positional encoding

  1. What is positional encoding and why do we need it?

  1. Calculate the sinusoidal positional encoding ziR6\mathbf{z}_i \in \mathbb{R}^6 of position ii for dk=6d_k = 6.

Transformer components

  1. Compare the transformer from “The Annotated Transformer”[2] with “nanoChat”[3]. Name three differences and explain what the differences are.

Homework

  1. Finish this exercise sheet and review the solutions! The sheets are not graded but solving them helps a lot to prepare for the exam. We provide solutions to the exercises on Ilias. Write your own solutions down FIRST (best by hand!) and review our solutions only AFTER you finished the exercises.

Footnotes
  1. An Image is Worh 16x16 Words: Transformers for Image Recognition at Scale, Dosovitskiy et al. arxiv.org/pdf/2010.11929