Self-Attention as Message Passing¶
Let’s assume that we have three tokens with scalar features:
Note: This is a toy model. In practice, token features are high-dimensional vectors.
For simplicity, we define , therefore:
And the self-attention update for token 1:
Before doing any math, answer:
Which token do you expect token 1 to pay most attention to?
Which token should receive the least attention from others?
Now, compute and answer:
All and then . Describe what represents in this example.
Are and equal? What does this say about attention as an “edge weight”?
What happens if all are equal? What kind of GNN is this equivalent to?
What changes in the self-attention mechanism when moving from a Transformer to a GAT?
Graph Laplacian Magic¶
In previous lectures, we used the Laplacian matrix , but what does it actually mean?
Given an undirected graph with adjacency matrix and degree matrix , is defined as:
Consider the following path graph of 3 nodes:

What is the Laplacian matrix for this graph?
Verify that the following are eigenvectors of , and find the corresponding eigenvalues:
Refresher: , where is the eigenvector and is the eigenvalue.
Order the eigenvectors from slowly varying to rapidly varying (i.e., smooth vs oscillatory) across the graph. Then, look at the corresponding eigenvalues. What do you observe about how the eigenvalues are ordered relative to the variation speed?
Based on your finding in (3), which eigenvectors capture the global structure and which eigenvectors capture the local variation?
Now let’s verify our findings in a relatively larger graph. We will revisit the Karate Club graph and plot what the eigenvectors highlight within the graph.
The following code snippet computes the eigenvectors and sorts them based on the corresponding eigenvalues. Plot different eigenvectors across the graph to see which information they encode (local vs global). Does it match with your finding in (4)?
import networkx as nx
import numpy as np
import matplotlib.pyplot as plt
# load the Karate Club graph
G = nx.karate_club_graph()
# compute the Laplacian matrix
L = nx.laplacian_matrix(G).toarray()
# compute eigenvalues and eigenvectors
eigvals, eigvecs = np.linalg.eigh(L)
# sort eigenvalues and eigenvectors
idx = np.argsort(eigvals)
eigvals = eigvals[idx]
eigvecs = eigvecs[:, idx]
# plot the eigenvectors
pos = nx.spring_layout(G, seed=42)
for k in [1]:
plt.figure()
nx.draw(G, pos, node_color=eigvecs[:, k], cmap='coolwarm', with_labels=False)
plt.title(f"Laplacian eigenvector {k}")
plt.show()Eigenvector Sign Ambiguity¶
Recall that , where is the eigenvector and is the eigenvalue. This also means:
Why is this a problem when we want to use the eigenvectors as positional encodings?
Would an attention-based model automatically be invariant to this sign ambiguity? Why or why not?
If we need sign invariance, why don’t we just take before using it as a positional encoding?
How does
SignNet(paper link) solve this problem? You can play with the following code snippet which implements a simplifed version ofSignNet.
import torch
import torch.nn as nn
class SimpleSignNet(nn.Module):
def __init__(self, hidden_dim=16):
super().__init__()
self.phi = nn.Sequential(
nn.Linear(1, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim)
)
def forward(self, v):
# v: (n,) eigenvector
v = v.unsqueeze(-1) # (n, 1)
out_pos = self.phi(v) # φ(v)
out_neg = self.phi(-v) # φ(-v)
return out_pos + out_neg # sign-invariant
# toy eigenvector
v = torch.tensor([1.0, -2.0, 0.5])
model = SimpleSignNet()
z1 = model(v)
z2 = model(-v)
print(torch.allclose(z1, z2))