Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Knowledge Graphs

Warm-up: What is actually learned?

In KG embedding models like TransE:

  1. What are the trainable parameters?

  2. Suppose a new entity enewe_{\text{new}} is added to the KG after training. Can TransE produce an embedding for it without retraining? Why or why not?

  3. Let’s assume that you observe one triple (h,r,enew)(h, r, e_{\text{new}}). Could you heuristically assign an embedding to enewe_{\text{new}}? What are the limitations? Hint: Think of the translation h+rt\mathbf{h} + \mathbf{r} \approx \mathbf{t}.

  4. Given a trained TransE model and a query (h,r,?)(h, r, ?), inference is performed by computing the query embedding q=h+r\mathbf q = \mathbf h + \mathbf r. How can we get semantic understanding of what the embedding vector q\mathbf{q} corresponds to?

TransE Mechanics

Consider the following 2D embeddings:

  • h=(1,0)\mathbf{h} = (1, 0)

  • r=(1,1)\mathbf{r} = (1, 1)

  • t1=(2,1)\mathbf{t_1} = (2, 1)

  • t2=(2,2)\mathbf{t_2} = (2, 2)

  1. Compute the TransE score fr(h,t)=h+rt2f_r(h,t) = -\lVert h + r - t \rVert_2 for t1t_1 and t2t_2. Which triple is more plausible?

  2. Assume that relation rr is symmetric (e.g., “siblingOf”), meaning: (h,r,t1)(h,r,t_1) and (t1,r,h)(t_1,r,h) are both true. Write down the TransE equations implied by symmetry. What’s the issue here?

  3. Assume that relation rr is 1-to-N (e.g., “studentOf”): (h,r,t1)(h,r,t_1) and (h,r,t2)(h,r,t_2) are true with t1t2t_1 \neq t_2. Write the TransE equations for both triples. What geometric constraint does this impose on t1t_1 and t2t_2?

  4. Suppose we increase the embedding dimension from k=2k=2 to k=100k=100. Does this resolve the issue in (2) and (3)? Justify mathematically.

Path Queries

Consider a knowledge graph with the following entity types:

  • Person

  • Company

  • City

  • University

  • Country

and the following directed relations:

  • worksAt(Person → Company)

  • locatedIn(Company → City)

  • studiedAt(Person → University)

  • locatedIn(University → City)

  • basedIn(City → Country)

  • foundedBy(Company → Person)

  1. For each of the following natural-language queries, write the formal path query.

    • Q1Q_1: Which company does Alice work at?

    • Q2Q_2: In which city is the company where Alice works located?

    • Q3Q_3: In which country is the company where Alice works based?

    • Q4Q_4: Which people work at companies located in Berlin?

    • Q5Q_5: Which people studied at universities located in the same city as the company where Alice works?

    • Q6Q_6: Which people studied at universities located in cities where companies founded by Bob are based?

    • Q7Q_7: Which people both studied at universities in Berlin and work at companies based in Germany?

  2. Which of the queries in (1) can TransE represent?

  3. What if we simply use r-r to represent inverse relations. Would that work?

Programming: TransE on a Toy KG

In this exercise, we’ll use the PyG implementation of TransE on a toy KG, which is given in the following code snippet:

import torch
from torch_geometric.data import Data
from torch_geometric.nn.kge import TransE

# toy KG
entities = [
    "Alice", "Bob", "Charlie",
    "CompanyA", "CompanyB",
    "Berlin", "Munich"
]
relations = ["worksAt", "locatedIn"]
triples = [
    ("Alice",   "worksAt",   "CompanyA"),
    ("Bob",     "worksAt",   "CompanyA"),
    ("Charlie", "worksAt",   "CompanyB"),
    ("CompanyA","locatedIn", "Berlin"),
    ("CompanyB","locatedIn", "Munich"),
]

# helper mappings
ent2id = {e: i for i, e in enumerate(entities)}
rel2id = {r: i for i, r in enumerate(relations)}

head = torch.tensor([ent2id[h] for h, _, _ in triples], dtype=torch.long)
rel  = torch.tensor([rel2id[r] for _, r, _ in triples], dtype=torch.long)
tail = torch.tensor([ent2id[t] for _, _, t in triples], dtype=torch.long)

# store as PyG data
edge_index = torch.stack([head, tail], dim=0)
edge_type = rel
data = Data(num_nodes=len(entities), edge_index=edge_index, edge_type=edge_type)

In the following code snippet, we define the model using the TransE implementation in PyG. Complete the train function. Use all the triples given in the KG.

# model setup
model = TransE(
    num_nodes=data.num_nodes,
    num_relations=len(relations),
    hidden_channels=2,
    margin=1.0,
    p_norm=2.0, # use L2 distance
    sparse=False,
)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-2)

# training loop
def train(steps=2000, seed=42):
    torch.manual_seed(seed)
    model.train()

    # === YOUR CODE HERE ===

    # ======================

train()

Now, using the following inference functions, test the given cases:

  • Rank Company tails for (Alice, worksAt, ?)

  • Rank City tails for (CompanyA, locatedIn, ?)

  • Vector-space path query: q = Alice + worksAt + locatedIn, rank cities by distance

  • Heuristic inverse: q = CompanyA - worksAt, rank persons by distance

# inference helpers
@torch.no_grad()
def rank_tails(head_name: str, rel_name: str, candidate_tail_names=None, topk=5):
    """
    Ranks candidate tails for query (head, rel, ?) by TransE score.
    Higher score = more plausible.
    """
    model.eval()
    h = torch.tensor([ent2id[head_name]], dtype=torch.long)
    r = torch.tensor([rel2id[rel_name]], dtype=torch.long)

    if candidate_tail_names is None:
        cand_ids = torch.arange(len(entities), dtype=torch.long)
        cand_names = entities
    else:
        cand_ids = torch.tensor([ent2id[x] for x in candidate_tail_names], dtype=torch.long)
        cand_names = candidate_tail_names

    # broadcast h,r to match candidates
    h_rep = h.repeat(cand_ids.numel())
    r_rep = r.repeat(cand_ids.numel())

    scores = model(h_rep, r_rep, cand_ids)  # forward() returns score for triplets
    vals, pos = torch.topk(scores, k=min(topk, scores.numel()), largest=True)

    return [(cand_names[i], float(v)) for i, v in zip(pos.tolist(), vals.tolist())]

# for vector-space traversal (q = h + r1 + r2)
@torch.no_grad()
def get_entity_emb(name: str) -> torch.Tensor:
    return model.node_emb.weight[ent2id[name]]

@torch.no_grad()
def get_rel_emb(name: str) -> torch.Tensor:
    return model.rel_emb.weight[rel2id[name]]

@torch.no_grad()
def rank_by_query_vector(q_vec: torch.Tensor, candidate_names):
    cand = torch.stack([get_entity_emb(n) for n in candidate_names], dim=0)
    # TransE score is -||q - t||, so ranking by smallest distance is equivalent
    d = torch.norm(cand - q_vec.unsqueeze(0), p=2, dim=-1)
    order = torch.argsort(d, descending=False)
    return [(candidate_names[i], float(d[i])) for i in order.tolist()]