Transformer Self-Attention

Epoch 0  |  Loss 1.000  |  Tokens 6  |  Heads 4

Input

Architecture

Training

Display

Attention Heatmap (Head 1, Layer 1)
Head 1 (cyan)
Head 2 (magenta)
Head 3 (gold)
Head 4 (mint)
Arc brightness = attention weight
Token glow = embedding magnitude