Rotary Positional Embeddings
When a language model processes text, it must determine how much the context of one token helps explain the meaning of another. This depends on both content (a dog
should attend more strongly to puppy
than tshirt
) and relative position - the word cat
appearing in the same sentence has far more semantic influence than the same word, cat
, appearing chapters away. Without positional information, Transformers are permutation-invariant and cannot distinguish between these cases.