All Articles

Understanding Positional Encoding in Transformers

Published 10 May 2021 · 17 min read

What is positional encoding?

As I’ve explained in “Introduction to Attention Mechanism”, attention doesn’t care about the position of the inputs. To fix that problem we have to introduce something called Positional Encoding. This encoding is covered in the original “Attention Is All You Need” paper and it’s added to every input (not concatenated but added).

Transformer architecture
Figure 1: Original Transformer architecture, Source “Attention Is All You Need”, 2017

The paper only considered fixed (non-trainable) positional encoding and that’s what I’m going to explain. Right now encodings are trained along with the model but that requires another article. To calculate the value of positional encoding we have to go to section 3.5 in the paper. Authors are using sin and cos functions to calculate a value for every input vector.

PE(pos,2i)=sin(pos100002i/dmodel) PE(pos,2i+1)=cos(pos100002i/dmodel)PE_{(pos,2i)} = sin(\frac{pos}{10000^{2i/d_{\text{model}}}}) \newline \text{ } \newline PE_{(pos,2i+1)} = cos(\frac{pos}{10000^{2i/d_{\text{model}}}})

As you can see these values depend on dmodeld_{model} (input dimension) and ii (index of the position vector). Original paper operates on 512 dimensional vectors but for simplicity I’m going to use dmodel=50d_{model} = 50 or dmodel=20d_{model} = 20. The authors also attached the comment about why they had chosen this kind of function:

We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset kk, PEpos+kPE_{pos+k} can be represented as a linear function of PEposPE_{pos}.

Positional encoding visualization

pos0

0

pos1

1

pos2

2

pos3

3
Figure 2: Positional Encoding visualization, Designed base on “Attention Is All You Need”, NeurIPS 2017 Paper

Values

We calculate the value for each index using the formula for a given index. It’s worth noticing that 2i2i value in coscos function is an even number so to calculate values for 0th and 1st indexes we’re going to use sin(pos100000/50)=sin(pos)sin(\frac{pos}{10000^{0/50}}) = sin(pos) and cos(pos100000/50)=cos(pos)cos(\frac{pos}{10000^{0/50}}) = cos(pos). That’s why values for 0th and 1st indexes are only dependent on the value of pospos instead of both pospos and dmodeld_{model}. This changes from the 2nd index onward because the dividend is no longer equal to 0, so the whole divisor is larger than 1 (100002i/50)(10000^{2i/50}).

Dimension dependency

If you switch to the second step, then you can compare how the PEPE value changes depend on dmodeld_{model}.

PE values with different embedding dimension
Figure 3: PE values comparison with different dimensions (d), Source: Positional encoding visualization

The period of the first two indexes is not changing with the change of dmodeld_{model}, but the period of further indexes (2nd and greater) widens with the decrease of dmodeld_{model}. This might be obvious, but it’s still good to see the difference.

Function periods

When we plot PEPE values for the first 20 pospos vectors we get a result like that:

Positional encodings for 20 positions
Figure 4: Positional Encoding values for first 20 positions, Generated with the use of Tensorflow - Positional encoding code

This plot is generated from one of Tensorflow’s Tutorials and you can run it with the help of Google Colab directly from their website. As you can see, lower dimensions of the position vector have a very short wavelength (distance between identical points). The wavelength of the function at i=6i = 6 index has a wavelength around 19 (21012/252 * 10^{12/25}).

We know that periods are increasing with the increase of ii. When ii reaches the side of dmodeld_{model}, you need a lot of pospos vectors to cover the whole function period.

Positional encodings values for large indexes
Figure 5: Function values for further indexes, Source: Positional encoding visualization

The values of the first 20 positions at the higher indexes are almost constant. You can see the same thing in Fig. 4 where the color of the columns 30-50 bearly change. To see that change we have to plot the values for tens of thousands of positions:

Positional encodings for whole period of last index
Figure 6: Positional Encoding periods for further indexes, Generated with the use of Tensorflow - Positional encoding code

Warning This plot has a built-in illusion, it’s not actually an illusion but because it tries to print 40k+ values on 670px (height) it cannot show the correct value of anything with a wavelength smaller than 1px. That’s why anything prior to column 24 is visually incorrect even if the right values were used to generate this plot.

Conclusions

Positional embeddings are there to give a transformer knowledge about the position of the input vectors. They are added (not concatenated) to corresponding input vectors. Encoding depends on three values:

  • pospos - position of the vector
  • ii - index within the vector
  • dmodeld_{model} - dimension of the input

Value is calculated alternately with the help of the periodic functions (sinsin and coscos) and the wavelength of those functions increases with higher dimensions of the input vector. Values for indexes closer to the top of the vector (lower indexes) are changing quickly when those further away require a lot of positions to change a value (large periods).

This is just one way of doing positional encoding. Current SOTA models have encoders trained along with the model instead of using predefined functions. The authors even mentioned that option in the paper but didn’t notice a difference in the results:

We also experimented with using learned positional embeddings instead, and found that the two versions produced nearly identical results (see Table 3 row (E)). We chose the sinusoidal version because it may allow the model to extrapolate to sequence lengths longer than the ones encountered during training.

References: