
Words are made as word embeddings (numerical values) Context helps to determine what kind of word that should refer to. This is where similarity comes in.
Think of a 2D graph with x and y axes. The dot product is a multiplication of the words in matrix form (matrix multiplication). eg. [2, 3] * [1, 4]-> (2 * 1) + (3 * 4) = 14 The first one times the transpose of the second one
points on the graph are traced to the origin and the angle is gotten using the arctan property and the angle is put into the cosine() function to give the value.
The answer from the dot product divided by the length of the vector. eg. 14/sqrt(2) This helps to prevent the exploding gradient problem.
After the word math step, the words should be normalized/scaled down to prevent the use of extremely large numbers. The softmax activation function is used.
Turn the embedding into one that is best for calculating similarities.
Best embedding for finding the next word. Multiplies the embedding from the keys and queries and multiplies it by itself.
Why move words on a different embedding?
The first can give info like:
The second one (values) knows when two words could appear in the same context.
Many heads are used (n-times). basically the single head attention procedure is done many times.
If you have an embedding of 3 (2 dimensions) you get 6 dimensions
It transforms the dimensions into lower ones which could actually be used. The best ones are scaled up, the worst are scaled down. Then an optimal embedding is produced.
Starting with the encoder
Input imbedding:
Positional Encoding:
Self-Attention:
Multi-Head Attention: $$ Multihead(Q, K, V) = Concat(head_{1}\dots.head_{n})W $$ $$ head_{i} = Attention(QW_{i}^{Q}, KW_{i}^{K}, VW_{i}^{V}) $$
Layer Normalization:
For the decoder:
Masked Multi-Head: