Attention

<aside> 💡

Attention allows Neural Networks to focus on the most relevant parts of the input data while making predictions

</aside>

Transformers

They are essentially Neural Network architectures used for performing Machine Learning tasks. They fundamentally contain three parts:

Word Embeddings

This converts words, bits of words, and symbols collectively called tokens into numbers. This is done because NNs accept numbers for input values.

Positional Encoding

It is important to keep track of word order.

Attention is how we keep track of relationship among the words so that associations are maintained. Self-attention for instance calculate similarity of each word with other words and then used then used to determine how the transformer encodes each word.

Example

Here’s an implementation of Self Attention in PyTorch

Self_Attention_in_PyTroch.ipynb

Now, let’s move on to the math behind Self Attention

Matrix Math

Masked Self-Attention