<aside> 💡
Attention allows Neural Networks to focus on the most relevant parts of the input data while making predictions
</aside>
They are essentially Neural Network architectures used for performing Machine Learning tasks. They fundamentally contain three parts:
This converts words, bits of words, and symbols collectively called tokens into numbers. This is done because NNs accept numbers for input values.
It is important to keep track of word order.
Attention is how we keep track of relationship among the words so that associations are maintained. Self-attention for instance calculate similarity of each word with other words and then used then used to determine how the transformer encodes each word.
Here’s an implementation of Self Attention in PyTorch
Self_Attention_in_PyTroch.ipynb
Now, let’s move on to the math behind Self Attention