Attention.md (1616B)
1 # Attention 2 3 **Source:** [https://arxiv.org/abs/1706.03762](https://arxiv.org/abs/1706.03762) 4 5 **Definition:** Attention is a method to determine the importance of each token in a sequence of tokens relative to other tokens in said sequence. 6 7 ## Standard Scaled Dot-Product Self-Attention 8 9 Consider the following matrices 10 11 - $Q$ - Query matrix 12 - This represents what we **want** to get from the other vectors 13 - Basically, we are interested in gathering some type of information from the other vectors because that information is useful in the current context. 14 - $K$ - Key matrix 15 - This matrix represents the information each vector offers. 16 - The key can be thought of as indexing / labeling the information afforded by the vector. 17 - $V$ 18 - This matrix represents the content each vector contributes. 19 20 We then have $\text{Attention}(Q,K,V) = \text{softmax}(\frac{QK^T}{\sqrt{d_k}})V$ 21 22 ### Break Down 23 24 1. $QK^T$ 25 - This gives us a matrix of the dot products between queries and keys, giving their similarity. 26 - This is useful because it describes how relevant each key is to the queried information. 27 2. $\frac{QK^T}{\sqrt{d_k}}$ 28 - $d_k$ is the number of elements in each key. This prevents the dot products from becoming too large. 29 3. $\text{softmax}(\frac{QK^T}{\sqrt{d_k}})$ 30 - This converts our similarity scores (dot products) to attention weights, describing how much attention to give to each key, normalized to sum up to 1. 31 3. $\text{softmax}(\frac{QK^T}{\sqrt{d_k}})V$ 32 - This computes the weighted sum (weighted by the attention weights) of the value vectors.