notes

Personal notes
git clone git://git.laack.co/notes.git
Log | Files | Refs

Attention.md (1616B)


      1 # Attention
      2 
      3 **Source:** [https://arxiv.org/abs/1706.03762](https://arxiv.org/abs/1706.03762)
      4 
      5 **Definition:** Attention is a method to determine the importance of each token in a sequence of tokens relative to other tokens in said sequence.
      6 
      7 ## Standard Scaled Dot-Product Self-Attention
      8 
      9 Consider the following matrices
     10 
     11 - $Q$ - Query matrix
     12     - This represents what we **want** to get from the other vectors
     13         - Basically, we are interested in gathering some type of information from the other vectors because that information is useful in the current context.
     14 - $K$ - Key matrix
     15     - This matrix represents the information each vector offers.
     16         - The key can be thought of as indexing / labeling the information afforded by the vector.
     17 - $V$
     18     - This matrix represents the content each vector contributes.
     19 
     20 We then have $\text{Attention}(Q,K,V) = \text{softmax}(\frac{QK^T}{\sqrt{d_k}})V$
     21 
     22 ### Break Down
     23 
     24 1. $QK^T$
     25     - This gives us a matrix of the dot products between queries and keys, giving their similarity.
     26         - This is useful because it describes how relevant each key is to the queried information.
     27 2.  $\frac{QK^T}{\sqrt{d_k}}$
     28     - $d_k$ is the number of elements in each key. This prevents the dot products from becoming too large.
     29 3. $\text{softmax}(\frac{QK^T}{\sqrt{d_k}})$
     30     - This converts our similarity scores (dot products) to attention weights, describing how much attention to give to each key, normalized to sum up to 1.
     31 3. $\text{softmax}(\frac{QK^T}{\sqrt{d_k}})V$
     32     - This computes the weighted sum (weighted by the attention weights) of the value vectors.