自然语言处理中的Attention机制-白红宇

自然语言处理中的Attention机制

阅读量：4067 次

发布时间：2019-05-25

本文共 4371 字，大约阅读时间需要 14 分钟。

Attention in NLP

Advantage:

integrate information over time

handle variable-length sequences

could be parallelized

Seq2seq

Encoder–Decoder framework:

Encoder:

$h_t = f(x_t, h_{t-1})$

$c = q({h_1,...,h_{T_x}})$

Sutskeveretal.(2014) used an LSTM as f and $q ({h_1,··· ,h_T}) = h_T$

Decoder:

$\sum_{t=1}^T p(y_t | {y_1,...,y_{t-1}}, c)$

$p(y_t | {y_1,...,y_{t-1}}, c) = g(y_{t-1}, s_t, c)$

LEARNING TO ALIGN AND TRANSLATE

Decoder:

each conditional probability:

$p(y_i | {y_1,...,y_{i-1}}, x) = g(y_{i-1}, s_i, c_i)$

$s_i = f(s_{i-1}, y_{i-1}, c_i)$

context vector $c_i$ :

$c_i = \sum_{j=1}^{T_x} \alpha_{ij} h_j$

$\alpha_{ij} = \frac{exp(e_{ij})}{\sum_{k=1}^{T_x}exp(e_{ik})}$

$e_{ij} = a(s_{i-1}, h_j)$

in [1], $a$ is computed by:

$a(s_{i-1}, h_j) = v^T tanh(W_a s_{i-1} + U_a h_j)$

在这里插入图片描述

Kinds of attention

Hard and soft attention

hard attention 会专注于很小的区域，而 soft attention 的注意力相对发散

Global and local attention

在这里插入图片描述

四种alignment function计算方法:

在这里插入图片描述

$\qquad \qquad a_t = softmax(W_ah_t) \qquad \qquad location$

小结：

在这里插入图片描述

attention in feed-forword NN

在这里插入图片描述

simpliﬁed version of attention:

在这里插入图片描述

$\qquad \qquad a(h_t) = tanh(W_{hc}h_t + b_{hc})$

Hierarchical Attention

在这里插入图片描述

word level attention:

在这里插入图片描述

sentence level attention:

在这里插入图片描述

inner attention mechanism:

在这里插入图片描述

annotation $h_t$ is ﬁrst passed to a dense layer. An alignment coeﬃcient $α_t$ is then derived by comparing the output $u_t$ of the dense layer with a trainable context vector $u$ (initialized randomly) and normalizing with a softmax. The attentional vector $s$ is ﬁnally obtained as a weighted sum of the annotations.

score can in theory be any alignment function. A straightforward approach is to use dot. The context vector can be interpreted as a representation of the optimal word, on average. When faced with a new example, the model uses this knowledge to decide which word it should pay attention to. During training, through backpropagation, the model updates the context vector, i.e., it adjusts its internal representation of what the optimal word is.

Note： The context vector in the deﬁnition of inner-attention above has nothing to do with the context vector used in seq2seq attention！

self-attention

在这里插入图片描述

Self-Attention 即 K=V=Q，例如输入一个句子，那么里面的每个词都要和该句子中的所有词进行 Attention 计算。目的是学习句子内部的词依赖关系，捕获句子的内部结构。

在这里插入图片描述

Conclusion

Attention 函数的本质可以被描述为一个查询（query）到一系列（键key-值value）对的映射。

在这里插入图片描述

将Source中的构成元素想象成是由一系列的<Key,Value>数据对构成，此时给定Target中的某个元素Query，通过计算Query和各个Key的相似性或者相关性，得到每个Key对应Value的权重系数，然后对Value进行加权求和，即得到了最终的Attention数值。所以本质上Attention机制是对Source中元素的Value值进行加权求和，而Query和Key用来计算对应Value的权重系数

在这里插入图片描述