當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

NLP中遇到的各类Attention结构汇总以及代码复现

發布時間：2023/12/8 编程问答 36 豆豆

生活随笔收集整理的這篇文章主要介紹了 NLP中遇到的各类Attention结构汇总以及代码复现小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

點擊下方標題，迅速定位到你感興趣的內容

前言
Bahdanau Attention
Luong Attention
Self-Attention、Multi-Head Attention
Location Sensitive Attention
Attention形式
- Soft attention、global attention、動態attention
- Hard attention
- Local Attention（半軟半硬attention）
- Concatenation-based Attention
- 靜態attention
- 多層Attention
說在最后

前言

Github：本文代碼放在該項目中：NLP相關Paper筆記和代碼復現
說明：講解時會對相關文章資料進行思想、結構、優缺點，內容進行提煉和記錄，相關引用會標明出處，引用之處如有侵權，煩請告知刪除。
轉載請注明：DengBoCong

我們所熟知的encoder和decoder結構中，通常采用RNN結構如GRU或LSTM等，在encoder RNN中將輸入語句信息總結到最后一個hidden vector中，并將其作為decoder的初始hidden vector，從而利用decoder的解碼成對應的其他語言中的文字。但是這樣的結構會出現一些問題，比如老生常談的長程梯度消失的問題，對于較長的句子很難寄希望于將輸入的序列轉化為定長的向量而保存所有的有效的信息，所以隨著輸入序列的長度增加，這種結構的效果就會顯著下降。因此這個時候就是Attention出場了，用一個淺顯描述總結Attention就是，分配權重系數，保留序列的有效信息，而不是局限于原來模型中的定長隱藏向量，并且不會喪失長程的信息。

本篇文章主要是匯總我目前在對話和語音方面遇到的各類Attention，針對這些Attention進行理解闡述、總結、論文、代碼復現。本文只對各Attention的關鍵處進行闡述，具體細節可查閱資料或閱讀原論文了解。**本文所述的結構不是很多，主要是目前我再學習中遇到的比較重要的Attention（一些用的不多的在最后提了一下），后續還會持續更新。

Bahdanau Attention

Paper Link

Bahdanau Attention實現可以說是Attention的開創者之一，該實現的論文名叫“Neural Machine Translation by Learning to Jointly Align and Translate”，其中使用到了“Align”一次，意思是在訓練模型的同時調整直接影響得分的權重，下面是論文中的結構圖：

計算公式如下：
$ct=∑j=1Txatjhjc_t = \sum_{j=1}^{T_x}a_{tj}h_j$ $atj=exp(etj)∑k=1Txexp(etk)a_{tj}=\frac{exp(e_{tj})}{\sum_{k=1}^{T_x}exp(e_{tk})}$ $e_{tj}=V_a^Ttanh(W_a[s_{t-1};h_j])$

其中， $c_t$ 是 $t$ 時刻的語義向量， $e_ij$ 是encoder中 $j$ 時刻Encoder隱藏層狀態 $h_j$ 對decoder中 $t$ 時刻隱藏層狀態 $s_t$ 的影響程度，然后通過softmax函數（第二個式子）將 $e_{tj}$ 概率歸一化為 $a_{tj}$

論文是使用Seq2seq結構對Attention進行闡述的，所以需要注意幾點的是：

在模型結構的encoder中，是使用雙向RNN處理序列的，并將方向RNN的最后一個隱藏層作為decoder的初始化隱藏層。
attention層中的分數計算方式是使用 additive/concat
解碼器的下一個時間步的輸入是前一個解碼器時間步生成的單詞（或ground-truth）與當前時間步的上下文向量之間的concat。

下面附一張更清晰的結構圖：

復現代碼（以TensorFlow2為例），注意，將如下實現應用到實際模型中，需要根據具體模型微調：

def bahdanau_attention(hidden_dim: int, units: int):""":param units: 全連接層單元數"""query = tf.keras.Input(shape=(hidden_dim))values = tf.keras.Input(shape=(None, hidden_dim))V = tf.keras.layers.Dense(1)W1 = tf.keras.layers.Dense(units)W2 = tf.keras.layers.Dense(units)# query其實就是decoder的前一個狀態，decoder的第一個狀態就是上# 面提到的encoder反向RNN的最后一層，它作為decoderRNN中的初始隱藏層狀態# values其實就是encoder每個時間步的隱藏層狀態，所以下面需要將query擴展一個時間步維度進行之后的操作hidden_with_time_axis = tf.expand_dims(query, 1)score = V(tf.nn.tanh(W1(values) + W2(hidden_with_time_axis)))attention_weights = tf.nn.softmax(score, axis=1)context_vector = attention_weights * valuescontext_vector = tf.reduce_mean(context_vector, axis=1)return tf.keras.Model(inputs=[query, values], outputs=[context_vector, attention_weights])

Luong Attention

Paper Link

論文名為“Effective Approaches to Attention-based Neural Machine Translation”，文章其實是基于Bahdanau Attention進行研究的，但在架構上更加簡單。論文研究了兩種簡單有效的注意力機制：一種始終關注所有詞的global方法和一種僅一次查看詞子集的local方法。結構如下圖：

計算公式如下：
$at(s)=align(ht,hˉs)=exp(score(ht,hˉs))∑s′exp(score(ht,hˉs′))a_t(s)=align(h_t,\bar{h}_s)=\frac{exp(score(h_t, \bar{h}_s))}{\sum_{s'}exp(score(h_t, \bar{h}_{s'}))}$ $score(ht,hˉs){htThˉsdothtTWahˉsgeneralvaTtanh(Wa[ht;hˉs])concatscore(h_t, \bar{h}_s)\left\{\begin{matrix} h_t^T\bar{h}_s & dot \\ h_t^TW_a\bar{h}_s &general \\ v_a^Ttanh(W_a[h_t;\bar{h}_s]) &concat \end{matrix}\right.$

同樣的，論文中也是使用Seq2Seq結構進行闡述，需要注意如下幾點：

在encoder部分是使用兩層堆疊的LSTM，decoder也是同樣的結構，不過它使用encoder最后一個隱藏層作為初始化隱藏層。
用作Attention計算的隱藏層向量是使用堆疊的最后一個LSTM的隱層
論文中實驗的注意力分數計算方式有：（1）additive/concat，（2）dot product，（3）location-based，（4）‘general’
當前時間步的解碼器輸出與當前時間步的上下文向量之間的concat喂給前饋神經網絡，從而給出當前時間步的解碼器的最終輸出。

下面附一張更清晰的結構圖：你會發現和Bahdanau Attention很像區別在于score計算方法和最后decoder中和context vector合并部分。

復現代碼（以TensorFlow2為例），注意，將如下實現應用到實際模型中，需要根據具體模型微調：

def luong_attention_concat(hidden_dim: int, units: int) -> tf.keras.Model:""":param units: 全連接層單元數"""query = tf.keras.Input(shape=(hidden_dim))values = tf.keras.Input(shape=(None, hidden_dim))W1 = tf.keras.layers.Dense(units)V = tf.keras.layers.Dense(1)# query其實就是decoder的前一個狀態，decoder的第一個狀態就是上# 面提到的encoder反向RNN的最后一層，它作為decoderRNN中的初始隱藏層狀態# values其實就是encoder每個時間步的隱藏層狀態，所以下面需要將query擴展一個時間步維度進行之后的操作hidden_with_time_axis = tf.expand_dims(query, 1)scores = V(tf.nn.tanh(W1(hidden_with_time_axis + values)))attention_weights = tf.nn.softmax(scores, axis=1)context_vector = tf.matmul(attention_weights, values)context_vector = tf.reduce_mean(context_vector, axis=1)return tf.keras.Model(inputs=[query, values], outputs=[attention_weights, context_vector])def luong_attention_dot(query: tf.Tensor, value: tf.Tensor) -> tf.Tensor:""":param query: decoder的前一個狀態:param value: encoder的output"""hidden_with_time_axis = tf.expand_dims(query, 1)scores = tf.matmul(hidden_with_time_axis, value, transpose_b=True)attention_weights = tf.nn.softmax(scores, axis=1)context_vector = tf.matmul(attention_weights, value)context_vector = tf.reduce_mean(context_vector, axis=1)

Self-Attention、Multi-Head Attention

Link

Transformer用的就是Self-Attention、Multi-Head Attention。對于self-attention來講，Q(Query), K(Key), V(Value)三個矩陣均來自同一輸入，首先我們要計算Q與K之間的點乘，然后為了防止其結果過大，會除以一個尺度標度 $dk\sqrt{d_k}$ ，其中 $d_k$ 為一個query和key向量的維度。再利用Softmax操作將其結果歸一化為概率分布，然后再乘以矩陣V就得到權重求和的表示。多頭Attention，用到了多個query對一段原文進行了多次attention，每個query都關注到原文的不同部分，相當于重復做多次單層attention，兩個的結構圖如下：

計算公式如下：
$Attention(Q,K,V)=softmax(QKTdk)VAttention(Q,K,V)=softmax(\frac{QK^T}{\sqrt{d_k}})V$ $head_i=Attention(q_i,K,V)$ $MultiHead(Q,K,V)=Concat(head_1,...,head_h)W^O$

復現代碼（以TensorFlow2為例），注意，將如下實現應用到實際模型中，需要根據具體模型微調：

def scaled_dot_product_attention(query: tf.Tensor, key: tf.Tensor, value: tf.Tensor, mask: tf.Tensor=None):"""計算注意力權重。q, k, v 必須具有匹配的前置維度。k, v 必須有匹配的倒數第二個維度，例如：seq_len_k = seq_len_v。雖然 mask 根據其類型（填充或前瞻）有不同的形狀，但是 mask 必須能進行廣播轉換以便求和。參數:q: 請求的形狀 == (..., seq_len_q, depth)k: 主鍵的形狀 == (..., seq_len_k, depth)v: 數值的形狀 == (..., seq_len_v, depth_v)mask: Float 張量，其形狀能轉換成(..., seq_len_q, seq_len_k)。默認為None。返回值:輸出，注意力權重"""matmul_qk = tf.matmul(q, k, transpose_b=True) # (..., seq_len_q, seq_len_k)# 縮放 matmul_qkdk = tf.cast(tf.shape(k)[-1], tf.float32)scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)# 將 mask 加入到縮放的張量上。if mask is not None:scaled_attention_logits += (mask * -1e9)# softmax 在最后一個軸（seq_len_k）上歸一化，因此分數相加等于1。attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1) # (..., seq_len_q, seq_len_k)output = tf.matmul(attention_weights, v) # (..., seq_len_q, depth_v)return output, attention_weights

Location Sensitive Attention

Link
語音合成中的Tacotron2用的就是Location Sensitive Attention，即對位置敏感的Attention，也就是說加入了位置特征，是一種混合注意力機制（見最后一節說明）。原論文中提出，基于內容的Attention對于所輸入內容的輸入序列中的絕對位置能夠跟蹤捕獲信息，但是在較長的語音片段中性能迅速下降，所以作者為了解決這個問題，通過將輔助的卷積特征作為輸入添加到注意機制中來實現的，而這些卷積特征是通過將前一步的注意力權重進行卷積而提取的。結構圖如下：

計算公式如下：
$e_{ij}=score(s_{i-1},ca_{i-1}, h_j)=v_a^Ttanh(Ws_i+Vh_j+Uf_{i,j}+b)$

其中， $s_i$ 為當前解碼器隱狀態而非上一步解碼器隱狀態，偏置值 $b$ 被初始化為 $0$ 。位置特征 $f_i$ 使用累加注意力權重 $ca_i$ 卷積而來：
$f_i=F*ca_{i-1}$ $cai=∑j=1i?1ajca_i=\sum_{j=1}^{i-1}a_j$

復現代碼（以TensorFlow2為例），注意，將如下實現應用到實際模型中，需要根據具體模型微調：

class Attention(tf.keras.layers.Layer):def __init__(self, attention_dim, attention_filters, attention_kernel):super(Attention, self).__init__()self.attention_dim = attention_dimself.attention_location_n_filters = attention_filtersself.attention_location_kernel_size = attention_kernelself.query_layer = tf.keras.layers.Dense(self.attention_dim, use_bias=False, activation="tanh")self.memory_layer = tf.keras.layers.Dense(self.attention_dim, use_bias=False, activation="tanh")self.V = tf.keras.layers.Dense(1, use_bias=False)self.location_layer = LocationLayer(self.attention_location_n_filters, self.attention_location_kernel_size,self.attention_dim)self.score_mask_value = -float("inf")def get_alignment_energies(self, query, memory, attention_weights_cat):processed_query = self.query_layer(tf.expand_dims(query, axis=1))processed_memory = self.memory_layer(memory)attention_weights_cat = tf.transpose(attention_weights_cat, (0, 2, 1))processed_attention_weights = self.location_layer(attention_weights_cat)energies = tf.squeeze(self.V(tf.nn.tanh(processed_query + processed_attention_weights + processed_memory)), -1)return energiesdef call(self, attention_hidden_state, memory, attention_weights_cat):alignment = self.get_alignment_energies(attention_hidden_state, memory, attention_weights_cat)attention_weights = tf.nn.softmax(alignment, axis=1)attention_context = tf.expand_dims(attention_weights, 1)attention_context = tf.matmul(attention_context, memory)attention_context = tf.squeeze(attention_context, axis=1)return attention_context, attention_weights

Attention形式

關于Attention形式和獲取信息方式的總結，可參考這篇文章：Attention用于NLP的一些小結。我接下來陳列出具體形式下的相關論文（這里的陳列的論文我并沒有全部研讀，單純在這里匯總，往后有空或者需要用到對應Attention時，再仔細研讀）。

Soft attention、global attention、動態attention

這是比較常見的Attention方式，對所有key求權重概率，每個key都有一個對應的權重，是一種全局的計算方式（也可以叫Global Attention）。這種方式比較理性，參考了所有key的內容，再進行加權。但是計算量可能會比較大一些。

Hard attention

這種方式是直接精準定位到某個key，其余key就都不管了，相當于這個key的概率是1，其余key的概率全部是0。因此這種對齊方式要求很高，要求一步到位，如果沒有正確對齊，會帶來很大的影響。另一方面，因為不可導，一般需要用強化學習的方法進行訓練。（或者使用gumbel softmax之類的）

Local Attention（半軟半硬attention）

這種方式其實是以上兩種方式的一個折中，對一個窗口區域進行計算。先用Hard方式定位到某個地方，以這個點為中心可以得到一個窗口區域，在這個小區域內用Soft方式來算Attention。

A Context-aware Attention Network for Interactive Question Answering
Dynamic Attention Deep Model for Article Recommendation by Learning Human Editors’ Demonstration

Concatenation-based Attention

Attentive Collaborative Filtering: Multimedia Recommendation with Item- and Component-Level Attention
Dipole: Diagnosis Prediction in Healthcare via A?ention-based Bidirectional Recurrent Neural Networks
Enhancing Recurrent Neural Networks with Positional Attention for Question Answering
Learning to Generate Rock Descriptions from Multivariate Well Logs with Hierarchical Attention
REASONING ABOUT ENTAILMENT WITH NEURAL ATTENTION

靜態attention

對輸出句子共用一個 $s_t$ 的attention就夠了，一般用在Bilstm的首位hidden state輸出拼接起來作為 $s_t$

Teaching Machines to Read and Comprehend
Supervised Sequence Labelling with Recurrent Neural Networks

多層Attention

A Context-aware Attention Network for Interactive Question Answering
Learning to Generate Rock Descriptions from Multivariate Well Logs with Hierarchical Attention
Attentive Collaborative Filtering: Multimedia Recommendation with Item- and Component-Level Attention
Leveraging Contextual Sentence Relations for Extractive Summarization Using a Neural Attention Model

說在最后

Attention的提出到現在擁有很多的變種，但是經典的還是Bahdanau Attention和Luong Attention，很多Attention都是對這兩個進行改進的。其實學習了Attention的伙伴會發現，對于Attention而言，重要的是Score計算方法，對于不同的計算方法在下面做個總結：

基于內容的注意力機制(content-based attention)：
$e_{ij}=score(s_{i-1}, h_j)=v_a^Ttanh(W_as_{i-1}+U_ah_j)$
其中， $s_{i?1}$ 為上一個時間步中解碼器的輸出(解碼器隱狀態，decoder hidden states)， $h_j$ 是編碼器此刻輸入(編碼器隱狀態，encoder hidden state j)， $v_a$ 、 $W_a$ 和 $U_a$ 是待訓練參數張量。由于 $U_ah_j$ 是獨立于解碼步i的，因此可以獨立提前計算。基于內容的注意力機制能夠將不同的輸出與相應的輸入元素連接，而與其位置無關。
基于位置的注意力機制(location-based attention)：
$e_{ij}=score(a_{i-1}, h_j)=v_a^Ttanh(Wh_j+Uf_{i,j})$
其中， $f_{i,j}$ 是之前的注意力權重， $a_{i-1}$ 是經卷積而得的位置特征， $f_i=F?α_{i?1}$ ， $v_a$ 、 $W_a$ 、 $U_a$ 和 $F$ 是待訓練參數。基于位置的注意力機制僅關心序列元素的位置和它們之間的距離。基于位置的注意力機制會忽略靜音或減少它們，因為該注意力機制沒有發現輸入的內容。
混合注意力機制(hybrid attention)：
$e_{ij}=score(s_{i-1},a_{i-1}, h_j)=v_a^Ttanh(Ws_{i-1}+Uh_j+Uf_{i,j})$
顧名思義，混合注意力機制是上述兩者注意力機制的結合。其中， $s_{i-1}$ 為之前的解碼器隱狀態， $a_{i-1}$ 是之前的注意力權重， $h_j$ 是第j個編碼器隱狀態。為其添加偏置值b，最終的score函數計算如下：
$e_{ij}=v_a^Ttanh(Ws_{i-1}+Vh_j+Uf_{i,j}+b)$
其中， $v_a$ 、 $W$ 、 $V$ 、 $U$ 和 $b$ 為待訓練參數， $s_{i?1}$ 為上一個時間步中解碼器隱狀態， $h_j$ 是當前編碼器隱狀態， $f_{i,j}$ 是之前的注意力權重 $a_{i-1}$ 經卷積而得的位置特征(location feature)， $f_i=F?α_{i?1}$ 。混合注意力機制能夠同時考慮內容和輸入元素的位置。

參考資料：

Attn: Illustrated Attention
Attention Mechanism
聲譜預測網絡
Tutorial on Attention-based Models
Attention Model（mechanism）的套路
Performer: 基于正交隨機特征的快速注意力計算

總結

以上是生活随笔為你收集整理的NLP中遇到的各类Attention结构汇总以及代码复现的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： SpringBoot后台搭建-创建res
下一篇：搭建基础后台框架及整合Swagger2及