自然语言推理:使用注意力机制
自然語言推理:使用注意力機制
Natural Language Inference: Using Attention
自然語言推理任務和SNLI數據集。鑒于許多模型都是基于復雜和深層架構的,Parikh等人提出用注意機制解決自然語言推理問題,并稱之為“可分解注意力模型”【Parikh等人,2016年】。這就產生了一個沒有遞歸層或卷積層的模型,在SNLI數據集上用更少的參數獲得了最好的結果。在本節中,將描述并實現這種基于注意的自然語言推理方法(使用MLPs),如圖1所示。
Fig. 1. This section feeds pretrained GloVe to an architecture based
on attention and MLPs for natural language inference.
- The Model
比起在前提和假設中保持單詞的順序更簡單,可以將一個文本序列中的單詞與另一個文本序列中的每個單詞對齊,反之亦然,然后比較和聚合這些信息來預測前提和假設之間的邏輯關系。與機器翻譯中源句子和目標句子之間的單詞對齊類似,前提和假設之間的單詞對齊也可以通過注意機制來完成。
Fig. 2. Natural language inference using attention mechanisms.
圖2描述了使用注意機制的自然語言推理方法。在高層次上,包括三個共同訓練的步驟:參與、比較和匯總。將在下面一步一步地加以說明。
from d2l import mxnet as d2l
import mxnet as mx
from mxnet import autograd, gluon, init, np, npx
from mxnet.gluon import nn
npx.set_np()
1.1. Attending
第一步是將一個文本序列中的單詞與另一個序列中的每個單詞對齊。假設前提是“確實需要睡眠”,假設是“累了”。由于語義上的相似性,可以將假設中的“”與前提中的“”對齊,將假設中的“累”與前提中的“睡眠”對齊。同樣,也可以將前提中的“”與假設中的“”對齊,并將前提中的“需要”和“睡眠”與假設中的“累”對齊。注意,這種對齊使用加權平均是軟的,在理想情況下,較大的權重與要對齊的單詞相關聯。為了便于演示,圖2以硬的方式顯示了這種對齊方式。
def mlp(num_hiddens, flatten):
net = nn.Sequential()net.add(nn.Dropout(0.2))net.add(nn.Dense(num_hiddens, activation='relu', flatten=flatten))net.add(nn.Dropout(0.2))net.add(nn.Dense(num_hiddens, activation='relu', flatten=flatten))
return net
class Attend(nn.Block):
def __init__(self, num_hiddens, **kwargs):super(Attend, self).__init__(**kwargs)self.f = mlp(num_hiddens=num_hiddens, flatten=False)def forward(self, A, B):# Shape of A/B: (batch_size, #words in sequence A/B, embed_size)# Shape of f_A/f_B: (batch_size, #words in sequence A/B, num_hiddens)f_A = self.f(A)f_B = self.f(B)# Shape of e: (batch_size, #words in sequence A, #words in sequence B)e = npx.batch_dot(f_A, f_B, transpose_b=True)# Shape of beta: (batch_size, #words in sequence A, embed_size), where# sequence B is softly aligned with each word (axis 1 of beta) in# sequence Abeta = npx.batch_dot(npx.softmax(e), B)# Shape of alpha: (batch_size, #words in sequence B, embed_size),# where sequence A is softly aligned with each word (axis 1 of alpha)# in sequence Balpha = npx.batch_dot(npx.softmax(e.transpose(0, 2, 1)), A)return beta, alpha
1.2. Comparing
在下一步中,將一個序列中的一個單詞與另一個與該單詞輕輕對齊的序列進行比較。請注意,在軟對齊中,來自一個序列的所有單詞,盡管可能具有不同的注意權重,但將與另一個序列中的一個單詞進行比較。為了便于演示,圖15.5.2將單詞與對齊的單詞進行硬配對。例如,假設主治步驟確定前提中的“需要”和“睡眠”都與假設中的“累”一致,則將對“累-需要睡眠”進行比較。
在比較步驟中,提供連接(運算符 [?,?])將一個序列中的單詞和另一個序列中的對齊單詞組合成一個函數g(多層感知器):
class Compare(nn.Block):
def __init__(self, num_hiddens, **kwargs):super(Compare, self).__init__(**kwargs)self.g = mlp(num_hiddens=num_hiddens, flatten=False)def forward(self, A, B, beta, alpha):V_A = self.g(np.concatenate([A, beta], axis=2))V_B = self.g(np.concatenate([B, alpha], axis=2))return V_A, V_B
1.3. Aggregating
class Aggregate(nn.Block):
def __init__(self, num_hiddens, num_outputs, **kwargs):super(Aggregate, self).__init__(**kwargs)self.h = mlp(num_hiddens=num_hiddens, flatten=True)self.h.add(nn.Dense(num_outputs))def forward(self, V_A, V_B):# Sum up both sets of comparison vectorsV_A = V_A.sum(axis=1)V_B = V_B.sum(axis=1)# Feed the concatenation of both summarization results into an MLPY_hat = self.h(np.concatenate([V_A, V_B], axis=1))return Y_hat
1.4. Putting All Things Together
通過將參與、比較和聚合步驟放在一起,定義了可分解的注意力模型來聯合訓練這三個步驟。
class DecomposableAttention(nn.Block):
def __init__(self, vocab, embed_size, num_hiddens, **kwargs):super(DecomposableAttention, self).__init__(**kwargs)self.embedding = nn.Embedding(len(vocab), embed_size)self.attend = Attend(num_hiddens)self.compare = Compare(num_hiddens)# There are 3 possible outputs: entailment, contradiction, and neutralself.aggregate = Aggregate(num_hiddens, 3)def forward(self, X):premises, hypotheses = XA = self.embedding(premises)B = self.embedding(hypotheses)beta, alpha = self.attend(A, B)V_A, V_B = self.compare(A, B, beta, alpha)Y_hat = self.aggregate(V_A, V_B)return Y_hat
- Training and Evaluating the Model
現在將在SNLI數據集上訓練和評估已定義的可分解注意力模型。從讀取數據集開始。
2.1. Reading the dataset
下載并讀取SNLI數據集。批大小和序列長度分別設置為256和50。
batch_size, num_steps = 256, 50
train_iter, test_iter, vocab = d2l.load_data_snli(batch_size, num_steps)
read 549367 examples
read 9824 examples
2.2. Creating the Model
embed_size, num_hiddens, ctx = 100, 200, d2l.try_all_gpus()
net = DecomposableAttention(vocab, embed_size, num_hiddens)
net.initialize(init.Xavier(), ctx=ctx)
glove_embedding = d2l.TokenEmbedding(‘glove.6b.100d’)
embeds = glove_embedding[vocab.idx_to_token]
net.embedding.weight.set_data(embeds)
2.3. Training and Evaluating the Model
與split_batch函數不同,該函數采用單個輸入,如文本序列(或圖像),定義了一個split_batch_multi_inputs函數,以獲取多個輸入,如小批量中的前提和假設。
#@save
def split_batch_multi_inputs(X, y, ctx_list):
"""Split multi-input X and y into multiple devices
specified by ctx"""
X = list(zip(*[gluon.utils.split_and_load(feature, ctx_list, even_split=False) for feature in X]))return (X, gluon.utils.split_and_load(y, ctx_list, even_split=False))
現在可以在SNLI數據集上訓練和評估模型。
lr, num_epochs = 0.001, 4
trainer = gluon.Trainer(net.collect_params(), ‘adam’, {‘learning_rate’: lr})
loss = gluon.loss.SoftmaxCrossEntropyLoss()
d2l.train_ch13(net, train_iter, test_iter, loss, trainer, num_epochs, ctx,
split_batch_multi_inputs)
loss 0.511, train acc 0.798, test acc 0.823
9402.9 examples/sec on [gpu(0), gpu(1)]
2.4. Using the Model
最后,定義預測函數,輸出一對前提與假設之間的邏輯關系。
#@save
def predict_snli(net, vocab, premise, hypothesis):
premise = np.array(vocab[premise], ctx=d2l.try_gpu())hypothesis = np.array(vocab[hypothesis], ctx=d2l.try_gpu())label = np.argmax(net([premise.reshape((1, -1)),hypothesis.reshape((1, -1))]), axis=1)return 'entailment' if label == 0 else 'contradiction' if label == 1 \else 'neutral'
利用訓練后的模型,可以得到一對例句的自然語言推理結果。
predict_snli(net, vocab, [‘he’, ‘is’, ‘good’, ‘.’], [‘he’, ‘is’, ‘bad’, ‘.’])
‘contradiction’
- Summary
· The decomposable attention model consists of three steps for predicting the logical relationships between premises and hypotheses: attending, comparing, and aggregating.
· With attention mechanisms, we can align words in one text sequence to every word in the other, and vice versa. Such alignment is soft using weighted average, where ideally large weights are associated with the words to be aligned.
· The decomposition trick leads to a more desirable linear complexity than quadratic complexity when computing attention weights.
· We can use pretrained word embedding as the input representation for downstream natural language processing task such as natural language inference.
總結
以上是生活随笔為你收集整理的自然语言推理:使用注意力机制的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 自然语言推理和数据集
- 下一篇: 微调BERT:序列级和令牌级应用程序