當前位置：首頁 > 人文社科 > 生活经验 >内容正文

生活经验

自然语言推理：使用注意力机制

發布時間：2023/11/28 生活经验 55 豆豆

生活随笔收集整理的這篇文章主要介紹了自然语言推理：使用注意力机制小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

自然語言推理：使用注意力機制

Natural Language Inference: Using Attention

自然語言推理任務和SNLI數據集。鑒于許多模型都是基于復雜和深層架構的，Parikh等人提出用注意機制解決自然語言推理問題，并稱之為“可分解注意力模型”【Parikh等人，2016年】。這就產生了一個沒有遞歸層或卷積層的模型，在SNLI數據集上用更少的參數獲得了最好的結果。在本節中，將描述并實現這種基于注意的自然語言推理方法（使用MLPs），如圖1所示。

Fig. 1. This section feeds pretrained GloVe to an architecture based
on attention and MLPs for natural language inference.

The Model

比起在前提和假設中保持單詞的順序更簡單，可以將一個文本序列中的單詞與另一個文本序列中的每個單詞對齊，反之亦然，然后比較和聚合這些信息來預測前提和假設之間的邏輯關系。與機器翻譯中源句子和目標句子之間的單詞對齊類似，前提和假設之間的單詞對齊也可以通過注意機制來完成。

Fig. 2. Natural language inference using attention mechanisms.

圖2描述了使用注意機制的自然語言推理方法。在高層次上，包括三個共同訓練的步驟：參與、比較和匯總。將在下面一步一步地加以說明。

from d2l import mxnet as d2l

import mxnet as mx

from mxnet import autograd, gluon, init, np, npx

from mxnet.gluon import nn

npx.set_np()

1.1. Attending

第一步是將一個文本序列中的單詞與另一個序列中的每個單詞對齊。假設前提是“確實需要睡眠”，假設是“累了”。由于語義上的相似性，可以將假設中的“”與前提中的“”對齊，將假設中的“累”與前提中的“睡眠”對齊。同樣，也可以將前提中的“”與假設中的“”對齊，并將前提中的“需要”和“睡眠”與假設中的“累”對齊。注意，這種對齊使用加權平均是軟的，在理想情況下，較大的權重與要對齊的單詞相關聯。為了便于演示，圖2以硬的方式顯示了這種對齊方式。

def mlp(num_hiddens, flatten):

net = nn.Sequential()net.add(nn.Dropout(0.2))net.add(nn.Dense(num_hiddens, activation='relu', flatten=flatten))net.add(nn.Dropout(0.2))net.add(nn.Dense(num_hiddens, activation='relu', flatten=flatten))

return net

class Attend(nn.Block):

def __init__(self, num_hiddens, **kwargs):super(Attend, self).__init__(**kwargs)self.f = mlp(num_hiddens=num_hiddens, flatten=False)def forward(self, A, B):# Shape of A/B: (batch_size, #words in sequence A/B, embed_size)# Shape of f_A/f_B: (batch_size, #words in sequence A/B, num_hiddens)f_A = self.f(A)f_B = self.f(B)# Shape of e: (batch_size, #words in sequence A, #words in sequence B)e = npx.batch_dot(f_A, f_B, transpose_b=True)# Shape of beta: (batch_size, #words in sequence A, embed_size), where# sequence B is softly aligned with each word (axis 1 of beta) in# sequence Abeta = npx.batch_dot(npx.softmax(e), B)# Shape of alpha: (batch_size, #words in sequence B, embed_size),# where sequence A is softly aligned with each word (axis 1 of alpha)# in sequence Balpha = npx.batch_dot(npx.softmax(e.transpose(0, 2, 1)), A)return beta, alpha

1.2. Comparing

在下一步中，將一個序列中的一個單詞與另一個與該單詞輕輕對齊的序列進行比較。請注意，在軟對齊中，來自一個序列的所有單詞，盡管可能具有不同的注意權重，但將與另一個序列中的一個單詞進行比較。為了便于演示，圖15.5.2將單詞與對齊的單詞進行硬配對。例如，假設主治步驟確定前提中的“需要”和“睡眠”都與假設中的“累”一致，則將對“累-需要睡眠”進行比較。

在比較步驟中，提供連接（運算符 [?,?])將一個序列中的單詞和另一個序列中的對齊單詞組合成一個函數g（多層感知器）：

class Compare(nn.Block):

def __init__(self, num_hiddens, **kwargs):super(Compare, self).__init__(**kwargs)self.g = mlp(num_hiddens=num_hiddens, flatten=False)def forward(self, A, B, beta, alpha):V_A = self.g(np.concatenate([A, beta], axis=2))V_B = self.g(np.concatenate([B, alpha], axis=2))return V_A, V_B

1.3. Aggregating

class Aggregate(nn.Block):

def __init__(self, num_hiddens, num_outputs, **kwargs):super(Aggregate, self).__init__(**kwargs)self.h = mlp(num_hiddens=num_hiddens, flatten=True)self.h.add(nn.Dense(num_outputs))def forward(self, V_A, V_B):# Sum up both sets of comparison vectorsV_A = V_A.sum(axis=1)V_B = V_B.sum(axis=1)# Feed the concatenation of both summarization results into an MLPY_hat = self.h(np.concatenate([V_A, V_B], axis=1))return Y_hat

1.4. Putting All Things Together

通過將參與、比較和聚合步驟放在一起，定義了可分解的注意力模型來聯合訓練這三個步驟。

class DecomposableAttention(nn.Block):

def __init__(self, vocab, embed_size, num_hiddens, **kwargs):super(DecomposableAttention, self).__init__(**kwargs)self.embedding = nn.Embedding(len(vocab), embed_size)self.attend = Attend(num_hiddens)self.compare = Compare(num_hiddens)# There are 3 possible outputs: entailment, contradiction, and neutralself.aggregate = Aggregate(num_hiddens, 3)def forward(self, X):premises, hypotheses = XA = self.embedding(premises)B = self.embedding(hypotheses)beta, alpha = self.attend(A, B)V_A, V_B = self.compare(A, B, beta, alpha)Y_hat = self.aggregate(V_A, V_B)return Y_hat

Training and Evaluating the Model

現在將在SNLI數據集上訓練和評估已定義的可分解注意力模型。從讀取數據集開始。

2.1. Reading the dataset

下載并讀取SNLI數據集。批大小和序列長度分別設置為256和50。

batch_size, num_steps = 256, 50

train_iter, test_iter, vocab = d2l.load_data_snli(batch_size, num_steps)

read 549367 examples

read 9824 examples

2.2. Creating the Model

embed_size, num_hiddens, ctx = 100, 200, d2l.try_all_gpus()

net = DecomposableAttention(vocab, embed_size, num_hiddens)

net.initialize(init.Xavier(), ctx=ctx)

glove_embedding = d2l.TokenEmbedding(‘glove.6b.100d’)

embeds = glove_embedding[vocab.idx_to_token]

net.embedding.weight.set_data(embeds)

2.3. Training and Evaluating the Model

與split_batch函數不同，該函數采用單個輸入，如文本序列（或圖像），定義了一個split_batch_multi_inputs函數，以獲取多個輸入，如小批量中的前提和假設。

#@save

def split_batch_multi_inputs(X, y, ctx_list):

"""Split multi-input X and y into multiple devices

specified by ctx"""

X = list(zip(*[gluon.utils.split_and_load(feature, ctx_list, even_split=False) for feature in X]))return (X, gluon.utils.split_and_load(y, ctx_list, even_split=False))

現在可以在SNLI數據集上訓練和評估模型。

lr, num_epochs = 0.001, 4

trainer = gluon.Trainer(net.collect_params(), ‘adam’, {‘learning_rate’: lr})

loss = gluon.loss.SoftmaxCrossEntropyLoss()

d2l.train_ch13(net, train_iter, test_iter, loss, trainer, num_epochs, ctx,

           split_batch_multi_inputs)

loss 0.511, train acc 0.798, test acc 0.823

9402.9 examples/sec on [gpu(0), gpu(1)]

2.4. Using the Model

最后，定義預測函數，輸出一對前提與假設之間的邏輯關系。

#@save

def predict_snli(net, vocab, premise, hypothesis):

premise = np.array(vocab[premise], ctx=d2l.try_gpu())hypothesis = np.array(vocab[hypothesis], ctx=d2l.try_gpu())label = np.argmax(net([premise.reshape((1, -1)),hypothesis.reshape((1, -1))]), axis=1)return 'entailment' if label == 0 else 'contradiction' if label == 1 \else 'neutral'

利用訓練后的模型，可以得到一對例句的自然語言推理結果。

predict_snli(net, vocab, [‘he’, ‘is’, ‘good’, ‘.’], [‘he’, ‘is’, ‘bad’, ‘.’])

‘contradiction’

Summary

· The decomposable attention model consists of three steps for predicting the logical relationships between premises and hypotheses: attending, comparing, and aggregating.

· With attention mechanisms, we can align words in one text sequence to every word in the other, and vice versa. Such alignment is soft using weighted average, where ideally large weights are associated with the words to be aligned.

· The decomposition trick leads to a more desirable linear complexity than quadratic complexity when computing attention weights.

· We can use pretrained word embedding as the input representation for downstream natural language processing task such as natural language inference.

總結

以上是生活随笔為你收集整理的自然语言推理：使用注意力机制的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：自然语言推理和数据集
下一篇：微调BERT：序列级和令牌级应用程序