當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

【Spark NLP】第 15 章：聊天机器人

發布時間：2023/12/16 编程问答 21 豆豆

生活随笔收集整理的這篇文章主要介紹了【Spark NLP】第 15 章：聊天机器人小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

??🔎大家好，我是Sonhhxg_柒，希望你看完之后，能對你有所幫助，不足請指正！共同學習交流🔎

📝個人主頁－Sonhhxg_柒的博客_CSDN博客?📃

🎁歡迎各位→點贊👍 + 收藏?? + 留言📝?

📣系列專欄 - 機器學習【ML】?自然語言處理【NLP】? 深度學習【DL】

?🖍foreword

?說明?本人講解主要包括Python、機器學習（ML）、深度學習（DL）、自然語言處理（NLP）等內容。

如果你對這個系列感興趣的話，可以關注訂閱喲👋

文章目錄

問題陳述和約束

計劃項目

設計解決方案

實施解決方案

測試和測量解決方案

業務指標

以模型為中心的指標

審查

結論

當我們討論語言模型時，我們展示了如何生成文本。構建一個聊天機器人是類似的，除了我們正在為一個交換建模。這可以使我們的要求更復雜，或者實際上更簡單，具體取決于我們要如何解決問題。

在本章中，我們將討論一些可以對此建模的方法，然后我們將構建一個程序，該程序將使用生成模型來獲取然后生成響應。首先，讓我們談談什么是話語。

形態學和句法告訴我們詞素是如何組合成詞的，詞是如何組合成短語和句子的。將句子組合成更大的語言行為并不容易建模。有不恰當的句子組合的想法。讓我們看一些例子：

I went to the doctor, yesterday. It is just a sprained ankle.
I went to the doctor, yesterday. Mosquitoes have 47 teeth.

在第一個例子中，第二句話顯然與第一句話有關。從這兩句話，結合常識，我們可以推斷出說話者是因為腳踝問題去看醫生，結果是扭傷。第二個例子沒有意義。從語言學的角度來看，句子是從概念生成的，然后編碼成單詞和短語。句子所表達的概念是相互聯系的，所以一個句子序列應該由相似的概念聯系起來。無論對話中只有一個或多個發言者，這都是正確的。

話語的語用學對于理解如何對其建模很重要。如果我們正在為客戶服務交換建模，則響應范圍可能會受到限制。這些有限類型的響應通常稱為意圖。在構建客戶服務聊天機器人時，這大大降低了潛在的復雜性。如果我們對一般對話進行建模，這可能會變得更加困難。語言模型學習序列中可能發生的事情，但它們無法學習生成概念。所以我們的選擇是要么構建一些模型來模擬可能的序列，要么找到一種作弊的方法。

我們可以通過對無法識別的意圖構建罐頭響應來作弊。例如，如果用戶聲明我們的簡單模型不期望，我們可以讓它回應，“對不起，我不明白。”?如果我們正在記錄對話，我們可以使用使用預設響應的交換來擴展我們涵蓋的意圖。

在我們所涵蓋的示例中，我們將構建一個純粹為整個話語建模的程序。本質上，它是一種語言模型。不同之處在于我們如何使用它。

本章與前幾章的不同之處在于它沒有使用 Spark。Spark 非常適合批量處理大量數據。在交互式應用程序中它不是很好。此外，循環神經網絡可能需要很長時間來訓練大量數據。因此，在本章中，我們正在處理一小部分數據。如果您有正確的硬件，您可以更改 NLTK 處理以使用 Spark NLP。

問題陳述和約束

我們將構建一個故事構建工具。這個想法是幫助某人寫一個類似于格林童話故事的原創故事。從包含更多參數的意義上說，這個模型將比以前的語言模型復雜得多。該程序將是一個腳本，它要求輸入句子并生成一個新句子。然后，用戶獲取該句子，對其進行修改和更正，然后輸入它。

我們試圖解決的問題是什么？

我們需要一個系統來推薦故事中的下一個句子。我們還必須認識到文本生成技術的局限性。我們需要讓用戶參與循環。所以我們需要一個可以生成相關文本的模型和一個可以讓我們查看輸出的系統。

有哪些限制條件？

首先，我們需要一個具有兩個上下文概念的模型——前一個句子和當前句子。我們不需要過多擔心性能，因為這將與人進行交互。這似乎違反直覺，因為大多數交互式系統需要相當低的延遲。然而，如果你考慮這個程序正在產生什么，等待一到三秒的響應并不是不合理的。

我們如何解決約束問題？

我們將構建一個用于生成文本的神經網絡，特別是 RNN，如第4章和第8章所述。我們可以在這個模型中學習詞嵌入，但我們可以使用預先構建的嵌入。這將幫助我們更快地訓練模型。

計劃項目

這個項目的大部分工作將是開發一個模型。一旦我們有了模型，我們將構建一個簡單的腳本，我們可以用它來編寫我們自己的格林式童話故事。一旦我們開發了這個腳本，這個模型就有可能被用來驅動 Twitter 機器人或 Slackbot。

在文本生成的實際生產環境中，我們希望監控生成文本的質量。這將使我們能夠通過開發更有針對性的訓練數據來改進生成的文本。

設計解決方案

如果你還記得我們的語言模型，我們使用了三層。

Input

Embedding

LSTM

Dense output

我們輸入固定大小的字符窗口并預測下一個字符。現在我們需要找到一種方法來考慮更大的文本部分。有幾個選項。

許多 RNN 架構包括一個用于學習單詞嵌入的層。這僅需要我們學習更多參數，因此我們將使用預訓練的 GloVe 模型。此外，我們將在令牌級別上構建模型，而不是像以前那樣在角色級別上構建模型。

我們可以使窗口大小比平均句子大得多。這有利于保持相同的模型架構。缺點是我們的 LSTM 層必須在很長的距離上維護信息。我們可以使用一種用于機器翻譯的架構。

讓我們考慮連接方法。

Context input

Context LSTM

Current input

Current LSTM

Concatenate 2 and 4

Dense output

當前輸入將是句子上的窗口，因此對于給定句子的每個窗口，我們將使用相同的上下文向量。這種方法的好處是能夠擴展到多個句子。缺點是模型必須學會平衡遠近的信息。

讓我們考慮有狀態的方法。

Context input

Context LSTM

Current input

Current LSTM, initialized with state of 2

Dense output

通過減少前一句的影響，這有助于使訓練更容易。然而，這是一把雙刃劍，因為上下文給我們的信息較少。我們將使用這種方法。

實施解決方案

讓我們從導入開始。本章將依賴 Keras。

from collections import Counter import pickle as pklimport nltk import numpy as np import pandas as pdfrom keras.models import Model from keras.layers import Input, Embedding, LSTM, Dense, CuDNNLSTM from keras.layers.merge import Concatenate import keras.utils as ku import keras.preprocessing as kp import tensorflow as tf np.random.seed(1) tf.set_random_seed(2)

讓我們還為句子的開頭和結尾以及未知標記定義一些特殊標記。

START = '>' END = '###' UNK = '???'

現在，我們可以加載數據了。我們需要替換一些特殊字符。

with open('grimms_fairytales.txt', encoding='UTF-8') as fp:text = fp.read()text = text\.replace('\t', ' ')\.replace('“', '"')\.replace('”', '"')\.replace('“', '"')\.replace('‘', "'")\.replace('’', "'")

現在，我們可以將我們的文本處理成標記化的句子。

sentences = nltk.tokenize.sent_tokenize(text) sentences = [s.strip()for s in sentences] sentences = [[t.lower() for t in nltk.tokenize.wordpunct_tokenize(s)] for s in sentences] word_counts = Counter([t for s in sentences for t in s]) word_counts = pd.Series(word_counts) vocab = [START, END, UNK] + list(sorted(word_counts.index))

我們需要為我們的模型定義一些超參數。

dim是令牌嵌入的大小
w是我們將使用的窗口的大小
max_len是我們使用的句子長度
units是我們將用于 LSTM 的狀態向量的大小

dim = 50 w = 10 max_len = int(np.quantile([len(s) for s in sentences], 0.95)) units = 200

現在，讓我們加載 GloVe 嵌入。

glove = {} with open('glove.6B/glove.6B.50d.txt', encoding='utf-8') as fp:for line in fp:token, embedding = line.split(maxsplit=1)if token in vocab:embedding = np.fromstring(embedding, 'f', sep=' ')glove[token] = embeddingvocab = list(sorted(glove.keys())) vocab_size = len(vocab)

我們還需要查找 one-hot-encoded 輸出。

i2t = dict(enumerate(vocab)) t2i = {t: i for i, t in i2t.items()}token_oh = ku.to_categorical(np.arange(vocab_size)) token_oh = {t: token_oh[i,:] for t, i in t2i.items()}

現在，我們可以定義一些實用函數。

我們需要填充句子的結尾；否則，我們將無法從句子中的最后一個單詞中學習。

def pad_sentence(sentence, length):sentence = sentence[:length]if len(sentence) < length:sentence += [END] * (length - len(sentence))return sentence

我們還需要將句子轉換為矩陣。

def sent2mat(sentence, embedding):mat = [embedding.get(t, embedding[UNK]) for t in sentence]return np.array(mat)

我們需要一個將序列轉換為滑動窗口序列的函數。

def slide_seq(seq, w):window = []target = []for i in range(len(seq)-w-1):window.append(seq[i:i+w])target.append(seq[i+w])return window, target

現在我們可以構建我們的輸入矩陣。我們將有兩個輸入矩陣。一個來自上下文，一個來自當前句子。

Xc = [] Xi = [] Y = []for i in range(len(sentences)-1):context_sentence = pad_sentence(sentences[i], max_len)xc = sent2mat(context_sentence, glove)input_sentence = [START]*(w-1) + sentences[i+1] + [END]*(w-1)for window, target in zip(*slide_seq(input_sentence, w)):xi = sent2mat(window, glove)y = token_oh.get(target, token_oh[UNK])Xc.append(np.copy(xc))Xi.append(xi)Y.append(y)Xc = np.array(Xc) Xi = np.array(Xi) Y = np.array(Y) print('context sentence: ', xc.shape) print('input sentence: ', xi.shape) print('target sentence: ', y.shape) context sentence: (42, 50) input sentence: (10, 50) target sentence: (4407,)

讓我們建立我們的模型。

input_c = Input(shape=(max_len,dim,), dtype='float32') lstm_c, h, c = LSTM(units, return_state=True)(input_c)input_i = Input(shape=(w,dim,), dtype='float32') lstm_i = LSTM(units)(input_i, initial_state=[h, c])out = Dense(vocab_size, activation='softmax')(lstm_i) model = Model(input=[input_c, input_i], output=[out]) print(model.summary()) Model: "model_1" __________________________________________________________________________ Layer (type) Output Shape Param # Connected to ========================================================================== input_1 (InputLayer) (None, 42, 50) 0 __________________________________________________________________________ input_2 (InputLayer) (None, 10, 50) 0 __________________________________________________________________________ lstm_1 (LSTM) [(None, 200), (None, 200800 input_1[0][0] __________________________________________________________________________ lstm_2 (LSTM) (None, 200) 200800 input_2[0][0]lstm_1[0][1] lstm_1[0][2] __________________________________________________________________________ dense_1 (Dense) (None, 4407) 885807 lstm_2[0][0] ========================================================================== Total params: 1,287,407 Trainable params: 1,287,407 Non-trainable params: 0 __________________________________________________________________________ None model.compile(loss='categorical_crossentropy', optimizer='adam',metrics=['accuracy'])

現在我們可以訓練我們的模型了。根據您的硬件，這在 CPU 上每個 epoch 可能需要四分鐘。這是我們迄今為止最復雜的模型，具有近 130 萬個參數。

Epoch 1/10 145061/145061 [==============================] - 241s 2ms/step - loss: 3.7840 - accuracy: 0.3894 ... Epoch 10/10 145061/145061 [==============================] - 244s 2ms/step - loss: 1.8933 - accuracy: 0.5645

一旦我們訓練了這個模型，我們就可以嘗試生成一些句子。這個函數需要一個上下文句子和一個輸入句子——我們可以簡單地提供一個單詞來開始。該函數會將標記附加到輸入句子，直到END生成標記或我們達到最大允許長度。

def generate_sentence(context_sentence, input_sentence, max_len=100):context_sentence = [t.lower() for t in nltk.tokenize.wordpunct_tokenize(context_sentence)]context_sentence = pad_sentence(context_sentence, max_len)context_vector = sent2mat(context_sentence, glove)input_sentence = [t.lower() for t in nltk.tokenize.wordpunct_tokenize(input_sentence)]input_sentence = [START] * (w-1) + input_sentenceinput_sentence = input_sentence[:w]output_sentence = input_sentenceinput_vector = sent2mat(input_sentence, glove)predicted_vector = model.predict([[context_vector], [input_vector]])predicted_token = i2t[np.argmax(predicted_vector)]output_sentence.append(predicted_token)i = 0while predicted_token != END and i < max_len:input_sentence = input_sentence[1:w] + [predicted_token]input_vector = sent2mat(input_sentence, glove)predicted_vector = model.predict([[context_vector], [input_vector]])predicted_token = i2t[np.argmax(predicted_vector)]output_sentence.append(predicted_token)i += 1return output_sentence

因為我們需要提供新句子的第一個單詞，所以我們可以簡單地從語料庫中找到的開頭標記進行采樣。讓我們將需要的第一個單詞的分布保存為 JSON。

first_words = Counter([s[0] for s in sentence]) first_words = pd.Series(first_words) first_words = first_words.sum() first_words.to_json('grimm-first-words.json') with open('glove-dict.pkl', 'wb') as out:pkl.dump(glove, out) with open('vocab.pkl', 'wb') as out:pkl.dump(i2t, out)

讓我們看看在沒有人工干預的情況下生成了什么。

context_sentence = ''' In old times, when wishing was having, there lived a King whose daughters were all beautiful, but the youngest was so beautiful that the sun itself, which has seen so much, was astonished whenever it shone in her face. '''.strip().replace('\n', ' ')input_sentence = np.random.choice(first_words.index, p=first_words)for _ in range(10):print(context_sentence, END)output_sentence = generate_sentence(context_sentence, input_sentence, max_len)output_sentence = ' '.join(output_sentence[w-1:-1])context_sentence = output_sentenceinput_sentence = np.random.choice(first_words.index, p=first_words) print(output_sentence, END) In old times, when wishing was having, there lived a King whose daughters were all beautiful, but the youngest was so beautiful that the sun itself, which has seen so much, was astonished whenever it shone in her face. ### " what do you desire ??? ### the king ' s son , however , was still beautiful , and a little chair there ' s blood and so that she is alive ??? ### the king ' s son , however , was still beautiful , and the king ' s daughter was only of silver , and the king ' s son came to the forest , and the king ' s son seated himself on the leg , and said , " i will go to church , and you shall be have lost my life ??? ### " what are you saying ??? ### cannon - maiden , and the king ' s daughter was only a looker - boy . ### but the king ' s daughter was humble , and said , " you are not afraid ??? ### then the king said , " i will go with you ??? ### " i will go with you ??? ### he was now to go with a long time , and the bird threw in the path , and the strong of them were on their of candles and bale - plants . ### then the king said , " i will go with you ??? ###

該模型不會很快通過圖靈測試。這就是為什么我們需要一個人參與其中。讓我們構建我們的腳本。首先，讓我們保存我們的模型。

model.save('grimm-model')

我們的腳本需要能夠訪問我們的一些實用函數以及超參數——例如dim，w.

%%writefile fairywriter.py """ 這個腳本幫助你生成一個童話故事。 """import pickle as pklimport nltk import numpy as np import pandas as pdfrom keras.models import load_model import keras.utils as ku import keras.preprocessing as kp import tensorflow as tfSTART = '>' END = '###' UNK = '???'FINISH_CMDS = ['finish', 'f'] BACK_CMDS = ['back', 'b'] QUIT_CMDS = ['quit', 'q'] CMD_PROMPT = ' | '.join(','.join(c) for c in [FINISH_CMDS, BACK_CMDS, QUIT_CMDS]) QUIT_PROMPT = '"{}" to quit'.format('" or "'.join(QUIT_CMDS)) ENDING = ['THE END']def pad_sentence(sentence, length):sentence = sentence[:length]if len(sentence) < length:sentence += [END] * (length - len(sentence))return sentencedef sent2mat(sentence, embedding):mat = [embedding.get(t, embedding[UNK]) for t in sentence]return np.array(mat)def generate_sentence(context_sentence, input_sentence, vocab, max_len=100, hparams=(42, 50, 10)):max_len, dim, w = hparamscontext_sentence = [t.lower() for t in nltk.tokenize.wordpunct_tokenize(context_sentence)]context_sentence = pad_sentence(context_sentence, max_len)context_vector = sent2mat(context_sentence, glove)input_sentence = [t.lower() for t in nltk.tokenize.wordpunct_tokenize(input_sentence)]input_sentence = [START] * (w-1) + input_sentenceinput_sentence = input_sentence[:w]output_sentence = input_sentenceinput_vector = sent2mat(input_sentence, glove)predicted_vector = model.predict([[context_vector], [input_vector]])predicted_token = vocab[np.argmax(predicted_vector)]output_sentence.append(predicted_token)i = 0while predicted_token != END and i < max_len:input_sentence = input_sentence[1:w] + [predicted_token]input_vector = sent2mat(input_sentence, glove)predicted_vector = model.predict([[context_vector], [input_vector]])predicted_token = vocab[np.argmax(predicted_vector)]output_sentence.append(predicted_token)i += 1return output_sentenceif __name__ == '__main__':model = load_model('grimm-model')(_, max_len, dim), (_, w, _) = model.get_input_shape_at(0)hparams = (max_len, dim, w)first_words = pd.read_json('grimm-first-words.json', typ='series')with open('glove-dict.pkl', 'rb') as fp:glove = pkl.load(fp)with open('vocab.pkl', 'rb') as fp:vocab = pkl.load(fp)print("Let's write a story!")title = input('Give me a title ({}) '.format(QUIT_PROMPT))story = [title]context_sentence = titleinput_sentence = np.random.choice(first_words.index, p=first_words)if title.lower() in QUIT_CMDS:exit()print(CMD_PROMPT)while True:input_sentence = np.random.choice(first_words.index, p=first_words)generated = generate_sentence(context_sentence, input_sentence, vocab, hparams=hparams)generated = ' '.join(generated)### 模型創建一個建議的句子print('Suggestion:', generated)### 用戶回復他們想要添加的句子 ### 用戶可以修改建議的句子或編寫自己的### 這是將用于制作下一個建議句子的sentence = input('Sentence: ')if sentence.lower() in QUIT_CMDS:story = []breakelif sentence.lower() in FINISH_CMDS:story.append(np.random.choice(ENDING))breakelif sentence.lower() in BACK_CMDS:if len(story) == 1:print('You are at the beginning')story = story[:-1]context_sentence = story[-1]continueelse:story.append(sentence)context_sentence = sentenceprint('\n'.join(story))print('exiting...')

讓我們運行一下我們的腳本。我將使用它來閱讀建議并將其中的元素添加到下一行。一個更復雜的模型可能能夠生成可以編輯和添加的句子，但這個模型并不完全存在。

%run fairywriter.py Let's write a story! Give me a title ("quit" or "q" to quit) The Wolf Goes Home finish,f | back,b | quit,q Suggestion: > > > > > > > > > and when they had walked for the time , and the king ' s son seated himself on the leg , and said , " i will go to church , and you shall be have lost my life ??? ### Sentence: There was once a prince who got lost in the woods on the way to a church. Suggestion: > > > > > > > > > she was called hans , and as the king ' s daughter , who was so beautiful than the children , who was called clever elsie . ### Sentence: The prince was called Hans, and he was more handsome than the boys. Suggestion: > > > > > > > > > no one will do not know what to say , but i have been compelled to you ??? ### Sentence: The Wolf came along and asked, "does no one know where are?" Suggestion: > > > > > > > > > there was once a man who had a daughter who had three daughters , and he had a child and went , the king ' s daughter , and said , " you are growing and thou now , i will go and fetch Sentence: The Wolf had three daughters, and he said to the prince, "I will help you return home if you take one of my daughters as your betrothed." Suggestion: > > > > > > > > > but the king ' s daughter was humble , and said , " you are not afraid ??? ### Sentence: The prince asked, "are you not afraid that she will be killed as soon as we return home?" Suggestion: > > > > > > > > > i will go and fetch the golden horse ??? ### Sentence: The Wolf said, "I will go and fetch a golden horse as dowry." Suggestion: > > > > > > > > > one day , the king ' s daughter , who was a witch , and lived in a great forest , and the clouds of earth , and in the evening , came to the glass mountain , and the king ' s son Sentence: The Wolf went to find the forest witch that she might conjure a golden horse. Suggestion: > > > > > > > > > when the king ' s daughter , however , was sitting on a chair , and sang and reproached , and said , " you are not to be my wife , and i will take you to take care of your ??? ### Sentence: The witch reproached the wolf saying, "you come and ask me such a favor with no gift yourself?" Suggestion: > > > > > > > > > then the king said , " i will go with you ??? ### Sentence: So the wolf said, "if you grant me this favor, I will be your servant." Suggestion: > > > > > > > > > he was now to go with a long time , and the other will be polluted , and we will leave you ??? ### Sentence: f The Wolf Goes Home There was once a prince who got lost in the woods on the way to a church. The prince was called Hans, and he was more handsome than the boys. The Wolf came along and asked, "does no one know where are?" The Wolf had three daughters, and he said to the prince, "I will help you return home if you take one of my daughters as your betrothed." The prince asked, "are you not afraid that she will be killed as soon as we return home?" The Wolf said, "I will go and fetch a golden horse as dowry." The Wolf went to find the forest witch that she might conjure a golden horse. The witch reproached the wolf saying, "you come and ask me such a favor with no gift yourself?" So the wolf said, "if you grant me this favor, I will be your servant." THE END exiting..

您可以進行額外的 epochs 以獲得更好的建議，但要注意過度擬合。如果你過度擬合這個模型，那么如果你向它提供它無法識別的上下文和輸入，它會產生更糟糕的結果。

現在我們有了一個可以與之交互的模型，下一步就是將它與聊天機器人系統集成。大多數系統都需要一些服務于模型的服務器。具體情況取決于您的聊天機器人平臺。

測試和測量解決方案

與大多數應用程序相比，衡量聊天機器人更多地取決于產品的最終目的。讓我們考慮一下我們將用于測量的不同類型的指標。

業務指標

如果您正在構建一個聊天機器人來支持客戶服務，那么業務指標將以客戶體驗為中心。如果您正在構建一個用于娛樂目的的聊天機器人，就像這里的情況一樣，沒有明顯的業務指標。但是，如果娛樂聊天機器人用于營銷，您可以使用營銷指標。

以模型為中心的指標

很難以模型在訓練中測量的相同方式測量實時交互。在訓練中，我們知道“正確”的答案，但由于模型的交互性，我們沒有明確的正確答案。要測量實時模型，您需要手動標記對話。

現在讓我們談談基礎設施。

審查

在審查聊天機器人時，您需要進行任何項目所需的正常審查。額外的要求是將聊天機器人放在實際用戶的代理前面。與任何需要用戶交互的應用程序一樣，用戶測試是核心。

結論

在本章中，我們學習了如何為交互式應用程序構建模型。有許多不同種類的聊天機器人。我們在這里看到的示例是基于語言模型的，但我們也可以構建推薦模型。這完全取決于您期待什么樣的互動。在我們的情況下，我們正在輸入并接收完整的句子。如果您的應用程序有一組受限制的響應，那么您的任務就會變得更容易。

總結

以上是生活随笔為你收集整理的【Spark NLP】第 15 章：聊天机器人的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： C++ 并发指南-atomic原子变量使
下一篇：面试——路径、转发与重定向的区别