日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問(wèn) 生活随笔!

生活随笔

當(dāng)前位置: 首頁(yè) > 人文社科 > 生活经验 >内容正文

生活经验

自然语言处理:网购商品评论情感判定

發(fā)布時(shí)間:2023/11/27 生活经验 27 豆豆
生活随笔 收集整理的這篇文章主要介紹了 自然语言处理:网购商品评论情感判定 小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

目錄

1、項(xiàng)目背景

2、數(shù)據(jù)集

3、數(shù)據(jù)預(yù)處理

4、基于SVM的情感分類(lèi)模型

5、基于word2vec中doc2vec的無(wú)監(jiān)督分類(lèi)模型


自然語(yǔ)言處理(Natural Language Processing,簡(jiǎn)稱(chēng)NLP),是為各類(lèi)企業(yè)及開(kāi)發(fā)者提供的用于文本分析及挖掘的核心工具,旨在幫助用戶(hù)高效的處理文本,已經(jīng)廣泛應(yīng)用在電商、文娛、司法、公安、金融、醫(yī)療、電力等行業(yè)客戶(hù)的多項(xiàng)業(yè)務(wù)中,取得了良好的效果。

1、項(xiàng)目背景

任何行業(yè)領(lǐng)域,用戶(hù)對(duì)產(chǎn)品的評(píng)價(jià)都顯得尤為重要。通過(guò)用戶(hù)評(píng)論,可以對(duì)用戶(hù)情感傾向進(jìn)行判定。

例如,目前最為普遍的網(wǎng)購(gòu)行為:對(duì)于用戶(hù)來(lái)說(shuō),參考評(píng)論可以做出更優(yōu)的購(gòu)買(mǎi)決策;對(duì)于商家來(lái)說(shuō),對(duì)商品評(píng)論按照情感傾向進(jìn)行分類(lèi),并通過(guò)文本聚類(lèi)得到普遍提及的商品優(yōu)缺點(diǎn),可以進(jìn)一步改良產(chǎn)品。

本案例主要討論如何對(duì)商品評(píng)論進(jìn)行情感傾向判定。下圖為某電商平臺(tái)上針對(duì)某款手機(jī)的部分評(píng)論:

2、數(shù)據(jù)集

這份某款手機(jī)的商品評(píng)論信息數(shù)據(jù)集,包含2個(gè)屬性,共計(jì)8187個(gè)樣本。

使用Pandas中的read_excel函數(shù)讀取xls格式的數(shù)據(jù)集文件,注意文件的編碼設(shè)置為gb18030,代碼如下所示:

import pandas as pd#讀入數(shù)據(jù)集
data = pd.read_excel("data.xls", encoding='gb18030')
print(data.head())

讀取數(shù)據(jù)集效果(部分)如下所示:

查看數(shù)據(jù)集的相關(guān)信息,包括行列數(shù),列名,以及各個(gè)類(lèi)別的樣本數(shù),實(shí)現(xiàn)代碼如下所示:

# 數(shù)據(jù)集的大小
print(data.shape)# 數(shù)據(jù)集的列名
print(data.columns.values)# 不同類(lèi)別數(shù)據(jù)記錄的統(tǒng)計(jì)
print(data['Class'].value_counts())

效果如下所示

(8186, 2)array([u'Comment', u'Class'], dtype=object)1    3042
-1    26570    2487
Name: Class, dtype: int64

3、數(shù)據(jù)預(yù)處理

現(xiàn)在,我們要將Comment列的文本信息,轉(zhuǎn)化成數(shù)值矩陣表示,也就是將文本映射到特征空間。

首先,通過(guò)jieba,使用HMM模型,對(duì)文本進(jìn)行中文分詞,實(shí)現(xiàn)代碼如下所示:

# 導(dǎo)入中文分詞庫(kù)jieba
import jieba
import numpy as np

接下來(lái),對(duì)數(shù)據(jù)集的每個(gè)樣本的文本進(jìn)行中文分詞,如遇到缺失值,使用“還行、一般吧”進(jìn)行填充,實(shí)現(xiàn)代碼如下所示:

cutted = []
for row in data.values:try:raw_words = (" ".join(jieba.cut(row[0])))cutted.append(raw_words)except AttributeError:print row[0]cutted.append(u"還行 一般吧")cutted_array = np.array(cutted)# 生成新數(shù)據(jù)文件,Comment字段為分詞后的內(nèi)容
data_cutted = pd.DataFrame({'Comment': cutted_array,'Class': data['Class']
})

讀取并查看預(yù)處理后的數(shù)據(jù),實(shí)現(xiàn)代碼如下所示:

print(data_cutted.head())

數(shù)據(jù)集效果(部分)如下所示:

為了更直觀地觀察詞頻高的詞語(yǔ),我們使用第三方庫(kù)wordcloud進(jìn)行文本的可視化,導(dǎo)入庫(kù)實(shí)現(xiàn)代碼如下所示:

# 導(dǎo)入第三方庫(kù)wordcloudfrom wordcloud import WordCloud
import matplotlib.pyplot as plt

針對(duì)好評(píng),中評(píng)和差評(píng)的文本,建立WordCloud對(duì)象,繪制詞云,好評(píng)詞云可視化實(shí)現(xiàn)代碼如下所示:

# 好評(píng)
wc = WordCloud(font_path='Courier.ttf')
wc.generate(''.join(data_cutted['Comment'][data_cutted['Class'] == 1]))
plt.axis('off')
plt.imshow(wc)
plt.show()

好評(píng)詞云效果如下所示:

中評(píng)詞云可視化實(shí)現(xiàn)代碼如下所示:

# 中評(píng)wc = WordCloud(font_path='Courier.ttf')
wc.generate(''.join(data_cutted['Comment'][data_cutted['Class'] == 0]))
plt.axis('off')
plt.imshow(wc)
plt.show()

中評(píng)詞云效果如下所示:

差評(píng)詞云可視化實(shí)現(xiàn)代碼如下所示:

# 差評(píng)wc = WordCloud(font_path='Courier.ttf')
wc.generate(''.join(data_cutted['Comment'][data_cutted['Class'] == -1]))
plt.axis('off')
plt.imshow(wc)
plt.show()

差評(píng)詞云效果如下所示:

從詞云展現(xiàn)的詞頻統(tǒng)計(jì)圖來(lái)看,"手機(jī)","就是","屏幕","收到"等詞對(duì)于區(qū)分毫無(wú)幫助而且會(huì)造成偏差。因此,需要把這些對(duì)區(qū)分類(lèi)沒(méi)有意義的詞語(yǔ)篩選出來(lái),放到停用詞文件stopwords.txt中。實(shí)現(xiàn)代碼如下所示:

# 讀入停用詞文件
import codecswith codecs.open('stopwords.txt', 'r', encoding='utf-8') as f:stopwords = [item.strip() for item in f]for item in stopwords[0:200]:print(item,)

輸出停用詞效果如下所示:

使用jieba庫(kù)的extract_tags函數(shù),統(tǒng)計(jì)好評(píng),中評(píng),差評(píng)文本中的TOP20關(guān)鍵詞。

#設(shè)定停用詞文件,在統(tǒng)計(jì)關(guān)鍵詞的時(shí)候,過(guò)濾停用詞
import jieba.analysejieba.analyse.set_stop_words('stopwords.txt') 

好評(píng)關(guān)鍵詞分析,實(shí)現(xiàn)代碼如下所示:

# 好評(píng)關(guān)鍵詞
keywords_pos = jieba.analyse.extract_tags(''.join(data_cutted['Comment'][data_cutted['Class'] == 1]), topK=20)
for item in keywords_pos:print(item,)

好評(píng)關(guān)鍵詞TOP20如下所示:

不錯(cuò) 正品 贈(zèng)品 五分 發(fā)貨 東西 滿(mǎn)意 機(jī)子 喜歡 收到 很漂亮 充電 好評(píng) 很快 賣(mài)家 速度 評(píng)價(jià) 流暢 快遞 物流

中評(píng)關(guān)鍵詞分析,實(shí)現(xiàn)代碼如下所示:

#中評(píng)關(guān)鍵詞
keywords_med = jieba.analyse.extract_tags(''.join(data_cutted['Comment'][data_cutted['Class'] == 0]), topK=20)
for item in keywords_med:print(item,)

中評(píng)關(guān)鍵詞TOP20如下所示:

充電 不錯(cuò) 發(fā)熱 外觀 感覺(jué) 電池 機(jī)子 問(wèn)題 贈(zèng)品 有點(diǎn) 無(wú)線 發(fā)燙 換貨 軟件 快遞 安卓 內(nèi)存 退貨 知道 售后

差評(píng)關(guān)鍵詞分析,實(shí)現(xiàn)代碼如下所示:

#差評(píng)關(guān)鍵詞
keywords_neg = jieba.analyse.extract_tags(''.join(data_cutted['Comment'][data_cutted['Class'] == -1]), topK=20)for item in keywords_neg:print(item,)

差評(píng)關(guān)鍵詞TOP20如下所示:

差評(píng) 售后 垃圾 贈(zèng)品 退貨 問(wèn)題 換貨 充電 降價(jià) 發(fā)票 充電器 東西 剛買(mǎi) 發(fā)熱 無(wú)線 機(jī)子 死機(jī) 收到 質(zhì)量 15

經(jīng)過(guò)以上步驟的處理,整個(gè)數(shù)據(jù)集的預(yù)處理工作“告一段落”。在中文文本分析和情感分析的工作中,數(shù)據(jù)預(yù)處理的內(nèi)容主要是分詞。只有經(jīng)過(guò)分詞處理后的文本數(shù)據(jù)集才可以進(jìn)行下一步的向量化操作,滿(mǎn)足輸入模型的條件。

4、基于SVM的情感分類(lèi)模型

經(jīng)過(guò)分詞之后的文本數(shù)據(jù)集要先進(jìn)行向量化之后才能輸入到分類(lèi)模型中進(jìn)行運(yùn)算。

我們使用sklearn庫(kù)實(shí)現(xiàn)向量化方法,去掉停用詞,并將其通過(guò)tftf-idf映射到特征空間。

其中,tftf為詞頻,即分詞后每個(gè)詞項(xiàng)在該條評(píng)論中出現(xiàn)的次數(shù);dfdf為出現(xiàn)該詞項(xiàng)評(píng)論數(shù)目;NN為評(píng)論總數(shù),使用對(duì)數(shù)來(lái)適當(dāng)抑制tftf和dfdf值的影響。

我們使用sklearn庫(kù)中的函數(shù)直接實(shí)現(xiàn)SVM算法。在這里,我們選取以下形式的SVM模型參與運(yùn)算。

為了方便,創(chuàng)建文本情感分析類(lèi)CommentClassifier,來(lái)實(shí)現(xiàn)建模過(guò)程:

  • __init__為類(lèi)的初始化函數(shù),輸入?yún)?shù)classifier_typevector_type,分別代表分類(lèi)模型的類(lèi)型和向量化方法的類(lèi)型。
  • fit()函數(shù),來(lái)實(shí)現(xiàn)向量化與模型建立的過(guò)程。

實(shí)現(xiàn)代碼如下所示:

# 實(shí)現(xiàn)向量化方法
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer#實(shí)現(xiàn)svm和貝葉斯模型
from sklearn.svm import SVC
from sklearn.svm import LinearSVC
from sklearn.linear_model import SGDClassifier# 實(shí)現(xiàn)交叉驗(yàn)證
from sklearn.cross_validation import train_test_split
from sklearn.cross_validation import cross_val_score# 實(shí)現(xiàn)評(píng)價(jià)指標(biāo)
from sklearn import metrics# 文本情感分類(lèi)的類(lèi):CommentClassifier
class CommentClassifier:def __init__(self, classifier_type, vector_type):self.classifier_type = classifier_type #分類(lèi)器類(lèi)型:支持向量機(jī)或貝葉斯分類(lèi)self.vector_type = vector_type         #文本向量化模型:0\1模型,TF模型,TF-IDF模型def fit(self, train_x, train_y, max_df):list_text = list(train_x)#向量化方法:0 - 0/1,1 - TF,2 - TF-IDFif self.vector_type == 0:self.vectorizer = CountVectorizer(max_df, stop_words = stopwords, ngram_range=(1, 3)).fit(list_text)elif self.vector_type == 1:self.vectorizer = TfidfVectorizer(max_df, stop_words = stopwords, ngram_range=(1, 3), use_idf=False).fit(list_text)else:self.vectorizer = TfidfVectorizer(max_df, stop_words = stopwords, ngram_range=(1, 3)).fit(list_text)self.array_trainx = self.vectorizer.transform(list_text)self.array_trainy = train_y#分類(lèi)模型選擇:1 - SVC,2 - LinearSVC,3 - SGDClassifier,三種SVM模型  if self.classifier_type == 1:self.model = SVC(kernel='linear', gamma=10 ** -5, C=1).fit(self.array_trainx, self.array_trainy)elif self.classifier_type == 2:self.model = LinearSVC().fit(self.array_trainx, self.array_trainy)else:self.model = SGDClassifier().fit(self.array_trainx, self.array_trainy)def predict_value(self, test_x):list_text = list(test_x)self.array_testx = self.vectorizer.transform(list_text)array_predict = self.model.predict(self.array_testx)return array_predictdef predict_proba(self, test_x):list_text = list(test_x)self.array_testx = self.vectorizer.transform(list_text)array_score = self.model.predict_proba(self.array_testx)return array_score 
  • 使用train_test_split()函數(shù)劃分訓(xùn)練集和測(cè)試集。訓(xùn)練集:80%;測(cè)試集:20%。
  • 建立classifier_typevector_type兩個(gè)參數(shù)的取值列表,來(lái)表示選擇的向量化方法以及分類(lèi)模型
  • 輸出每種向量化方法和分類(lèi)模型的組合所對(duì)應(yīng)的分類(lèi)評(píng)價(jià)結(jié)果,內(nèi)容包括混淆矩陣以及含PrecisionRecallF1-score三個(gè)指標(biāo)的評(píng)分矩陣

實(shí)現(xiàn)代碼如下所示:

#劃分訓(xùn)練集,測(cè)試集
train_x, test_x, train_y, test_y = train_test_split(data_cutted['Comment'].ravel().astype('U'), data_cutted['Class'].ravel(),test_size=0.2, random_state=4)classifier_list = [1,2,3]
vector_list = [0,1,2]for classifier_type in classifier_list:for vector_type in vector_list:commentCls = CommentClassifier(classifier_type, vector_type)#max_df 設(shè)置為0.98commentCls.fit(train_x, train_y, 0.98)if classifier_type == 0:value_result = commentCls.predict_value(test_x)proba_result = commentCls.predict_proba(test_x)print(classifier_type,vector_type)print('classification report')print(metrics.classification_report(test_y, value_result, labels=[-1, 0, 1]))print('confusion matrix')print(metrics.confusion_matrix(test_y, value_result, labels=[-1, 0, 1]))else:value_result = commentCls.predict_value(test_x)print(classifier_type,vector_type)print('classification report')print(metrics.classification_report(test_y, value_result, labels=[-1, 0, 1]))print('confusion matrix')print(metrics.confusion_matrix(test_y, value_result, labels=[-1, 0, 1]))

輸出效果如下所示:

1 0
classification reportprecision    recall  f1-score   support-1       0.68      0.62      0.65       5190       0.55      0.49      0.52       4851       0.75      0.86      0.80       634avg / total       0.67      0.68      0.67      1638confusion matrix
[[324 130  65][131 236 118][ 24  64 546]]
1 1
classification reportprecision    recall  f1-score   support-1       0.71      0.74      0.72       5190       0.58      0.54      0.56       4851       0.84      0.85      0.85       634avg / total       0.72      0.72      0.72      1638confusion matrix
[[385 109  25][145 263  77][ 15  80 539]]
1 2
classification reportprecision    recall  f1-score   support-1       0.70      0.74      0.72       5190       0.58      0.52      0.55       4851       0.84      0.86      0.85       634avg / total       0.72      0.72      0.72      1638confusion matrix
[[386 106  27][151 254  80][ 14  76 544]]
2 0
classification reportprecision    recall  f1-score   support-1       0.70      0.62      0.66       5190       0.56      0.51      0.54       4851       0.76      0.88      0.82       634avg / total       0.68      0.69      0.68      1638confusion matrix
[[320 135  64][122 248 115][ 16  57 561]]
2 1
classification reportprecision    recall  f1-score   support-1       0.69      0.73      0.71       5190       0.61      0.48      0.54       4851       0.81      0.91      0.86       634avg / total       0.71      0.73      0.72      1638confusion matrix
[[377 108  34][154 233  98][ 12  44 578]]
2 2
classification reportprecision    recall  f1-score   support-1       0.70      0.74      0.72       5190       0.61      0.50      0.55       4851       0.83      0.91      0.87       634avg / total       0.72      0.73      0.73      1638confusion matrix
[[383 108  28][154 241  90][ 13  43 578]]
3 0
classification reportprecision    recall  f1-score   support-1       0.69      0.69      0.69       5190       0.58      0.47      0.52       4851       0.79      0.90      0.84       634avg / total       0.70      0.71      0.70      1638confusion matrix
[[359 118  42][148 228 109][ 14  47 573]]
3 1
classification reportprecision    recall  f1-score   support-1       0.70      0.74      0.72       5190       0.60      0.49      0.54       4851       0.81      0.88      0.84       634avg / total       0.71      0.72      0.71      1638confusion matrix
[[386  96  37][152 240  93][ 13  66 555]]
3 2
classification reportprecision    recall  f1-score   support-1       0.65      0.75      0.69       5190       0.63      0.49      0.55       4851       0.83      0.86      0.85       634avg / total       0.71      0.72      0.71      1638confusion matrix
[[389  98  32][169 236  80][ 45  41 548]]

從結(jié)果上來(lái)看,選擇tfidf向量化方法,使用LinearSVC模型效果比較好,f1-socre為0.73

從混淆矩陣來(lái)看,我們會(huì)發(fā)現(xiàn)多數(shù)的錯(cuò)誤分類(lèi)都出現(xiàn)在中評(píng)和差評(píng)上。我們可以將原始數(shù)據(jù)集的中評(píng)刪除。實(shí)現(xiàn)代碼如下所示:

data_bi = data_cutted[data_cutted['Class'] != 0]
data_bi['Class'].value_counts()

效果如下所示:

 1    3042
-1    2658
Name: Class, dtype: int64

再次運(yùn)行分類(lèi)模型,查看分類(lèi)結(jié)果,如下所示:

1 0
classification reportprecision    recall  f1-score   support-1       0.90      0.79      0.84       5371       0.83      0.92      0.87       603avg / total       0.86      0.86      0.86      1140confusion matrix
[[425 112][ 48 555]]
1 1
classification reportprecision    recall  f1-score   support-1       0.87      0.92      0.90       5371       0.93      0.88      0.90       603avg / total       0.90      0.90      0.90      1140confusion matrix
[[496  41][ 71 532]]
1 2
classification reportprecision    recall  f1-score   support-1       0.88      0.93      0.90       5371       0.93      0.88      0.91       603avg / total       0.90      0.90      0.90      1140confusion matrix
[[497  40][ 70 533]]
2 0
classification reportprecision    recall  f1-score   support-1       0.90      0.80      0.85       5371       0.84      0.92      0.88       603avg / total       0.87      0.86      0.86      1140confusion matrix
[[431 106][ 48 555]]
2 1
classification reportprecision    recall  f1-score   support-1       0.92      0.91      0.91       5371       0.92      0.93      0.92       603avg / total       0.92      0.92      0.92      1140confusion matrix
[[486  51][ 43 560]]
2 2
classification reportprecision    recall  f1-score   support-1       0.93      0.91      0.92       5371       0.92      0.94      0.93       603avg / total       0.92      0.92      0.92      1140confusion matrix
[[488  49][ 39 564]]
3 0
classification reportprecision    recall  f1-score   support-1       0.92      0.82      0.87       5371       0.86      0.94      0.90       603avg / total       0.89      0.88      0.88      1140confusion matrix
[[443  94][ 38 565]]
3 1
classification reportprecision    recall  f1-score   support-1       0.92      0.91      0.91       5371       0.92      0.93      0.92       603avg / total       0.92      0.92      0.92      1140confusion matrix
[[486  51][ 41 562]]
3 2
classification reportprecision    recall  f1-score   support-1       0.88      0.93      0.90       5371       0.93      0.89      0.91       603avg / total       0.91      0.91      0.91      1140confusion matrix
[[497  40][ 67 536]]

刪除差評(píng)之后,不同組合的分類(lèi)模型效果均有顯著提升。這也說(shuō)明,分類(lèi)模型能夠有效地將好評(píng)區(qū)分出來(lái)。

數(shù)據(jù)集中存在標(biāo)注不準(zhǔn)確的問(wèn)題,主要集中在中評(píng)。由于人在評(píng)論時(shí),除非有問(wèn)題否則一般都會(huì)打好評(píng),如果打了中評(píng)說(shuō)明對(duì)產(chǎn)品有不滿(mǎn)意之處,在情感的表達(dá)上就會(huì)趨向于負(fù)向情感,同時(shí)評(píng)論具有很大主觀性,很多中評(píng)會(huì)將其歸為差評(píng),但數(shù)據(jù)集中卻認(rèn)為是中評(píng)。因此,將一條評(píng)論分類(lèi)為好評(píng)、中評(píng)、差評(píng)是不夠客觀,中評(píng)與差評(píng)之間的邊界很模糊,因此識(shí)別率很難提高。

5、基于word2vec中doc2vec的無(wú)監(jiān)督分類(lèi)模型

開(kāi)源文本向量化工具word2vec,可以為文本數(shù)據(jù)尋求更加深層次的特征表示。詞語(yǔ)之間可以進(jìn)行運(yùn)算:

w2v(woman)-w2v(man)+w2v(king)=w2v(queen)

基于word2vec的doc2vec,將每個(gè)文檔表示為一個(gè)向量,并且通過(guò)余弦距離可以計(jì)算兩個(gè)文檔的相似程度,那么就可以計(jì)算一句話(huà)和一句極好的好評(píng)的距離,以及一句話(huà)到極差的差評(píng)的距離。

在本案例的數(shù)據(jù)集中:

  • 好評(píng):快 就是 手感 滿(mǎn)意 也好 喜歡 也 流暢 很 服務(wù)態(tài)度 實(shí)用 超快 挺快 用著 速度 禮品 也不錯(cuò) 非常好 挺好 感覺(jué) 才來(lái) 還行 好看 也快 不錯(cuò)的 送了 非常不錯(cuò) 超級(jí) 贊 好多東西 很實(shí)用 各方面 挺好的 很多 漂亮 配件 還不錯(cuò) 也多 特意 慢 滿(mǎn)分 好用 非常漂亮......
  • 差評(píng):不多說(shuō) 上當(dāng) 差差 剛用 服務(wù)差 一點(diǎn)也不 不要 簡(jiǎn)直 還是去 實(shí)體店 大家 保證 不肯 生氣 開(kāi)發(fā)票 磨損 后悔 印記 網(wǎng) 什么破 爛爛 左邊 失效 太 騙 掉價(jià) 走下坡路 不說(shuō)了 徹底 三星手機(jī) 自營(yíng) 幾次 真心 別的 看完 簡(jiǎn)單說(shuō) 機(jī)會(huì) 這是 生氣了 觸動(dòng) 縫隙 沖動(dòng)了 失望......

我們使用第三方庫(kù)gensim來(lái)實(shí)現(xiàn)doc2vec模型。

實(shí)現(xiàn)代碼如下所示:

import pandas as pd
from gensim.models import Doc2Vec
from gensim.models.doc2vec import TaggedDocument
import logginglogging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)train_x = data_bi['Comment'].ravel()
train_y = data_bi['Class'].ravel()#為train_x列貼上標(biāo)簽"TRAIN"
def labelizeReviews(reviews, label_type):labelized = []for i, v in enumerate(reviews):label = '%s_%s' % (label_type, i)labelized.append(TaggedDocument(v.split(" "), [label]))return labelizedtrain_x = labelizeReviews(train_x, "TRAIN")#建立Doc2Vec模型model
size = 300
all_data = []
all_data.extend(train_x)model = Doc2Vec(min_count=1, window=8, size=size, sample=1e-4, negative=5, hs=0, iter=5, workers=8)
model.build_vocab(all_data)# 設(shè)置迭代次數(shù)10
for epoch in range(10):model.train(train_x)#建立空列表pos和neg以對(duì)相似度計(jì)算結(jié)果進(jìn)行存儲(chǔ),計(jì)算每個(gè)評(píng)論和極好評(píng)論之間的余弦距離,并存在pos列表中
#計(jì)算每個(gè)評(píng)論和極差評(píng)論之間的余弦距離,并存在neg列表中
pos = []
neg = []for i in range(0,len(train_x)):pos.append(model.docvecs.similarity("TRAIN_0","TRAIN_{}".format(i)))neg.append(model.docvecs.similarity("TRAIN_1","TRAIN_{}".format(i)))#將pos列表和neg列表更新到原始數(shù)據(jù)文件中,分別表示為字段PosSim和字段NegSim
data_bi[u'PosSim'] = pos
data_bi[u'NegSim'] = neg

模型訓(xùn)練過(guò)程如下所示:

2017-05-27 14:30:28,393 : INFO : collecting all words and their counts
2017-05-27 14:30:28,394 : INFO : PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
2017-05-27 14:30:28,593 : INFO : collected 10545 word types and 5700 unique tags from a corpus of 5700 examples and 482148 words
2017-05-27 14:30:28,595 : INFO : Loading a fresh vocabulary
2017-05-27 14:30:28,649 : INFO : min_count=1 retains 10545 unique words (100% of original 10545, drops 0)
2017-05-27 14:30:28,650 : INFO : min_count=1 leaves 482148 word corpus (100% of original 482148, drops 0)
2017-05-27 14:30:28,705 : INFO : deleting the raw counts dictionary of 10545 items
2017-05-27 14:30:28,706 : INFO : sample=0.0001 downsamples 217 most-common words
2017-05-27 14:30:28,707 : INFO : downsampling leaves estimated 108356 word corpus (22.5% of prior 482148)
2017-05-27 14:30:28,709 : INFO : estimated required memory for 10545 words and 300 dimensions: 38560500 bytes
2017-05-27 14:30:28,784 : INFO : resetting layer weights
2017-05-27 14:30:29,120 : INFO : training model with 8 workers on 10545 vocabulary and 300 features, using sg=0 hs=0 sample=0.0001 negative=5 window=8
2017-05-27 14:30:29,121 : INFO : expecting 5700 sentences, matching count from corpus used for vocabulary survey
2017-05-27 14:30:30,176 : INFO : PROGRESS: at 10.24% examples, 72316 words/s, in_qsize 15, out_qsize 0
2017-05-27 14:30:31,211 : INFO : PROGRESS: at 29.96% examples, 91057 words/s, in_qsize 16, out_qsize 0
2017-05-27 14:30:32,218 : INFO : PROGRESS: at 66.30% examples, 126742 words/s, in_qsize 15, out_qsize 0
2017-05-27 14:30:33,231 : INFO : PROGRESS: at 86.00% examples, 122698 words/s, in_qsize 15, out_qsize 0
2017-05-27 14:30:33,571 : INFO : worker thread finished; awaiting finish of 7 more threads
2017-05-27 14:30:33,573 : INFO : worker thread finished; awaiting finish of 6 more threads
2017-05-27 14:30:33,605 : INFO : worker thread finished; awaiting finish of 5 more threads
2017-05-27 14:30:33,647 : INFO : worker thread finished; awaiting finish of 4 more threads
2017-05-27 14:30:33,678 : INFO : worker thread finished; awaiting finish of 3 more threads
2017-05-27 14:30:33,696 : INFO : worker thread finished; awaiting finish of 2 more threads
2017-05-27 14:30:33,711 : INFO : worker thread finished; awaiting finish of 1 more threads
2017-05-27 14:30:33,722 : INFO : worker thread finished; awaiting finish of 0 more threads
2017-05-27 14:30:33,724 : INFO : training on 2410740 raw words (570332 effective words) took 4.6s, 124032 effective words/s
2017-05-27 14:30:33,727 : INFO : training model with 8 workers on 10545 vocabulary and 300 features, using sg=0 hs=0 sample=0.0001 negative=5 window=8
2017-05-27 14:30:33,731 : INFO : expecting 5700 sentences, matching count from corpus used for vocabulary survey
2017-05-27 14:30:34,753 : INFO : PROGRESS: at 36.38% examples, 212225 words/s, in_qsize 15, out_qsize 0
2017-05-27 14:30:35,762 : INFO : PROGRESS: at 75.24% examples, 216859 words/s, in_qsize 16, out_qsize 0
2017-05-27 14:30:36,243 : INFO : worker thread finished; awaiting finish of 7 more threads
2017-05-27 14:30:36,244 : INFO : worker thread finished; awaiting finish of 6 more threads
2017-05-27 14:30:36,264 : INFO : worker thread finished; awaiting finish of 5 more threads
2017-05-27 14:30:36,306 : INFO : worker thread finished; awaiting finish of 4 more threads
2017-05-27 14:30:36,311 : INFO : worker thread finished; awaiting finish of 3 more threads
2017-05-27 14:30:36,320 : INFO : worker thread finished; awaiting finish of 2 more threads
2017-05-27 14:30:36,330 : INFO : worker thread finished; awaiting finish of 1 more threads
2017-05-27 14:30:36,336 : INFO : worker thread finished; awaiting finish of 0 more threads
2017-05-27 14:30:36,338 : INFO : training on 2410740 raw words (570008 effective words) took 2.6s, 219523 effective words/s
2017-05-27 14:30:36,339 : INFO : training model with 8 workers on 10545 vocabulary and 300 features, using sg=0 hs=0 sample=0.0001 negative=5 window=8
2017-05-27 14:30:36,341 : INFO : expecting 5700 sentences, matching count from corpus used for vocabulary survey
2017-05-27 14:30:37,353 : INFO : PROGRESS: at 28.23% examples, 177496 words/s, in_qsize 16, out_qsize 0
2017-05-27 14:30:38,372 : INFO : PROGRESS: at 66.30% examples, 193880 words/s, in_qsize 16, out_qsize 0
2017-05-27 14:30:39,061 : INFO : worker thread finished; awaiting finish of 7 more threads
2017-05-27 14:30:39,062 : INFO : worker thread finished; awaiting finish of 6 more threads
2017-05-27 14:30:39,074 : INFO : worker thread finished; awaiting finish of 5 more threads
2017-05-27 14:30:39,115 : INFO : worker thread finished; awaiting finish of 4 more threads
2017-05-27 14:30:39,122 : INFO : worker thread finished; awaiting finish of 3 more threads
2017-05-27 14:30:39,132 : INFO : worker thread finished; awaiting finish of 2 more threads
2017-05-27 14:30:39,147 : INFO : worker thread finished; awaiting finish of 1 more threads
2017-05-27 14:30:39,154 : INFO : worker thread finished; awaiting finish of 0 more threads
2017-05-27 14:30:39,155 : INFO : training on 2410740 raw words (570746 effective words) took 2.8s, 203312 effective words/s
2017-05-27 14:30:39,158 : INFO : training model with 8 workers on 10545 vocabulary and 300 features, using sg=0 hs=0 sample=0.0001 negative=5 window=8
2017-05-27 14:30:39,159 : INFO : expecting 5700 sentences, matching count from corpus used for vocabulary survey
2017-05-27 14:30:40,168 : INFO : PROGRESS: at 37.74% examples, 222816 words/s, in_qsize 16, out_qsize 0
2017-05-27 14:30:41,177 : INFO : PROGRESS: at 77.55% examples, 223202 words/s, in_qsize 16, out_qsize 0
2017-05-27 14:30:41,605 : INFO : worker thread finished; awaiting finish of 7 more threads
2017-05-27 14:30:41,610 : INFO : worker thread finished; awaiting finish of 6 more threads
2017-05-27 14:30:41,614 : INFO : worker thread finished; awaiting finish of 5 more threads
2017-05-27 14:30:41,645 : INFO : worker thread finished; awaiting finish of 4 more threads
2017-05-27 14:30:41,670 : INFO : worker thread finished; awaiting finish of 3 more threads
2017-05-27 14:30:41,674 : INFO : worker thread finished; awaiting finish of 2 more threads
2017-05-27 14:30:41,682 : INFO : worker thread finished; awaiting finish of 1 more threads
2017-05-27 14:30:41,690 : INFO : worker thread finished; awaiting finish of 0 more threads
2017-05-27 14:30:41,692 : INFO : training on 2410740 raw words (569889 effective words) took 2.5s, 225457 effective words/s
2017-05-27 14:30:41,694 : INFO : training model with 8 workers on 10545 vocabulary and 300 features, using sg=0 hs=0 sample=0.0001 negative=5 window=8
2017-05-27 14:30:41,696 : INFO : expecting 5700 sentences, matching count from corpus used for vocabulary survey
2017-05-27 14:30:42,712 : INFO : PROGRESS: at 29.16% examples, 183182 words/s, in_qsize 15, out_qsize 0
2017-05-27 14:30:43,754 : INFO : PROGRESS: at 69.96% examples, 203560 words/s, in_qsize 15, out_qsize 0
2017-05-27 14:30:44,804 : INFO : PROGRESS: at 91.97% examples, 173787 words/s, in_qsize 14, out_qsize 0
2017-05-27 14:30:44,973 : INFO : worker thread finished; awaiting finish of 7 more threads
2017-05-27 14:30:44,989 : INFO : worker thread finished; awaiting finish of 6 more threads
2017-05-27 14:30:45,028 : INFO : worker thread finished; awaiting finish of 5 more threads
2017-05-27 14:30:45,061 : INFO : worker thread finished; awaiting finish of 4 more threads
2017-05-27 14:30:45,097 : INFO : worker thread finished; awaiting finish of 3 more threads
2017-05-27 14:30:45,101 : INFO : worker thread finished; awaiting finish of 2 more threads
2017-05-27 14:30:45,121 : INFO : worker thread finished; awaiting finish of 1 more threads
2017-05-27 14:30:45,125 : INFO : worker thread finished; awaiting finish of 0 more threads
2017-05-27 14:30:45,128 : INFO : training on 2410740 raw words (569903 effective words) took 3.4s, 166370 effective words/s
2017-05-27 14:30:45,131 : INFO : training model with 8 workers on 10545 vocabulary and 300 features, using sg=0 hs=0 sample=0.0001 negative=5 window=8
2017-05-27 14:30:45,132 : INFO : expecting 5700 sentences, matching count from corpus used for vocabulary survey
2017-05-27 14:30:46,152 : INFO : PROGRESS: at 11.26% examples, 79348 words/s, in_qsize 16, out_qsize 0
2017-05-27 14:30:47,153 : INFO : PROGRESS: at 27.52% examples, 85992 words/s, in_qsize 16, out_qsize 0
2017-05-27 14:30:48,166 : INFO : PROGRESS: at 66.47% examples, 130273 words/s, in_qsize 15, out_qsize 0
2017-05-27 14:30:49,061 : INFO : worker thread finished; awaiting finish of 7 more threads
2017-05-27 14:30:49,076 : INFO : worker thread finished; awaiting finish of 6 more threads
2017-05-27 14:30:49,088 : INFO : worker thread finished; awaiting finish of 5 more threads
2017-05-27 14:30:49,123 : INFO : worker thread finished; awaiting finish of 4 more threads
2017-05-27 14:30:49,144 : INFO : worker thread finished; awaiting finish of 3 more threads
2017-05-27 14:30:49,147 : INFO : worker thread finished; awaiting finish of 2 more threads
2017-05-27 14:30:49,152 : INFO : worker thread finished; awaiting finish of 1 more threads
2017-05-27 14:30:49,159 : INFO : worker thread finished; awaiting finish of 0 more threads
2017-05-27 14:30:49,160 : INFO : training on 2410740 raw words (570333 effective words) took 4.0s, 141860 effective words/s
2017-05-27 14:30:49,161 : INFO : training model with 8 workers on 10545 vocabulary and 300 features, using sg=0 hs=0 sample=0.0001 negative=5 window=8
2017-05-27 14:30:49,163 : INFO : expecting 5700 sentences, matching count from corpus used for vocabulary survey
2017-05-27 14:30:50,185 : INFO : PROGRESS: at 31.78% examples, 193530 words/s, in_qsize 15, out_qsize 0
2017-05-27 14:30:51,244 : INFO : PROGRESS: at 48.51% examples, 141817 words/s, in_qsize 15, out_qsize 0
2017-05-27 14:30:52,278 : INFO : PROGRESS: at 69.96% examples, 134399 words/s, in_qsize 16, out_qsize 0
2017-05-27 14:30:52,918 : INFO : worker thread finished; awaiting finish of 7 more threads
2017-05-27 14:30:52,936 : INFO : worker thread finished; awaiting finish of 6 more threads
2017-05-27 14:30:52,945 : INFO : worker thread finished; awaiting finish of 5 more threads
2017-05-27 14:30:52,976 : INFO : worker thread finished; awaiting finish of 4 more threads
2017-05-27 14:30:52,979 : INFO : worker thread finished; awaiting finish of 3 more threads
2017-05-27 14:30:52,984 : INFO : worker thread finished; awaiting finish of 2 more threads
2017-05-27 14:30:52,995 : INFO : worker thread finished; awaiting finish of 1 more threads
2017-05-27 14:30:52,998 : INFO : worker thread finished; awaiting finish of 0 more threads
2017-05-27 14:30:52,999 : INFO : training on 2410740 raw words (570031 effective words) took 3.8s, 148864 effective words/s
2017-05-27 14:30:53,000 : INFO : training model with 8 workers on 10545 vocabulary and 300 features, using sg=0 hs=0 sample=0.0001 negative=5 window=8
2017-05-27 14:30:53,002 : INFO : expecting 5700 sentences, matching count from corpus used for vocabulary survey
2017-05-27 14:30:54,024 : INFO : PROGRESS: at 34.48% examples, 202424 words/s, in_qsize 15, out_qsize 0
2017-05-27 14:30:55,035 : INFO : PROGRESS: at 68.58% examples, 201499 words/s, in_qsize 15, out_qsize 0
2017-05-27 14:30:56,010 : INFO : worker thread finished; awaiting finish of 7 more threads
2017-05-27 14:30:56,017 : INFO : worker thread finished; awaiting finish of 6 more threads
2017-05-27 14:30:56,048 : INFO : PROGRESS: at 96.89% examples, 183861 words/s, in_qsize 5, out_qsize 1
2017-05-27 14:30:56,049 : INFO : worker thread finished; awaiting finish of 5 more threads
2017-05-27 14:30:56,071 : INFO : worker thread finished; awaiting finish of 4 more threads
2017-05-27 14:30:56,084 : INFO : worker thread finished; awaiting finish of 3 more threads
2017-05-27 14:30:56,099 : INFO : worker thread finished; awaiting finish of 2 more threads
2017-05-27 14:30:56,101 : INFO : worker thread finished; awaiting finish of 1 more threads
2017-05-27 14:30:56,104 : INFO : worker thread finished; awaiting finish of 0 more threads
2017-05-27 14:30:56,104 : INFO : training on 2410740 raw words (570328 effective words) took 3.1s, 184129 effective words/s
2017-05-27 14:30:56,105 : INFO : training model with 8 workers on 10545 vocabulary and 300 features, using sg=0 hs=0 sample=0.0001 negative=5 window=8
2017-05-27 14:30:56,107 : INFO : expecting 5700 sentences, matching count from corpus used for vocabulary survey
2017-05-27 14:30:57,134 : INFO : PROGRESS: at 33.13% examples, 197730 words/s, in_qsize 15, out_qsize 0
2017-05-27 14:30:58,140 : INFO : PROGRESS: at 69.96% examples, 206423 words/s, in_qsize 15, out_qsize 0
2017-05-27 14:30:58,876 : INFO : worker thread finished; awaiting finish of 7 more threads
2017-05-27 14:30:58,883 : INFO : worker thread finished; awaiting finish of 6 more threads
2017-05-27 14:30:58,889 : INFO : worker thread finished; awaiting finish of 5 more threads
2017-05-27 14:30:58,937 : INFO : worker thread finished; awaiting finish of 4 more threads
2017-05-27 14:30:58,949 : INFO : worker thread finished; awaiting finish of 3 more threads
2017-05-27 14:30:58,953 : INFO : worker thread finished; awaiting finish of 2 more threads
2017-05-27 14:30:58,960 : INFO : worker thread finished; awaiting finish of 1 more threads
2017-05-27 14:30:58,967 : INFO : worker thread finished; awaiting finish of 0 more threads
2017-05-27 14:30:58,968 : INFO : training on 2410740 raw words (570312 effective words) took 2.9s, 199922 effective words/s
2017-05-27 14:30:58,969 : INFO : training model with 8 workers on 10545 vocabulary and 300 features, using sg=0 hs=0 sample=0.0001 negative=5 window=8
2017-05-27 14:30:58,970 : INFO : expecting 5700 sentences, matching count from corpus used for vocabulary survey
2017-05-27 14:30:59,991 : INFO : PROGRESS: at 32.86% examples, 198045 words/s, in_qsize 16, out_qsize 0
2017-05-27 14:31:00,993 : INFO : PROGRESS: at 68.23% examples, 201443 words/s, in_qsize 16, out_qsize 0
2017-05-27 14:31:01,881 : INFO : worker thread finished; awaiting finish of 7 more threads
2017-05-27 14:31:01,888 : INFO : worker thread finished; awaiting finish of 6 more threads
2017-05-27 14:31:01,907 : INFO : worker thread finished; awaiting finish of 5 more threads
2017-05-27 14:31:01,922 : INFO : worker thread finished; awaiting finish of 4 more threads
2017-05-27 14:31:01,941 : INFO : worker thread finished; awaiting finish of 3 more threads
2017-05-27 14:31:01,948 : INFO : worker thread finished; awaiting finish of 2 more threads
2017-05-27 14:31:01,955 : INFO : worker thread finished; awaiting finish of 1 more threads
2017-05-27 14:31:01,961 : INFO : worker thread finished; awaiting finish of 0 more threads
2017-05-27 14:31:01,962 : INFO : training on 2410740 raw words (570826 effective words) took 3.0s, 191072 effective words/s

最后可視化評(píng)論分類(lèi)效果,實(shí)現(xiàn)代碼如下所示:

from matplotlib import pyplot as pltlabel= data_bi['Class'].ravel()
values = data_bi[['PosSim' , 'NegSim']].valuesplt.scatter(values[:,0], values[:,1], c=label, alpha=0.4)
plt.show()

效果如下所示:

從上圖中可以看到,好評(píng)與差評(píng)基本上可以通過(guò)一條直線區(qū)分開(kāi)(藍(lán)色為差評(píng),紅色為好評(píng))

該方法與傳統(tǒng)思路完全不同,沒(méi)有使用詞頻率,情感詞等特征,其優(yōu)點(diǎn)有:

  • 將數(shù)據(jù)集映射到了極低維度的空間,只有二維
  • 一種無(wú)監(jiān)督的學(xué)習(xí)方法,不需要對(duì)原始訓(xùn)練數(shù)據(jù)進(jìn)行標(biāo)注
  • 具有普適性,在其他領(lǐng)域也可以用這種方法,只需要先找出該領(lǐng)域極其正和極其負(fù)的方法,將其與所有待識(shí)別的數(shù)據(jù)通過(guò)doc2vec轉(zhuǎn)化為向量計(jì)算距離即可

總結(jié)

以上是生活随笔為你收集整理的自然语言处理:网购商品评论情感判定的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò),歡迎將生活随笔推薦給好友。