當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

文本分类（2）——取特征词构建词典

發布時間：2023/12/16 编程问答 35 豆豆

生活随笔收集整理的這篇文章主要介紹了文本分类（2）——取特征词构建词典小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

001 常見特征詞提取

tf-idf http://www.ruanyifeng.com/blog/2013/03/tf-idf.html
chi https://blog.csdn.net/hubin232/article/details/81272126 【比較新】

sklearn里面算的是每個文本的tdidf向量,max_features是對所有詞得頻率進行降序排序只取前max_features個詞。加上之前參考了 https://github.com/chenfei0328/BayesProject
思路就被局限在了所有文本建一個總的矩陣再進行降維，后來被大佬提點，才轉了思路。

002 實現

我們做的時候的大致思路：

每類文本利用tf-idf提取一定數目的特征詞，組成該類文本的特征詞典

所有的文本算CHI提取特征詞

（聽老師說還有用CHI*TF-IDF來降維的，別的區特征的方法比如IG也可以）
寫的時候計數老有問題，同一類的數據是一樣的。發現是拷貝和引用沒有弄清楚。

統計詞頻

每類文檔的數目懶得貼了

#統計詞頻和文檔數,考慮詞頻來計算CHI #wroddict={tf:出現次數；total:該類總單詞數；idf:出現的文檔數目} import json import os from collections import Counter import time def count_words(cut_path):cate=os.listdir(cut_path)wordDict={'total':{}}for i,category in enumerate(cate):print(time.strftime('%Y-%m-%d %H:%M:%S', time.localtime()), '>', '=' * 40 + '[' + category + ']' + '=' * 40)wordInfo={'tf':0,'idf':1}wordsLen=0file_path=cut_path+category+'/'file_list=os.listdir(file_path)for j,file_name in enumerate(file_list):full_path=file_path+file_namewith open(full_path,"r",encoding='utf-8') as f:content=f.read()words=content.split()wordsLen+=len(words)wordCounter=Counter(words)for index,wordTuple in enumerate(wordCounter.most_common(len(wordCounter))):(word,count)=wordTuplewordDict.setdefault(word,{})wordDict[word].setdefault(category,wordInfo.copy())wordDict[word][category]['tf']+=countwordDict[word][category]['idf']+=1wordDict['total'][category]=wordDict.setdefault(category,wordsLen)wordDict=dict(wordDict)fname='C:/lyr/DM/feature_reduction/wordDict.json'with open(fname,'w') as fp:json.dump(wordDict,fp)

計算每個詞的權

前面的字典好像有問題。。然后就為了能運行直接忽略了異常。長度比原來只少1

from math import log weightDict={} featureDict={'chi':0,'tfidf':'0'} #weightDict={chi: ,tfidf:}都一起算了吧 def gen_weightDict():for (word,value) in ddict.items():if word!='total':weightDict.setdefault(word,{})totalCount=0#A+B該詞出現的總次數try:for (cate,times) in value.items():totalCount+=times['tf']except Exception as ex:print(word,value)try: for (cate,times) in value.items():weightDict[word].setdefault(cate,featureDict.copy())#乘除100，防止太大太小！？weightDict[word][cate]['tfidf']=(times['tf']/ddict['total'][cate])*100*log(docLen[cate]/(1+times['idf']))#times['idf']#相當于anot_in_Class=totalCount-times['idf']#相當于Bnot_has_word=docLen[cate]-times['idf']#相當于cd=docLen['All']-totalCount-not_has_wordchi=pow((times['idf']*d-not_in_Class*not_has_word),2)/(100*totalCount*(d+not_has_word))weightDict[word][cate]['chi']=chiexcept Exception as ex:passfname='C:/lyr/DM/feature_reduction/weigthDict.json'with open(fname,'w') as fp:json.dump(weightDict,fp)

然后換了一下dict的組織方式，不然不會排序

取特征詞

我，每類取了8000

resultDict={'chi':{},'tfidf':{},'both':{}}#存3種結果 maxCount=8000 def gen_featureDict(maxCount):for cate,words in new_weigthDict.items():print(cate)chiMix=dict(sorted(words.items(),key=lambda x:x[1]['chi'],reverse=True)[:maxCount])tdidfMix=dict(sorted(words.items(),key=lambda x:x[1]['tfidf'],reverse=True)[:maxCount])bothMix=dict(sorted(words.items(),key=lambda x:x[1]['both'],reverse=True)[:maxCount])resultDict['chi'].setdefault(cate,{})resultDict['tfidf'].setdefault(cate,{})resultDict['both'].setdefault(cate,{})chiSum,tfidfSum,bothSum=0,0,0#我想把里面混的別的數據分出去for word,weight in chiMix.items():chiSum+=weight['chi'] #雖然能放下但就是覺得好大resultDict['chi'][cate].setdefault(word,weight['chi'])for word,weight in tdidfMix.items():tfidfSum+=weight['tfidf']resultDict['tfidf'][cate].setdefault(word,weight['tfidf'])for word,weight in bothMix.items():bothSum+=weight['both']resultDict['both'][cate].setdefault(word,weight['both'])#結果到0-1之間，不然chi乘起來就gg了for word,weight in chiMix.items():resultDict['chi'][cate][word]/=chiSumfor word,weight in tdidfMix.items():resultDict['tfidf'][cate][word]/=tfidfSumfor word,weight in bothMix.items():resultDict['both'][cate][word]/=bothSum gen_featureDict(maxCount)

總結

以上是生活随笔為你收集整理的文本分类（2）——取特征词构建词典的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： My Eighty-first Page
下一篇：让天底下没有难接的支付|支付宝网银直连转