文本分类(2)——取特征词构建词典
生活随笔
收集整理的這篇文章主要介紹了
文本分类(2)——取特征词构建词典
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
001 常見特征詞提取
tf-idf http://www.ruanyifeng.com/blog/2013/03/tf-idf.html
chi https://blog.csdn.net/hubin232/article/details/81272126 【比較新】
sklearn里面算的是每個文本的tdidf向量,max_features是對所有詞得頻率進行降序排序只取前max_features個詞。加上之前參考了 https://github.com/chenfei0328/BayesProject
思路就被局限在了所有文本建一個總的矩陣再進行降維,后來被大佬提點,才轉了思路。
002 實現
我們做的時候的大致思路:
(聽老師說還有用CHI*TF-IDF來降維的,別的區特征的方法比如IG也可以)
寫的時候計數老有問題,同一類的數據是一樣的。發現是拷貝和引用沒有弄清楚。
統計詞頻
每類文檔的數目懶得貼了
#統計詞頻和文檔數,考慮詞頻來計算CHI #wroddict={tf:出現次數;total:該類總單詞數;idf:出現的文檔數目} import json import os from collections import Counter import time def count_words(cut_path):cate=os.listdir(cut_path)wordDict={'total':{}}for i,category in enumerate(cate):print(time.strftime('%Y-%m-%d %H:%M:%S', time.localtime()), '>', '=' * 40 + '[' + category + ']' + '=' * 40)wordInfo={'tf':0,'idf':1}wordsLen=0file_path=cut_path+category+'/'file_list=os.listdir(file_path)for j,file_name in enumerate(file_list):full_path=file_path+file_namewith open(full_path,"r",encoding='utf-8') as f:content=f.read()words=content.split()wordsLen+=len(words)wordCounter=Counter(words)for index,wordTuple in enumerate(wordCounter.most_common(len(wordCounter))):(word,count)=wordTuplewordDict.setdefault(word,{})wordDict[word].setdefault(category,wordInfo.copy())wordDict[word][category]['tf']+=countwordDict[word][category]['idf']+=1wordDict['total'][category]=wordDict.setdefault(category,wordsLen)wordDict=dict(wordDict)fname='C:/lyr/DM/feature_reduction/wordDict.json'with open(fname,'w') as fp:json.dump(wordDict,fp)計算每個詞的權
前面的字典好像有問題。。然后就為了能運行直接忽略了異常。長度比原來只少1
from math import log weightDict={} featureDict={'chi':0,'tfidf':'0'} #weightDict={chi: ,tfidf:}都一起算了吧 def gen_weightDict():for (word,value) in ddict.items():if word!='total':weightDict.setdefault(word,{})totalCount=0#A+B該詞出現的總次數try:for (cate,times) in value.items():totalCount+=times['tf']except Exception as ex:print(word,value)try: for (cate,times) in value.items():weightDict[word].setdefault(cate,featureDict.copy())#乘除100,防止太大太小!?weightDict[word][cate]['tfidf']=(times['tf']/ddict['total'][cate])*100*log(docLen[cate]/(1+times['idf']))#times['idf']#相當于anot_in_Class=totalCount-times['idf']#相當于Bnot_has_word=docLen[cate]-times['idf']#相當于cd=docLen['All']-totalCount-not_has_wordchi=pow((times['idf']*d-not_in_Class*not_has_word),2)/(100*totalCount*(d+not_has_word))weightDict[word][cate]['chi']=chiexcept Exception as ex:passfname='C:/lyr/DM/feature_reduction/weigthDict.json'with open(fname,'w') as fp:json.dump(weightDict,fp)然后換了一下dict的組織方式,不然不會排序
取特征詞
我,每類取了8000
resultDict={'chi':{},'tfidf':{},'both':{}}#存3種結果 maxCount=8000 def gen_featureDict(maxCount):for cate,words in new_weigthDict.items():print(cate)chiMix=dict(sorted(words.items(),key=lambda x:x[1]['chi'],reverse=True)[:maxCount])tdidfMix=dict(sorted(words.items(),key=lambda x:x[1]['tfidf'],reverse=True)[:maxCount])bothMix=dict(sorted(words.items(),key=lambda x:x[1]['both'],reverse=True)[:maxCount])resultDict['chi'].setdefault(cate,{})resultDict['tfidf'].setdefault(cate,{})resultDict['both'].setdefault(cate,{})chiSum,tfidfSum,bothSum=0,0,0#我想把里面混的別的數據分出去for word,weight in chiMix.items():chiSum+=weight['chi'] #雖然能放下但就是覺得好大resultDict['chi'][cate].setdefault(word,weight['chi'])for word,weight in tdidfMix.items():tfidfSum+=weight['tfidf']resultDict['tfidf'][cate].setdefault(word,weight['tfidf'])for word,weight in bothMix.items():bothSum+=weight['both']resultDict['both'][cate].setdefault(word,weight['both'])#結果到0-1之間,不然chi乘起來就gg了for word,weight in chiMix.items():resultDict['chi'][cate][word]/=chiSumfor word,weight in tdidfMix.items():resultDict['tfidf'][cate][word]/=tfidfSumfor word,weight in bothMix.items():resultDict['both'][cate][word]/=bothSum gen_featureDict(maxCount)總結
以上是生活随笔為你收集整理的文本分类(2)——取特征词构建词典的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: My Eighty-first Page
- 下一篇: 让天底下没有难接的支付|支付宝网银直连转