當(dāng)前位置：首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

《集体智慧编程》第六章

發(fā)布時(shí)間：2024/9/30 编程问答 39 豆豆

生活随笔收集整理的這篇文章主要介紹了《集体智慧编程》第六章小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

1.P126代碼
為了定義閾值，請(qǐng)修改初始化方法，在classifier中加入一個(gè)新的實(shí)例變量：

def __init__(self, getfeatures):classifier.__init__(self, getfeatures)self.thresholds = {}

這段代碼在做修改時(shí)，應(yīng)直接在類(lèi)classifier里的定義_ _ init _ _() 中加入最后一句代碼，前面的一句代碼就不要了。
修改后的_ _ init _ _()為：

class classifier:def __init__(self, getfeatures, filename = None):#count the number of feature or classify groupself.fc = {}#count the number of doc in each classificationself.cc = {}self.getfeatures = getfeatures#classifier.__init__(self, getfeatures)self.thresholds = {}

2.P131
在輸入代碼驗(yàn)證時(shí)，如果輸入

>>> reload(docclass) <module 'docclass' from 'docclass.py'> >>> docclass.sampletrain(c1) >>> c1.classify('quick rabbit')

就會(huì)提示錯(cuò)誤

正確做法應(yīng)該是在重新加載了文件后應(yīng)該先重新對(duì)c1進(jìn)行重新定義，就不會(huì)提示錯(cuò)誤了。如下

>>> reload(docclass) <module 'docclass' from 'docclass.py'> >>> docclass.sampletrain(c1) >>> c1.classify('quick rabbit') Traceback (most recent call last):File "<stdin>", line 1, in <module>File "docclass.py", line 94, in classifyprobs[cat] = self.prob(item, cat) AttributeError: fisherclassifier instance has no attribute 'prob' >>> c1 = docclass.fisherclassifier(docclass.getwords) >>> docclass.sampletrain(c1) >>> c1.classify('quick rabbit') 'good' >>> c1.classify('quick money') 'bad' >>> c1.setminimum('bad', 0.8) >>> c1.classify('quick money') 'good' >>> c1.setminimum('good', 0.4) >>> c1.classify('quick money') 'good' >>>

3.P128
本頁(yè)中進(jìn)行歸一化計(jì)算時(shí)，文章中的公式為：
cprob = clf/(clf+nclf)
但是在程序中卻是

p = clf / (freqsum)

我認(rèn)為在計(jì)算nclf時(shí)就已經(jīng)包括了clf，故不需要再加一次既可以實(shí)現(xiàn)歸一化，所以應(yīng)該將文章中的公式改為：
cprob = clf/nclf
當(dāng)然，加不加clf并不會(huì)影響最終結(jié)果，只會(huì)影響概率的數(shù)值，不會(huì)影響排行。
4.P129
文中的“包含單詞‘casino’的文檔是垃圾郵件的概率為0.9”一句有誤，經(jīng)過(guò)計(jì)算，包含單詞‘casino’的文檔是垃圾郵件的概率應(yīng)該為1.0
5.P137
書(shū)中代碼：

def entryfeatures(entry):splitter = re.compile('\\W*')f = {}#get words in title and sign ittitlewords = [s.lower() for s in splitter.split(entry['title']) if len(s) > 2 and len(s) < 20]for w in titlewords: f['Title: ' + w] = 1#get words in absrtactsummarywords = [s.lower() for s in splitter.split(entry['summary']) if len(s) > 2 and len(s) < 20]#count capitalize wordsuc = 0for i in range(len(summarywords)):w = summarywords[i]f[w] = 1if w.isupper(): uc += 1#words from absrtact as featuresif i < len(summarywords) - 1:twowords = ' '.join(summarywords[i : i + 1])f[twowords] = 1#keep names compile of artile's creater and publicorf['Publisher: ' + entry['publisher']] = 1#UPPERCASE is a virtual word, and it is used to aim at too many capitalize words existif float(uc) / len(summarywords) > 0.3: f['UPPERCASE'] = 1return f

統(tǒng)計(jì)大寫(xiě)單詞的數(shù)量時(shí)，用到了前面提取到的summarywords變量，但是，在提取summarywords變量時(shí)，

summarywords = [s.lower() for s in splitter.split(entry['summary']) if len(s) > 2 and len(s) < 20]

可以看到，lower（）函數(shù)已經(jīng)把summarywords變量中的單詞全變成小寫(xiě)的了。所以在統(tǒng)計(jì)后面的大寫(xiě)單詞也就沒(méi)有意義了。所以我認(rèn)為應(yīng)該改為

summarywords = [s for s in splitter.split(entry['summary']) if len(s) > 2 and len(s) < 20]

請(qǐng)忽略我的渣英語(yǔ)。

總結(jié)

以上是生活随笔為你收集整理的《集体智慧编程》第六章的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： spark安装测试过程中提示consol
下一篇：《集体智慧编程》第8章