python 数据挖掘 简书_[Python数据挖掘入门与实践]-第一章开启数据挖掘之旅
1.數(shù)據(jù)挖掘簡介(略)
2.使用Python和IPython Notebook
2.1.安裝Python
2.2.安裝IPython
2.3.安裝scikit-learn
scikit-learn是用Python開發(fā)的機(jī)器學(xué)習(xí)庫,它包含大量機(jī)器學(xué)習(xí)算法、數(shù)據(jù)集、工具和框架。它以Python科學(xué)計(jì)算的相關(guān)工具集為基礎(chǔ),其中numpy和scipy等都針對數(shù)據(jù)處理任務(wù)進(jìn)行過優(yōu)化,因此scikit-learn速度快、擴(kuò)展性強(qiáng),為此做數(shù)據(jù)挖掘很實(shí)用。
scikit-learn可以用Python3提供的pip工具進(jìn)行安裝,之前沒有安裝Numpy和Scipy的話也會(huì)順便安裝。安裝命令如下:
pip install scikit-learn
3.親和性分析示例
3.1什么是親和性分析
親和性分析根據(jù)樣本個(gè)體(物體)之間的相似度,確定它們關(guān)系的親疏。親和性分析的應(yīng)用場景如下。
(1)向網(wǎng)站用戶提供多樣化的服務(wù)或投放定向廣告
(2)為了向用戶推薦電影或商品,而賣給他們一些與之相關(guān)的小玩意。
(3)根據(jù)基因?qū)ふ矣杏H緣關(guān)系的人。
......
親和性有哪些測量方法?
(1)統(tǒng)計(jì)兩件商品一起出售的頻率,或者統(tǒng)計(jì)顧客購買商品1后再買商品2的比率。
(2)計(jì)算兩個(gè)體之間的相似度
......
3.2商品推薦
商品銷售從線下搬到線上后,很多之前靠人工完成的工作只有實(shí)現(xiàn)自動(dòng)化,才有望將生意做大,向上銷售出自英文up-selling,指的是向已經(jīng)購買商品的顧客推銷另一種商品。原來線下由人工完成的商品推薦工作,現(xiàn)在依靠數(shù)據(jù)挖掘技術(shù)就能完成,而且利潤大,助推電子商務(wù)革命的發(fā)展!
我們一起看一個(gè)簡單的推薦服務(wù):人們之前經(jīng)常購買的兩件商品,以后也很可能同時(shí)購買。
作為數(shù)據(jù)挖掘入門性質(zhì)的例子,我們希望得到下面的規(guī)則:
如果一個(gè)人買了商品X,那么他很有可能購買商品Y
3.3在Numpy中加載數(shù)據(jù)集
import numpy as np
dataset_filename = 'affinity_dataset.txt'
X = np.loadtxt(dataset_filename)
3.4實(shí)現(xiàn)簡單的排序規(guī)則
規(guī)則的優(yōu)劣有多種衡量方法,常用的是支持度(support)和置信度(confidence)。
支持度指數(shù)據(jù)集中應(yīng)驗(yàn)的次數(shù),有時(shí)候需要對支持度進(jìn)行規(guī)范化。
支持度衡量給定規(guī)則應(yīng)驗(yàn)的比例,置信度衡量規(guī)則準(zhǔn)確率如何,即符合給定條件的所有規(guī)則里,跟當(dāng)前規(guī)則結(jié)論一致的比例有多大,計(jì)算方法為首先統(tǒng)計(jì)當(dāng)前規(guī)則的出現(xiàn)次數(shù),再用它除以條件相同的規(guī)則數(shù)量
如果顧客買了蘋果,他們也會(huì)購買香蕉的支持度和置信度
num_apple_purchases = 0
for sample in X:
if sample[3] ==1: #This person bought apples
num_apple_purchases += 1
# print('{0} people bought apples'.format(num_apple_purchases)) #ou can try the print way to find difference
print('{0} people bought apples'.format(num_apple_purchases))
image.png
同理,檢測sample[4]的值是否為1,就可以確定顧客是否也買了香蕉,進(jìn)而可以計(jì)算支持度和置信度。
我們需要統(tǒng)計(jì)數(shù)據(jù)集中所有規(guī)則的相關(guān)數(shù)據(jù),首先分別為規(guī)則應(yīng)驗(yàn)和規(guī)則無效兩種情況構(gòu)建字典。字典的鍵是由條件和結(jié)論組成的元組,元組元素為特征在特征列表中的索引值,不要用實(shí)際特證名,比如“顧客如果購買了蘋果,也買了香蕉”就用(3,4)表示。如果某個(gè)個(gè)體的條件和結(jié)論均與給定規(guī)則相符,則表示給定規(guī)則對該個(gè)體適用,反之無效。
為了計(jì)算所有規(guī)則的置信度和支持度,首先創(chuàng)建幾個(gè)字典,用來存儲(chǔ)計(jì)算結(jié)果。這里使用defaultdict,好處是如果查找的鍵不存在,則返回默認(rèn)值。需要統(tǒng)計(jì)的量有規(guī)則應(yīng)驗(yàn)、規(guī)則無效、條件相同的規(guī)則數(shù)量。
from collections import defaultdict
vaild_rules = defaultdict(int)
invaild_rules = defaultdict(int)
num_occurances = defaultdict(int)
#依次對樣本的每個(gè)個(gè)體及個(gè)體的每個(gè)特征值進(jìn)行處理。第一個(gè)特征為規(guī)則的前提條件-----顧客購買了某種商品
for sample in X:
for premise in range(5):
#檢測個(gè)體是否滿足條件,如果不滿足則檢測下一個(gè)條件
if sample[premise] ==0:continue
#如果條件滿足(即值為1),該條件的出現(xiàn)次數(shù)加1,在遍歷過程中跳過條件和結(jié)論相同的情況,比如“如果顧客購買了蘋果,他們也買蘋果”,這樣的規(guī)則無用
num_occurances[premise] += 1
n_sample,n_features = X.shape
for conclusion in range(n_features):
if premise ==conclusion:continue
#如果規(guī)則適用于個(gè)體,規(guī)則應(yīng)驗(yàn)這種情況(vaild_rules字典中,鍵為由條件和結(jié)論組成的元組)增加一次,反之,違反規(guī)則情況(invaild_rules字典中)就增加一次
https://github.com/datawhalechina/joyful-pandas datawhale pandas教程
https://space.bilibili.com/631186842?from=search&seid=16882960572917617056 Rachel's english
https://www.liulishuo.com/liulishuo.html 流利說英語 有app直接下
https://github.com/fengdu78/lihang-code 李航Python實(shí)現(xiàn)
[ch1_affinity_create]
X = np.zeros((100, 5), dtype='bool')
#dtype can change,such as int,float
X.shape[0]
#0 is row,1 is col
#數(shù)組的索引方式是和列表一樣的
np.savetxt("affinity_dataset.txt", X, fmt='%d')
#parameters
fmt : str or sequence of strs, optional
A single format (%10.5f), a sequence of formats, or a
multi-format string, e.g. 'Iteration %d -- %10.5f', in which
case `delimiter` is ignored. For complex `X`, the legal options
for `fmt` are:
#create a random float from 0 to 1
a = np.random.random()
print(X[:5].astype(np.int))
[ch1_affinity]
n_samples, n_features = X.shape
print("This dataset has {0} samples and {1} features".format(n_samples, n_features))
#這種print格式用print('{0},{1}'.format(a,b))
#count the people who bought apples
num_apple_purchases = 0
for sample in X:
if sample[3] = 1:
num_apple_purchases += 1
print('{0} people bought apples'.format(num_apples_purchases))
####################################################################################################
##bought 3 but not bought 4
rule_valid = 0
rule_invalid = 0
for sample in X:
if sample[3] == 1: # This person bought Apples
if sample[4] == 1:
# This person bought both Apples and Bananas
rule_valid += 1
else:
# This person bought Apples, but not Bananas
rule_invalid += 1
print("{0} cases of the rule being valid were discovered".format(rule_valid))
print("{0} cases of the rule being invalid were discovered".format(rule_invalid))
####################################################################################################
## not bought 3
rule_valid = 0
rule_invalid = 0
for sample in X:
if sample[3] == 1:
if sample[4] == 1:
rule_valid += 1
else:
rule_invalid += 1
print('{0} rule_valid'.format(rule_valid))
print('{0} rule_invalid'.format(rule_invalid))
####################################################################################################
規(guī)則是 如果買了蘋果,可能也買了香蕉。
規(guī)則無效是 如果買了蘋果,但沒買香蕉
print("The support is {0} and the confidence is {1:.3f}.".format(support, confidence))
# Confidence can be thought of as a percentage using the following:
print("As a percentage, that is {0:.1f}%.".format(100 * confidence))
####################################################################################################
from collections import defaultdict
# Now compute for all possible rules
valid_rules = defaultdict(int)
invalid_rules = defaultdict(int)
num_occurences = defaultdict(int)
for sample in X:
for premise in range(n_features):
if sample[premise] == 0: continue
# Record that the premise was bought in another transaction
num_occurences[premise] += 1
for conclusion in range(n_features):
if premise == conclusion: # It makes little sense to measure if X -> X.
continue
if sample[conclusion] == 1:
# This person also bought the conclusion item
valid_rules[(premise, conclusion)] += 1
else:
# This person bought the premise, but not the conclusion
invalid_rules[(premise, conclusion)] += 1
support = valid_rules
confidence = defaultdict(float)
for premise, conclusion in valid_rules.keys():
confidence[(premise, conclusion)] = valid_rules[(premise, conclusion)] / num_occurences[premise]
####################################################################################################
from collections import defaultdict
rule_valid = defaultdict(int)
rule_invalid = defaultdict(int)
num_premise = defaultdict(int)
n_features = X.shape[1]
for sample in X:
for premise in range(n_features):
if sample[premise] == 0:continue
if sample[premise] == 1:
num_premise[premise] += 1
for conclusion in range(n_features):
if premise == conclusion:continue
if sample[conclusion] == 1:
rule_valid[(premise,conclusion)] += 1
else:
rule_invalid[(premise,conclusion)] += 1
support = rule_valid
confidence = defaultdict(float)
for premise,conclusion in rule_valid.keys():
confidence[(premise,conclusion)] = rule_valid[(premise,conclusion)] / num_premise[premise]
####################################################################################################
for premise, conclusion in confidence:
premise_name = features[premise]
conclusion_name = features[conclusion]
print("Rule: If a person buys {0} they will also buy {1}".format(premise_name, conclusion_name))
print(" - Confidence: {0:.3f}".format(confidence[(premise, conclusion)]))
print(" - Support: {0}".format(support[(premise, conclusion)]))
print("")
####################################################################################################
for premise,conclusion in confidence:
features = ["bread", "milk", "cheese", "apples", "bananas"]
premise_name = features[premise]
conclusion_name = features[conclusion]
print('If someone buy {0} then they may buy {1}'.format(premise_name,conclusion_name))
print('confidence is {0:.3f}'.format(confidence[(premise,conclusion)]))
print('support is {0}'.format(support[(premise,conclusion)]))
#用于打印特定的數(shù)據(jù)結(jié)構(gòu),整齊好看
from pprint import pprint
pprint(list(support.items()))
#example
import pprint
data = ("test", [1, 2, 3,'test', 4, 5], "This is a string!",
{'age':23, 'gender':'F'})
print(data)
pprint.pprint(data)
image.png
注意
# [Python: 字典列表: itemgetter 函數(shù): 根據(jù)某個(gè)或某幾個(gè)字典字段來排序列表](https://www.cnblogs.com/baxianhua/p/8182627.html)
from operator import itemgetter
sorted_support = sorted(support.items(), key=itemgetter(1), reverse=True)
# [python中sorted和sort 、reversed和reverse的使用](https://www.cnblogs.com/shengguorui/p/10863988.html)
[OneR]
[ch1_oner_application]
#iris.describe
print(dataset.DESCR)
在進(jìn)行OneR算法分類前需要將數(shù)據(jù)進(jìn)行離散化
# Compute the mean for each attribute
attribute_means = X.mean(axis=0)
assert attribute_means.shape == (n_features,)#assert:斷言
X_d = np.array(X >= attribute_means, dtype='int')
#X.means(axis):axis = 0 is symbol take col
#assert 1==1 # 條件為 true 正常執(zhí)行
#assert 1==2 # 條件為 false 觸發(fā)異常
#sklearn中已經(jīng)廢棄cross_validation,將其中的內(nèi)容整合到#model_selection中
#將sklearn.cross_validation 替換為 sklearn.model_selection
##origin
# Now, we split into a training and test set
from sklearn.cross_validation import train_test_split
# Set the random state to the same number to get the same results as in the book
random_state = 14
X_train, X_test, y_train, y_test = train_test_split(X_d, y, random_state=random_state)
print("There are {} training samples".format(y_train.shape))
print("There are {} testing samples".format(y_test.shape))
##new
# Now, we split into a training and test set
from sklearn.model_selection import train_test_split
# Set the random state to the same number to get the same results as in the book
random_state = 14
X_train, X_test, y_train, y_test = train_test_split(X_d, y, random_state=random_state)
print("There are {} training samples".format(y_train.shape))
print("There are {} testing samples".format(y_test.shape))
#train_X,test_X,train_y,test_y = train_test_split(train_data,train_target,test_size=0.3,random_state=5)
#train_test_split()函數(shù)是用來隨機(jī)劃分樣本數(shù)據(jù)為訓(xùn)練集和測試集的,當(dāng)然也可以人為的切片劃分
#優(yōu)點(diǎn):隨機(jī)客觀的劃分?jǐn)?shù)據(jù),減少人為因素
#test_size:測試數(shù)據(jù)占樣本數(shù)據(jù)的比例,若整數(shù)則樣本數(shù)量
#zip() 函數(shù)用于將可迭代的對象作為參數(shù),將對象中對應(yīng)的元素打包成一個(gè)個(gè)元組,然后返回由這些元組組成的列表。
#如果各個(gè)迭代器的元素個(gè)數(shù)不一致,則返回列表長度與最短的對象相同,利用 * 號(hào)操作符,可以將元組解壓為列表
>>>a = [1,2,3]
>>>b = [4,5,6]
>>>c = [4,5,6,7,8]
>>>zipped = zip(a,b) # 打包為元組的列表
[(1, 4), (2, 5), (3, 6)]
>>>zip(a,c) # 元素個(gè)數(shù)與最短的列表一致
[(1, 4), (2, 5), (3, 6)]
>>>zip(*zipped) # 與 zip 相反,*zipped 可理解為解壓,返回二維矩陣式
[(1, 2, 3), (4, 5, 6)]
class_counts = defaultdict(int)
#Iterate through each sample and count the frequency of each class/value pair
for sample, y in zip(X, y_true):
if sample[feature] == value:
class_counts[y] += 1
a = zip(X, y)
for b in a:
print(b)
image.png
a = zip(X, y)
for b,c in a:
print(b)
image.png
a = zip(X, y)
for b,c in a:
print(c)
image.png
error = sum([class_count for class_value, class_count in class_counts.items()
if class_value != most_frequent_class])
總結(jié)
以上是生活随笔為你收集整理的python 数据挖掘 简书_[Python数据挖掘入门与实践]-第一章开启数据挖掘之旅的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: python 东哥 with open_
- 下一篇: 什么用于创建python与数据库之间的链