當(dāng)前位置：首頁 > 编程语言 > python >内容正文

python

python人工智能——机器学习——分类算法-k近邻算法——kaggle案例： Facebook V: Predicting Check Ins

發(fā)布時間：2024/4/30 python 28 豆豆

生活随笔收集整理的這篇文章主要介紹了 python人工智能——机器学习——分类算法-k近邻算法——kaggle案例： Facebook V: Predicting Check Ins 小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

題目及翻譯

Facebook and Kaggle are launching a machine learning engineering competition for 2016.
Facebook和Kaggle正在推出2016年的機(jī)器學(xué)習(xí)工程競賽。
Trail blaze your way to the top of the leaderboard to earn an opportunity at interviewing for one of the 10+ open roles as a software engineer, working on world class machine learning problems.
開拓者通過自己的方式進(jìn)入排行榜的頂端，為10名作為軟件工程師的開放角色中的一位獲得面試機(jī)會，從而解決世界級的機(jī)器學(xué)習(xí)問題。

The goal of this competition is to predict which place a person would like to check in to.
本次比賽的目的是預(yù)測一個人想要登記的地方。
For the purposes of this competition, Facebook created an artificial world consisting of more than 100,000 places located in a 10 km by 10 km square.
為了本次比賽的目的，Facebook創(chuàng)建了一個人工世界，其中包括10多公里10平方公里的100,000多個地方。
For a given set of coordinates, your task is to return a ranked list of the most likely places.
對于給定的坐標(biāo)集，您的任務(wù)是返回最可能位置的排名列表。
Data was fabricated to resemble location signals coming from mobile devices, giving you a flavor of what it takes to work with real data complicated by inaccurate and noisy values.
數(shù)據(jù)被制作成類似于來自移動設(shè)備的位置信號，讓您了解如何處理由不準(zhǔn)確和嘈雜的值導(dǎo)致的實(shí)際數(shù)據(jù)。
Inconsistent and erroneous location data can disrupt experience for services like Facebook Check In.
不一致和錯誤的位置數(shù)據(jù)可能會破壞Facebook Check In等服務(wù)的體驗(yàn)。
We highly encourage competitors to be active on Kaggle Scripts.
我們強(qiáng)烈鼓勵競爭對手積極參與Kaggle Scripts。
Your work there will be thoughtfully included in the decision making process.
您在那里的工作將被認(rèn)真地包含在決策過程中。
Please note: You must compete as an individual in recruiting competitions.
請注意：您必須在招募比賽中作為個人參加比賽。
You may only use the data provided to make your predictions.
您只能使用提供的數(shù)據(jù)進(jìn)行預(yù)測。

數(shù)據(jù)

In this competition, you are going to predict which business a user is checking into based on their location, accuracy, and timestamp.
在本次競賽中，您將根據(jù)用戶的位置，準(zhǔn)確性和時間戳預(yù)測用戶正在檢查的業(yè)務(wù)。

The train and test dataset are split based on time, and the public/private leaderboard in the test data are split randomly.
訓(xùn)練和測試數(shù)據(jù)集根據(jù)時間進(jìn)行劃分，測試數(shù)據(jù)中的公共/私人排行榜隨機(jī)拆分。
There is no concept of a person in this dataset.
此數(shù)據(jù)集中沒有人的概念。
All the row_id’s are events, not people.
所有row_id都是事件，而不是人。
Note: Some of the columns, such as time and accuracy, are intentionally left vague in their definitions.
注意：某些列（例如時間和準(zhǔn)確性）在其定義中有意留下含糊不清的內(nèi)容。
Please consider them as part of the challenge.
請將它們視為挑戰(zhàn)的一部分。

File descriptions

文件說明
train.csv, test.csv
row_id: id of the check-in event
row_id：簽入事件的id
x y: coordinates
xy：坐標(biāo)
accuracy: location accuracy
準(zhǔn)確度：定位精度
time: timestamp
時間：時間戳
place_id: id of the business, this is the target you are predicting
place_id：業(yè)務(wù)的ID，這是您預(yù)測的目標(biāo)
sample_submission.csv - a sample submission file in the correct format with random predictions
sample_submission.csv - 具有隨機(jī)預(yù)測的正確格式的樣本提交文件

數(shù)據(jù)集下載

分析

特征值：x，y坐標(biāo)，定位準(zhǔn)確性，時間戳。
目標(biāo)值：入住位置的id。
處理：

0<x<10,0<y<10 由于數(shù)據(jù)量大，為了節(jié)省時間，x，y縮小時間戳進(jìn)行處理（年、月、日、周、時、分、秒），當(dāng)做新的特征。幾千~幾萬類別，少于指定人數(shù)的簽到位置刪除

讀取數(shù)據(jù)

data = pd.read_csv("./facebook-v-predicting-check-ins/train.csv")

數(shù)據(jù)的處理

1、縮小數(shù)據(jù)集范圍 DataFrame.query()

#1.縮小數(shù)據(jù),查詢數(shù)據(jù)篩選data=data.query("x>1.0&x<1.25&y>2.5&y<2.75")

2、處理日期數(shù)據(jù) pd.to_datetime、pd.DatetimeIndex

#處理時間的數(shù)據(jù)time_value=pd.to_datetime(data['time'],unit='s')print(time_value)

3、增加分割的日期數(shù)據(jù)

4、刪除沒用的日期數(shù)據(jù)

#把日期格式轉(zhuǎn)換為字典格式time_value=pd.DatetimeIndex(time_value)#構(gòu)造一些特征data['day']=time_value.daydata['hour']=time_value.hourdata['weekday']=time_value.weekday#把時間戳特征刪除data=data.drop(['time'],axis=1)print(data)

處理完之后，數(shù)據(jù)規(guī)模減少。

5、將簽到位置少于n個用戶的刪除

place_count =data.groupby(‘place_id’).aggregate(np.count_nonzero)

tf = place_count[place_count.row_id > 3].reset_index()

data = data[data[‘place_id’].isin(tf.place_id)]

# 把簽到數(shù)量少于n個目標(biāo)位置刪除place_count = data.groupby('place_id').count()tf = place_count[place_count.row_id > 3].reset_index()data = data[data['place_id'].isin(tf.place_id)]

6.標(biāo)準(zhǔn)化

#特征工程（標(biāo)準(zhǔn)化）std=StandardScaler()#對測試集和訓(xùn)練集的特征值進(jìn)行標(biāo)準(zhǔn)化x_train=std.fit_transform(x_train)x_test=std.transform(x_test)

預(yù)測

# 進(jìn)行算法流程 # 超參數(shù)knn = KNeighborsClassifier(n_neighbors=5)#fit() predict() score()knn.fit(x_train,y_train)#得出預(yù)測結(jié)果y_predict=knn.predict(x_test)print("預(yù)測的目標(biāo)簽到位置為：",y_predict)#得出準(zhǔn)確率print("預(yù)測的準(zhǔn)確率：",knn.score(x_test,y_test))

準(zhǔn)確率才剛40%，有點(diǎn)低，再優(yōu)化一下：

x = data.drop(['row_id'], axis=1)

行吧，，孬好及格了。

完整代碼

from sklearn.neighbors import KNeighborsClassifier from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler import pandas as pddef knncls():"""K-近鄰預(yù)測用戶簽到位置:return:None"""# 讀取數(shù)據(jù)data = pd.read_csv("./facebook-v-predicting-check-ins/train.csv")# print(data.head(10))#處理數(shù)據(jù)#1.縮小數(shù)據(jù),查詢數(shù)據(jù)篩選data=data.query("x>1.0&x<1.25&y>2.5&y<2.75")#處理時間的數(shù)據(jù)time_value=pd.to_datetime(data['time'],unit='s')# print(time_value)#把日期格式轉(zhuǎn)換為字典格式time_value=pd.DatetimeIndex(time_value)#構(gòu)造一些特征data['day']=time_value.daydata['hour']=time_value.hourdata['weekday']=time_value.weekday#把時間戳特征刪除data=data.drop(['time'],axis=1)# print(data)# 把簽到數(shù)量少于n個目標(biāo)位置刪除place_count = data.groupby('place_id').count()tf = place_count[place_count.row_id > 3].reset_index()data = data[data['place_id'].isin(tf.place_id)]# 取出數(shù)據(jù)當(dāng)中的特征值和目標(biāo)值y = data['place_id']x = data.drop(['place_id'], axis=1)x = data.drop(['row_id'], axis=1)# 進(jìn)行數(shù)據(jù)的分割訓(xùn)練集合測試集x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25)#特征工程（標(biāo)準(zhǔn)化）std=StandardScaler()#對測試集和訓(xùn)練集的特征值進(jìn)行標(biāo)準(zhǔn)化x_train=std.fit_transform(x_train)x_test=std.transform(x_test)# 進(jìn)行算法流程 # 超參數(shù)knn = KNeighborsClassifier(n_neighbors=5)#fit() predict() score()knn.fit(x_train,y_train)#得出預(yù)測結(jié)果y_predict=knn.predict(x_test)print("預(yù)測的目標(biāo)簽到位置為：",y_predict)#得出準(zhǔn)確率print("預(yù)測的準(zhǔn)確率：",knn.score(x_test,y_test))return Noneif __name__ == "__main__":knncls()

流程分析

1、數(shù)據(jù)集的處理

2、分割數(shù)據(jù)集

3、對數(shù)據(jù)集進(jìn)行標(biāo)準(zhǔn)化

4、estimator流程進(jìn)行分類預(yù)測

——————————————————————————————————————————

2019-7-17更新

好多人都要數(shù)據(jù)集，現(xiàn)在直接放在這了，直接拿吧。

鏈接：https://pan.baidu.com/s/1ZT39BIG8LjJ3F6GYfcbfPw
提取碼：hoxm
復(fù)制這段內(nèi)容后打開百度網(wǎng)盤手機(jī)App，操作更方便哦

總結(jié)

以上是生活随笔為你收集整理的python人工智能——机器学习——分类算法-k近邻算法——kaggle案例： Facebook V: Predicting Check Ins的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： python人工智能——机器学习——分类
下一篇： AI 质检学习报告——学习篇——AI质检