當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

【数据竞赛】基于LSTM模型实现共享自行车需求预测

發布時間：2025/3/12 编程问答 29 豆豆

生活随笔收集整理的這篇文章主要介紹了【数据竞赛】基于LSTM模型实现共享自行车需求预测小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

公眾號：尤而小屋
作者：Peter
編輯：Peter

今天給大家帶來一篇新的kaggle數據分析實戰案例：基于長短期記憶網絡（LSTM）模型的倫敦自行車需求預測分析。本文的兩個亮點：

高級可視化：本文使用seaborn進行了可視化探索分析，圖表精美，分析維度多樣化，結論清晰
使用LSTM模型：長短期網絡模型的使用，使得結果更具價值和參考性

這是一個排名第三的方案：

感興趣的可以參考原notebook地址進行學習：

https://www.kaggle.com/yashgoyal401/advanced-visualizations-and-predictions-with-lstm/notebook

還有一篇類似文章：

https://www.kaggle.com/geometrein/helsinki-city-bike-network-analysis

本文步驟

下面是原文中的主要步驟：數據信息、特征工程、數據EDA、預處理、模型構建、需求預測和評價模型

LSTM模型

本文重點是使用了LSTM模型。LSTM是一種時間遞歸神經網絡，適合于處理和預測時間序列中間隔和延遲相對較長的重要事件。

小編實力有限，關于模型的原理詳細講解參考書籍和文章：

1、優秀書籍：《Long Short Term Memory Networks with Python》是澳大利亞機器學習專家Jason Brownlee的著作

2、知乎文章：https://zhuanlan.zhihu.com/p/24018768

3、B站：搜索李沐大神關于LSTM的講解

以后有實力了，肯定寫一篇關于LSTM原理的文章！一起學習吧！卷！

數據

導入庫

import?pandas?as?pd import?numpy?as?np#?seaborn可視化 import?seaborn?as?sns import?matplotlib.pyplot?as?plt sns.set(context="notebook",?style="darkgrid",?palette="deep",?font="sans-serif",?font_scale=1,?color_codes=True)#?忽略警告 import?warnings warnings.filterwarnings("ignore")

讀取數據

基本信息：

#?1、數據量 data.shape(17414,?10)#?2、數據字段類型 data.dtypestimestamp????????object cnt???????????????int64 t1??????????????float64 t2??????????????float64 hum?????????????float64 wind_speed??????float64 weather_code????float64 is_holiday??????float64 is_weekend??????float64 season??????????float64 dtype:?object

數據中沒有缺失值：

字段含義

解釋下數據中字段的含義：

timestamp：用于將數據分組的時間戳字段
cnt：新自行車份額的計數
t1：以C為單位的實際溫度
t2：C中的溫度“感覺像”，主觀感受
hum：濕度百分比
windspeed：風速，以km / h為單位
weathercode：天氣類別；（具體的取值見下圖中的最后）
isholiday：布爾字段，1-假期，0-非假期
isweekend：布爾字段，如果一天是周末，則為1
Season：類別氣象季節：0-春季；1-夏；2-秋；3-冬

TensorFlow基本信息

TensorFlow的GPU信息和版本查看：

特征工程

下面介紹本文中特征工程的實現：

數據信息

一個DataFrame的info信息能夠顯示出字段名、非空數量、數據類型等多個基本信息

時間字段處理

對原始數據中的時間相關字段進行處理：

1、將時間戳轉成時間類型

2、轉成索引

使用set_index方法將timestamp屬性轉成索引

3、提取時、一個月中的第幾天、第幾周、月份等信息

提取時間相關的多個信息，同時查看數據的shape

數據EDA

空值判斷

關于如何判斷一份數據中是否存在空值，小編常用的方法：

文章中使用的方法是：基于熱力圖顯示。圖形中沒有任何信息，表明數據是不存在空值的

需求量變化

整體的需求量cnt隨著時間變化的關系：

plt.figure(figsize=(15,6))sns.lineplot(data=data,??#?傳入數據x=data.index,??#?時間y=data.cnt??#?需求量)plt.xticks(rotation=90)

從上面的圖形，我們能夠看到整體日期下的需求量變化情況。

按月采樣resample

pandas中的采樣函數使用的是resample，頻率可以是天、周、月等

查看隨著時間的變化，每月的需求量變化情況：

plt.figure(figsize=(16,6))sns.lineplot(data=df_by_month,x=df_by_month.index,y=df_by_month.cnt,color="red")plt.xticks(rotation=90)plt.show()

可以從圖中觀察到以下3點結論：

年初到7、8月份需求量呈現上升趨勢

差不多在8月份達到一定的峰值

8月份過后需求量開始降低

每小時需求量

plt.figure(figsize=(16,6))sns.pointplot(data=data,??#?數據x=data.hour,??#?小時y=data.cnt,??#?需求量color="red"??#?顏色)plt.show()

每月的需求量對比

plt.figure(figsize=(16,6))sns.pointplot(data=data,x=data.month,y=data.cnt,color="red") plt.show()

明顯的結論：7月份是需求的高峰期

按照星期統計

plt.figure(figsize=(16,6))sns.pointplot(data=data,x=data.day_of_week,y=data.cnt,color="black")plt.show()

從圖中觀察到：

周1到周五的需求是明顯高于周末兩天；
同時在周五的時候已經呈現下降趨勢

按照自然日

plt.figure(figsize=(16,6))sns.lineplot(data=data,x=data.day_of_month,??#?一個月中的某天y=data.cnt,??#?需求量color="r")plt.show()

3點結論：

前10天需求量在逐步增加
中間10天存在一定的小幅波動
最后10天波動加大，呈現下降趨勢

多個維度下的可視化化效果

基于是否節假日下的小時

plt.figure(figsize=(16,6))sns.pointplot(data=data,x=data.hour,??#?按照小時統計y=data.cnt,hue=data.is_holiday??#?節假日分組)plt.show()

通過上面圖形呈現的結果；

非節假日下（is_holiday=0）：在8點和下午的17、18點是用車的高峰期，恰好是上下班的時間點
到了節假日（1）的情況下：下午的2-3點才是真正的用車高峰期

基于是否節假日的月份

plt.figure(figsize=(16,6))sns.pointplot(data=data,x=data.month,y=data.cnt,hue=data.is_holiday)plt.show()

在非節假日，7月份達到了用車的高峰期

3、按照季度統計

plt.figure(figsize=(16,6))sns.pointplot(data=data,y=data.cnt,x=data.month,hue=data.season,?#?季度分組)plt.show()

從上圖中觀察到：第3個季度（6–7-8月份）才是用車需求量最多的時候

4、季度+是否節假日

plt.figure(figsize=(16,6))#?分組統計數量 sns.countplot(data=data,x=data.season,hue=data.is_holiday,)plt.show()

從1-2-3-4季度來看，非節假日中的整體需求量1和2季度是稍高于0和3季度；而節假日中，0-3季度則存在一定的需求

5、是否周末+小時

plt.figure(figsize=(16,6))sns.lineplot(data=data,x=data.hour,??#?小時y=data.cnt,hue=data.is_weekend)??#?分是否周末統計plt.show()

非周末（0）：仍然是上午的7-8點和下午的17-18點是用車高峰期
周末（1）：下午的14-15點才是高峰期

這個結論和上面的是吻合的

6、季度+小時

plt.figure(figsize=(16,6))sns.pointplot(data=data,x=data.hour,y=data.cnt,hue=data.season?#?分季度統計)plt.show()

分季度查看每個小時的需求量：整體的趨勢大體是相同的，都是在8點左右達到上午的高封期，下午的17-18點（下班的時候）達到另一個高封期

天氣因素

濕度和需求量關系

觀察不同濕度下，需求量的變化情況：

plt.figure(figsize=(16,6))sns.pointplot(data=data,x=data.hum,y=data.cnt,color="black")plt.xticks(rotation=90)plt.show()

可以看到：空氣空氣濕度越大，整體的需求量是呈現下降趨勢

風速和需求量

plt.figure(figsize=(16,6))sns.pointplot(data=data,x=data.wind_speed,y=data.cnt)plt.xticks(rotation=90)plt.show()

風速對需求量的影響：

在風速為25.5的時候存在一個局部峰值
風速偏高或者偏低的時候需求都有所降低

不同天氣情況weather_code

plt.figure(figsize=(16,6))sns.pointplot(data=data,x=data.weather_code,y=data.cnt)plt.xticks(rotation=90)plt.show()

結論：可以看到在scattered coluds（weather_code=2）情況下，需求量是最大的

天氣情況+小時

plt.figure(figsize=(16,6))sns.pointplot(data=data,x=data.hour,y=data.cnt,hue=data.weather_code?#?分天氣統計)plt.show()

從上午中觀察到：不同的天氣對小時需求量的趨勢影響不大，仍然是在上下班高峰期的時候需求量最大，說明打工人上班出行幾乎不受天氣影響！！！

自然天+天氣情況

plt.figure(figsize=(16,6))sns.countplot(data=data,x=data.day_of_week,??#?一周中的第幾天hue=data.weather_code,??#?天氣情況palette="viridis")plt.legend(loc="best")??#?位置選擇plt.show()

從上圖中觀察到：

不同的星期日期，code=1下的需求量都是最大的

禮拜1到禮拜5：滿足code=1 > 2 > 3 > 7 > 4 的需求量

到禮拜6和禮拜天：大家出行的時候對天氣關注影響偏低，除去code=1，其他天氣情況的需求差距也在縮小！

箱型圖

箱型圖能夠反映一組數據的分布情況

按小時

plt.figure(figsize=(16,6))sns.boxplot(data=data,x=data.hour,??#?小時y=data.cnt)plt.show()

從箱型圖的分布觀察到：兩個重要的時間段：上午7-8點和下午的17-18點

每周星期幾

plt.figure(figsize=(16,6))sns.boxplot(data=data,x=data["day_of_week"],y=data.cnt)plt.show()

在基于星期的箱型圖中，禮拜三的時候存在一定的用車高峰期

月的自然天

plt.figure(figsize=(16,6))sns.boxplot(data=data,x=data["day_of_month"],y=data.cnt)plt.show()

在基于自然日的情況下，9號的存在高峰期

按月

plt.figure(figsize=(16,6))sns.boxplot(data=data,x=data["month"],y=data.cnt)plt.show()

明顯觀察到：7-8月份存在一定的需求高峰期，兩側月份的需求相對較少些

是否節假日+月的天

#?每月中的天和是否節假日統計plt.figure(figsize=(16,6))sns.boxplot(data=data,x=data["day_of_month"],y=data.cnt,hue=data["is_holiday"])plt.show()

數據預處理

下面開始進行建模，首先進行的是數據預處理工作，主要是包含兩點：

數據集的切分
數據歸一化和標準化

切分數據

按照9：1的比例來切分數據集：

#?切分數據集的模塊 from?sklearn.model_selection?import?train_test_split train,test?=?train_test_split(data,test_size=0.1,?random_state=0) print(train.shape) print(test.shape)#?------ (15672,?13) (1742,?13)

數據歸一化

from?sklearn.preprocessing?import?MinMaxScaler #?實例化對象 scaler??=?MinMaxScaler()#?部分字段的擬合 num_col?=?['t1',?'t2',?'hum',?'wind_speed'] trans_1?=?scaler.fit(train[num_col].to_numpy())#?訓練集轉換 train.loc[:,num_col]?=?trans_1.transform(train[num_col].to_numpy()) #?測試集轉換 test.loc[:,num_col]?=?trans_1.transform(test[num_col].to_numpy())#?對標簽cnt的歸一化 cnt_scaler?=?MinMaxScaler() #?數據擬合 trans_2?=?cnt_scaler.fit(train[["cnt"]]) #?數據轉化 train["cnt"]?=?trans_2.transform(train[["cnt"]]) test["cnt"]?=?trans_2.transform(test[["cnt"]])

訓練集和測試集

#?用于顯示進度條 from?tqdm?import?tqdm_notebook?as?tqdm tqdm().pandas()def?prepare_data(X,?y,?time_steps=1):Xs?=?[]Ys?=?[]for?i?in?tqdm(range(len(X)?-?time_steps)):a?=?X.iloc[i:(i?+?time_steps)].to_numpy()Xs.append(a)Ys.append(y.iloc[i?+?time_steps])return?np.array(Xs),?np.array(Ys)steps?=?24X_train,?y_train?=?prepare_data(train,?train.cnt,?time_steps=steps) X_test,?y_test?=?prepare_data(test,?test.cnt,?time_steps=steps)print(X_train.shape) print(X_test.shape) print(y_train.shape) print(y_test.shape)

LSTM建模

導入庫

在建模之前先導入相關的庫：

#?1、導入需要的庫 from?keras.preprocessing?import?sequence from?keras.models?import?Sequential from?keras.layers?import?Dense,?Dropout,?LSTM,?Bidirectional#?2、實例化對象并擬合建模 model?=?Sequential() model.add(Bidirectional(LSTM(128,?input_shape=(X_train.shape[1],X_train.shape[2]))))model.add(Dropout(0.2)) model.add(Dense(1,?activation="sigmoid")) model.compile(optimizer="adam",?loss="mse")

模型準備

傳入訓練集的數據后，進行數據的擬合建模過程：

均方差和Epoch的關系

探索在不同的Epoch下均方差的大小：

plt.plot(prepared_model.history["loss"],label="loss") plt.plot(prepared_model.history["val_loss"],label="val_loss")#?lengend位置選擇 plt.legend(loc="best") #?兩個軸的標題 plt.xlabel("No.?Of?Epochs") plt.ylabel("mse?score")

需求量預測

生成真實值和預測值

inverse_transform 函數是將標準化后的數據轉換為原始數據。

pred?=?model.predict(X_test)??#?對測試集預測? y_test_inv?=?cnt_scaler.inverse_transform(y_test.reshape(-1,1))??#?轉變數據 pred_inv?=?cnt_scaler.inverse_transform(pred)??#?預測值轉換 pred_inv

繪圖比較

將測試集轉變后的值和基于模型的預測值進行繪圖比較：

plt.figure(figsize=(16,6))#?測試集：真實值 plt.plot(y_test_inv.flatten(),?marker=".",?label="actual") #?模型預測值 plt.plot(pred_inv.flatten(),?marker=".",?label="predicttion",color="r") #?圖例位置 plt.legend(loc="best") plt.show()

生成數據

將測試集的真實值和預測值進行對比，通過兩個指標來進行評估：

1、原文中的方法（個人認為復雜了）：

#?原方法過程復雜了y_test_actual?=?cnt_scaler.inverse_transform(y_test.reshape(-1,1)) y_test_pred?=?cnt_scaler.inverse_transform(pred)arr_1?=?np.array(y_test_actual) arr_2?=?np.array(y_test_pred)actual?=?pd.DataFrame(data=arr_1.flatten(),columns=["actual"]) predicted?=?pd.DataFrame(data=arr_2.flatten(),columns?=?["predicted"])final?=?pd.concat([actual,predicted],axis=1) final.head()

2、個人方法

y_test_actual?=?cnt_scaler.inverse_transform(y_test.reshape(-1,1)) y_test_pred?=?cnt_scaler.inverse_transform(pred) final?=?pd.DataFrame({"actual":?y_test_actual.flatten(),"pred":?y_test_pred.flatten()}) final.head()

模型評價

通過mse和r2_score指標來評估模型：

#?mse、r2_score from?sklearn.metrics?import?mean_squared_error,?r2_scorermse?=?np.sqrt(mean_squared_error(final.actual,?final.pred)) r2?=?r2_score(final.actual,?final.pred)print("rmse?is?:?",?rmse) print("-------") print("r2_score?is?:?",?r2)#?結果 rmse?is?:??1308.7482342002293 ------- r2_score?is?:??-0.3951062293743659

下面作者又繪圖來對比真實值和預測值：

plt.figure(figsize=(16,6))#?真實值和預測值繪圖 plt.plot(final.actual,?marker=".",?label="Actual?label") plt.plot(final.pred,?marker=".",?label="predicted?label") #?圖例位置 plt.legend(loc="best")plt.show()

疑點

Peter個人有個疑點：下面的兩幅圖有什么區別，除了顏色不同？看了整個源碼，作圖的數據和代碼都是一樣的。作者還寫了兩段話：

Note that our model is predicting only one point in the future. That being said, it is doing very well. Although our model can’t really capture the extreme values it does a good job of predicting (understanding) the general pattern.

說普通話：注意到，我們的模型僅預測未來的一個點。話雖如此，它仍做得很好。雖然我們的模型不能真正捕捉到極值，但它在預測（理解）一般模式方面還是做得很好

AS you can see that I have used Bidirectional LSTM to train our model and Our model is working quite well.Our model is cap*able to capture the trend and not capturing the Extreme values which is a really good thing. SO, we can say that the overall perfomance is good.

說普通話：如你所見，我使用雙向 LSTM 來訓練我們的模型，并且我們的模型運行良好。我們的模型能夠捕捉趨勢而不是捕捉極值，這是一件非常好的事情。所以，我們可以說整體表現不錯。

下面是整個建模的源碼，請參考學習，也可以討論上面的疑點：

#?劃分數據集 from?sklearn.model_selection?import?train_test_split train,test?=?train_test_split(data,test_size=0.1,random_state=0)#?數據歸一化 from?sklearn.preprocessing?import?MinMaxScaler scaler??=?MinMaxScaler() #?對4個自變量的歸一化 num_colu?=?['t1',?'t2',?'hum',?'wind_speed'] trans_1?=?scaler.fit(train[num_colu].to_numpy()) train.loc[:,num_colu]?=?trans_1.transform(train[num_colu].to_numpy()) test.loc[:,num_colu]?=?trans_1.transform(test[num_colu].to_numpy()) #?對因變量的歸一化 cnt_scaler?=?MinMaxScaler() trans_2?=?cnt_scaler.fit(train[["cnt"]]) train["cnt"]?=?trans_2.transform(train[["cnt"]]) test["cnt"]?=?trans_2.transform(test[["cnt"]])#?導入建模庫和實例化 from?keras.preprocessing?import?sequence from?keras.models?import?Sequential from?keras.layers?import?Dense,?Dropout?,?LSTM?,?Bidirectional? #?時序對象的實例化 model?=?Sequential() model.add(Bidirectional(LSTM(128,input_shape=(X_train.shape[1],X_train.shape[2])))) model.add(Dropout(0.2)) model.add(Dense(1,activation="sigmoid"))?#?激活函數選擇 model.compile(optimizer="adam",loss="mse")??#?優化器和損失函數選擇with?tf.device('/GPU:0'):prepared_model?=?model.fit(X_train,y_train,batch_size=32,epochs=100,validation_data=[X_test,y_test])#?兩種損失的對比 plt.plot(prepared_model.history["loss"],label="loss") plt.plot(prepared_model.history["val_loss"],label="val_loss") plt.legend(loc="best") plt.xlabel("No.?Of?Epochs") plt.ylabel("mse?score")#?測試數據集的預測 pred?=?model.predict(X_test)??#?cnt數據的還原 y_test_inv?=?cnt_scaler.inverse_transform(y_test.reshape(-1,1)) pred_inv?=?cnt_scaler.inverse_transform(pred)#?繪圖1 plt.figure(figsize=(16,6)) plt.plot(y_test_inv.flatten(),?marker=".",label="actual") plt.plot(pred_inv.flatten(),?marker=".",label="prediction",color="r")#?cnt數據的還原 y_test_actual?=?cnt_scaler.inverse_transform(y_test.reshape(-1,1)) y_test_pred?=?cnt_scaler.inverse_transform(pred)#?轉成數組 arr_1?=?np.array(y_test_actual) arr_2?=?np.array(y_test_pred)#?生成Pandas的DataFrame，合并數據 actual?=?pd.DataFrame(data=arr_1.flatten(),columns=["actual"]) predicted?=?pd.DataFrame(data=arr_2.flatten(),columns?=?["predicted"]) final?=?pd.concat([actual,predicted],axis=1)#?評價指標 from?sklearn.metrics?import?mean_squared_error,?r2_score rmse?=?np.sqrt(mean_squared_error(final.actual,final.predicted))? r2?=?r2_score(final.actual,final.predicted)? print("rmse?is?:?{}\nr2?is?:?{}".format(rmse,r2))#?繪圖2 plt.figure(figsize=(16,6)) plt.plot(final.actual,label="Actual?data") plt.plot(final.predicted,label="predicted?values") plt.legend(loc="best")

文中數據的獲取方式，關注公眾號【尤而小屋】，回復 自行車 即可。

或者百度云下載，鏈接: https://pan.baidu.com/s/1x_ZkXQJIrgyjkJ7Sko8lmA?

提取碼: igoc

往期精彩回顧適合初學者入門人工智能的路線及資料下載(圖文+視頻)機器學習入門系列下載中國大學慕課《機器學習》（黃海廣主講）機器學習及深度學習筆記等資料打印《統計學習方法》的代碼復現專輯 AI基礎下載機器學習交流qq群955171419，加入微信群請掃碼：

總結

以上是生活随笔為你收集整理的【数据竞赛】基于LSTM模型实现共享自行车需求预测的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： yii mysql 缓存_yii2优化
下一篇： restful api接口设计