當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

摩拜单车探索性分析

發布時間：2024/1/1 编程问答 35 豆豆

生活随笔收集整理的這篇文章主要介紹了摩拜单车探索性分析小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

關于摩拜單車的探索性分析

數據簡介

Github

git init git clone github_link

隨著共享經濟的發展，共享單車應運而生，“行”作為四大民生需求（衣食住行）的一部分，探索其在新形態經濟下的發展態勢以及存在的問題尤為重要。
本項目選用摩拜單車的數據，共包含102361條摩拜單車訂單記錄，包含的變量有：

列名解釋說明

orderid	訂單編號
bikeid	車輛編號
userid	用戶ID
start_time	騎行開始時間
end_time	騎行結束時間
start_location_x	起點維度位置
start_location_y	起點經度位置
end_location_x	終點維度位置
start_location_x	起點維度位置
start_location_y	起點經度位置
end_location_x	終點維度位置
end_location_y	終點經度位置
track	軌跡點

# 導入功能庫 import pandas as pd import os import numpy as np import matplotlib.pyplot as plt import seaborn as sns from folium import plugins import folium from IPython.display import HTML import warnings warnings.filterwarnings("ignore") os.listdir('Mobike Data') ['.DS_Store','MOBIKE 樣本數據說明(data_description).pdf','Mobike_location_heatmap.html','mobike_shanghai_sample_updated.csv']

數據評估與清理

# 上載數據 mobike_df = pd.read_csv(os.path.join("Mobike Data",'mobike_shanghai_sample_updated.csv')) # 觀察并評估數據 mobike_df.head(5) orderidbikeiduseridstart_timestart_location_xstart_location_yend_timeend_location_xend_location_ytrack

0	78387	158357	10080	2016-08-20 06:57	121.348	31.389	2016-08-20 07:04	121.357	31.388	121.347,31.392#121.348,31.389#121.349,31.390#1...
1	891333	92776	6605	2016-08-29 19:09	121.508	31.279	2016-08-29 19:31	121.489	31.271	121.489,31.270#121.489,31.271#121.490,31.270#1...
2	1106623	152045	8876	2016-08-13 16:17	121.383	31.254	2016-08-13 16:36	121.405	31.248	121.381,31.251#121.382,31.251#121.382,31.252#1...
3	1389484	196259	10648	2016-08-23 21:34	121.484	31.320	2016-08-23 21:43	121.471	31.325	121.471,31.325#121.472,31.325#121.473,31.324#1...
4	188537	78208	11735	2016-08-16 07:32	121.407	31.292	2016-08-16 07:41	121.418	31.288	121.407,31.291#121.407,31.292#121.408,31.291#1...

mobike_df.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 102361 entries, 0 to 102360 Data columns (total 10 columns): orderid 102361 non-null int64 bikeid 102361 non-null int64 userid 102361 non-null int64 start_time 102361 non-null object start_location_x 102361 non-null float64 start_location_y 102361 non-null float64 end_time 102361 non-null object end_location_x 102361 non-null float64 end_location_y 102361 non-null float64 track 102361 non-null object dtypes: float64(4), int64(3), object(3) memory usage: 6.6+ MB # 更正數據類型 mobike_df.orderid = mobike_df.orderid.astype(str) mobike_df.bikeid = mobike_df.bikeid.astype(str) mobike_df.userid = mobike_df.userid.astype(str) mobike_df.start_time = pd.to_datetime(mobike_df.start_time) mobike_df.end_time = pd.to_datetime(mobike_df.end_time) # 獲取地理點 def get_loc(value):loc_list = []for item in value:loc = tuple([float(i) for i in item.split(',')])loc_list.append(loc)return loc_list mobike_df.track = mobike_df.track.str.replace('\\','') # 獲取軌跡點 mobike_df.track = mobike_df.track.str.split('#').apply(get_loc) # 根據騎行起止時間，得出騎行時長（分鐘數） mobike_df['riding_time'] = (mobike_df.end_time - mobike_df.start_time).apply(lambda x:x.total_seconds())/60 # 檢查是否存在異常值 mobike_df.describe() start_location_xstart_location_yend_location_xend_location_yriding_time

count	102361.000000	102361.000000	102361.000000	102361.000000	102361.000000
mean	121.454144	31.251740	121.453736	31.252029	17.195162
std	0.060862	0.057358	0.061577	0.057740	34.049919
min	121.173000	30.842000	120.486000	30.841000	1.000000
25%	121.415000	31.212000	121.414000	31.212000	7.000000
50%	121.456000	31.260000	121.456000	31.261000	12.000000
75%	121.497000	31.294000	121.497000	31.294000	20.000000
max	121.970000	31.450000	121.971000	31.477000	4725.000000

riding_time存在異常值，最大值高達4725分鐘，數據差異之大不合邏輯

# 探索騎行時長的數據分布 plt.figure(figsize=(8,6)) colorful = sns.color_palette('Paired') plt.hist(data=mobike_df,x='riding_time',color=colorful[3]);

騎行時長嚴重右偏，存在少量較大的異常值，需要進行對數變換增加數據的粒度

# 騎行時長進行對數變換 plt.figure(figsize=(8,6)) colorful = sns.color_palette('Paired') bin_edge = 10**np.arange(0,np.log10(mobike_df.riding_time.max())+0.05,0.05) plt.hist(data=mobike_df,x='riding_time',bins=bin_edge,color=colorful[3]) plt.xscale('log') plt.xticks(ticks=[1,3,10,30,100,300,500,1000,2000,4000],labels=[1,3,10,30,100,300,500,1000,2000,4000]);

單次騎行時長主要集中在5-30分鐘，整體分布形態嚴重右偏，數據集存在異常值，200分鐘以內的騎行時長屬于正常的有效時長；異常值的存在可能是由于數據錄入錯誤，或者用戶騎行結束后忘記關鎖等原因導致記錄的時長過長。

# 篩選有效的騎行時長的記錄 mobike_df = mobike_df[mobike_df.riding_time <= 200] mobike_df.info() <class 'pandas.core.frame.DataFrame'> Int64Index: 102230 entries, 0 to 102360 Data columns (total 11 columns): orderid 102230 non-null object bikeid 102230 non-null object userid 102230 non-null object start_time 102230 non-null datetime64[ns] start_location_x 102230 non-null float64 start_location_y 102230 non-null float64 end_time 102230 non-null datetime64[ns] end_location_x 102230 non-null float64 end_location_y 102230 non-null float64 track 102230 non-null object riding_time 102230 non-null float64 dtypes: datetime64[ns](2), float64(5), object(4) memory usage: 7.8+ MB mobike_df.head(2) orderidbikeiduseridstart_timestart_location_xstart_location_yend_timeend_location_xend_location_ytrackriding_time

0	78387	158357	10080	2016-08-20 06:57:00	121.348	31.389	2016-08-20 07:04:00	121.357	31.388	[(121.347, 31.392), (121.348, 31.389), (121.34...	7.0
1	891333	92776	6605	2016-08-29 19:09:00	121.508	31.279	2016-08-29 19:31:00	121.489	31.271	[(121.489, 31.27), (121.489, 31.271), (121.49,...	22.0

mobike_df.to_csv('mobike_df_edit.csv',index=False) from operator import itemgetter, attrgetter # 根據起止點重排軌跡點順序 def reorder_track(start_point,end_point,track):lgt1,lat1 = start_pointlgt2,lat2 = end_pointtry:track.remove(start_point)track.remove(end_point)except:track = trackif np.abs(lgt1 - lgt2) > np.abs(lat1 - lat2):if (lgt1 < lgt2):# to eastif (lat1 < lat2): # to northordered_track = [start_point] + sorted(track,key=itemgetter(0,1)) + [end_point]else: # to souths = sorted(track,key=itemgetter(1),reverse=True)ordered_track = [start_point] + sorted(s,key=itemgetter(0)) + [end_point]elif (lgt1 > lgt2): # to westif (lat1 < lat2): # to norths = sorted(track,key=itemgetter(1))ordered_track = [start_point] + sorted(s,key=itemgetter(0),reverse=True) + [end_point]else: # to southordered_track = [start_point] + sorted(track,key=itemgetter(0,1),reverse=True) + [end_point]elif np.abs(lgt1 - lgt2) <= np.abs(lat1 - lat2):if (lgt1 < lgt2):# to eastif (lat1 < lat2): # to northordered_track = [start_point] + sorted(track,key=itemgetter(1,0)) + [end_point]else: # to souths = sorted(track,key=itemgetter(0))ordered_track = [start_point] + sorted(s,key=itemgetter(1),reverse=True) + [end_point]elif (lgt1 > lgt2): # to westif (lat1 < lat2): # to norths = sorted(track,key=itemgetter(0),reverse=True)ordered_track = [start_point] + sorted(s,key=itemgetter(1)) + [end_point]else: # to southordered_track = [start_point] + sorted(track,key=itemgetter(1,0),reverse=True) + [end_point]return ordered_track mobike_df['start_point'] = [(st_lgt,st_lat) for st_lgt,st_lat in zip(mobike_df.start_location_x,mobike_df.start_location_y)] mobike_df['end_point'] = [(end_lgt,end_lat) for end_lgt,end_lat in zip(mobike_df.end_location_x,mobike_df.end_location_y)] mobike_df2 = mobike_df.copy() # 重排騎行軌跡 mobike_df2['new_trace'] = [reorder_track(st_point,ed_point,track) for st_point, ed_point, track in zip(mobike_df2.start_point,mobike_df2.end_point,mobike_df2.track)]

數據結構與概況

通過以上數據的評估與清理，得出：

該數據集的大部分變量的數據類型為數值型，如地理位置（起止點），騎行軌跡，騎行時長，騎行起止時間，其次訂單編號、用戶編號以及車輛編號為字符型；
該數據集因為騎行時長中存在少量的異常值，清理后的數據集在源數據集的基礎上移除部分，約占源數據量的0.02%，對分析結果基本無影響。

數據探索概述

本次單車的數據探索中，主要想集中于對于單車在上海各個城區內使用量的地理位置分布、用戶偏好以及用戶價值這三方面進行探索，需要探索的主要度量為訂單數量、騎行時長、地理位置，實現的方法如下：

采用5W和RFM的分析方法進行探索性分析，并以可視化方式對結果進行呈現：

5W （可以對用戶的行為以及地理位置的分布進行分析）
- WHAT
  - 車輛的重復使用率如何？
- WHEN
  - 隨著時間的推移，訂單量/騎行時長是如何發展的？
  - 一周內，不同日期的訂單數/騎行時長是如何分布的？
  - 一天內是否有明顯的騎行高低峰期，訂單量和騎行時長是如何變化的，是否有差異？
- WHERE
  - 哪些地點是車輛使用的高頻地段？
  - 哪些地點是騎車軌跡經常經過的？
  - 哪些路線是騎車較熱的路線？
  - 哪些地點的車輛重復使用率較高？
- Who
  - 用戶的重復使用率
RFM（對用戶價值進行分析）
- R：最近一次使用單車的日期
- F：騎行的總體頻率
- M：騎行的總時長

單變量分析

# 計算車輛的重復使用率 bike_reuse_ratio = sum(mobike_df2.bikeid.value_counts() >= 2)/mobike_df2.bikeid.value_counts().size bike_reuse_ratio 0.21356777558357384 # 計算用戶的重復使用率 user_reuse_ratio = sum(mobike_df2.userid.value_counts() >= 2)/mobike_df2.userid.value_counts().size user_reuse_ratio 0.9292229329542763

車輛的重復使用率比較低，而用戶的重復使用率很高，高達92.9%

# 探索篩選后的騎行時長的數據分布，先以正常的scale探索 plt.figure(figsize=(8,6)) colorful = sns.color_palette('Paired') plt.hist(data=mobike_df2,x='riding_time',color=colorful[2]) plt.xlabel('Riding_Time(m)') plt.ylabel('Frequency');

# 數據嚴重右偏，對riding_time軸進行對數變換 plt.figure(figsize=(8,6)) bin_edge = 10**np.arange(0,np.log10(mobike_df2.riding_time.max())+0.01,0.01) plt.hist(data=mobike_df2,x='riding_time',bins=bin_edge,color=colorful[2]) plt.xscale('log') plt.xticks(ticks=[1,3,5,10,30,50,70,100,150,200],labels=[1,3,5,10,30,50,70,100,150,200]) plt.xlabel('Riding_Time(m)') plt.ylabel('Frequency');

大部分用戶騎行時長集中在5-10分鐘

# 探索日期時間上的數據分布，為了探索整個8月連續性發生的騎行發生的日期時間，將日期型數據轉換成連續型的浮點型數值 from datetime import datetime import time datetime_value = [time.mktime(datetime(s_datetime.year,s_datetime.month,s_datetime.day,s_datetime.hour,s_datetime.minute,s_datetime.second).timetuple()) for s_datetime in mobike_df2.start_time] bin_edge = np.arange(np.min(datetime_value),np.max(datetime_value)+10000,10000) # 探索時間上的訂單總量的數據分布 fig,axes = plt.subplots(figsize=(15,6)) colorful = sns.color_palette('Paired') sns.distplot(datetime_value,bins=bin_edge,hist_kws={'color':colorful[2],'alpha':1},kde=False,ax=axes) axes.set_xscale('log') axes.set_xlabel('Datetime_Value(D)') axes.set_ylabel('Frequency');

可以看出，在日期時間的分布上，訂單數量呈波段式增長趨勢，圖中共有31個波段，所以每一個波段對應一天，而一天中也都基本呈現著雙峰結構，推斷對應著早晚高峰，而日期與日期之間則是夜間凌晨。

# 探索起點經度和維度的數據分布 fig,axes = plt.subplots(1,2,figsize=(16,6)) bin_edge1 = np.arange(mobike_df2.start_location_x.min(),mobike_df2.start_location_x.max()+0.005,0.005) sns.distplot(mobike_df2.start_location_x,bins=bin_edge1,kde=False,hist_kws={'alpha':1,'color':colorful[2]},ax=axes[0]) bin_edge2 = np.arange(mobike_df2.start_location_y.min(),mobike_df2.start_location_y.max()+0.005,0.005) sns.distplot(mobike_df2.start_location_y,bins=bin_edge2,kde=False,hist_kws={'alpha':1,'color':colorful[2]},ax=axes[1]);

# 探索起點經度和維度的數據分布 fig,axes = plt.subplots(1,2,figsize=(16,6)) bin_edge3 = np.arange(mobike_df2.end_location_x.min(),mobike_df2.end_location_x.max()+0.005,0.005) sns.distplot(mobike_df2.end_location_x,bins=bin_edge3,kde=False,hist_kws={'alpha':1,'color':colorful[2]},ax=axes[0]) bin_edge4 = np.arange(mobike_df2.end_location_y.min(),mobike_df2.end_location_y.max()+0.005,0.005) sns.distplot(mobike_df2.end_location_y,bins=bin_edge2,kde=False,hist_kws={'alpha':1,'color':colorful[2]},ax=axes[1]);

從終點的地理位置分布可以看出，經度主要集中于121.4-121.5,之間，整體左偏，維度主要集中于31.2-31.3.之間，整體左偏，存在少量駛向西南方向的路線。

整體上開看，起點和終點的聚集區域基本一致

雙變量分析

# 為了更好地查看變量在日期上的變化，將日期時間按照day,hour,day_name進行分組，以便更好地多角度分析和論證 mobike_df2['day'] = mobike_df2.start_time.dt.day mobike_df2['hour'] = mobike_df2.start_time.dt.hour mobike_df2['day_name'] = mobike_df2.start_time.dt.day_name() # 探索訂單總量隨著日期的推移的變化趨勢 plt.figure(figsize=(15,6)) day_count = mobike_df2.start_time.dt.day.value_counts().sort_index() plt.plot(day_count.index,day_count.values,marker='o',color=colorful[3]) plt.xticks(ticks=range(1,32,1),labels=range(1,32,1)) plt.xlabel('Day',fontsize=10) plt.ylabel('Order_Count',fontsize=10);

隨著日期的增長，訂單量呈增長趨勢，且增速不斷加快，此觀察結果再次印證了訂單數量隨著日期時間的推移，整體呈上升趨勢

cat_dtype = pd.api.types.CategoricalDtype(categories=['Monday', 'Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday'],ordered=True) mobike_df2.day_name = mobike_df2.day_name.astype(cat_dtype) # 探索訂單總量和訂單平均是否對所屬周日期存在偏好 fig,axes = plt.subplots(1,2,figsize=(12,6)) base_color = sns.color_palette()[0] sns.countplot(data=mobike_df2,x='day_name',color=base_color,ax=axes[0]) axes[0].set_ylabel('Total_order_count') group_day = mobike_df2.groupby(['day','day_name']).size().reset_index(name='order_count',level=1) sns.barplot(data=group_day,x='day_name',y='order_count',color=base_color,ax=axes[1]) axes[1].set_ylabel('Avg_order_count') fig.autofmt_xdate();

group_day.groupby('day_name').size().sort_values(ascending=False) day_name Wednesday 5 Tuesday 5 Monday 5 Sunday 4 Saturday 4 Friday 4 Thursday 4 dtype: int64

從周日期的數據分布上來看，Wednesday、Monday和Tuesday在總訂單量上較大；就平均訂單量，周日期的數據差異不大，分布相對均勻；
產生上述差異的原因是：8月的Wednesday、Monday和Tuesday比其他日期要多1天，所以在周日期的分布上，數據差異不大。

# 探索訂單總量在一天內的不同時間段是否存在較大的數據差異，即是否存在高低峰期 plt.figure(figsize=(15,6)) hour_count = mobike_df2.start_time.dt.hour.value_counts().sort_index() plt.plot(hour_count.index,hour_count.values,marker='o') plt.xticks(ticks=range(0,24,1),labels=range(0,24,1)) plt.xlabel('Hour',fontsize=10) plt.ylabel('Total_order_Count',fontsize=10);

訂單量在時間上的分布上呈現較大差異，由上圖可知，存在較明顯早晚高峰時段：7-9點騎行量較大，8點左右達到上午高峰值，17-20點騎行量較大，18點左右達到高峰，晚高峰的訂單數量大于早高峰的訂單數量，而夜間和凌晨(0-5)期間，訂單數量達到低峰；
所以運營商應該在發生騎行次數較少的時段，如凌晨23-5點或者9-15點進行車輛的投放或維修，以備高峰時段有足夠的質量良好的車輛供用戶使用。

# 探索騎行總時長在日期上的分布 plt.figure(figsize=(15,6)) riding_time_total = mobike_df2.groupby('day')['riding_time'].sum() plt.plot(riding_time_total.index,riding_time_total,marker='o'); plt.xticks(ticks=range(1,32,1),labels=range(1,32,1)) plt.xlabel('Day',fontsize=10) plt.ylabel('total riding time(min)'.title(),fontsize=10);

騎行總時長在日期上的分布與訂單總量在日期上的分布基本一致，訂單總量隨著時間的推移不斷增長的同時，騎行總時長也隨之不斷增長，且增速不斷加快。

# 因為騎行總時長受訂單量的影響，為了進一步探索，需要分析騎行平均時長在8月每一天的分布 plt.figure(figsize=(15,6)) riding_time_avg = mobike_df2.groupby('day')['riding_time'].mean() riding_time_sem = mobike_df2.groupby('day')['riding_time'].sem() plt.errorbar(x=riding_time_avg.index,y=riding_time_avg,yerr=riding_time_sem); plt.xticks(ticks=range(1,32,1),labels=range(1,32,1)) plt.ylim(bottom=0) plt.xlabel('Day',fontsize=10) plt.ylabel('avg of riding time(min)'.title(),fontsize=10);

從平均騎行時長在日期上的數據分布上來看，周六和周日（工作日）比其他日期的平均時長略長

# 探索騎行總時長在一天內的每個時間段內的數據 plt.figure(figsize=(15,6)) hour_riding_time = mobike_df2.groupby('hour')['riding_time'].agg('sum') plt.plot(hour_riding_time.index,hour_riding_time.values,marker='o') plt.xticks(ticks=range(0,24,1),labels=range(0,24,1)) plt.xlabel('Hour',fontsize=10) plt.ylabel('total riding time'.title(),fontsize=10);

從時間段與騎行總時長的數據分布上來看，分布形態基本與訂單量在時間段上的分布一致，呈現顯著的早晚高峰期；
但是騎行時長在早高峰和晚高峰上的差異比訂單量在兩個時間段上的差異更大

# 同樣騎行總時長會受到訂單量的影響，進而探索騎行的平均時長在各個時間段內的數據 plt.figure(figsize=(15,6)) riding_time_avg2 = mobike_df2.groupby('hour')['riding_time'].mean() riding_time_sem2 = mobike_df2.groupby('hour')['riding_time'].sem() plt.errorbar(x=riding_time_avg2.index,y=riding_time_avg2,yerr=riding_time_sem2); plt.xticks(ticks=range(0,24,1),labels=range(0,24,1)) plt.ylim(bottom=0) plt.xlabel('Hour',fontsize=10) plt.ylabel('avg of riding time(min)'.title(),fontsize=10);

由上圖可得：

上圖可以解釋早晚高峰時段在訂單量差異不大的情況下，騎行總時長卻存在很大差異的現象：晚上的平均騎行時長大于早上的平均騎行時長；
上圖在0-5點期間，騎行時長相對于白天大部分時間段比較長，但是該期間的個體之間數據差異比較大，可能存在個別較大的異常值；
此現象說明用戶在晚上，如下班后有更充裕的時間或者愿意花費更多的時間來騎車，而早上的時間相對比較緊張，即使是同等距離，也會比晚上花費的時間更少。

# 探索一周內的每天在其各個時間段的平均騎行時長 day_hour_time = mobike_df2.groupby(['day_name','hour'])['riding_time'].mean().reset_index(name='riding_time_avg') day_hour_time_pivot = pd.pivot_table(data=day_hour_time,index='day_name',columns='hour',values='riding_time_avg') plt.figure(figsize=(16,8)) sns.heatmap(data=day_hour_time_pivot,cmap='BuGn',cbar_kws={'label':'riding_time'});

有上圖可知：

周六和周天有更充裕的時間來騎車，對比其他天，周六和周天在各個時間段上的平均騎行時長均有所增加；
在時間段的分布上，與以上的時間段與騎行時長的分布基本一致，晚上和夜間的騎行時長大于白天的騎行時長，而周六和周日在該特征上體現地更加顯著。

基于地理位置的探索

# 經緯度結合，對完整的坐標點進行探索 plt.figure(figsize=(8,5)) bins_x = np.arange(121.17, 121.97+0.1, 0.01) bins_y = np.arange(30.84, 31.45+0.1,0.01) plt.hist2d(data=mobike_df2,x='start_location_x',y='start_location_y',bins = [bins_x, bins_y],cmap = 'viridis_r',cmin = 0.5); plt.xlim((121.2,121.7)) plt.ylim((31.0,31.4)) plt.xlabel('start_point_lat') plt.ylabel('start_point_lgt') plt.colorbar();

plt.figure(figsize=(8,5)) bins_x = np.arange(121.17, 121.97+0.1, 0.01) bins_y = np.arange(30.84, 31.45+0.1,0.01) plt.hist2d(data=mobike_df2,x='end_location_x',y='end_location_y',bins = [bins_x, bins_y],cmap = 'viridis_r', cmin = 0.5); plt.xlim((121.2,121.7)) plt.ylim((31.0,31.4)) plt.xlabel('end_point_lat') plt.ylabel('end_point_lgt') plt.colorbar();

起點和終點集中區域基本一致，主要集中((121.4-121.55),(31.2-31.35))區域內

# 將地理位置數據標注至地圖上 st_locs = mobike_df2[['start_location_y','start_location_x']].values.tolist() # 所有起點標注到地圖上，以熱圖形式顯示 m = folium.Map(location=[31.22,121.48],control_scale=True, zoom_start=10) m.add_child(plugins.HeatMap(st_locs,radius=7,gradient={.4: 'blue', .65: 'lime', 1: 'yellow'})) m.save('st_loc_heatmap.html') HTML('<iframe src=st_loc_heatmap.html width=700 height=450></iframe>')

通過調用百度地圖API，得出地理信息更為詳細的熱力圖，詳情請查看Mobike Data/Mobike_location_heatmap.html

由上圖和html可知：

從覆蓋區域上來看，騎車起止點，或者說車輛使用主要集中在虹口區、黃浦區、靜安區，而離上海市中心較遠的普陀區、長寧區、徐匯區和浦東新區車輛使用量相對較低，
城市特征上來看，車輛使用量較大的主要位于大學城區、體育館、大型商圈、密集住宅區及主要交通干道交匯處（立交橋，交通樞紐）地段等：
- 如：以同濟和復旦大學為代表的大學城區，江灣體育場，內環共和立交橋，中環虹橋樞紐等
- 黃浦區各個以中國城市命名的主要街道，人民廣場
在地鐵沿線上，如1號線，3號線，11號線和13號線是車輛高頻使用沿線

#探索騎行軌跡 data = pd.DataFrame(mobike_df2.new_trace.tolist()).stack().reset_index().rename(columns={'level_0':'order','level_1':'trace_order',0:'location'}) data['lng'] = data.location.apply(lambda x:x[0]) data['lat'] = data.location.apply(lambda x:x[1]) # 提取出trace數據，在Tableau中繪制路徑圖 data.to_csv('./Mobike_trace.csv',index=False)

以下是使用Tableau繪制的關于單車騎行的軌跡圖

Tableau Public-Mobike Trace Chart

通過軌跡圖可以看出的是：

騎行主要是以短距離為主，長距離的軌跡點多的比較少
騎行的主要活動區域正如上述熱力圖所示，主要集中在中心城區，如虹口區、黃埔區、楊浦區
黃浦沿江區域的騎行軌跡較多

基于用戶的探索

使用RFM模型探索和衡量各個用戶的客戶價值和創利能力

mobike_df2.head(2) orderidbikeiduseridstart_timestart_location_xstart_location_yend_timeend_location_xend_location_ytrackriding_timestart_pointend_pointnew_tracedayhourday_name

0	78387	158357	10080	2016-08-20 06:57:00	121.348	31.389	2016-08-20 07:04:00	121.357	31.388	[(121.347, 31.392), (121.349, 31.39), (121.35,...	7.0	(121.348, 31.389)	(121.357, 31.388)	[(121.348, 31.389), (121.347, 31.392), (121.34...	20	6	Saturday
1	891333	92776	6605	2016-08-29 19:09:00	121.508	31.279	2016-08-29 19:31:00	121.489	31.271	[(121.489, 31.27), (121.49, 31.27), (121.49, 3...	22.0	(121.508, 31.279)	(121.489, 31.271)	[(121.508, 31.279), (121.507, 31.279), (121.50...	29	19	Monday

# 假設當前日期是2016年9月1日 from datetime import datetime mobike_df2['user_recently'] = (datetime(2016,9,1) - mobike_df2.start_time).dt.days R = mobike_df2.groupby('userid')['user_recently'].min() M = mobike_df2.groupby('userid')['riding_time'].sum() F = mobike_df2.userid.value_counts() fig,axes = plt.subplots(1,3,figsize=(18,6)) sns.distplot(R,kde=False,hist_kws = {'alpha' : 1},ax=axes[0],color=colorful[2]) axes[0].set_xlabel('recent_day_diff') sns.distplot(M,kde=False,hist_kws = {'alpha' : 1},ax=axes[1],color=colorful[2]) axes[1].set_xlabel('riding_time_total') sns.distplot(F.values,kde=False,hist_kws = {'alpha' : 1},ax=axes[2],color=colorful[2]) axes[2].set_xlabel('order_count');

從上圖可知：

因為該數據集僅為8月這一個月的單車訂單數據，從上方recent_day_diff圖可知，發生在最近的騎行占比較大，而且從上述訂單量隨著日期增長而不斷增長的數據可知，隨著摩拜單車的不斷推廣，新用戶的數量在不斷增長，且增速越來越大。
單個用戶在8月的累計騎行時長集中在50-100分鐘之間，單個用戶在8月的累計騎行次數多集中在2-8次，少數用戶有高達20次以上的使用

grouped_user = mobike_df2.groupby('userid') # 對不同列使用不同的統計方法 user_rfm = grouped_user.agg({'user_recently':'min','riding_time':'sum','orderid':'count'}) user_rfm.rename(columns={'user_recently':'days_diff','riding_time':'riding_time_total','orderid':'order_count'},inplace=True)

根據上述三類特征數據的數據分布圖，具體分箱出各個特征下的不同數值區間的對應分值（滿分均為5分），如下所示：

RFMData BinGrades

Recency	0-5	5
Recency	6-11	4
Recency	12-17	3
Recency	18-23	2
Recency	24-30	1
Frequency	>=15	5
Frequency	12-14	4
Frequency	10-11	3
Frequency	6-9	2
Frequency	0-5	1
Monetary(riding time total)	>=300	5
Monetary(riding time total)	200-299	4
Monetary(riding time total)	100-199	3
Monetary(riding time total)	50-99	2
Monetary(riding time total)	0-49	1

user_rfm['r_grade'] = pd.cut(user_rfm.days_diff,bins=[0,6,12,18,24,31],labels=[5,4,3,2,1],right=False,include_lowest=True) user_rfm['f_grade'] = pd.cut(user_rfm.order_count,bins=[0,6,10,12,15,26],labels=[1,2,3,4,5],right=False,include_lowest=True) user_rfm['m_grade'] = pd.cut(user_rfm.riding_time_total,bins=[0,50,100,200,300,512],labels=[1,2,3,4,5],right=False,include_lowest=True) user_rfm.r_grade = user_rfm.r_grade.astype(pd.api.types.CategoricalDtype(categories=[1,2,3,4,5],ordered=True)) user_rfm['r_value'] = user_rfm.r_grade.astype('int') user_rfm['f_value'] = user_rfm.f_grade.astype('int') user_rfm['m_value'] = user_rfm.m_grade.astype('int') user_rfm.reset_index(inplace=True) # 計算出三項的平均值 r_mean = user_rfm.r_value.mean() f_mean = user_rfm.f_value.mean() m_mean = user_rfm.m_value.mean() r_mean,f_mean,m_mean (4.596600331674958, 1.7670575692963753, 2.269426676143094)

將數據集的三項數據與三項平均值作對比，將客戶價值按照下圖所示劃分

def get_custermer_value(r,f,m):if r > r_mean and f > f_mean and m > m_mean:user_tag = '重要價值客戶'elif r < r_mean and f > f_mean and m > m_mean:user_tag = '重要喚回客戶'elif r > r_mean and f < f_mean and m > m_mean:user_tag = '重要深耕客戶'elif r < r_mean and f < f_mean and m > m_mean:user_tag = '重要挽留客戶'elif r > r_mean and f > f_mean and m < m_mean:user_tag = '潛力客戶'elif r > r_mean and f < f_mean and m < m_mean:user_tag = '新客戶'elif r < r_mean and f > f_mean and m < m_mean:user_tag = '一般維持客戶'elif r < r_mean and f < f_mean and m < m_mean:user_tag = '流失客戶'return user_tag user_rfm['user_value'] = user_rfm[['r_value','f_value','m_value']].apply(lambda x:get_custermer_value(x.r_value,x.f_value,x.m_value), axis = 1) # 支持中文 plt.rcParams['font.sans-serif'] = ['Microsoft YaHei'] # 用來正常顯示中文標簽 plt.rcParams['axes.unicode_minus'] = False # 用來正常顯示負號 # 統計各類客戶的占比 value_counts = user_rfm.user_value.value_counts() fig,ax = plt.subplots(figsize=(8,8)) sns.barplot(x=value_counts.index,y=value_counts.values,color=colorful[2]) plt.xticks(rotation=30);

可以看出占比較大的分別是：重要價值客戶、新客戶，其中重要價值客戶的占比最大，新客戶占比略低一點，僅次于重要價值客戶占比；
流失客戶也存在一定的占比，公司需要對于流失率上進行進一步的分析，如用戶體驗，產品質量，競爭者數量，或者公共關系等原因；

結論

通過5W1H的分析方法與日期時間維度的結合，得出以下結論：

車輛的重復使用率為21.36%，而用戶的重復使用率是92.92%。衣食住行中的行屬于剛性需求，用戶復用率之高反映了用戶的需求量大之外，還反映了用戶對于產品的初步認可或者競爭者數量較少等原因，而車輛的復用率之低可能是由于目前市場投放的車輛數量遠大于需求量，從而造成供過于求的局面，也有可能產品破損嚴重，所以需要根據相關數據再作分析。

基于日期和時間維度的分析：

隨著日期的推移，訂單總量和騎行總時長呈快速增長趨勢；
在一天內的時間變化中，訂單總量呈現明顯的早晚高峰時段：7-9點騎行量較大，8點左右達到上午高峰值，17-20點騎行量較大，18點達到一天的高峰值，所以運營商應該在非高峰時段，如凌晨23-5點或者10-15點進行車輛的投放或維修，以備高峰時段有足夠的質量良好的車輛供用戶使用；
騎行的總時長因為訂單總量的影響，數據呈現與訂單總量在時間維度的分布上基本一致，但從騎行的平均時長上來看，晚上（18點以后）用戶的騎行時長大于白天，大致可以推斷用戶有更充裕的時間或者愿意花費更多的時間在晚上（如下班后）騎車，而上午的時間相對比較緊張；
從工作日和非工作日上來分析，周六周日的平均騎行時長大于工作日的平均騎行時長；

基于地理位置的分析：

從行政區劃上來看，車輛使用主要集中在虹口區、黃浦區、靜安區，而普陀區、長寧區、徐匯區和浦東新區車輛使用量相對較低；
從城市特征上來看，車輛使用主要集中在大學城區、體育館、大型商圈、密集住宅區及主要交通干道交匯處（立交橋，交通樞紐）等地段；
- 以同濟和復旦大學為代表的大學城區，江灣體育場，內環共和立交橋，中環虹橋樞紐等
- 黃浦區各個以中國城市命名的主要街道，人民廣場
從地鐵沿線上來看：1號線，3號線，11號線和13號線是車輛高頻使用沿線；
從騎行軌跡上開看：
- 騎行主要是以短距離為主，長距離的軌跡點多的比較少；
- 騎行的主要活動區域正如上述熱力圖所示，主要集中在中心城區，如虹口區、黃埔區、楊浦區；
- 黃浦沿江區域的騎行軌跡較多；

通過RFM模型對用戶價值進行了探索分析，得出以下結論：

基于用戶的分析：

隨著摩拜單車在市場的不斷推廣，新用戶的數量呈現快速增長趨勢；
大部分用戶騎行時長在5-10分鐘，單個用戶在8月的累計騎行次數集中在3-7次，少數用戶有高達20次以上的使用量；
重要價值客戶、新客戶、流失客戶和潛力客戶的占比較大，其中重要價值客戶的占比最大，高達31.7%，新用戶的占比僅次于重要價值客戶的占比，也占據28.1%的用戶，這一分析結果也再次論證，摩拜單車目前正處于上升期或者說快速擴張期，為了快速占領市場流量和提高公司的知名度，公司目前大量投放單車，新用戶快速增長；
流失客戶也存在較大的占比，在占比數據上排第三，公司需要對于流失率上進行進一步的分析，如用戶體驗，產品質量，競爭者數量，或者公共關系等原因。
當然重要價值客戶的占比最高，所以在繼續保持優勢的基礎上，可以根據不同的客戶價值群體，采取相應的運營措施：如推送折扣月卡，不定期發送免費騎車券，與其他公司合作，可以使用騎行里程兌換禮品等；

總結

以上是生活随笔為你收集整理的摩拜单车探索性分析的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

单车

上一篇：在 Visual Basic .NET
下一篇：【渗透测试】--- FCKeditor文