當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

【数据分析】数据分析达人赛3:汽车产品聚类分析

發(fā)布時間：2023/12/31 编程问答 28 豆豆

生活随笔收集整理的這篇文章主要介紹了【数据分析】数据分析达人赛3:汽车产品聚类分析小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

賽題簡介

賽題背景

賽題數(shù)據(jù)

一、查看數(shù)據(jù)

?查看類別型變量

查看數(shù)值型變量?

?二、數(shù)據(jù)處理

處理類別型特征

LabelEncoder

one-hot

特征歸一化

PCA降維?

三、K-means進(jìn)行聚類

肘方法看k值?

聚類結(jié)果可視化?

輪廓系數(shù)判斷k值?

四、分析聚類結(jié)果?

賽題簡介

本次教學(xué)賽是數(shù)據(jù)科學(xué)家陳博士發(fā)起的數(shù)據(jù)分析系列賽事第3場 —— 汽車產(chǎn)品聚類分析

賽題以競品分析為背景，通過數(shù)據(jù)的聚類，為汽車提供聚類分類。對于指定的車型，可以通過聚類分析找到其競品車型。通過這道賽題，鼓勵學(xué)習(xí)者利用車型數(shù)據(jù)，進(jìn)行車型畫像的分析，為產(chǎn)品的定位，競品分析提供數(shù)據(jù)決策。

賽題背景

賽題數(shù)據(jù)

數(shù)據(jù)源：car_price.csv，數(shù)據(jù)包括了205款車的26個字段

一、查看數(shù)據(jù)

import pandas as pd import time import matplotlib.pyplot as pltcar_price = pd.read_csv("./car_price.csv") car_price.head()car_price.info() # car_price.duplicated().sum()

數(shù)據(jù)特征具體可區(qū)分為3大類：

第一類：汽車ID類屬性

1 Car_ID 車號

3 CarName 車名

第二類：類別型變量（10個）

2 Symboling 保險風(fēng)險評級

4 fueltype 燃料類型

5 aspiration 發(fā)動機吸氣形式

6 doornumber 車門數(shù)

7 carbody 車身型式

8 drivewheel 驅(qū)動輪

9 enginelocation 發(fā)動機位置

15 enginetype 發(fā)動機型號

16 cylindernumber 氣缸數(shù)

18 fuelsystem 燃油系統(tǒng)

第三類：連續(xù)數(shù)值型變量（14個）

10 wheelbase 軸距

11 carlength 車長

12 carwidth 車寬

13 carheight 車高

14 curbweight 整備質(zhì)量（汽車凈重）

17 enginesize 發(fā)動機尺寸

19 boreratio 氣缸橫截面面積與沖程比

20 stroke 發(fā)動機沖程

21 compressionratio 壓縮比

22 horsepower 馬力

23 peakrpm 最大功率轉(zhuǎn)速

24 citympg 城市里程（每加侖英里數(shù)）

25 highwaympg 高速公路里程（每加侖英里數(shù)）

26 price(Dependent variable) 價格（因變量）

?查看類別型變量

# 提取類別變量的列名 cate_columns=['symboling','fueltype','aspiration','doornumber','carbody','drivewheel','enginelocation','enginetype','fuelsystem','cylindernumber']#打印類別變量每個分類的取值情況 for i in cate_columns:print (i)print(set(car_price[i])) symboling {0, 1, 2, 3, -2, -1} fueltype {'gas', 'diesel'} aspiration {'std', 'turbo'} doornumber {'two', 'four'} carbody {'convertible', 'hatchback', 'wagon', 'sedan', 'hardtop'} drivewheel {'4wd', 'fwd', 'rwd'} enginelocation {'rear', 'front'} enginetype {'ohcv', 'ohcf', 'dohc', 'ohc', 'l', 'rotor', 'dohcv'} fuelsystem {'idi', 'mfi', '4bbl', '2bbl', 'mpfi', 'spfi', '1bbl', 'spdi'} cylindernumber {'eight', 'six', 'five', 'two', 'four', 'three', 'twelve'}

查看數(shù)值型變量?

#提取連續(xù)數(shù)值型變量特征數(shù)據(jù)(除了'car_ID'和'CarName') car_df=car_price.drop(['car_ID','CarName'],axis=1) #查看連續(xù)數(shù)值型情況，并是檢查否有異常值 #對數(shù)據(jù)進(jìn)行描述性統(tǒng)計 car_df.describe()# 描繪數(shù)據(jù)集的箱線圖，查看異常值#提取連續(xù)數(shù)值型數(shù)據(jù)的列名 num_cols=car_df.columns.drop(cate_columns) print(num_cols)#繪制連續(xù)數(shù)值型數(shù)據(jù)的箱線圖，檢查異常值 import seaborn as snsfig=plt.figure(figsize=(12,8)) i=1 for col in num_cols:ax=fig.add_subplot(3,5,i)sns.boxplot(data=car_df[col],ax=ax)i=i+1plt.title(col) plt.subplots_adjust(wspace=0.4,hspace=0.3) plt.show()

#查看數(shù)值型特征的相關(guān)系數(shù) df_corr=car_df.corr() df_corr['price'].sort_values(ascending = False) price 1.000000 enginesize 0.874145 curbweight 0.835305 horsepower 0.808139 carwidth 0.759325 carlength 0.682920 wheelbase 0.577816 boreratio 0.553173 carheight 0.119336 stroke 0.079443 compressionratio 0.067984 symboling -0.079978 peakrpm -0.085267 citympg -0.685751 highwaympg -0.697599 Name: price, dtype: float64 f , ax = plt.subplots(figsize = (7, 7))plt.title('Correlation of Numeric Features with Price',y=1,size=16)sns.heatmap(df_corr,square = True, vmax=0.8)

?二、數(shù)據(jù)處理

?cylindernumber

car_price['cylindernumber'] = car_price.cylindernumber.replace({'three':3,'four':4,'five':5,'six':6,'eight':8,'twelve':12})

CarName?

#去重查看CarName print(car_price['CarName'].drop_duplicates())#驗證是否object全部改為數(shù)值類型carBrand = car_price['CarName'].str.split(expand=True)[0]#根據(jù)車名提取品牌，車名中第一個詞為品牌 print(set(carBrand))

由 carlength構(gòu)建新特征carSize

# 由上面描述性統(tǒng)計可知，車身長范圍為141.1~208.1英寸之間，可劃分為6類 bins=[min(car_df.carlength)-0.01,145.67,169.29,181.10,192.91,200.79,max(car_df.carlength)+0.01] label=['A00','A0','A','B','C','D'] carSize=pd.cut(car_df.carlength,bins,labels=label) print(carSize)#將車型大小分類放入數(shù)據(jù)集中 car_price['carSize']=carSize car_df['carSize']=carSize#剔除carlength features=car_df.drop(['carlength'],axis=1)

處理類別型特征

對于類別型特征的取值，有大小意義的數(shù)據(jù)轉(zhuǎn)換為數(shù)值型映射，沒有大小意義（不同取值表示類別不同），進(jìn)行獨熱編碼。?

LabelEncoder

# 將取值具有大小意義的類別型變量數(shù)據(jù)轉(zhuǎn)變?yōu)閿?shù)值型映射 features1=features.copy()#使用LabelEncoder對不具實體數(shù)值數(shù)據(jù)編碼 from sklearn.preprocessing import LabelEncoder carSize1=LabelEncoder().fit_transform(features1['carSize']) features1['carSize']=carSize1 carSize1a

one-hot

#對于類別離散型特征，取值間沒有大小意義的，可采用one-hot編碼 cate=features1.select_dtypes(include='object').columns print(cate)features1=features1.join(pd.get_dummies(features1[cate])).drop(cate,axis=1) features1.head()

特征歸一化

獲取的原始特征，必須對每一特征分別進(jìn)行歸一化，比如，特征A的取值范圍是[-1000,1000]，特征B的取值范圍是[-1,1].
如果使用logistic回歸，w1*x1+w2*x2，因為x1的取值太大了，所以x2基本起不了作用。
所以，必須進(jìn)行特征的歸一化，每個特征都單獨進(jìn)行歸一化。

連續(xù)型特征歸一化：

1、均值歸一化（方差為1，均值為0）

2、最大最小值歸一化（0-1）

3、 x = (2x - max - min)/(max - min).線性放縮到[-1,1]

離散型特征（類別型特征）：

離散特征進(jìn)行one-hot編碼后，編碼后的特征，其實每一維度的特征都可以看做是連續(xù)的特征。就可以跟對連續(xù)型特征的歸一化方法一樣，對每一維特征再進(jìn)行歸一化。比如歸一化到[-1,1]或歸一化到均值為0,方差為1

因為之前對類別型特征分別進(jìn)行標(biāo)簽和獨熱編碼，類別型特征已經(jīng)可以看做連續(xù)特征，所以統(tǒng)一對所有特征進(jìn)行歸一化

#對特征進(jìn)行歸一化 from sklearn import preprocessingfeatures1=preprocessing.MinMaxScaler().fit_transform(features1) features1=pd.DataFrame(features1) features1.head()

PCA降維?

#對數(shù)據(jù)集進(jìn)行PCA降維（信息保留為99.99%） from sklearn.decomposition import PCA pca=PCA(n_components=0.9999) #保證降維后的數(shù)據(jù)保持90%的信息，則填0.9 features2=pca.fit_transform(features1)#降維后，每個主要成分的解釋方差占比（解釋PC攜帶的信息多少） ratio=pca.explained_variance_ratio_ print('各主成分的解釋方差占比：',ratio)#降維后有幾個成分 print('降維后有幾個成分：',len(ratio))#累計解釋方差占比 cum_ratio=np.cumsum(ratio)#cumsum函數(shù)通常用于計算一個數(shù)組各行的累加值 print('累計解釋方差占比：',cum_ratio) 各主成分的解釋方差占比： [2.34835648e-01 1.89291914e-01 1.11193502e-01 6.41024136e-025.90453139e-02 4.54763783e-02 4.21689429e-02 3.65477617e-022.97528000e-02 2.24095237e-02 1.98458305e-02 1.95803021e-021.70780800e-02 1.47611074e-02 1.32208566e-02 1.19093756e-029.01434709e-03 8.74908243e-03 7.28321292e-03 6.65001057e-035.68867886e-03 4.89870846e-03 4.50894857e-03 3.81422315e-033.45197486e-03 2.23759951e-03 2.14676779e-03 1.84529725e-031.56025958e-03 1.22067828e-03 1.12126257e-03 1.03278716e-038.30359553e-04 6.87972243e-04 5.63679041e-04 4.64609849e-043.33065301e-04 2.76366954e-04 1.67241531e-04 1.07861538e-047.49681455e-05] 降維后有幾個成分： 41 累計解釋方差占比： [0.23483565 0.42412756 0.53532106 0.59942348 0.65846879 0.703945170.74611411 0.78266187 0.81241467 0.8348242 0.85467003 0.874250330.89132841 0.90608952 0.91931037 0.93121975 0.9402341 0.948983180.95626639 0.9629164 0.96860508 0.97350379 0.97801274 0.981826960.98527894 0.98751654 0.9896633 0.9915086 0.99306886 0.994289540.9954108 0.99644359 0.99727395 0.99796192 0.9985256 0.998990210.99932327 0.99959964 0.99976688 0.99987474 0.99994971] #繪制PCA降維后各成分方差占比的直方圖和累計方差占比折線圖 plt.figure(figsize=(8,6)) X=range(1,len(ratio)+1) Y=ratio plt.bar(X,Y,edgecolor='black') plt.plot(X,Y,'r.-') plt.plot(X,cum_ratio,'b.-') plt.ylabel('explained_variance_ratio') plt.xlabel('PCA') plt.show()

#PCA選擇降維保留8個主要成分 pca=PCA(n_components=8) features3=pca.fit_transform(features1)#降維后的累計各成分方差占比和（即解釋PC攜帶的信息多少） print(sum(pca.explained_variance_ratio_))#0.7826618733273734 features3

三、K-means進(jìn)行聚類

肘方法看k值?

##肘方法看k值，簇內(nèi)離差平方和 #對每一個k值進(jìn)行聚類并且記下對于的SSE，然后畫出k和SSE的關(guān)系圖 from sklearn.cluster import KMeanssse=[] for i in range(1,15):km=KMeans(n_clusters=i,init='k-means++',n_init=10,max_iter=300,random_state=0)km.fit(features3)sse.append(km.inertia_)plt.plot(range(1,15),sse,marker='*') plt.xlabel('n_clusters') plt.ylabel('distortions') plt.title("The Elbow Method") plt.show()

?選擇5個聚類點進(jìn)行聚類

#進(jìn)行K-Means聚類分析 kmeans=KMeans(n_clusters=5,init='k-means++',n_init=10,max_iter=300,random_state=0) kmeans.fit(features3) lab=kmeans.predict(features3) print(lab)

聚類結(jié)果可視化?

#繪制聚類結(jié)果2維的散點圖 plt.figure(figsize=(8,8)) plt.scatter(features3[:,0],features3[:,1],c=lab)for ii in np.arange(205):plt.text(features3[ii,0],features3[ii,1],s=car_price.car_ID[ii]) plt.xlabel('PC1') plt.ylabel('PC2') plt.title('K-Means PCA') plt.show()

#繪制聚類結(jié)果后3d散點圖 from mpl_toolkits.mplot3d import Axes3D plt.figure(figsize=(8,8)) ax=plt.subplot(111,projection='3d') ax.scatter(features3[:,0],features3[:,1],features3[:,2],c=lab) #視角轉(zhuǎn)換，轉(zhuǎn)換后更易看出簇群 ax.view_init(30,45) ax.set_xlabel('PC1') ax.set_ylabel('PC2') ax.set_zlabel('PC3') plt.show()

輪廓系數(shù)判斷k值?

#繪制輪廓圖和3d散點圖 from sklearn.datasets import make_blobs from sklearn.metrics import silhouette_samples, silhouette_score import matplotlib.cm as cm from mpl_toolkits.mplot3d import Axes3Dfor n_clusters in range(2,9):fig=plt.figure(figsize=(12,6))ax1=fig.add_subplot(121)ax2=fig.add_subplot(122,projection='3d')ax1.set_xlim([-0.1,1])ax1.set_ylim([0,len(features3)+(n_clusters+1)*10])km=KMeans(n_clusters=n_clusters,init='k-means++',n_init=10,max_iter=300,random_state=0)y_km=km.fit_predict(features3)silhouette_avg=silhouette_score(features3,y_km)print('n_cluster=',n_clusters,'The average silhouette_score is :',silhouette_avg)cluster_labels=np.unique(y_km) silhouette_vals=silhouette_samples(features3,y_km,metric='euclidean')y_ax_lower=10for i in range(n_clusters):c_silhouette_vals=silhouette_vals[y_km==i]c_silhouette_vals.sort()cluster_i=c_silhouette_vals.shape[0]y_ax_upper=y_ax_lower+cluster_icolor=cm.nipy_spectral(float(i)/n_clusters)ax1.fill_betweenx(range(y_ax_lower,y_ax_upper),0,c_silhouette_vals,edgecolor='none',color=color)ax1.text(-0.05,y_ax_lower+0.5*cluster_i,str(i))y_ax_lower=y_ax_upper+10ax1.set_title('The silhouette plot for the various clusters')ax1.set_xlabel('The silhouette coefficient values')ax1.set_ylabel('Cluster label')ax1.axvline(x=silhouette_avg,color='red',linestyle='--')ax1.set_yticks([])ax1.set_xticks([-0.1,0,0.2,0.4,0.6,0.8,1.0])colors=cm.nipy_spectral(y_km.astype(float)/n_clusters)ax2.scatter(features3[:,0],features3[:,1],features3[:,2],marker='.',s=30,lw=0,alpha=0.7,c=colors,edgecolor='k')centers=km.cluster_centers_ax2.scatter(centers[:,0],centers[:,1],centers[:,2],marker='o',c='white',alpha=1,s=200,edgecolor='k')for i,c in enumerate(centers):ax2.scatter(c[0],c[1],c[2],marker='$%d$' % i,alpha=1,s=50,edgecolor='k')ax2.set_title("The visualization of the clustered data.")ax2.set_xlabel("Feature space for the 1st feature")ax2.set_ylabel("Feature space for the 2nd feature")ax2.view_init(30,45)plt.suptitle(("Silhouette analysis for KMeans clustering on sample data ""with n_clusters = %d" % n_clusters),fontsize=14, fontweight='bold') plt.show()

結(jié)合輪廓圖和3d散點圖：當(dāng)k太小時，單獨的集群會合并；而當(dāng)k太大時，某些集群會被分成多個。

當(dāng)k=2，每個集群很大且很大部分實例系數(shù)接近0，表明集群內(nèi)很大部分實例接近邊界，一些單獨的集群被合并了，模型效果不好；

當(dāng)k=3時，集群‘0’大部分實例輪廓系數(shù)低于集群的輪廓分?jǐn)?shù)，且有小部分實例系數(shù)小于0趨向-1，說明該部分實例可能已分配給錯誤的集群；

k=4時，集群‘0’大部分實例輪廓系數(shù)低于集群的輪廓分?jǐn)?shù)且接近0，說明這些實例接近邊界，該集群可能分為2個單獨集群更合適；

k=7或8時，某些集群被分成多個，中心非常接近，導(dǎo)致非常糟糕的模型；

當(dāng)k為5或6時，大多數(shù)實例都超出虛線，集群看起來很好，聚類效果都很好。按得分排k更佳是6>5，當(dāng)k=5時，集群‘3’很大，k=6時，各個集群分布更均衡一些；

綜上所述，k值選取5或6都可以，聚類模型效果都可以，但考慮各集群均衡些，所以選取k=6。

#調(diào)整選擇k=6進(jìn)行聚類 kmeans=KMeans(n_clusters=6,init='k-means++',n_init=10,max_iter=300,random_state=0) y_pred=kmeans.fit_predict(features3) print(y_pred)#將聚類后的類目放入原特征數(shù)據(jù)中 car_df_km=car_price.copy() car_df_km['km_result']=y_pred [4 4 4 1 5 3 5 5 5 0 4 5 4 5 5 5 4 5 3 3 1 3 3 0 1 1 1 0 1 0 3 3 3 3 3 1 13 3 1 1 1 3 1 3 1 3 5 5 4 3 3 3 1 1 4 4 4 4 3 1 3 1 2 1 5 2 2 2 2 2 5 4 54 0 3 3 3 0 0 3 0 0 0 1 1 1 1 3 2 3 1 1 3 3 1 1 3 1 1 5 5 5 4 4 4 5 2 5 25 2 5 2 5 2 5 3 0 1 1 1 1 0 4 4 4 4 4 1 3 3 1 3 1 0 5 3 3 3 1 1 1 1 5 1 11 5 3 3 1 1 1 1 1 1 2 2 1 1 1 3 3 4 4 4 4 4 4 4 4 1 2 1 1 1 4 4 5 5 2 3 21 1 2 1 3 3 5 2 1 5 5 5 5 5 5 5 5 5 2 5]

四、分析聚類結(jié)果?

#統(tǒng)計聚類后每個集群中包含的車型數(shù) car_df_km.groupby('km_result')['car_ID'].count() km_result 0 13 1 59 2 20 3 43 4 31 5 39 Name: car_ID, dtype: int64 import pandas as pd #顯示所有列 pd.set_option('display.max_columns',None) #顯示所有行 pd.set_option('display.max_rows',None)#統(tǒng)計每個集群里各品牌的車型數(shù) car_df_km.groupby(by=['km_result','carBrand'])['car_ID'].count()#統(tǒng)計每個品牌在各個集群里的車型數(shù) car_df_km.groupby(by=['carBrand','km_result'])['car_ID'].count() #查看特指車名‘vokswagen’車型的聚類集群 df=car_df_km.loc[:,['car_ID','CarName','carBrand','km_result']] print(df.loc[df['CarName'].str.contains("vokswagen")]) # ’vokswagen’的車名為‘vokswagen rabbit’，car_ID 為183，集群分類為2.#查看特指車名為‘vokswagen’車型的競品車型（分類2的所有車型） df.loc[df['km_result']==2] #查看大眾volkswagen品牌在各集群內(nèi)的競品車型li = [1, 2,3,5] #volkswagen品牌在1235這幾個集群里分布 df_volk=df[df['km_result'].isin(li)].sort_values(by=['km_result','carBrand']) df_volk

在全量數(shù)據(jù)里提取‘vokswagen’車型的競品車型

df0 = car_df_km.loc[car_df_km['km_result']==2] df0.head() df0_1=df0.drop(['car_ID','CarName','km_result'],axis=1)#查看集群2的車型所有特征分布 fig=plt.figure(figsize=(20,20)) i=1 for c in df0_1.columns:ax=fig.add_subplot(7,4,i) if df0_1[c].dtypes=='int' or df0_1[c].dtypes=='float':#數(shù)值型變量sns.histplot(df0_1[c],ax=ax)#直方圖else:sns.barplot(df0_1[c].value_counts().index,df0_1[c].value_counts(),ax=ax)#條形圖3i=i+1plt.xlabel('')plt.title(c) plt.subplots_adjust(top=1.2) plt.show()

類別型變量取值只有一種的有：
fueltype : {‘diesel’}；enginelocation : {‘front’}；fuelsystem:{'idi'}

這些共性的特征在競品分析時可不考慮

#對不同車型級別、品牌、車身等類型特征進(jìn)行數(shù)據(jù)透視 #按車型大小級別進(jìn)行對比 df2=df0.pivot_table(index=['carSize','carbody','carBrand','CarName']) df2 boreratiocar_IDcarheightcarlengthcarwidthcitympgcompressionratiocurbweightenginesizehighwaympghorsepowerkm_resultpeakrpmpricestrokesymbolingwheelbasecarSizecarbodycarBrandCarNameA0hatchbacktoyotatoyota corollasedannissannissan gt-rtoyotatoyota coronaAsedanmazdamazda glc deluxemazda rx-7 gstoyotatoyota celica gtvolkswagenvokswagen rabbitvolkswagen model 111volkswagen rabbit customvolkswagen super beetleBhardtopbuickbuick centurysedanbuickbuick electra 225 custompeugeotpeugeot 304peugeot 504peugeot 604slvolvovolvo 246wagonbuickbuick century luxus (sw)Cwagonpeugeotpeugeot 504peugeot 505s turbo dieselDsedanbuickbuick skyhawk

3.27	160	52.8	166.3	64.4	38	22.5	2275	110	47	56	2	4500	7788.0	3.35	0	95.7
2.99	91	54.5	165.3	63.8	45	21.9	2017	103	50	55	2	4800	7099.0	3.47	1	94.5
3.27	159	53.0	166.3	64.4	34	22.5	2275	110	36	56	2	4500	7898.0	3.35	0	95.7
3.39	64	55.5	177.8	66.5	36	22.7	2443	122	42	64	2	4650	10795.0	3.39	0	98.8
3.43	67	54.4	175.0	66.1	31	22.0	2700	134	39	72	2	4200	18344.0	3.64	0	104.9
3.27	175	54.9	175.6	66.5	30	22.5	2480	110	33	73	2	4500	10698.0	3.35	-1	102.4
3.01	183	55.7	171.7	65.5	37	23.0	2261	97	46	52	2	4800	7775.0	3.40	2	97.3
3.01	185	55.7	171.7	65.5	37	23.0	2264	97	46	52	2	4800	7995.0	3.40	2	97.3
3.01	193	55.1	180.2	66.9	33	23.0	2579	97	38	68	2	4500	13845.0	3.40	0	100.4
3.01	188	55.7	171.7	65.5	37	23.0	2319	97	42	68	2	4500	9495.0	3.40	2	97.3
3.58	70	54.9	187.5	70.3	22	21.5	3495	183	25	123	2	4350	28176.0	3.64	0	106.7
3.58	68	56.5	190.9	70.3	22	21.5	3515	183	25	123	2	4350	25552.0	3.64	-1	110.0
3.70	109	56.7	186.7	68.4	28	21.0	3197	152	33	95	2	4150	13200.0	3.52	0	107.9
3.70	117	56.7	186.7	68.4	28	21.0	3252	152	33	95	2	4150	17950.0	3.52	0	107.9
3.70	113	56.7	186.7	68.4	28	21.0	3252	152	33	95	2	4150	16900.0	3.52	0	107.9
3.01	204	55.5	188.8	68.9	26	23.0	3217	145	27	106	2	4800	22470.0	3.40	-1	109.1
3.58	69	58.7	190.9	70.3	22	21.5	3750	183	25	123	2	4350	28248.0	3.64	-1	110.0
3.70	111	58.7	198.9	68.4	25	21.0	3430	152	25	95	2	4150	13860.0	3.52	0	114.2
3.70	115	58.7	198.9	68.4	25	21.0	3485	152	25	95	2	4150	17075.0	3.52	0	114.2
3.58	71	56.3	202.6	71.7	22	21.5	3770	183	25	123	2	4350	31600.0	3.64	-1	115.6

集群2中所有的車型大小級別為：A0小型車、A緊湊型車、B中型車、C中大型車、D豪華型車。
car_id183的車vokswagen rabbit屬于A緊湊型車，其最直接的細(xì)分競品為同屬于a級的7輛車??

#提取集群2中的A級車 df0_A=df0.loc[df0['carSize']=='A'] df0_A#查看集群0中A級車型的類別型變量的分類情況 ate_col=df0_A.select_dtypes(include='object').columns df3=df0_A[ate_col] df3

#對集群0中A級車的特征進(jìn)行數(shù)據(jù)透視 df4=df0_A.pivot_table(index=['carBrand','CarName','doornumber','aspiration','drivewheel']) df4

包含‘vokswagen rabbit’在內(nèi)的7輛A級車中均有4個氣缸，沖程范圍在3.4-3.64，最大功率轉(zhuǎn)速范圍在4500-4800，壓縮比范圍在22.5-23.0，車身寬范圍66.1-66.9，車高范圍在54.4-55.7，氣缸橫截面面積與沖程比范圍在3.01-3.43；以上這些數(shù)據(jù)都是比較相似的。

一般汽車關(guān)注點在：車型級別（carSize）、品牌（carBrand）、動力性能（馬力horsepower）、質(zhì)量安全（Symboling ）、油耗（citympg、highwaympg）、空間體驗（軸距wheelbase）、車身（carbody、curbweight）等等。

下面提取其他一些不同關(guān)鍵特征進(jìn)行考量‘vokswagen rabbit’與其他競品之間的差異化：

基本信息：‘carBrand’，‘doornumber’, ‘curbweight’

油耗：‘highwaympg’、‘citympg’

安全性：‘symboling’

底盤制動：‘drivewheel’

動力性能：‘a(chǎn)spiration’, ‘enginesize’, ‘horsepower’

空間體驗：‘wheelbase’

價格： ‘price’

#對油耗的分析('citympg','highwaympg') lab=df0_A['CarName']fig,ax=plt.subplots(figsize=(10,8)) ax.barh(range(len(lab)),df0_A['highwaympg'],tick_label=lab,color='red') ax.barh(range(len(lab)),df0_A['citympg'],tick_label=lab,color='blue')#在水平直方圖上標(biāo)注數(shù)據(jù) for i,(highway,city) in enumerate(zip(df0_A['highwaympg'],df0_A['citympg'])):ax.text(highway,i,highway,ha='right')ax.text(city,i,city,ha='right')plt.legend(('highwaympg','citympg'), loc='upper right') plt.title('miles per gallon') plt.show()

#其他6個特征分析 colors=['yellow', 'blue', 'green','red', 'gray','tan','darkviolet'] col2=['symboling','wheelbase','enginesize','horsepower','curbweight','price'] data=df0_A[col2]fig=plt.figure(figsize=(10,8)) i=1 for c in data.columns:ax=fig.add_subplot(3,2,i)plt.barh(range(len(lab)),data[c],tick_label=lab,color=colors)for y,x in enumerate(data[c].values):plt.text(x,y,"%s" %x)i=i+1plt.xlabel('')plt.title(c) plt.subplots_adjust(top=1.2,wspace=0.7) plt.show()

由上面條形圖，‘vokswagen rabbit’與其他競品相比：

質(zhì)量安全方面：其保險風(fēng)險評級為2，比馬自達(dá)品牌和豐田品牌車型相對更具有風(fēng)險；

車身空間方面：軸距是最小的；

動力方面：發(fā)動機尺寸和馬力都是最小的；

車重方面：整備質(zhì)量最小的；

價格方面：價格是最小的；
綜上所述，‘'vokswagen rabbit‘’與集群0中同是A級的競品相比：

劣勢：質(zhì)量安全性偏低、車身空間偏小、動力馬力偏小

優(yōu)勢：車身輕、油耗低、價格低（在類似的配置中性價比非常高）

設(shè)計特點：雙車門三廂車

產(chǎn)品定位：“經(jīng)濟適用、城市代步緊湊型A級轎車”

建議：在銷售推廣時，可偏重于：①同類配置車型中超高的性價比；②油耗低，城市代步非常省油省錢；③車身小巧，停車方便；④雙車門設(shè)計，個性獨特

【算法競賽學(xué)習(xí)】數(shù)據(jù)分析達(dá)人賽3:汽車產(chǎn)品聚類分析

總結(jié)

以上是生活随笔為你收集整理的【数据分析】数据分析达人赛3:汽车产品聚类分析的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： Flutter自定义iconfont字体
下一篇：华为虚拟服务器bim,bim云服务器