营收与预测:线性回归建立预测收入水平的线性回归模型。
1.獲取數(shù)據(jù)
特征含義、
## 獲取數(shù)據(jù) from sklearn.linear_model import LinearRegression from sklearn.metrics import r2_score import statsmodels.api as sm #是 Python 中一個(gè)強(qiáng)大的統(tǒng)計(jì)分析包,包含了回歸分析、時(shí)間序列分析、假設(shè)檢驗(yàn)等等的功能 import numpy as np import pandas as pd import matplotlib.pyplot as plt data = pd.read_table('C:/Users/lb/Desktop/test/earndata3.txt',sep='\t',engine="python",encoding = 'utf-8') data.columns.values data.head() # # 重命名 # data.rename(columns = {'類(lèi)型':'type','盈利率':'profit','付費(fèi)率':'pay','活躍率':'active','收入':'income','觸達(dá)比例':'touch', # '轉(zhuǎn)化比例':'conves','新增比例':'new','運(yùn)營(yíng)費(fèi)用占比':'operate','服務(wù)費(fèi)用占比':'servicce'},inplace = True) # data # 數(shù)據(jù)框操作,plt.rcParams設(shè)置圖像細(xì)節(jié),如圖像大小,線條樣式和寬度 # 繪制某兩個(gè)維度的散點(diǎn)圖 plt.rcParams['font.sans-serif']=['SimHei'] #用來(lái)正常顯示中文標(biāo)簽 plt.rcParams['axes.unicode_minus']=False #用來(lái)正常顯示負(fù)號(hào) plt.scatter(data['feature35'],data['收入']) plt.xlabel('feature35') plt.ylabel('收入') plt.show()2.查看缺失值和填充辦法
#查看缺失值 na_num = pd.isna(data).sum() print(na_num) #缺失值填充 #fillna() #df['taixin'] = df['taixin'].fillna(df['taixin'] .mean()) #均值 #df['taixin'] = df['taixin'].fillna(df['taixin'] .mode()) #眾數(shù) # df['taixin'] = df['taixin'].interpolate() #插值法異常值 一般用箱線圖
#查看異常值 plt.boxplot(data['feature1']) plt.show()
Seaborn是對(duì)matplotlib的extend,是一個(gè)數(shù)據(jù)可視化庫(kù),提供更高級(jí)的API封裝,在應(yīng)用中更加的方便靈活。 1.直方圖和密度圖 2.柱狀圖和熱力圖 3.設(shè)置圖形顯示效果 4.調(diào)色功能
當(dāng)Pearson相關(guān)系數(shù)低于0.4,則表明變量之間存在弱相關(guān)關(guān)系;當(dāng)Pearson相關(guān)系數(shù)在0.4~0.6之間,則說(shuō)明變量之間存在中度相關(guān)關(guān)系;當(dāng)相關(guān)系數(shù)在0.6以上時(shí),則反映變量之間存在強(qiáng)相關(guān)關(guān)系
.
導(dǎo)出
3.這里把所有重新恢復(fù)成 從 0開(kāi)始
# #索引恢復(fù) for i in [train_x,test_x]:i.index = range(i.shape[0]) train_x.head()
導(dǎo)入linear_model模塊,然后創(chuàng)建一個(gè)線性模型linear_model.LinearRegression,該線性回歸模型創(chuàng)建有幾個(gè)參數(shù)(可以通過(guò)help(linear_model.LinearRegression)來(lái)查看):
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
1、fit_intercept:bool量,選擇是否需要計(jì)算截距,默認(rèn)為T(mén)rue,如果中心化了的數(shù)據(jù)可以選擇false
2、normalize:bool量,選擇是否需要標(biāo)準(zhǔn)化(中心化),默認(rèn)為false,為true,表示標(biāo)準(zhǔn)化處理,一般不使用。可以自行使用standerdscaler
3、copy_x: bool量,選擇是否復(fù)制X數(shù)據(jù),默認(rèn)True,如果否,可能會(huì)因?yàn)橹行幕裍數(shù)據(jù)覆蓋
4、n_job:int量,選擇幾核用于計(jì)算,默認(rèn)1,-1表示全速運(yùn)轉(zhuǎn)
#建立線性回歸模型 from sklearn.linear_model import LinearRegression #線性回歸 model = LinearRegression() model.fit(train_x,train_y) display(model.intercept_) # 顯示模型的截距 display(model.coef_)#顯示模型的參數(shù) [*zip(train_x.columns,model.coef_)][(‘feature1’, -1.306876920532666),
(‘feature2’, -2.135359164945528),
(‘feature3’, 0.3922366623031011),
(‘feature4’, -0.4006529864411556),
(‘feature5’, -0.800071692310333),
(‘feature6’, -0.04600005945256574),
(‘feature7’, -0.11265174384663018),
(‘feature8’, 0.10045483109131019),
(‘feature9’, -0.32704175764925286),
(‘feature10’, 1.2434839400292559),
(‘feature11’, 1.5832422299388336),
(‘feature12’, -0.09226719574100156),
(‘feature13’, -2.459151124321635),
(‘feature14’, 1.9779923876600112),
(‘feature15’, 0.024525225949885543),
(‘feature16’, 0.00918442799938769),
(‘feature17’, 0.006352020349345027),
(‘feature18’, 1.6985207539315572),
(‘feature19’, -4.9467995989630396e-05),
(‘feature20’, -0.022589825867085835),
(‘feature21’, 0.09947572093574145),
(‘feature22’, -1.0843768519460424),
(‘feature23’, -0.000538610562770904),
(‘feature24’, 0.007229716791249371),
(‘feature25’, 0.0011119599539866497),
(‘feature26’, 0.23842187221124864),
(‘feature27’, 0.026069170729882317),
(‘feature28’, 0.00691494578802669),
(‘feature29’, -0.0449676591248237),
(‘feature30’, 0.0011027808324655089),
(‘feature31’, -1.151930096574248),
(‘feature32’, 0.001446787798073345),
(‘feature33’, 0.012505109738047488),
(‘feature34’, 0.3162910511343061),
(‘feature35’, -0.4609002081574919),
(‘feature36’, -0.03493518878291976),
(‘feature37’, 0.000816129764761191),
(‘feature38’, 1.4467629087041338),
(‘feature39’, 0.038077869864662946),
(‘feature40’, 2.4660343230998505e-05)]
4.預(yù)測(cè)和實(shí)際圖像
# 預(yù)測(cè)與實(shí)際圖像 pre_train = model.predict(train_x) # plt.plot(range(len(pre_train)),sorted(pre_train),label = 'yuce') # plt.plot(range(len(train_y)),sorted(train_y),label = 'shiji') plt.plot(range(len(pre_train)),pre_train,label = 'yuce') plt.plot(range(len(train_y)),train_y,label = 'shiji') plt.legend() plt.show()5.這樣看不上很清楚 ,可以將預(yù)測(cè)值進(jìn)行排序 看畫(huà)圖
# 預(yù)測(cè)與實(shí)際圖像 pre_train = model.predict(train_x) plt.plot(range(len(pre_train)),sorted(pre_train),label = 'yuce') plt.plot(range(len(train_y)),sorted(train_y),label = 'shiji') # plt.plot(range(len(pre_train)),pre_train,label = 'yuce') # plt.plot(range(len(train_y)),train_y,label = 'shiji') plt.legend() plt.show()6.模型評(píng)估
from sklearn.metrics import mean_squared_error as MSE MSE(train_y,pre_train) from sklearn.model_selection import cross_val_score #出現(xiàn)負(fù)數(shù)則為損失 cross_val_score(model,train_x,train_y,cv=10,scoring = "r2").mean()交叉驗(yàn)證
#使用sklearn來(lái)進(jìn)行調(diào)用,計(jì)算MSE from sklearn.metrics import mean_squared_error mean_squared_error(train_y,pre_train) # 求R方 pre_y = model.predict(test_x) from sklearn.metrics import r2_score import statsmodels.api as sm score=r2_score(test_y,pre_y) #第一個(gè)是真實(shí)值,第二個(gè)預(yù)測(cè)值 score
計(jì)算MSE
7.多重共線性
多重共線性檢查的是自變量之間存在線性關(guān)系,存在多重共線性會(huì)導(dǎo)致變量的顯著性檢驗(yàn)將失去效果、OLS數(shù)據(jù)失真。一般使用方差膨脹因子來(lái)進(jìn)行檢測(cè),若VIF>10,證明存在共線性,若存在多重共線性,可以選擇刪除變量或重新選擇模型(LASSO)。
dmatrices :特征組合起來(lái)
VIF Factor features
0 8.244820 Intercept
1 1.159042 feature1
2 1.833402 feature2
3 1.062635 feature3
4 1.171333 feature4
5 1.024516 feature5
6 4.241079 feature6
7 9.112658 feature7
8 4.629342 feature8
9 4.295007 feature9
10 3.085991 feature10
11 1.106177 feature11
12 1.083720 feature12
13 1.073559 feature13
14 1.013818 feature14
15 1.070490 feature15
16 1.498616 feature16
17 1.248611 feature17
18 1.007976 feature18
19 1.095409 feature19
20 3.652704 feature20
21 3.985798 feature21
22 6.668252 feature22
23 3.120996 feature23
24 1.020012 feature24
25 2.170860 feature25
26 2.018375 feature26
27 1.926179 feature27
28 1.982204 feature28
29 1.492437 feature29
30 1.275029 feature30
31 5.298781 feature31
32 2.267253 feature32
33 2.655571 feature33
34 3.786994 feature34
35 6.317910 feature35
36 1.039711 feature36
37 3.315587 feature37
38 3.098461 feature38
39 1.313444 feature39
40 1.276520 feature40
以學(xué)生會(huì)殘差為2 為界限
計(jì)算離群值比例
去掉離群值
sklearn.preprocessing.PolynomialFeatures,對(duì)特征進(jìn)行構(gòu)造,degree:控制多項(xiàng)式的次數(shù)
把數(shù)據(jù)列放進(jìn)去
本來(lái)都是以行所以先轉(zhuǎn)置一下
8.使用LASSO
Lasso中最重要的參數(shù),alpha : float, 可選,默認(rèn) 1.0。當(dāng) alpha 為 0 時(shí)算法等同于普通最小二乘法,可通過(guò) Linear Regression 實(shí)現(xiàn),因此不建議將 alpha 設(shè)為 0.
from sklearn.linear_model import Lasso model5 = Lasso(alpha=1.0) model5.fit(train_x,train_y)pre_y5 = model5.predict(test_x) score=r2_score(test_y,pre_y5) score
class sklearn.linear_model.Ridge(alpha=1.0, fit_intercept=True, normalize=False, copy_X=True, max_iter=None, tol=0.001, solver=‘a(chǎn)uto’, random_state=None)? alpha 正則化系數(shù),較大的值指定更強(qiáng)的正則化,默認(rèn)1.0,調(diào)整 fit_intercept 是否計(jì)算模型的截距,默認(rèn)為T(mén)rue,計(jì)算截距 normalize 在需要計(jì)算截距時(shí),如果值為T(mén)rue,則變量x在進(jìn)行回歸之前先進(jìn)行歸一化,果需要進(jìn)行標(biāo)準(zhǔn)化則normalize=False copy_X 默認(rèn)為T(mén)rue,將復(fù)制X;否則,X可能在計(jì)算中被覆蓋。 max_iter 共軛梯度求解器的最大迭代次數(shù) tol float類(lèi)型,指定計(jì)算精度 solver 在計(jì)算過(guò)程中選擇的解決器 ,可選svd(奇異值分解法),lsqr(最小二乘法),嶺回歸不用調(diào)
9.使用嶺回歸
# 使用嶺回歸 from sklearn.linear_model import Ridge model6 = Ridge(alpha=1.0) model6.fit(train_x,train_y)pre_y6 = model6.predict(test_x) score=r2_score(test_y,pre_y6) score10.lassocv
#lassocv from sklearn.linear_model import LassoCV alpha = np.logspace(-10,-2,200,base=10) lasso_ = LassoCV(alphas = alpha,cv =10).fit(train_x,train_y) lasso_.mse_path_ #mes
最佳參數(shù)
使用python自帶波士頓房?jī)r(jià)做線性回歸
波士頓房?jī)r(jià)數(shù)據(jù)集來(lái)源于1978年美國(guó)某經(jīng)濟(jì)學(xué)雜志上。該數(shù)據(jù)集包含若干波士頓房屋的價(jià)格及其各項(xiàng)數(shù)據(jù),每個(gè)數(shù)據(jù)項(xiàng)包含14個(gè)數(shù)據(jù),分別是犯罪率、是否在河邊和平均房間數(shù)等相關(guān)信息,其中最后一個(gè)數(shù)據(jù)是房屋中間價(jià)。
import numpy as np import pandas as pd from pandas import Series,DataFrame import matplotlib.pyplot as plt %matplotlib inline import sklearn.datasets as datasets #導(dǎo)入數(shù)據(jù) boston_dataset = datasets.load_boston() X_full = boston_dataset.data Y_full = boston_dataset.target boston = pd.DataFrame(X_full) boston.columns = boston_dataset.feature_names boston['PRICE'] = Y_full print(boston.head()) #查看數(shù)據(jù)前幾行 # 數(shù)據(jù)分布 plt.scatter(boston.CHAS, boston.PRICE) plt.xlabel('CHAS') plt.ylabel('PRICE') plt.show()import seaborn as sns sns.set() sns.pairplot(boston)#劃分測(cè)試與驗(yàn)證數(shù)據(jù)集 from sklearn.model_selection import train_test_split X_train,x_test,y_train,y_true = train_test_split(train,target,test_size=0.2) ##建立模型 from sklearn.linear_model import LinearRegression #線性回歸 from sklearn.linear_model import Ridge # 嶺回歸 from sklearn.linear_model import Lasso # LASSO回歸 from sklearn.linear_model import ElasticNet linear = LinearRegression() ridge = Ridge() lasso = Lasso() elasticnet = ElasticNet() #訓(xùn)練模型 linear.fit(X_train,y_train) ridge.fit(X_train,y_train) lasso.fit(X_train,y_train) elasticnet.fit(X_train,y_train) ##模型預(yù)測(cè) y_pre_linear = linear.predict(x_test) y_pre_ridge = ridge.predict(x_test) y_pre_lasso = lasso.predict(x_test) y_pre_elasticnet = elasticnet.predict(x_test) ##計(jì)算分值 from sklearn.metrics import r2_score linear_score=r2_score(y_true,y_pre_linear) ridge_score=r2_score(y_true,y_pre_ridge) lasso_score=r2_score(y_true,y_pre_lasso) elasticnet_score=r2_score(y_true,y_pre_elasticnet) display(linear_score,ridge_score,lasso_score,elasticnet_score) ##對(duì)比 #Linear plt.plot(y_true,label='true') plt.plot(y_pre_linear,label='linear') plt.legend() #Ridge plt.plot(y_true,label='true') plt.plot(y_pre_ridge,label='ridge') plt.legend() #lasso plt.plot(y_true,label='true') plt.plot(y_pre_lasso,label='lasso') plt.legend() #elasticnet plt.plot(y_true,label='true') plt.plot(y_pre_elasticnet,label='elasticnet') plt.legend() if __name__ == "__main__":pass總結(jié)
以上是生活随笔為你收集整理的营收与预测:线性回归建立预测收入水平的线性回归模型。的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 【转载】实现软件架构质量属性的战术
- 下一篇: 代码大全 MSIL语言程序设计