日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當(dāng)前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

SparkMLlib回归算法之决策树

發(fā)布時間:2025/3/17 编程问答 23 豆豆
生活随笔 收集整理的這篇文章主要介紹了 SparkMLlib回归算法之决策树 小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

SparkMLlib回歸算法之決策樹

(一),決策樹概念

1,決策樹算法(ID3,C4.5 ,CART)之間的比較:

  1,ID3算法在選擇根節(jié)點和各內(nèi)部節(jié)點中的分支屬性時,采用信息增益作為評價標(biāo)準(zhǔn)。信息增益的缺點是傾向于選擇取值較多的屬性,在有些情況下這類屬性可能不會提供太多有價值的信息。

  2 ID3算法只能對描述屬性為離散型屬性的數(shù)據(jù)集構(gòu)造決策樹,其余兩種算法對離散和連續(xù)都可以處理

2,C4.5算法實例介紹(參考網(wǎng)址:http://m.blog.csdn.net/article/details?id=44726921)

  

?c4.5后剪枝策略:以悲觀剪枝為主參考網(wǎng)址:http://www.cnblogs.com/zhangchaoyang/articles/2842490.html

(二) SparkMLlib決策樹回歸的應(yīng)用

1,數(shù)據(jù)集來源及描述:參考http://www.cnblogs.com/ksWorld/p/6891664.html

2,代碼實現(xiàn):

  2.1 構(gòu)建輸入數(shù)據(jù)格式:

val file_bike = "hour_nohead.csv"val file_tree=sc.textFile(file_bike).map(_.split(",")).map{x =>val feature=x.slice(2,x.length-3).map(_.toDouble)val label=x(x.length-1).toDoubleLabeledPoint(label,Vectors.dense(feature))}println(file_tree.first())val categoricalFeaturesInfo = Map[Int,Int]()val model_DT=DecisionTree.trainRegressor(file_tree,categoricalFeaturesInfo,"variance",5,32)

  2.2 模型評判標(biāo)準(zhǔn)(mse,mae,rmsle)

val predict_vs_train = file_tree.map {point => (model_DT.predict(point.features),point.label)/* point => (math.exp(model_DT.predict(point.features)), math.exp(point.label))*/}predict_vs_train.take(5).foreach(println(_))/*MSE是均方誤差*/val mse = predict_vs_train.map(x => math.pow(x._1 - x._2, 2)).mean()/* 平均絕對誤差(MAE)*/val mae = predict_vs_train.map(x => math.abs(x._1 - x._2)).mean()/*均方根對數(shù)誤差(RMSLE)*/val rmsle = math.sqrt(predict_vs_train.map(x => math.pow(math.log(x._1 + 1) - math.log(x._2 + 1), 2)).mean())println(s"mse is $mse and mae is $mae and rmsle is $rmsle") /* mse is 11611.485999495755 and mae is 71.15018786490428 and rmsle is 0.6251152586960916 */

(三)?改進模型性能和參數(shù)調(diào)優(yōu)

1,改變目標(biāo)量 (對目標(biāo)值求根號),修改下面語句

LabeledPoint(math.log(label),Vectors.dense(feature)) 和val predict_vs_train = file_tree.map {/*point => (model_DT.predict(point.features),point.label)*/point => (math.exp(model_DT.predict(point.features)), math.exp(point.label))} /*結(jié)果 mse is 14781.575988339053 and mae is 76.41310991122032 and rmsle is 0.6405996100717035 */

決策樹在變換后的性能有所下降

2,模型參數(shù)調(diào)優(yōu)

  1,構(gòu)建訓(xùn)練集和測試集

val file_tree=sc.textFile(file_bike).map(_.split(",")).map{x =>val feature=x.slice(2,x.length-3).map(_.toDouble)val label=x(x.length-1).toDoubleLabeledPoint(label,Vectors.dense(feature))/*LabeledPoint(math.log(label),Vectors.dense(feature))*/}val tree_orgin=file_tree.randomSplit(Array(0.8,0.2),11L)val tree_train=tree_orgin(0)val tree_test=tree_orgin(1)

  2,調(diào)節(jié)樹的深度參數(shù)

val categoricalFeaturesInfo = Map[Int,Int]()val model_DT=DecisionTree.trainRegressor(file_tree,categoricalFeaturesInfo,"variance",5,32)/*調(diào)節(jié)樹深度次數(shù)*/val Deep_Results = Seq(1, 2, 3, 4, 5, 10, 20).map { param =>val model = DecisionTree.trainRegressor(tree_train, categoricalFeaturesInfo,"variance",param,32)val scoreAndLabels = tree_test.map { point =>(model.predict(point.features), point.label)}val rmsle = math.sqrt(scoreAndLabels.map(x => math.pow(math.log(x._1) - math.log(x._2), 2)).mean)(s"$param lambda", rmsle)} /*深度的結(jié)果輸出*/Deep_Results.foreach { case (param, rmsl) => println(f"$param, rmsle = ${rmsl}")} /* 1 lambda, rmsle = 1.0763369409492645 2 lambda, rmsle = 0.9735820606349874 3 lambda, rmsle = 0.8786984993014815 4 lambda, rmsle = 0.8052113493915528 5 lambda, rmsle = 0.7014036913077335 10 lambda, rmsle = 0.44747906135994925 20 lambda, rmsle = 0.4769214752638845 */

  深度較大的決策樹出現(xiàn)過擬合,從結(jié)果來看這個數(shù)據(jù)集最優(yōu)的樹深度大概在10左右

  3,調(diào)節(jié)劃分?jǐn)?shù)

/*調(diào)節(jié)劃分?jǐn)?shù)*/val ClassNum_Results = Seq(2, 4, 8, 16, 32, 64, 100).map { param =>val model = DecisionTree.trainRegressor(tree_train, categoricalFeaturesInfo,"variance",10,param)val scoreAndLabels = tree_test.map { point =>(model.predict(point.features), point.label)}val rmsle = math.sqrt(scoreAndLabels.map(x => math.pow(math.log(x._1) - math.log(x._2), 2)).mean)(s"$param lambda", rmsle)}/*劃分?jǐn)?shù)的結(jié)果輸出*/ClassNum_Results.foreach { case (param, rmsl) => println(f"$param, rmsle = ${rmsl}")} /* 2 lambda, rmsle = 1.2995002615220668 4 lambda, rmsle = 0.7682777577495858 8 lambda, rmsle = 0.6615110909041817 16 lambda, rmsle = 0.4981237727958235 32 lambda, rmsle = 0.44747906135994925 64 lambda, rmsle = 0.4487531073836407 100 lambda, rmsle = 0.4487531073836407 */

  更多的劃分?jǐn)?shù)會使模型變復(fù)雜,并且有助于提升特征維度較大的模型性能。劃分?jǐn)?shù)到一定程度之后,對性能的提升幫助不大。實際上,由于過擬合的原因會導(dǎo)致測試集的性能變差。可見分類數(shù)應(yīng)在32左右。。

?

轉(zhuǎn)載于:https://www.cnblogs.com/ksWorld/p/6899594.html

總結(jié)

以上是生活随笔為你收集整理的SparkMLlib回归算法之决策树的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯,歡迎將生活随笔推薦給好友。