當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Machine Learning on Spark—— 统计基础（一)

發(fā)布時間：2024/1/23 编程问答 37 豆豆

生活随笔收集整理的這篇文章主要介紹了 Machine Learning on Spark—— 统计基础（一) 小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

本文主要內(nèi)容

本文對了org.apache.Spark.mllib.stat包及子包中的相關統(tǒng)計類進行介紹，stat包中包括下圖中的類或?qū)ο?

本文將對其中的部分內(nèi)容進行詳細講解

獲取矩陣列（column-wise）統(tǒng)計信息

Kernel density estimation（核密度估計)

Hypothesis testing（假設檢驗)

1. 獲取矩陣列（column-wise）統(tǒng)計信息

獲取列統(tǒng)計信息指的是以矩陣中的列為單位獲取其統(tǒng)計信息（如每列的最大值、最小值、均值等其它統(tǒng)計特征）

package cn.ml.statimport org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.stat.Statistics import org.apache.spark.mllib.stat.MultivariateStatisticalSummaryobject StatisticsDemo extends App {val sparkConf = new SparkConf().setAppName("StatisticsDemo").setMaster("spark://sparkmaster:7077") val sc = new SparkContext(sparkConf)val rdd1= sc.parallelize(Array(Array(1.0,2.0,3.0,4.0),Array(2.0,3.0,4.0,5.0),Array(3.0,4.0,5.0,6.0))).map(f => Vectors.dense(f))//在第一節(jié)中，我們使用過該MultivariateStatisticalSummary該類，通過下列方法// var mss:MultivariateStatisticalSummary=rowMatirx.computeColumnSummaryStatistics()// 這里是通過Statistics方法去獲取相關統(tǒng)計信息，它們的內(nèi)部實現(xiàn)原理是一致的，最終返回其實都是// MultivariateOnlineSummarizer的實例（下一小節(jié)將講解該類)//Statistics.colStats方法它的源碼如下：// def colStats(X: RDD[Vector]): MultivariateStatisticalSummary = {// new RowMatrix(X).computeColumnSummaryStatistics()//}//可以看到 Statistics.colStats方法調(diào)用的是RowMatrix中的computeColumnSummaryStatistics方法val mss:MultivariateStatisticalSummary=Statistics.colStats(rdd1)//因此下列方面返回的結(jié)果與第一節(jié)通過調(diào)用computeColumnSummaryStatistics得到的結(jié)果//返回值是一致的mss.maxmss.minmss.normL1//其它normL2等統(tǒng)計信息 }

2. Kernel density estimation（核密度估計)

統(tǒng)計學當中，核密度估計（Kernel density estimation，KDE）扮演著十分重要的角色，它是一種非參數(shù)化的隨機變量概率密度估計方法。設(x1, x2, …, xn)為n個獨立同分布的樣本，對其概率密度函數(shù)作如下定義：

其中K(?)被稱為核，h 被稱為帶寬bandwidth，它是一個大于0的平滑參數(shù)，更詳細的信息參見https://en.wikipedia.org/wiki/Kernel_density_estimation
核函數(shù)的種類比較多，但Spark中只實現(xiàn)了高斯核函數(shù)：

val sample = sc.parallelize(Seq(0.0, 1.0, 4.0, 4.0))val kernelDensity=new KernelDensity().setSample(sample) //設置密度估計樣本.setBandwidth(3.0) //設置帶寬，對高斯核函數(shù)來講就是標準差//給定相應的點，估計其概率密度//densities: Array[Double] = //Array(0.07464879256673691, 0.1113106036883375, 0.08485447240456075)val densities = kernelDensity.estimate(Array(-1.0, 2.0, 5.0))

3. Hypothesis testing（假設檢驗)

假設檢測在統(tǒng)計學中用于通過假設條件將樣本進行總體推斷，從而做出接受或拒絕假設判斷，假設檢驗的方法很多，具體可參考http://baike.baidu.com/link?url=f3DhyOL_9OLVupNkCk82fdOhYOvYKzTWSVNyJqDNBD2hqr1nSlxmqpMiStqnWgNrW3ni9U_kZgy2GA5_8kSAHa。目前Spark中只提供了皮爾森chi平方距離檢測法（Pearson’s chi-squared ( χ2) ），也稱卡方檢驗，它由統(tǒng)計學家皮爾遜推導。理論證明，實際觀察次數(shù)（fo）與理論次數(shù)（fe）之差的平方再除以理論次數(shù)所得的統(tǒng)計量，近似服從卡方分布。卡方檢驗的兩個主要應用：擬合性檢驗和獨立性檢驗，擬合性檢驗是用于分析實際次數(shù)與理論次數(shù)是否相同，適用于單個因素分類的計數(shù)數(shù)據(jù)。獨立性檢驗用于分析各有多項分類的兩個或兩個以上的因素之間是否有關聯(lián)或是否獨立的問題（參見http://en.wikipedia.org/wiki/Chi-squared_test）。在Spark中，擬合度檢驗要求輸入為Vector, 獨立性檢驗要求輸入是Matrix，另外還支持RDD[LabeledPoint]的獨立性檢驗。對應方法如下：

//對帶標簽的特征向量進行獨立性檢驗LabeledPoint，返回Array[ChiSqTestResult] //目前只支持PEARSON法即卡方檢驗 /*** Conduct Pearson's independence test for each feature against the label across the input RDD.* The contingency table is constructed from the raw (feature, label) pairs and used to conduct* the independence test.* Returns an array containing the ChiSquaredTestResult for every feature against the label.*/def chiSquaredFeatures(data: RDD[LabeledPoint],methodName: String = PEARSON.name): Array[ChiSqTestResult] //擬合度檢驗，針對Vector,目前只支持PEARSON法即卡方檢驗 /** Pearson's goodness of fit test on the input observed and expected counts/relative frequencies.* Uniform distribution is assumed when `expected` is not passed in.*/def chiSquared(observed: Vector,expected: Vector = Vectors.dense(Array[Double]()),methodName: String = PEARSON.name): ChiSqTestResult//獨立性檢驗，要求輸入為Matrix，目前只支持PEARSON法即卡方檢驗/** Pearson's independence test on the input contingency matrix.* TODO: optimize for SparseMatrix when it becomes supported.*/def chiSquaredMatrix(counts: Matrix, methodName: String = PEARSON.name): ChiSqTestResult

假設有兩塊土地，通過下列數(shù)據(jù)來檢驗其開紅花的比率是否相同：
土地一，開紅花:1000，開蘭花:1856
土地二，開紅花:400.，開蘭花:560

具體使用代碼如下：

val land1 = Vectors.dense(1000.0, 1856.0) val land2 = Vectors.dense(400, 560) val c1 = Statistics.chiSqTest(land1, land2)

執(zhí)行結(jié)果：

c1: org.apache.spark.mllib.stat.test.ChiSqTestResult = Chi squared test summary: method: pearson degrees of freedom = 1 statistic = 52.0048019207683 pValue = 5.536682223805656E-13 Very strong presumption against null hypothesis: observed follows the same distribution as expected..

單從結(jié)果來看，兩組數(shù)據(jù)滿足相同的分布

總結(jié)

以上是生活随笔為你收集整理的Machine Learning on Spark—— 统计基础（一)的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： Machine Learning On
下一篇： Machine Learning on