當前位置：首頁 > 人文社科 > 生活经验 >内容正文

生活经验

聚类Clustering

發布時間：2023/11/28 生活经验 34 豆豆

生活随笔收集整理的這篇文章主要介紹了聚类Clustering 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

聚類Clustering
This page describes clustering algorithms in MLlib. The guide for clustering in the RDD-based API also has relevant information about these algorithms. 本文描述MLlib中的聚類算法。基于RDD-API中的聚類指南提供了有關這些算法的相關信息。
Table of Contents
? K-means
o Input Columns
o Output Columns
? Latent Dirichlet allocation (LDA)
? Bisecting k-means
? Gaussian Mixture Model (GMM)
o Input Columns
o Output Columns
? Power Iteration Clustering (PIC)
K-means
k-means is one of the most commonly used clustering algorithms that clusters the data points into a predefined number of clusters. The MLlib implementation includes a parallelized variant of the k-means++ method called kmeans||.
KMeans is implemented as an Estimator and generates a KMeansModel as the base model.
k均值是最常用的聚類算法之一，將數據點聚集成預定數量的聚類。MLlib實現包括k-means ++方法的并行變體，稱為kmeans ||。。
KMeans實現，Estimator生成KMeansModel作為基本模型。

Examples
? Scala
? Java
? Python
? R
Refer to the Scala API docs for more details.
import org.apache.spark.ml.clustering.KMeans
import org.apache.spark.ml.evaluation.ClusteringEvaluator

// Loads data.
val dataset = spark.read.format(“libsvm”).load(“data/mllib/sample_kmeans_data.txt”)

// Trains a k-means model.
val kmeans = new KMeans().setK(2).setSeed(1L)
val model = kmeans.fit(dataset)

// Make predictions
val predictions = model.transform(dataset)

// Evaluate clustering by computing Silhouette score
val evaluator = new ClusteringEvaluator()

val silhouette = evaluator.evaluate(predictions)
println(s"Silhouette with squared euclidean distance = $silhouette")

// Shows the result.
println("Cluster Centers: ")
model.clusterCenters.foreach(println)
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/KMeansExample.scala” in the Spark repo.
Latent Dirichlet allocation (LDA)
LDA is implemented as an Estimator that supports both EMLDAOptimizer and OnlineLDAOptimizer, and generates a LDAModel as the base model. Expert users may cast a LDAModel generated by EMLDAOptimizer to a DistributedLDAModel if needed.
LDA實現Estimator，支持EMLDAOptimizer和OnlineLDAOptimizer，生成LDAModel作為基礎模型。專家用戶可以將LDAModel生成的 EMLDAOptimizer轉換為DistributedLDAModel。
Examples
? Scala
? Java
? Python
? R
Refer to the Scala API docs for more details.
import org.apache.spark.ml.clustering.LDA

// Loads data.
val dataset = spark.read.format(“libsvm”)
.load(“data/mllib/sample_lda_libsvm_data.txt”)

// Trains a LDA model.
val lda = new LDA().setK(10).setMaxIter(10)
val model = lda.fit(dataset)

val ll = model.logLikelihood(dataset)
val lp = model.logPerplexity(dataset)
println(s"The lower bound on the log likelihood of the entire corpus: $ll")
println(s"The upper bound on perplexity: $lp")

// Describe topics.
val topics = model.describeTopics(3)
println(“The topics described by their top-weighted terms:”)
topics.show(false)

// Shows the result.
val transformed = model.transform(dataset)
transformed.show(false)
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/LDAExample.scala” in the Spark repo.
Bisecting k-means
Bisecting k-means is a kind of hierarchical clustering using a divisive (or “top-down”) approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.
Bisecting K-means can often be much faster than regular K-means, but it will generally produce a different clustering.
BisectingKMeans is implemented as an Estimator and generates a BisectingKMeansModel as the base model.
將k均值平分是一種使用除法（或“自上而下”）方法的分層聚類：所有觀測值都在一個聚類中開始，當一個聚結向下移動時，遞歸執行拆分。
平分K均值通常會比常規K均值快得多，但通常會產生不同的聚類。
BisectingKMeans實現，Estimator并生成BisectingKMeansModel作為基本模型。
Examples
? Scala
? Java
? Python
? R
Refer to the Scala API docs for more details.
import org.apache.spark.ml.clustering.BisectingKMeans
import org.apache.spark.ml.evaluation.ClusteringEvaluator

// Loads data.
val dataset = spark.read.format(“libsvm”).load(“data/mllib/sample_kmeans_data.txt”)

// Trains a bisecting k-means model.
val bkm = new BisectingKMeans().setK(2).setSeed(1)
val model = bkm.fit(dataset)

// Make predictions
val predictions = model.transform(dataset)

// Evaluate clustering by computing Silhouette score
val evaluator = new ClusteringEvaluator()

val silhouette = evaluator.evaluate(predictions)
println(s"Silhouette with squared euclidean distance = $silhouette")

// Shows the result.
println("Cluster Centers: ")
val centers = model.clusterCenters
centers.foreach(println)
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/BisectingKMeansExample.scala” in the Spark repo.
Gaussian Mixture Model (GMM)
A Gaussian Mixture Model represents a composite distribution whereby points are drawn from one of k Gaussian sub-distributions, each with its own probability. The spark.ml implementation uses the expectation-maximization algorithm to induce the maximum-likelihood model given a set of samples.
GaussianMixture is implemented as an Estimator and generates a GaussianMixtureModel as the base model.
高斯混合模型代表一個復合分布，繪制?高斯子分布，每個具有其相應的概率。該spark.ml實現使用期望最大化算法，給定一組樣本，得出最大似然模型。
GaussianMixture實現，Estimator并生成GaussianMixtureModel作為基本模型。

Examples
? Scala
? Java
? Python
? R
Refer to the Scala API docs for more details.
import org.apache.spark.ml.clustering.GaussianMixture

// Loads data
val dataset = spark.read.format(“libsvm”).load(“data/mllib/sample_kmeans_data.txt”)

// Trains Gaussian Mixture Model
val gmm = new GaussianMixture()
.setK(2)
val model = gmm.fit(dataset)

// output parameters of mixture model model
for (i <- 0 until model.getK) {
println(s"Gaussian $KaTeX parse error: Undefined control sequence: \nweight at position 3: i:\?n?w?e?i?g?h?t?=$ {model.weights(i)}\n" +
s"mu= $KaTeX parse error: Undefined control sequence: \nsigma at position 26: …ssians(i).mean}\?n?s?i?g?m?a?=\n$ {model.gaussians(i).cov}\n")
}
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/GaussianMixtureExample.scala” in the Spark repo.
Power Iteration Clustering (PIC)
Power Iteration Clustering (PIC) is a scalable graph clustering algorithm developed by Lin and Cohen. From the abstract: PIC finds a very low-dimensional embedding of a dataset using truncated power iteration on a normalized pair-wise similarity matrix of the data.
spark.ml’s PowerIterationClustering implementation takes the following parameters:
功率迭代聚類（PIC）是Lin和Cohen開發的可伸縮圖聚類算法。PIC在數據的標準化成對相似度矩陣上使用截斷的冪次迭代，發現了數據集的非常低維的嵌入。
spark.ml的PowerIterationClustering實現采用以下參數：
? k: the number of clusters to create
? initMode: param for the initialization algorithm
? maxIter: param for maximum number of iterations
? srcCol: param for the name of the input column for source vertex IDs
? dstCol: name of the input column for destination vertex IDs
? weightCol: Param for weight column name
? k：要創建的聚類數
? initMode：初始化算法的參數
? maxIter：最大迭代次數的參數
? srcCol：參數，用于源頂點ID的輸入列的名稱
? dstCol：目標頂點ID的輸入列的名稱
? weightCol：權重列名稱的參數
Examples
? Scala
? Java
? Python
? R
Refer to the Scala API docs for more details.
import org.apache.spark.ml.clustering.PowerIterationClustering

val dataset = spark.createDataFrame(Seq(
(0L, 1L, 1.0),
(0L, 2L, 1.0),
(1L, 2L, 1.0),
(3L, 4L, 1.0),
(4L, 0L, 0.1)
)).toDF(“src”, “dst”, “weight”)

val model = new PowerIterationClustering().
setK(2).
setMaxIter(20).
setInitMode(“degree”).
setWeightCol(“weight”)

val prediction = model.assignClusters(dataset).select(“id”, “cluster”)

// Shows the cluster assignment
prediction.show(false)
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/PowerIterationClusteringExample.scala” in the Spark repo.

總結

以上是生活随笔為你收集整理的聚类Clustering的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。