當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

数据挖掘流程_数据流挖掘

發(fā)布時(shí)間：2023/11/29 编程问答 31 豆豆

生活随笔收集整理的這篇文章主要介紹了数据挖掘流程_数据流挖掘小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

數(shù)據(jù)挖掘流程

1-簡(jiǎn)介 (1- Introduction)

The fact that the pace of technological change is at its peak, Silicon Valley is also introducing new challenges that need to be tackled via new and efficient ways. Continuous research is being carried out to improve the existing tools, techniques, and algorithms to maximize their efficiency. Streaming data has always remained a challenge since the last decades, nevertheless plenty of stream-based algorithms have been introduced but still, the researchers are struggling to achieve better results and performance. As you know that, when water from a fire hose starts hitting your face, chances to measure it starts decreasing gradually. This is due to the torrent nature of streams. It has introduced new challenges of analyzing and mining the streams efficiently. Stream analysis has been made easy up to some extent because of a few new tools that are introduced in the market recently. These tools are following different approaches and algorithms which are being improved continuously. However, when it comes to mining data streams, it is not possible to store and iterate over the streams like traditional mining algorithms due to their continuous, high-speed, and unbounded nature.

技術(shù)變革的步伐達(dá)到頂峰這一事實(shí)，硅谷也帶來了新的挑戰(zhàn)，需要通過新的有效方式來應(yīng)對(duì)。正在進(jìn)行持續(xù)的研究以改進(jìn)現(xiàn)有的工具，技術(shù)和算法，以使其效率最大化。自從過去的幾十年以來，流數(shù)據(jù)一直是一個(gè)挑戰(zhàn)，盡管引入了很多基于流的算法，但是研究人員仍在努力獲得更好的結(jié)果和性能。如您所知，當(dāng)消防水帶上的水開始濺到您的臉上時(shí)，測(cè)量水的機(jī)會(huì)開始逐漸減少。這是由于流的洪流性質(zhì)。它帶來了有效分析和挖掘流的新挑戰(zhàn)。由于最近市場(chǎng)上引入了一些新工具，因此在某種程度上簡(jiǎn)化了流分析。這些工具采用了不同的方法和算法，并不斷得到改進(jìn)。但是，在挖掘數(shù)據(jù)流時(shí)，由于其連續(xù)，高速且無限制的特性，因此無法像傳統(tǒng)的挖掘算法一樣在數(shù)據(jù)流上進(jìn)行存儲(chǔ)和迭代。

Due to irregularity and variation in the arriving data, memory management has become the main challenge to deal with. Applications like sensor networks cannot afford mining algorithms with high memory cost. Similarly, time management, data preprocessing techniques, and choice of the data structure are also considered as some of the main challenges in the stream mining algorithms. Therefore, summarization techniques derived from the statistical science are dealing with a challenge of memory limitation, and techniques of the computational theory are being used to improve the time and space-efficient algorithms. Another challenge is the consumption of available resources, to cope with this challenge resource-aware mining is introduced which makes sure that the algorithm always consumes the available resources with some consideration.

由于到達(dá)數(shù)據(jù)的不規(guī)則性和變化，內(nèi)存管理已成為要處理的主要挑戰(zhàn)。像傳感器網(wǎng)絡(luò)這樣的應(yīng)用程序無法承受具有高存儲(chǔ)成本的挖掘算法。同樣，時(shí)間管理，數(shù)據(jù)預(yù)處理技術(shù)和數(shù)據(jù)結(jié)構(gòu)的選擇也被視為流挖掘算法中的一些主要挑戰(zhàn)。因此，源自統(tǒng)計(jì)科學(xué)的摘要技術(shù)正在應(yīng)對(duì)內(nèi)存限制的挑戰(zhàn)，并且使用計(jì)算理論的技術(shù)來改進(jìn)時(shí)間和空間效率高的算法。另一個(gè)挑戰(zhàn)是可用資源的消耗，為了應(yīng)對(duì)這一挑戰(zhàn)，引入了資源感知挖掘，以確保算法始終在考慮某些因素的情況下消耗可用資源。

As data stream is seen only once therefore it requires mining in a single pass, for this purpose an extremely fast algorithm is required to avoid problems like data sampling and shredding. Such algorithms should be able to run with data streams in parallel settings partitioned to many distributed processing units. Infinite data streams with high volumes are produced by many online, offline real-time applications and systems. The update rate of data streams is time-dependent. Therefore to extract knowledge from streaming data, some special mechanism is required. Due to their high volume and speed, some special mechanism is required to extract knowledge from them.

由于只能看到一次數(shù)據(jù)流，因此需要單次挖掘，因此需要一種非常快速的算法來避免數(shù)據(jù)采樣和粉碎等問題。這樣的算法應(yīng)該能夠與并行設(shè)置為多個(gè)分布式處理單元的數(shù)據(jù)流一起運(yùn)行。許多在線，離線實(shí)時(shí)應(yīng)用程序和系統(tǒng)都會(huì)產(chǎn)生大量的無限數(shù)據(jù)流。數(shù)據(jù)流的更新速率取決于時(shí)間。因此，要從流數(shù)據(jù)中提取知識(shí)，需要一些特殊的機(jī)制。由于它們的高容量和高速度，需要一些特殊的機(jī)制來從它們中提取知識(shí)。

Many stream mining algorithms have been developed and proposed by machine learning, statistical and theoretical computer science communities. The question is, how should we know which algorithm is best in terms of dealing with current challenges as mentioned above, and what is still needed in the market? This document intends to answer these questions. As this research topic is quite vast therefore deciding the best algorithm is not quite straightforward. We have compared the most recently published versions of stream mining algorithms in our distribution which are classification, clustering, and frequent itemset mining. Frequent itemset mining is a category of algorithms used to find the statistics about streaming data.

機(jī)器學(xué)習(xí)，統(tǒng)計(jì)和理論計(jì)算機(jī)科學(xué)界已經(jīng)開發(fā)和提出了許多流挖掘算法。問題是，就如何應(yīng)對(duì)上述當(dāng)前挑戰(zhàn)而言，我們?nèi)绾沃滥姆N算法最好，而市場(chǎng)仍需要什么呢？本文檔旨在回答這些問題。由于這個(gè)研究主題非常廣泛，因此確定最佳算法并不是一件容易的事。我們已經(jīng)比較了我們發(fā)行版中最新發(fā)布的流挖掘算法版本，它們是分類，聚類和頻繁項(xiàng)集挖掘。頻繁項(xiàng)集挖掘是用于查找有關(guān)流數(shù)據(jù)的統(tǒng)計(jì)信息的一種算法。

2-分類 (2- Classification)

The classification task is to decide the proper label for any given record from a dataset. It is a part of Supervised learning. The way of the learning works is to have the algorithm learn patterns and important features from a set of labeled data or ground truths resulting in a model. This model will be utilized in the classification tasks. There are various metrics used to rate the performance of a model. For example, Accuracy, in which the focus of this metric is to maximize the number of correct labels. There is also, Specificity in which the focus is to minimize mislabelling negative class. There are few factors that are crucial to deciding which metrics are to be used in classification tasks, such as label distributions and the purpose of the task itself.

分類任務(wù)是為數(shù)據(jù)集中的任何給定記錄確定適當(dāng)?shù)臉?biāo)簽。它是監(jiān)督學(xué)習(xí)的一部分。學(xué)習(xí)工作的方式是讓算法從一組標(biāo)記數(shù)據(jù)或模型得出的基礎(chǔ)事實(shí)中學(xué)習(xí)模式和重要特征。該模型將用于分類任務(wù)。有多種指標(biāo)可用于評(píng)估模型的性能。例如，準(zhǔn)確性，此度量標(biāo)準(zhǔn)的重點(diǎn)是最大化正確標(biāo)簽的數(shù)量。在“特異性”中，重點(diǎn)是最大程度地減少標(biāo)簽錯(cuò)誤的負(fù)面類別。對(duì)于決定在分類任務(wù)中使用哪些度量至關(guān)重要的因素很少，例如標(biāo)簽分布和任務(wù)本身的目的。

There are also a few types in the Classification Algorithm, such as Decision Trees, Logistic Regression, Neural Networks, and Naive Bayes. In this work, we decide to focus on Decision Tree.

分類算法中也有幾種類型，例如決策樹，邏輯回歸，神經(jīng)網(wǎng)絡(luò)和樸素貝葉斯。在這項(xiàng)工作中，我們決定專注于決策樹。

In Decision Tree, the learning algorithm will construct a tree-like model in which the node is a splitting attribute and the leaf is the predicted label. For every item, the decision tree will sort such items according to the splitting attribute down to the leaf which contained the predicted label.

在決策樹中，學(xué)習(xí)算法將構(gòu)建一個(gè)樹狀模型，其中節(jié)點(diǎn)是拆分屬性，葉是預(yù)測(cè)標(biāo)簽。對(duì)于每個(gè)項(xiàng)目，決策樹將根據(jù)拆分屬性將這些項(xiàng)目分類到包含預(yù)測(cè)標(biāo)簽的葉子。

2.1 Hoeffding樹 (2.1 Hoeffding Trees)

Currently, Decision Tree Algorithms such as ID3 and C4.5 build the trees from large amounts of data by recursively select the best attribute to be split using various metrics such as Entropy Information Gain and GINI. However, existing algorithms are not suitable when the training data cannot be fitted to the memory.

當(dāng)前，諸如ID3和C4.5之類的決策樹算法通過使用諸如熵信息增益和GINI之類的各種度量來遞歸選擇要分割的最佳屬性，從而從大量數(shù)據(jù)中構(gòu)建樹。但是，當(dāng)訓(xùn)練數(shù)據(jù)無法擬合到存儲(chǔ)器中時(shí)，現(xiàn)有算法不適合。

There exist few incremental learning methods in which the learning system, instead of fitting the entire data-sets at once in memory, continuously learning from the stream of data. However, it is found that those model lack of correctness guarantee compared to batch learning for the same amount of the data.

很少有增量學(xué)習(xí)方法，其中學(xué)習(xí)系統(tǒng)不是從內(nèi)存中一次擬合整個(gè)數(shù)據(jù)集，而是從數(shù)據(jù)流中不斷學(xué)習(xí)。但是，發(fā)現(xiàn)對(duì)于相同數(shù)量的數(shù)據(jù)，與批處理學(xué)習(xí)相比，這些模型缺乏正確性保證。

Domingos and Hulten [1] formulated a decision tree algorithm called the Hoeffding Tree. With Hoeffding Tree, the record or training instance itself is not saved in the memory, only the tree nodes and statistics are stored. Furthermore, the most interesting property of this tree is that the correctness of this tree converges to trees built using a batch learning algorithm given sufficient massive data.

Domingos和Hulten [1]制定了一種決策樹算法，稱為Hoeffding樹。使用霍夫丁樹，記錄或訓(xùn)練實(shí)例本身不會(huì)保存在內(nèi)存中，僅存儲(chǔ)樹節(jié)點(diǎn)和統(tǒng)計(jì)信息。此外，該樹的最有趣的屬性是，在給定足夠的大量數(shù)據(jù)的情況下，該樹的正確性收斂到使用批處理學(xué)習(xí)算法構(gòu)建的樹。

The training method for this tree is simple. For each sample, sort it to the subsequent leaf and update its statistic.

這棵樹的訓(xùn)練方法很簡(jiǎn)單。對(duì)于每個(gè)樣本，將其排序到隨后的葉子并更新其統(tǒng)計(jì)信息。

There are two conditions that must be fulfilled in order for a leaf to be split

為了分裂葉子，必須滿足兩個(gè)條件

1. There exists impurity in the leaf node. That is, not every record that is stored on the leaf has the same class.

1.葉節(jié)點(diǎn)中存在雜質(zhì)。即，并非每個(gè)存儲(chǔ)在葉子上的記錄都具有相同的類。

2. The difference of the result of the evaluation function between the best attribute and second-best attribute denoted ?G is greater than E, where E is

2.最佳屬性和次佳屬性之間的評(píng)估函數(shù)結(jié)果之差表示為? G大于E ，其中E為

Where R is the range of the attribute, δ (provided by the user) is the desired probability of the sample not within E and n is the number of collected samples in that node.

其中R是屬性的范圍， δ (由用戶提供)是樣本不在E中的期望概率， n是在該節(jié)點(diǎn)中收集的樣本數(shù)。

In the paper, it is rigorously proven that the error of this tree is bounded by Hoeffding Inequality. Another excellent property of this tree is that even though we reduce the error rate exponentially, we only need to increase the sample size linearly.

在本文中，嚴(yán)格證明了該樹的錯(cuò)誤受Hoeffding不等式的限制。該樹的另一個(gè)出色特性是，即使我們以指數(shù)方式減少錯(cuò)誤率，我們也只需要線性增加樣本大小。

2.2 VFDT算法 (2.2 VFDT Algorithm)

Domingos and Hutten further introduced a refinement of Hoeffding Tree called VFDT (Very Fast Decision Tree). The main idea is the same as Hoeffding Tree.

Domingos和Hutten進(jìn)一步介紹了Hoeffding樹的改進(jìn)，稱為VFDT(超快速?zèng)Q策樹)。主要思想與霍夫丁樹相同。

The refinements are

細(xì)化是

?Ties VFDT introduced an extra parameter τ. It is used when the delta between the best and the second-best attribute is too similar

? 領(lǐng)帶 VFDT引入了額外的參數(shù)τ 。當(dāng)最佳屬性和次佳屬性之間的差值太相似時(shí)使用

?G computation Another parameter introduced is nmin, which denotes the minimum number of samples before G is recomputed. That means the computation of the G can be deferred instead of every time a new sample arrives, which reduces global times resulted from frequents calculation of G

? G計(jì)算引入的另一個(gè)參數(shù)是nmin ，它表示重新計(jì)算G之前的最小樣本數(shù)。這意味著可以推遲G的計(jì)算，而不是每次到達(dá)新樣本時(shí)都進(jìn)行延遲，這減少了由于頻繁計(jì)算G而導(dǎo)致的全局時(shí)間

?Memory VFDT introduces a mechanism of pruning the least promising leaf from the Tree whenever the maximum available memory already utilized. The criterion used to determine whether a leaf is to be prune is the product of the probability of a random example that will go to it denoted as pl and its observed error rate el. The leaf with the lowest that criteria value will be considered as the least promising and will be deactivated.

? 內(nèi)存 VFDT引入了一種機(jī)制，只要已利用了最大的可用內(nèi)存，就會(huì)從Tree中修剪掉前景最差的葉子。用于確定是否要修剪葉子的標(biāo)準(zhǔn)是將要出現(xiàn)在葉子上的隨機(jī)示例的概率乘以p1并觀察到的錯(cuò)誤率el的乘積。具有最低標(biāo)準(zhǔn)值的葉子將被認(rèn)為是最沒有希望的葉子，并將被停用。

?Dropping Poor Attributes Another approach to have a performance boost is to drop attributes that are considered not promising early. If the difference of its evaluation function value between an attribute with the best attribute is bigger than E, then that attribute can be dropped. However, the paper doesn’t explain what is the exact parameter or situation for an attribute to be dropped.

? 刪除較差的屬性提高性能的另一種方法是刪除被認(rèn)為不盡早實(shí)現(xiàn)的屬性。如果具有最佳屬性的屬性之間的評(píng)估函數(shù)值之差大于E ，則可以刪除該屬性。但是，本文沒有說明要?jiǎng)h除的屬性的確切參數(shù)或情況是什么。

?Initialization The VFDT can be bootstrapped and combined by an existing memory-based decision tree to allow the VFDT to accomplish the same accuracy with a smaller number of examples. No detailed algorithm is provided, however.

? 初始化 VFDT可以通過現(xiàn)有的基于內(nèi)存的決策樹進(jìn)行引導(dǎo)和合并，以使VFDT能夠以更少的示例數(shù)實(shí)現(xiàn)相同的精度。但是，沒有提供詳細(xì)的算法。

2.3 Hoeffding自適應(yīng)樹 (2.3 Hoeffding Adaptive Trees)

One of the fallacies in Data Mining is the assumption that the distribution of data remains stationary. This is not the case, consider data from a kind of supermarket retails, the data can change rapidly in each different season. Such a phenomenon is called concept drift.

數(shù)據(jù)挖掘的謬論之一是假設(shè)數(shù)據(jù)分布保持平穩(wěn)。情況并非如此，考慮到一種超市零售的數(shù)據(jù)，該數(shù)據(jù)在每個(gè)不同的季節(jié)都會(huì)快速變化。這種現(xiàn)象稱為概念漂移 。

As a solution, Bifet et al [2] proposed a sliding window and adaptively based enhancements to Hoeffding Tree. Furthermore, the algorithm to build such a tree is based on the authors’ previous work, ADWIN (Adaptive Windowing) Algorithm which is a parameter-free algorithm to detect and estimates changes in Data Stream.

作為解決方案，Bifet等人[2]提出了一個(gè)滑動(dòng)窗口和對(duì)Hoeffding樹的自適應(yīng)增強(qiáng)。此外，構(gòu)建此類樹的算法基于作者先前的工作，即ADWIN(自適應(yīng)窗口)算法，該算法是一種無參數(shù)算法，可檢測(cè)和估計(jì)數(shù)據(jù)流中的變化。

In building adaptive learning algorithm, needs to be able to decides these three things

在建立自適應(yīng)學(xué)習(xí)算法時(shí)，需要能夠決定這三件事

? What are things that need to be remembered?

?需要記住哪些事情？

? When is the correct time to upgrade the model?

? 什么時(shí)候升級(jí)模型？

? How to upgrade the model?

? 如何升級(jí)模型？

Therefore there is a need for a procedure that is able to predict and detect changes in Data Distribution. In this case, is served by ADWIN algorithm mentioned before.

因此，需要一種能夠預(yù)測(cè)和檢測(cè)數(shù)據(jù)分布變化的過程。在這種情況下，由前面提到的ADWIN算法提供服務(wù)。

The main idea of Hoeffding Adaptive Tree is that aside from the main tree, alternative trees are created as well. Those alternative trees are created when distribution changes are detected in the data stream immediately. Furthermore, the alternate tree will replace the main tree when it is evidence that the alternate tree is far more accurate than the main tree. Ultimately, the changing and adaptation of trees are happening automatically judged from the time and nature of data instead of prior knowledge by the user. Note that, having said that it still retains in principle the algorithm to build and split the tree according to Hoeffding bound, similar to VFDT.

Hoeffding自適應(yīng)樹的主要思想是，除了主樹之外，還創(chuàng)建替代樹。當(dāng)立即在數(shù)據(jù)流中檢測(cè)到分布更改時(shí)，將創(chuàng)建這些備用樹。此外，當(dāng)有證據(jù)表明備用樹比主樹準(zhǔn)確得多時(shí)，備用樹將替換主樹。最終，樹木的改變和適應(yīng)是根據(jù)數(shù)據(jù)的時(shí)間和性質(zhì)自動(dòng)進(jìn)行判斷的，而不是用戶的先驗(yàn)知識(shí)。注意，盡管如此，它仍然保留了類似于Hoofding邊界的樹的構(gòu)建和分割算法，類似于VFDT。

In experiments, the authors mainly compared the Hoeffding Adaptive Tree with CVFDT (Concept Adapting Very Fast Decision Tree). CVFDT itself is formulated by the same authors of VFDT, it is basically VFDT with an attempt to include concept drift. In terms of performance measured with an error rate, using a synthetically generated dataset with a massive concept change, the algorithm managed to achieve a lower error rate quickly compared to CVFDT i.e. faster adaption to other trees. In addition, it managed to lower memory consumption by half. However, the drawback is that this algorithm consumes the longest time, 4 times larger than CVFDT.

在實(shí)驗(yàn)中，作者主要將Hoeffding自適應(yīng)樹與CVFDT(概念自適應(yīng)非常快決策樹)進(jìn)行了比較。 CVFDT本身由VFDT的相同作者制定，基本上是VFDT，試圖包括概念漂移。就以錯(cuò)誤率衡量的性能而言，使用具有重大概念變化的綜合生成的數(shù)據(jù)集，該算法設(shè)法比CVFDT更快地實(shí)現(xiàn)了更低的錯(cuò)誤率，即更快地適應(yīng)了其他樹。另外，它設(shè)法將內(nèi)存消耗降低了一半。但是，缺點(diǎn)是該算法耗時(shí)最長(zhǎng)，是CVFDT的4倍。

3聚類 (3 Clustering)

Clustering is to partition a given set of objects into groups called clusters in a way that each group have similar kind of objects and is strictly different from other groups. Classifying objects into groups of similar objects with a goal of simplifying data so that a cluster can be replaced by one or few representatives is considered as a core of the clustering process. Clustering algorithms are considered as tools to cluster high volumes datasets. We have selected three latest clustering algorithms and compared them with others based on a performance metric i.e. efficient creation of clusters, the capability to handle a large number of clusters, and the chosen data structure.

聚類是將一組給定的對(duì)象劃分為稱為聚類的組，其方式是每個(gè)組具有相似的對(duì)象類型，并且與其他組完全不同。為了簡(jiǎn)化數(shù)據(jù)，將對(duì)象分為相似對(duì)象的組，以便可以用一個(gè)或幾個(gè)代表替換群集，這被認(rèn)為是群集過程的核心。聚類算法被認(rèn)為是聚類大量數(shù)據(jù)集的工具。我們選擇了三種最新的聚類算法，并根據(jù)性能指標(biāo)將它們與其他算法進(jìn)行了比較，即有效創(chuàng)建集群，處理大量集群的能力以及所選的數(shù)據(jù)結(jié)構(gòu)。

3.1 流KM ++聚類算法 (3.1 Stream KM++ Clustering Algorithm)

Stream KM++ clustering algorithm is based on the idea of k -MEANS++ and Lloyd’s algorithm (also called k -MEANS algorithm) [3]. Lloyd’s algorithm is one of the famous clustering algorithms. Best clustering in Lloyd’s algorithm is achieved by assigning each point to the nearest center in a given sent of centers (fixed) and MEAN of these points is considered as the best center for the cluster. Also, k -MEANS++ serves as a seeding method for Lloyd’s algorithm. It gives a good practical result and guarantees a quality solution. Both algorithms are not suitable for the data streams as they require random access to the input data.

流 KM ++聚類算法基于k -MEANS ++和勞埃德算法(也稱為k -MEANS算法) [3]。勞埃德算法是著名的聚類算法之一。勞埃德算法中的最佳聚類是通過將每個(gè)點(diǎn)分配給給定發(fā)送的中心(固定)中的最近中心來實(shí)現(xiàn)的，這些點(diǎn)的MEAN被認(rèn)為是聚類的最佳中心。同樣， k -MEANS ++用作勞埃德算法的播種方法。它提供了良好的實(shí)踐結(jié)果，并保證了質(zhì)量解決方案。這兩種算法都不適合數(shù)據(jù)流，因?yàn)樗鼈冃枰S機(jī)訪問輸入數(shù)據(jù)。

Def: Stream KM++ computes a representative small weighted sample of the data points (known as a coreset) via a non-uniform sampling approach in one pass, then it runs k -MEANS++ on the computed sample and in a second pass, points are assigned to the center of nearest cluster greedily. Non-uniform sampling is a time-consuming task. The use of coreset trees has decreased this time significantly. A coreset tree is a binary tree that is associated with hierarchical divisive clustering for a given set of points. One starts with a single cluster that contains the whole set of points and successively partitions existing clusters into two sub-clusters, such that points in one sub-cluster are far from the points in another sub-cluster. It is based on merge and reduces technique i.e. whenever two samples with the same number of input points are detected, it takes the union of these points in the merge phase and produces a new sample in the reduced phase which uses coreset trees.[4]

Def： Stream KM ++通過非均勻采樣方法在一次通過中計(jì)算數(shù)據(jù)點(diǎn)的代表性小加權(quán)樣本(稱為核心集)，然后在計(jì)算的樣本上運(yùn)行k -MEANS ++，在第二次通過中，分配點(diǎn)貪婪地到達(dá)最近的星團(tuán)的中心。非均勻采樣是一項(xiàng)耗時(shí)的任務(wù)。這次， 核心集樹的使用已顯著減少。核心集樹是與給定點(diǎn)集的分層除法聚類關(guān)聯(lián)的二叉樹。一個(gè)群集從包含整個(gè)點(diǎn)集的單個(gè)群集開始，然后將現(xiàn)有群集依次劃分為兩個(gè)子群集，以使一個(gè)子群集中的點(diǎn)與另一個(gè)子群集中的點(diǎn)相距甚遠(yuǎn)。它基于合并和歸約技術(shù)，即，每當(dāng)檢測(cè)到兩個(gè)具有相同輸入點(diǎn)數(shù)量的樣本時(shí)，它將在合并階段合并這些點(diǎn)，并在歸約階段使用核心集樹生成一個(gè)新樣本。[4]

In comparison with BIRCH (A top-down hierarchical clustering algorithm), Stream KM++ is slower but in terms of the sum of squared errors, it computes a better solution up to a factor of 2. Also, it does not require trial-and-error adjustment of parameters. Quality of StreamLS algorithm is comparable to Stream KM++ but running time of Stream KM++ scales much better with the number of cluster centers than StreamLS. Stream KM++ is faster on large datasets and computes solutions that are on a par with k -MEANS++.

與BIRCH(自上而下的層次聚類算法)相比， Stream KM ++較慢，但就平方誤差的總和而言，它計(jì)算出的最佳解法高達(dá)2倍。而且，它不需要反復(fù)試驗(yàn)。參數(shù)的誤差調(diào)整。流 LS算法的質(zhì)量可媲美流 KM ++但比流 LS集群中心的數(shù)量要好得多流 KM ++秤的運(yùn)行時(shí)間。在大型數(shù)據(jù)集上， Stream KM ++更快，并且可以計(jì)算與k -MEANS ++相當(dāng)?shù)慕鉀Q方案。

3.2動(dòng)態(tài)物聯(lián)網(wǎng)數(shù)據(jù)流的自適應(yīng)集群 (3.2 Adaptive Clustering for Dynamic IoT Data Streams)

A dynamic environment such as IoT where the distribution of data streams changes overtime requires a type of clustering algorithm that can adapt according to the flowing data. Many stream clustering algorithms are dependent on different parameterization to find the number of clusters in data streams. Determining the number of clusters in the unknown flowing data is one of the key tasks in clustering. To deal with this problem, an adaptive clustering method is introduced by P. B. Daniel Puschmann and R. Tafazoll in this research paper [5]. It is specifically designed for IoT stream data. This method updates the cluster centroids upon detecting a change in the data stream by analyzing its distribution. It allows us to create dynamic clusters and assign data to these clusters by investigating data distribution settings at a given time instance. It is specialized in adapting to data drifts of the data streams. Data drift describes real concept drift (explained in 2.3) that is caused by changes in the streaming data. It makes use of data distribution and measurement of the cluster quality to detect the number of categories which can be found inherently in the data stream. This works independently of having prior knowledge about data and thus discover inherent categories.

諸如IoT之類的動(dòng)態(tài)環(huán)境，其中數(shù)據(jù)流的分布會(huì)隨著時(shí)間的變化而變化，這需要一種可以根據(jù)流動(dòng)數(shù)據(jù)進(jìn)行適應(yīng)的聚類算法。許多流聚類算法依賴于不同的參數(shù)化來查找數(shù)據(jù)流中的聚類數(shù)量。確定未知流動(dòng)數(shù)據(jù)中的聚類數(shù)量是聚類的關(guān)鍵任務(wù)之一。為了解決這個(gè)問題，PB Daniel Puschmann和R.Tafazoll在本文中介紹了一種自適應(yīng)聚類方法[5]。它是專門為物聯(lián)網(wǎng)流數(shù)據(jù)設(shè)計(jì)的。此方法通過分析數(shù)據(jù)流的分布來檢測(cè)數(shù)據(jù)流中的變化，從而更新群集質(zhì)心。它允許我們通過研究給定時(shí)間實(shí)例的數(shù)據(jù)分發(fā)設(shè)置來創(chuàng)建動(dòng)態(tài)集群并將數(shù)據(jù)分配給這些集群。它專門用于適應(yīng)數(shù)據(jù)流的數(shù)據(jù)漂移。數(shù)據(jù)漂移描述了由流數(shù)據(jù)的更改引起的實(shí)際概念漂移(在2.3中進(jìn)行了說明) 。它利用數(shù)據(jù)分布和群集質(zhì)量的度量來檢測(cè)可以在數(shù)據(jù)流中固有地找到的類別數(shù)量。這獨(dú)立于具有有關(guān)數(shù)據(jù)的先驗(yàn)知識(shí)而工作，因此可以發(fā)現(xiàn)固有類別。

A set of experiments has been performed on synthesized and intelligent live traffic data scenarios in this research paper [5]. These experiments are performed using both adaptive and non-adaptive clustering algorithms and results are compared based on cluster’ quality metric (i.e. silhouette coefficient). The result has shown that the adaptive clustering method produces clusters with 12.2 percent better in quality than non-adaptive.

在這篇研究論文中，已經(jīng)對(duì)合成的和智能的實(shí)時(shí)交通數(shù)據(jù)場(chǎng)景進(jìn)行了一組實(shí)驗(yàn)[5]。這些實(shí)驗(yàn)使用自適應(yīng)和非自適應(yīng)聚類算法進(jìn)行，并基于聚類的質(zhì)量度量(即輪廓系數(shù))比較結(jié)果。結(jié)果表明，自適應(yīng)聚類方法產(chǎn)生的聚類質(zhì)量比非自適應(yīng)聚類好12.2％。

In comparison to Stream KM++ algorithm explained in 3.1, it can be induced that Stream

與3.1中解釋的Stream KM ++算法相比，可以推斷出Stream

KM++ is not designed for evolving data streams.

KM ++不適用于不斷發(fā)展的數(shù)據(jù)流。

3.3 PERCH-一種用于極端聚類的在線分層算法 (3.3 PERCH-An Online Hierarchical Algorithm for Extreme Clustering)

The number of applications requiring clustering algorithms is increasing. Therefore, their requirements are also changing due to the rapidly growing data they contain. Such modern clustering applications need algorithms that can scale efficiently with data size and complexity.

需要集群算法的應(yīng)用程序數(shù)量正在增加。因此，由于其中包含的數(shù)據(jù)快速增長(zhǎng)，它們的要求也在發(fā)生變化。這樣的現(xiàn)代集群應(yīng)用程序需要能夠隨著數(shù)據(jù)大小和復(fù)雜性而有效擴(kuò)展的算法。

As many of the currently available clustering algorithms can handle the large datasets with high dimensionality, very few can handle the datasets with many clusters. This is also true for Stream Mining clustering algorithms. As the streaming data can have many clusters, this problem of having a large number of data points with many clusters is known as an extreme clustering problem. PERCH (Purity Enhancing Rotations for Cluster Hierarchies) algorithm scales mildly with high N (data points) and K (clusters), and thus addresses the extreme clustering problem. Researchers of the University of Massachusetts Amherst published it in April 2017.

由于許多當(dāng)前可用的聚類算法可以處理具有高維數(shù)的大型數(shù)據(jù)集，因此很少能處理具有許多聚類的數(shù)據(jù)集。對(duì)于Stream Mining聚類算法也是如此。由于流數(shù)據(jù)可以具有許多群集，因此具有許多群集的大量數(shù)據(jù)點(diǎn)的問題被稱為極端群集問題。 PERCH(用于群集層次結(jié)構(gòu)的純度增強(qiáng)旋轉(zhuǎn))算法在N(數(shù)據(jù)點(diǎn))和K(群集)較高的情況下進(jìn)行適度縮放，從而解決了極端的群集問題。麻省大學(xué)阿默斯特分校的研究人員于2017年4月發(fā)表了該論文。

This algorithm maintains a large tree data structure in a well efficient manner. Tree construction and its growth are maintained in an increment fashion over the incoming data points by directing them to leaves while maintaining the quality via rotation operations. The choice of a rich tree data structure provides an efficient (logarithmic) search that can scale to large datasets along with multiple clustering that can be extracted at various resolutions. Such greedy incremental clustering procedures give rise to some errors which can be recovered using rotation operations.

該算法以高效的方式維護(hù)大型樹數(shù)據(jù)結(jié)構(gòu)。通過將傳入的數(shù)據(jù)點(diǎn)定向到葉子，并通過旋轉(zhuǎn)操作保持質(zhì)量，以增量方式保持樹的構(gòu)造及其生長(zhǎng)。豐富樹數(shù)據(jù)結(jié)構(gòu)的選擇提供了一種有效的(對(duì)數(shù))搜索，該搜索可以縮放到大型數(shù)據(jù)集，并且可以以各種分辨率提取多個(gè)聚類。這種貪婪的增量聚類過程會(huì)引起一些錯(cuò)誤，這些錯(cuò)誤可以使用旋轉(zhuǎn)操作來恢復(fù)。

It is being claimed in [6] that this algorithm constructs a tree with the perfect dendrogram purity regardless of the number of data points and without the knowledge of the number of clusters. This is done by recursive rotation procedures. To achieve scalability, another type of rotation operation is also introduced in this research paper which encourages balance and an approximation that enables faster point insertions. This algorithm also possesses a leaf collapsing mode to cope with limited memory challenge i.e. when the dataset does not fit in the main memory (like data streams). In this mode, the algorithm expects another parameter which is an upper bound on the number of leaves in the cluster tree. Once the balance rotations are performed, the COLLAPSE procedure is invoked which merges leaves as necessary to meet the upper bound.

在[6]中要求保護(hù)的是，該算法構(gòu)建的樹具有完美的樹狀圖純度，而與數(shù)據(jù)點(diǎn)的數(shù)量無關(guān)，并且不知道簇的數(shù)量。這是通過遞歸循環(huán)過程完成的。為了實(shí)現(xiàn)可伸縮性，本研究論文中還引入了另一種旋轉(zhuǎn)操作類型，該操作鼓勵(lì)平衡和近似實(shí)現(xiàn)更快的點(diǎn)插入。該算法還具有葉子折疊模式以應(yīng)對(duì)有限的存儲(chǔ)挑戰(zhàn)，即當(dāng)數(shù)據(jù)集不適合主存儲(chǔ)時(shí)(如數(shù)據(jù)流)。在這種模式下，算法需要另一個(gè)參數(shù)，該參數(shù)是群集樹中葉數(shù)的上限。完成天平旋轉(zhuǎn)后，將調(diào)用COLLAPSE過程，該過程會(huì)根據(jù)需要合并葉子以達(dá)到上限。

In comparison with other online and multipass tree-building algorithms, perch has achieved the highest dendrogram purity in addition to being efficient. It is also competitive with all other scalable clustering algorithms. In comparison with both type of algorithms which uses the tree as a compact data structure or not, perch scales best with the number of clusters K. In comparison with BIRCH, which is based on top-down hierarchical clustering methods in which leaves of each internal node are represented by MEAN and VARIANCE statistics, and these node statistics are used to insert points greedily and there are no rotation operations performed, it has been proved that BIRCH constructs worst clustering as compared to its competitors. In comparison with Stream KM++, it shows that coreset construction is an expensive operation and it does not scale to the extreme clustering problem where K is very large.

與其他在線和多遍樹構(gòu)建算法相比，鱸魚除效率高外，還獲得了最高的樹狀圖純度。它與所有其他可伸縮群集算法相比也具有競(jìng)爭(zhēng)力。與使用樹作為緊湊數(shù)據(jù)結(jié)構(gòu)或不使用樹作為緊湊數(shù)據(jù)結(jié)構(gòu)的兩種算法相比，鱸魚的最佳擴(kuò)展群集數(shù)為K。與BIRCH相比，BIRCH是基于自上而下的層次聚類方法，其中每個(gè)內(nèi)部葉子節(jié)點(diǎn)由MEAN和VARIANCE統(tǒng)計(jì)信息表示，這些節(jié)點(diǎn)統(tǒng)計(jì)信息用于貪婪地插入點(diǎn)，并且不執(zhí)行任何旋轉(zhuǎn)操作，已證明BIRCH與其競(jìng)爭(zhēng)者相比構(gòu)成最差的聚類。與Stream KM ++相比，它表明核心集構(gòu)建是一項(xiàng)昂貴的操作，并且無法擴(kuò)展到K非常大的極端聚類問題。

PERCH algorithm has been applied on a variety of real-world datasets by writers of this research paper and it has proven as correct and efficient. [6]

本研究的作者將PERCH算法應(yīng)用于各種現(xiàn)實(shí)數(shù)據(jù)集，并被證明是正確有效的。 [6]

4頻繁項(xiàng)集挖掘 (4 Frequent Itemset mining)

Frequent Itemset Mining refers to mine a pattern or item that appears frequently from a dataset. Formally, assume there exist a set I comprising of n distinct items {i1, i2, . . . , in}. A subset of it X, X?I is called a pattern. The source of data to be mined is transactions. If a pattern is a subset of a transaction denoted t, X?t, then it is said X occurs in t. A metric for Frequent Item Mining is called support. Support of a pattern is the number of how many transactions in which that pattern occurs. For a natural number min sup, given as a parameter, any pattern in which support is greater or equal to it is called a frequent pattern.

頻繁項(xiàng)集挖掘是指挖掘從數(shù)據(jù)集中頻繁出現(xiàn)的模式或項(xiàng)目。形式上，假設(shè)存在一個(gè)集合I ，該集合I由n個(gè)不同的項(xiàng){i 1 ，i 2 ，...組成。。。，in}中。 X的子集X?I稱為模式。要挖掘的數(shù)據(jù)源是交易。如果模式是表示為t的事務(wù)的子集X?t ，則稱X出現(xiàn)在t中。頻繁項(xiàng)目挖掘的度量標(biāo)準(zhǔn)稱為支持。模式的支持是發(fā)生該模式的事務(wù)數(shù)量。對(duì)于作為參數(shù)給出的自然數(shù)min sup ，任何支持大于或等于它的模式都稱為頻繁模式。

One of the most famous data structures for Frequent Itemset Mining is FP-Tree [7]. However, FP-Tree requires multiple scanning of item databases, something that is very costly for fast-moving data streams. An ideal algorithm should have a one-pass like property to function optimally.

頻繁項(xiàng)集挖掘最著名的數(shù)據(jù)結(jié)構(gòu)之一是FP-Tree [7]。但是，FP-Tree需要對(duì)項(xiàng)目數(shù)據(jù)庫進(jìn)行多次掃描，這對(duì)于快速移動(dòng)的數(shù)據(jù)流而言非常昂貴。理想的算法應(yīng)具有類似單次通過的屬性才能發(fā)揮最佳功能。

Common recurrent property in Data Stream Mining is the utilization of window models. According to Jin et al, there are three types of window model [8]

數(shù)據(jù)流挖掘中的常見重復(fù)屬性是窗口模型的利用。根據(jù)Jin等人的說法，窗口模型有三種類型[8]

1. Landmark window In this window model, the focus is to find frequent itemsets from a starting time a to time point b. Consequently, if a is set to 1, then the Mining Algorithm will mine the entire data stream

1. 地標(biāo)窗口在此窗口模型中，重點(diǎn)是查找從開始時(shí)間a到時(shí)間點(diǎn)b的頻繁項(xiàng)目集。因此，如果將a設(shè)置為1，則挖掘算法將挖掘整個(gè)數(shù)據(jù)流

2. Sliding window From a time point b and given the length of the window a, the algorithm will mine item from time b ? a + 1 and b. In other words, it only considers item that enters our window stream at a time

2. 滑動(dòng)窗口從時(shí)間點(diǎn)b開始并給定窗口a的長(zhǎng)度，該算法將從時(shí)間b ? a + 1和b挖掘項(xiàng)目。換句話說，它只是認(rèn)為，在同一時(shí)間進(jìn)入我們的窗口流項(xiàng)目

3. Damped window model In this model, we give more weight to newly arrived items. This can be done simply by assigning a decaying rate to the itemsets. 1, t]

3. 阻尼窗口模型在此模型中，我們將更多權(quán)重分配給新到達(dá)的物品。只需將衰減率分配給項(xiàng)目集即可完成。 1，t]

4.1基于FP-Tree挖掘數(shù)據(jù)流中最大頻繁項(xiàng)集 (4.1 Mining Maximal Frequent Itemsets in Data Streams Based on FP- Tree)

This work [9] introduces a new algorithm FpMFI-DS, which is an improvement of FpMFI (frequent pattern tree for maximal frequent itemsets) [10] algorithm for the data stream. FpMFI itself is an algorithm to compress the FP-Tree and to check the superset pattern optimally.

這項(xiàng)工作[9]引入了一種新的算法FpMFI-DS，它是針對(duì)數(shù)據(jù)流的FpMFI(用于最大頻繁項(xiàng)集的頻繁模式樹) [10]算法的改進(jìn)。 FpMFI本身是一種算法，用于壓縮FP-Tree并以最佳方式檢查超集模式。

FmpMFI-DS is designed to store the transactions in Landmark Window or Sliding Windows. The consequence of adapting Windows for mining is that the FP-Tree needs to be updated when the transaction is out of the window. This is done by tidlist a list of transactions’ ID and a pointer to the ultimate node of the transaction in the tree. Other important details of FpMFI-DS are due to the requirement of having a one-pass algorithm, instead of ordering items in FP-Tree with its frequency, it is done lexicographically.

FmpMFI-DS旨在將事務(wù)存儲(chǔ)在“地標(biāo)窗口”或“滑動(dòng)窗口”中。使Windows適用于挖掘的結(jié)果是，當(dāng)事務(wù)不在窗口中時(shí)，需要更新FP-Tree。這是通過tidlist做交易ID列表和一個(gè)指向樹中的交易的最終節(jié)點(diǎn)。 FpMFI-DS的其他重要細(xì)節(jié)是由于需要使用一次遍歷算法，而不是按頻率順序?qū)P-Tree中的項(xiàng)目進(jìn)行排序，而是按字典順序進(jìn)行。

A further improvement of FpMFI-DS is the introduction of a new technique called ESEquivPS. In ESEquivPS. From an experiment by the authors, the size of the search space can be reduced by about 30%.

FpMFI-DS的進(jìn)一步改進(jìn)是引入了一種稱為ESEquivPS的新技術(shù)。在ESEquivPS中。根據(jù)作者的實(shí)驗(yàn)，搜索空間的大小可以減少約30％。

4.2在事務(wù)數(shù)據(jù)庫和動(dòng)態(tài)數(shù)據(jù)流中挖掘最大頻繁模式：一種基于火花的方法 (4.2 Mining maximal frequent patterns in transactional databases and dynamic data streams: A spark-based approach)

In this work [11], Karim et al describes how to build a tree-like structure to mine Maximal Frequent Pattern effectively. Maximal Frequent Patterns refers to patterns with a maximal number of items, that is: it should not have any superset patterns.

在這項(xiàng)工作中[11]， Karim等人描述了如何構(gòu)建樹狀結(jié)構(gòu)來有效挖掘最大頻繁模式。最大頻繁模式是指項(xiàng)目數(shù)量最多的模式，即：它不應(yīng)具有任何超集模式。

For example, assume that in our transaction database, there are three patterns AB, BC, and ABD with the occurrences of 7, 5, and 3. If we decide that the minimum support is 2, all of them are frequent patterns. However, AB is not a maximal frequent pattern, since it is a subset of ABD which is a frequent pattern.

例如，假設(shè)在我們的交易數(shù)據(jù)庫中，存在三種模式AB，BC和ABD，它們的出現(xiàn)次數(shù)分別為7、5和3。如果我們確定最小支持為2，則它們都是頻繁模式。但是，AB不是最大的頻繁模式，因?yàn)樗亲鳛轭l繁模式的ABD的子集。

The author utilized prime numbers for having faster computation and lower memory computation. The idea is that each distinct item from the database is represented as a distinct prime number. A transaction is represented as the multiplication of the prime number representing each item in that transaction which is called Transaction Value. From these formulations, there are few interesting properties.

作者利用質(zhì)數(shù)來實(shí)現(xiàn)更快的計(jì)算和更低的內(nèi)存計(jì)算。這個(gè)想法是將數(shù)據(jù)庫中每個(gè)不同的項(xiàng)目都表示為一個(gè)不同的素?cái)?shù)。交易表示為代表該交易中每個(gè)項(xiàng)目的質(zhì)數(shù)的乘積，稱為交易值。根據(jù)這些公式，幾乎沒有有趣的特性。

1. A huge number of possible distinct items For a 32-bit integer, the biggest prime number is 105097565 thus theoretically we can represent around 100 million different items. However, the computation of Transaction Value may result in Integer Overflow, thus class like BigInteger is used.

1. 大量可能的不同項(xiàng)目對(duì)于32位整數(shù)，最大質(zhì)數(shù)為105097565，因此從理論上講，我們可以表示大約1億個(gè)不同項(xiàng)目。但是，交易值的計(jì)算可能會(huì)導(dǎo)致整數(shù)溢出，因此使用了BigInteger之類的類。

2. No two different transactions have the same Transaction Value. Since the Transaction Value is the product of prime numbers, it is trivial to show that every Transaction Value should be unique and bijective.

2. 沒有兩個(gè)不同的交易具有相同的交易價(jià)值 。由于交易價(jià)值是素?cái)?shù)的乘積，因此證明每個(gè)交易價(jià)值都應(yīng)該是唯一的且是雙射的很簡(jiǎn)單。

3. Greatest Common Divisor to denote common item If δ is the GCD of the Transaction Value of a transaction α and the Transaction Value of a transaction β, we can get the common items from those two transactions by factoring δ

3. 表示公共項(xiàng)目的最大公因數(shù)如果δ是交易的交易價(jià)值α的GCD和交易的交易價(jià)值β的GCD，我們可以通過分解δ來從這兩個(gè)交易中獲得公共項(xiàng)目。

With the Transaction Value of the transaction, a tree-like structure called ASP-tree is constructed. Inside this structure, the Transaction Value and its count is preserved. Furthermore, the tree contains the following invariants

利用交易的交易價(jià)值，構(gòu)造了一個(gè)稱為ASP-tree的樹狀結(jié)構(gòu)。在此結(jié)構(gòu)內(nèi)部，保留了交易值及其計(jì)數(shù)。此外，樹包含以下不變量

1. Every node α is a descendant direct or indirect of all nodes in which TV value is a multiple of TV of α.

1.每個(gè)節(jié)點(diǎn)α是TV值是α的TV的倍數(shù)的所有節(jié)點(diǎn)的直接或間接后代。

2. The count of each node is the total support of the transaction represented by its TV

2.每個(gè)節(jié)點(diǎn)的數(shù)量是其電視代表的交易的總支持

The authors also introduce the MFPAS algorithm to generate the Maximal Frequent Itemsets from the ASP-tree. The algorithm simply scans the tree bottom-up and do necessary pruning to get the relevant Transaction Value to be decoded to a real list of items. Interestingly, all information to get the frequent itemset are available on the tree without a need to scan the database.

作者還介紹了MFPAS算法，以從ASP樹生成最大頻繁項(xiàng)集。該算法僅對(duì)樹進(jìn)行自下而上的掃描，并進(jìn)行必要的修剪，以獲取相關(guān)的交易值，以將其解碼為真實(shí)的項(xiàng)目列表。有趣的是，獲取頻繁項(xiàng)集的所有信息都可以在樹上找到，而無需掃描數(shù)據(jù)庫。

The procedure is suitable for either Batch or Data Stream environment. The authors include a Spark Implementation for this procedure. It is also shown that the differences between Batch or Data Stream lie only on using correct Spark API i.e. use Spark Stream API when doing works on stream data, while the proposed algorithm remains intact.

該過程適用于批處理或數(shù)據(jù)流環(huán)境。作者包括此過程的Spark實(shí)施。還顯示了批處理或數(shù)據(jù)流之間的區(qū)別僅在于使用正確的Spark API，即在對(duì)流數(shù)據(jù)進(jìn)行處理時(shí)使用Spark Stream API，而所提出的算法保持完整。

5匯總表 (5 Summary Table)

六，結(jié)論 (6 Conclusion)

In this report, we have conducted a survey of recent streaming data algorithms. Each algorithm is explained briefly along with key points and comparisons with other algorithms of the same class. In the end, we have presented a summary table with a crux of all the algorithms explained. We found out that recently introduced algorithms have solved the data problems (e.g. concept drift, data shredding, and sampling) and few of the main challenges (e.g. Memory limitation and data structure) which were considered as drawbacks of algorithms a few years back. As the wheel of advancement has no destination, we expect further evolution in data streams mining algorithms, opening research lines for further developments.

在此報(bào)告中，我們對(duì)最近的流數(shù)據(jù)算法進(jìn)行了調(diào)查。簡(jiǎn)要說明了每種算法，以及要點(diǎn)和與同類的其他算法的比較。最后，我們給出了一個(gè)匯總表，其中包含了所有已解釋算法的關(guān)鍵。我們發(fā)現(xiàn)，最近推出的算法解決了數(shù)據(jù)問題(例如概念漂移，數(shù)據(jù)分解和采樣)以及少數(shù)主要挑戰(zhàn)(例如內(nèi)存限制和數(shù)據(jù)結(jié)構(gòu))，這些挑戰(zhàn)被視為幾年前算法的缺點(diǎn)。由于前進(jìn)的輪子沒有終點(diǎn)，我們希望數(shù)據(jù)流挖掘算法會(huì)進(jìn)一步發(fā)展，為進(jìn)一步的發(fā)展打開研究路線。

翻譯自: https://medium.com/swlh/data-streams-mining-c5012ff1b4c1

數(shù)據(jù)挖掘流程

總結(jié)

以上是生活随笔為你收集整理的数据挖掘流程_数据流挖掘的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：数据库逻辑删除的sql语句_通过数据库的
下一篇：域嵌套太深_pyspark如何修改嵌套结