當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

数据分析模型和工具_数据分析师工具包：模型

發(fā)布時間：2023/12/15 编程问答 22 豆豆

生活随笔收集整理的這篇文章主要介紹了数据分析模型和工具_数据分析师工具包：模型小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

數(shù)據(jù)分析模型和工具

You’ve cleaned up your data and done some exploratory data analysis. Now what? As data analysts we have a lot of tools in our toolkit, but just like a screwdriver might be used to hammer in a nail, it isn’t the best tool for the job. Our tools are models, or if you prefer the mathematical term, algorithms. They allow us to make sense of the data we have collected and to make predictions.

您已經清理了數(shù)據(jù)并進行了一些探索性數(shù)據(jù)分析。怎么辦？作為數(shù)據(jù)分析人員，我們的工具包中有很多工具，但是就像螺絲起子可能會被釘在釘子上一樣，它并不是工作的最佳工具。我們的工具是模型，或者，如果您更喜歡數(shù)學術語，則是算法。它們使我們能夠理解所收集的數(shù)據(jù)并做出預測。

There are three basic types of models, depending on the type of data. For continuous numerical data we have a variety of regression techniques. These are our screwdrivers and wrenches. Fairly simple to understand and use, they bring data together to fit them to some sort of line or multidimensional plane. For categorical or discrete data, we have clustering and classification models. These are our saws and knives. They separate the data into different pieces of like versus unlike. With so many choices, it may be difficult to know which tool to use under which circumstance. So, let’s look at each in turn.

根據(jù)數(shù)據(jù)類型，共有三種基本模型類型。對于連續(xù)的數(shù)值數(shù)據(jù)，我們有多種回歸技術。這些是我們的螺絲刀和扳手。它們非常易于理解和使用，它們將數(shù)據(jù)組合在一起以使其適合某種線或多維平面。對于分類或離散數(shù)據(jù)，我們有聚類和分類模型。這些是我們的鋸子和刀子。他們將數(shù)據(jù)分為相似與不相似的不同部分。有這么多選擇，可能很難知道在哪種情況下要使用哪種工具。因此，讓我們依次看一下。

Numerical regression models seek to find the best line to fit continuous numerical data. They can be linear, in which the dependent variable (usually called y) is fit to one or more independent variables using some type of polynomial function. Nonlinear regression is used to fit one or more independent variables to a logarithmic, exponential, or sigmoid function.

數(shù)值回歸模型試圖找到適合連續(xù)數(shù)值數(shù)據(jù)的最佳直線。它們可以是線性的，其中使用某種多項式函數(shù)將因變量(通常稱為y)擬合到一個或多個自變量。非線性回歸用于將一個或多個自變量擬合到對數(shù)，指數(shù)或S型函數(shù)。

Linear regressions include:

線性回歸包括：

1)Single Linear Regression: one independent variable fit to a basic line:

1)單一線性回歸：一個獨立變量適合基本線：

y = mx + b, where m is the slope of the line and b is the value of y at x=0
y = mx + b，其中m是直線的斜率，b是x = 0時y的值

2) Multiple Linear Regression: 2 or more independent variables fit to a line of order 1:

2)多元線性回歸： 2個或多個自變量適合1階直線：

y = mx + nz + c, where m and n are the slopes of the line in the x and z planes, and c is the value of y at x=z=0
y = mx + nz + c，其中m和n是線在x和z平面上的斜率，c是x = z = 0時y的值

3) Polynomial Regression: both single and multiple linear regressions are actually special cases of polynomial regression, where one or more independent variables fit to a polynomial of order greater than 1:

3)多項式回歸：單線性回歸和多元線性回歸實際上都是多項式回歸的特殊情況，其中一個或多個自變量適合大于1的多項式：

y = m0 + m1x + m2x2 + m3 x3 + …
y = m0 + m1x + m2x2 + m3 x3 +…

Nonlinear regressions include:

非線性回歸包括：

1)Logarithmic Regression

1)對數(shù)回歸

y = alog(x) or y = bln(x)
y = alog(x)或y = bln(x)

2)Exponential Regressions

2)指數(shù)回歸

y = e^x + b
y = e ^ x + b

3)Sigmoidal Regressions: use functions that create an S-curve, such as sine and cosine

3)S形回歸：使用創(chuàng)建S曲線的函數(shù)，例如正弦和余弦

y = asin(x) + b or y = dcos(x) + e
y = asin(x)+ b或y = dcos(x)+ e

In each of these cases, a line (or plane) is fit to continuous data. Note that it is also possible to split up your data into sections and fit different lines to each section. There are various techniques that you can use to determine the best fit line, but that is for another article.

在每種情況下，一條線(或平面)都適合連續(xù)數(shù)據(jù)。請注意，也可以將數(shù)據(jù)分成多個部分，并在每個部分中插入不同的行。您可以使用多種技術來確定最佳擬合線，但這是另一篇文章。

What if you don’t have continuous data? What if you have only two or three discrete values: yes/no, for instance, or small/medium/large? Or perhaps twenty options, but each is apparently independent of the other. From a business standpoint, you may be asking about which customers are likely to default on a loan, or determining the demographics of customers purchasing a particular product. In these cases you would find it difficult to fit a linear or nonlinear regression to your data. Instead we have other types of tools that sort data rather than fit it: classification models and clustering models. While similar, the chief difference is that with classification models, you already have predefined classes into which you sort your data. For clustering models, the data is sorted into like categories, without knowing what those categories are ahead of time. (Note that these models can also be used on continuous data, but you will need to bin the continuous data into discrete units.) While regressions fit a line to the data, classification and clustering draw lines or planes between the data, separating them into categories of like vs unlike.

如果沒有連續(xù)數(shù)據(jù)怎么辦？如果您只有兩個或三個離散值：例如，是/否，或小/中/大，該怎么辦？也許有二十種選擇，但每種選擇顯然彼此獨立。從業(yè)務的角度來看，您可能在詢問哪些客戶可能拖欠貸款，或確定購買特定產品的客戶的人口統(tǒng)計信息。在這些情況下，您會發(fā)現(xiàn)很難對數(shù)據(jù)進行線性或非線性回歸。取而代之的是，我們有其他類型的工具對數(shù)據(jù)進行排序而不是對數(shù)據(jù)進行擬合：分類模型和聚類模型。盡管相似，但主要的區(qū)別在于使用分類模型時，您已經具有預定義的類，可以在其中對數(shù)據(jù)進行排序。對于聚類模型，數(shù)據(jù)被分類為相似的類別，而不事先知道這些類別是什么。 (請注意，這些模型也可以用于連續(xù)數(shù)據(jù)，但是您需要將連續(xù)數(shù)據(jù)分類為離散的單位。)雖然回歸擬合數(shù)據(jù)線，但是對數(shù)據(jù)之間的分類和聚類繪制線或平面進行聚類，將其分成喜歡與不喜歡的類別。

Classification Models include:

分類模型包括：

Decision trees: Here, the data begins by being broken into two categories with boolean results, True or False. At each juncture, a new boolean is considered, until all the like data splits into separate categories and can no longer be separated. This technique can get cumbersome once you go beyond a handful of branches.
決策樹：在這里，數(shù)據(jù)首先分為具有布爾結果的兩類，即True或False。在每個關頭，都將考慮一個新的布爾值，直到所有類似的數(shù)據(jù)拆分為單獨的類別，并且不再可以分離為止。一旦您超越了少數(shù)分支機構，該技術將變得很麻煩。
Random Forest: Similar to decision trees, except you start with several different trees.
隨機森林 ：與決策樹類似，不同之處在于您從幾棵不同的樹開始。
K-Nearest Neighbor (KNN): In this classification technique, you start with K number of clusters and each data point is assigned to the center of the cluster to which it is nearest. It is similar to K-Means Clustering (below) but the analyst chooses the number and location of clusters.
K最近鄰(KNN)：在這種分類技術中，您從K個簇開始，并且每個數(shù)據(jù)點都被分配到最接近的簇中心。它類似于下面的K-Means聚類，但分析人員選擇聚類的數(shù)量和位置。
Logistic Regression: The name sounds like this should be similar to logarithmic regression, but it is actually entirely different. In fact, it isn’t even a regression, but a classification algorithm. It is used to determine the probability of success or failure, or the probability of one outcome over another.
Logistic回歸：這個名稱聽起來應該類似于對數(shù)回歸，但實際上完全不同。實際上，它甚至不是回歸，而是分類算法。它用于確定成功或失敗的概率，或一個結果勝過另一個結果的概率。

Clustering Models include:

聚類模型包括：

Hierarchical clustering: Generally used with smaller data sets, as it becomes quickly unwieldy with too much data. Starts with a single cluster of the entire data set, and with each iteration, breaks into more clusters until one runs out of data, or all data has been assigned a branch that does not change. Similar to a decision tree, except that you do not know the categories ahead of time. Usually shown on a dendrite diagram.
分層聚類 ：通常用于較小的數(shù)據(jù)集，因為它會因過多的數(shù)據(jù)而很快變得難以處理。從整個數(shù)據(jù)集的單個群集開始，并在每次迭代中分成更多的群集，直到一個數(shù)據(jù)用完或為所有數(shù)據(jù)分配了不變的分支。與決策樹類似，不同之處在于您不提前知道類別。通常顯示在枝晶圖上。
Agglomerative clustering: A special case of hierarchical clustering, but beginning from the bottom up. Each data point begins in its own cluster, then with each iteration, data are linked together into clusters that are similar. Like hierarchical clustering, this works best with smaller data sets, because of space and time limitations.
聚集聚類 ：層次聚類的一種特殊情況，但從下而上開始。每個數(shù)據(jù)點都從其自己的群集開始，然后在每次迭代中，數(shù)據(jù)都鏈接在一起成為相似的群集。與分層群集一樣，由于空間和時間限制，這對于較小的數(shù)據(jù)集最有效。
K-means: A method of partitioning observations into k clusters, where the data within each cluster is more closely related to one another than the data outside the clusters. It is done iteratively, so that at each round, the location of each cluster center changes until all points have been assigned to a cluster and the clusters no longer change. K-means clustering can be used with both large and small data sets. It works best with sets of data that can form into roughly spherical sets.
K均值：一種將觀察結果劃分為k個聚類的方法，其中每個聚類中的數(shù)據(jù)比聚類外的數(shù)據(jù)彼此之間的關系更緊密。它是反復進行的，因此在每個回合中，每個聚類中心的位置都會更改，直到將所有點分配給一個聚類并且聚類不再更改為止。 K均值聚類可用于大型和小型數(shù)據(jù)集。它最適合可以形成大致球形的數(shù)據(jù)集。

Classification and clustering models can be used with numeric data or non-numeric data that have been one-hot encoded. That is, the textual data has a limited number of discrete values and can be converted to individual numbers, that do not mean anything. For example, you have three clothing sizes: Small, Medium, and Large. You can encode these as 1 for small, 2 for medium, and 3 for large. However, they are merely classifications. In this case 1+2 != 3.

分類和聚類模型可以與已被一鍵編碼的數(shù)字數(shù)據(jù)或非數(shù)字數(shù)據(jù)一起使用。即，文本數(shù)據(jù)具有有限數(shù)量的離散值，并且可以轉換為單個數(shù)字，這并不意味著任何意義。例如，您有三種衣服尺碼：小，中和大。您可以將它們編碼為1(小)，2(中)和3(大)。但是，它們僅僅是分類。在這種情況下1 + 2！= 3。

Like the regression models above, these models can be used both to describe your current set of data, and to make predictions about new data. Using machine learning, you can program these models by training them on sets of data you already know, to predict the data that you do not know. The mechanics of that is beyond this article, but there are many great resources on machine learning.

像上面的回歸模型一樣，這些模型既可以用來描述您當前的數(shù)據(jù)集，又可以對新數(shù)據(jù)進行預測。使用機器學習，您可以通過對已經知道的數(shù)據(jù)集進行訓練來對這些模型進行編程，以預測不知道的數(shù)據(jù)。其機制已超出本文的范圍，但是在機器學習方面有很多很棒的資源。

Conclusion:

結論：

We have many tools for modeling data in our data analyst toolkit. Regression models are the screwdrivers and wrenches of our kit, pulling continuous data together and fitting it to some sort of line or plane in one or more dimensions. Classification and clustering models are our saws and knives, cutting apart the data and separating it into groups or clusters of like versus unlike. These are our most basic models in our toolkit, and it is important to understand when we can use one type of model or another, and which is the best model for our data.

我們的數(shù)據(jù)分析器工具包中有許多用于對數(shù)據(jù)建模的工具。回歸模型是我們工具包中的螺絲起子和扳手，用于將連續(xù)數(shù)據(jù)收集在一起并以一個或多個維度安裝在某種線或平面上。分類和聚類模型是我們的工具，將數(shù)據(jù)分割并將其分為相似或不相似的組或簇。這些是我們工具箱中最基本的模型，了解何時可以使用一種或另一種模型以及哪種模型是我們數(shù)據(jù)的最佳模型非常重要。

For Further Learning:

為了進一步學習：

For a great background in data science, try Confident Data Skills, by Kirill Eremenko, a data scientist out of Australia who is head of SuperDataScience. You can check out his online courses on Udemy as well. He is very enthusiastic about data science and his courses are well plotted and easy to follow.

對于數(shù)據(jù)科學大背景下，嘗試自信數(shù)據(jù)技能，由基里爾Eremenko ，數(shù)據(jù)科學家從澳大利亞是誰的頭SuperDataScience 。您也可以查看有關Udemy的在線課程。他對數(shù)據(jù)科學非常熱情，他的課程設計得很好并且易于上手。

For a really in-depth look at the mathematics behind these models and other machine learning models, look at Machine Learning: A Concise Introduction, by Steven Knox. Steve is the head of data analytic at the NSA, and a former colleague of mine. His book won the award for best prose in a textbook, and is straightforward and easy to follow, with a depth of mathematical rigor that most data analysts tend to gloss over.

要真正深入了解這些模型和其他機器學習模型背后的數(shù)學原理，請參閱Steven Knox的《機器學習：簡明介紹》。史蒂夫(Steve)是國家安全局(NSA)數(shù)據(jù)分析的負責人，也是我的前同事。他的書獲得了教科書中最佳散文獎，而且簡單易懂，而且數(shù)學上的嚴謹性強，大多數(shù)數(shù)據(jù)分析師都傾向于掩飾這一觀點。

For a great online course, try IBM’s data science track on Coursera, a series of nine courses using python for data science, that covers everything from the basics of data analysis up through machine learning models. It is especially well done, with lots of labs, assignments, and projects to be done, including a final capstone project to complete their data science certificate.

對于一門很棒的在線課程，請在Coursera上嘗試IBM的數(shù)據(jù)科學課程， Coursera是一系列使用python進行數(shù)據(jù)科學的九門課程，涵蓋了從數(shù)據(jù)分析的基礎知識到機器學習模型的所有內容。它做得特別好，需要完成許多實驗室，任務和項目，其中包括完成數(shù)據(jù)科學證書的最終項目。

And, of course, there is the data science section of Medium, which offers a wide variety of data science topics from beginner to advanced, and has been a wealth of information for me as a career changer.

而且，當然，還有Medium的數(shù)據(jù)科學部分，其中提供了從初學者到高級的各種數(shù)據(jù)科學主題，對于我來說，作為職業(yè)改變者，它已經為我提供了很多信息。

About me: I am a lifelong user of data, originally as an environmental engineer, then (surprisingly) in the field of ministry. Having left that world, I have relearned old data analytic techniques and the wealth of new tools, to become a freelance data analyst. You can find me on LinkedIn.

關于我：我是一生的數(shù)據(jù)用戶，最初是一名環(huán)境工程師，后來(出人意料地)在事工領域。離開那個世界后，我重新學習了舊的數(shù)據(jù)分析技術和大量的新工具，成為一名自由數(shù)據(jù)分析師。您可以在LinkedIn上找到我。

翻譯自: https://medium.com/swlh/the-data-analysts-toolkit-models-81aae3611f65

數(shù)據(jù)分析模型和工具

總結

以上是生活随笔為你收集整理的数据分析模型和工具_数据分析师工具包：模型的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：宁德时代近 238 亿投建广东佛山基地
下一篇：图像梯度增强_使用梯度增强机在R中进行分