當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

机器学习多变量回归算法_如何为机器学习监督算法识别正确的自变量？

發(fā)布時間：2023/12/15 编程问答 39 豆豆

生活随笔收集整理的這篇文章主要介紹了机器学习多变量回归算法_如何为机器学习监督算法识别正确的自变量？小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

機器學習多變量回歸算法

There is a very famous acronym GIGO in the field of computer science which I have learnt in my school days. GIGO stands for garbage in and garbages out. Essentially it means that if we feed inappropriate and junk data to computer programs and algorithms, then it will result in junk and incorrect results.

我在上學的時候就學過計算機科學領(lǐng)域的一個非常著名的縮寫GIGO 。 GIGO代表垃圾進出。從本質(zhì)上講，這意味著如果我們將不適當?shù)睦鴶?shù)據(jù)輸入計算機程序和算法，則會導(dǎo)致垃圾和錯誤結(jié)果。

Machine learning algorithms are the same as us human beings. Broadly machine learning algorithms have two phases — learning and predicting. Learning environment and parameters should be similar to the condition in which prediction to be done in future. Algorithms trained on an unbiased data sample, and permutations of the input variables values a true reflection of full population dataset are well equipped to make an accurate prediction.

機器學習算法與我們?nèi)祟愊嗤?廣義上講，機器學習算法有兩個階段-學習和預(yù)測。學習環(huán)境和參數(shù)應(yīng)與將來進行預(yù)測的條件相似。在無偏數(shù)據(jù)樣本上訓練的算法以及輸入變量值的排列真實地反映了總體種群數(shù)據(jù)集，這些都可以很好地進行準確的預(yù)測。

One of the cornerstones for the success of the Supervised machine learning algorithms is selecting the right set of the independent variable for the learning phase. In this article, I will discuss a structured approach to select the right independent variables to feed the algorithms. We do not want to overfeed redundant data points i.e. highly related (Multicollinearity) data and complicate the model without increasing the prediction accuracy. In fact, sometime overfeeding the data can decrease the prediction accuracy. On the other hand, we need to make sure that the model is not oversimplified and reflects true complexity.

監(jiān)督式機器學習算法成功的基石之一是為學習階段選擇正確的自變量集。在本文中，我將討論一種結(jié)構(gòu)化方法，以選擇正確的自變量來提供算法。我們不想過量饋送冗余數(shù)據(jù)點，即高度相關(guān)的( Multicollinearity )數(shù)據(jù)并使模型復(fù)雜而不增加預(yù)測精度。實際上，有時過度饋入數(shù)據(jù)可能會降低預(yù)測精度。另一方面，我們需要確保模型沒有過分簡化并且反映了真實的復(fù)雜性。

Objective

目的

We want to build a model to predict the stock price of the company ASML. We have downloaded the stock price data of few of the ASML’s customer, competitors and index points for the last 20 years. We are not sure which of these data points to include to build the ASML stock prediction model.

我們想要建立一個模型來預(yù)測ASML公司的股價。我們已經(jīng)下載了過去20年間ASML的少數(shù)客戶，競爭對手和指數(shù)點的股價數(shù)據(jù)。我們不確定要建立ASML庫存預(yù)測模型要包括哪些數(shù)據(jù)點。

Sample Data File

樣本數(shù)據(jù)文件

I have written a small function which I can call from different programs to download the stock price for the last 20 years.

我編寫了一個小函數(shù)，可以從不同的程序調(diào)用該函數(shù)，以下載最近20年的股價。

"""Filename - GetStockData.py is a function to download the stock from 1st Jan 2000 until current date"""import datetime as dt
import pandas as pd
import pandas_datareader.data as web
import numpy as npdef stockdata(ticker): start= dt.datetime(2000,1,1) ## Start Date Range
end=dt.datetime.now() ## Curret date as end date Range
Stock=web.DataReader(ticker, "yahoo", start, end)

name=str(ticker) + ".xlsx"
Stock.to_excel(name)
return ()

Function stockdata() is called from another program with ticker symbols to download the data.

從另一個程序中使用股票代碼調(diào)用stockstock()函數(shù)來下載數(shù)據(jù)。

""" Filename - stockdownload.py"""
import GetStockData
ticker= ["MU", "ASML","TSM","QCOM", "UMC", "^SOX", "INTC","^IXIC"]
for i in ticker:
GetStockData.stockdata(i)

Please note that GetStockData python file and stockdownload.py files are placed in the same file directory to import the file successfully.

請注意，GetStockData python文件和stockdownload.py文件放置在同一文件目錄中，以成功導(dǎo)入文件。

Step 1- The first step is to think of all the variables which may influence the dependent variables. At this step, I will suggest not to constraint your thinking and brain dump all the variables.

步驟1-第一步是考慮所有可能影響因變量的變量。在這一步，我建議不要限制您的思維，不要動腦筋。

Step 2- Next step is to collect/download the prospective independent variables data points for analysis.

步驟2-下一步是收集/下載預(yù)期獨立變量數(shù)據(jù)點進行分析。

I have formatted and collated the downloaded data into one excel file “StockData.xlsx”

我已經(jīng)將下載的數(shù)據(jù)格式化并整理到一個Excel文件“ StockData.xlsx”中

Last 20 years stock price data — from Jan 2000 until Aug 2020最近20年的股價數(shù)據(jù)-從2000年1月到2020年8月

Step 3- We will import the packages pandas, matplotlib, seaborn and statsmodels packages which we are going to use for our analysis.

第3步-我們將導(dǎo)入將用于分析的軟件包pandas，matplotlib，seaborn和statsmodels軟件包。

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.stats.outliers_influence import variance_inflation_factor

Step 4- Read the full data sample data excel file into the PandasDataframe called “data”. Further, we will replace the index with the date column

步驟4-將完整的數(shù)據(jù)樣本數(shù)據(jù)excel文件讀入稱為“ data”的PandasDataframe中。此外，我們將索引替換為日期列

data=pd.read_excel("StockData.xlsx")
data.set_index("Date", inplace= True)

I will not focus on preliminary data quality checks like blank values, outliers, etc. and respective correction approach in this article, and assuming that there are no data series related to the discrepancy.

在本文中，我將不著重于初步的數(shù)據(jù)質(zhì)量檢查，例如空白值，離群值等，以及相應(yīng)的校正方法，并假設(shè)沒有與差異有關(guān)的數(shù)據(jù)系列。

Step 5- One of the best places to start understanding the relationship between the independent variable is the correlation between the variables. In the below code, heatmap of the correlation is plotted using .corr method in Pandas.

步驟5-開始了解自變量之間的關(guān)系的最佳位置之一是變量之間的相關(guān)性。在下面的代碼中，在Pandas中使用.corr方法繪制了相關(guān)的熱圖。

sns.heatmap(data.corr(), annot=True, cmap="YlGnBu")
plt.show()

Correlation heatmap, as shown below, provides us with a visual depiction of the relationship between the variables. Now, we do not want a set of independent variables which has a more or less similar relationship with the dependent variables. For example, TSM and Nasdaq index has a correlation coefficient of 0.99 and 0.97 with ASML respectively. Including both TSM and NASDAQ may not improve the prediction accuracy as they have a similar relationship with the dependent variable, ASML stock price.

如下所示，相關(guān)熱圖為我們提供了變量之間關(guān)系的直觀描述。現(xiàn)在，我們不希望有一組與因變量具有或多或少相似關(guān)系的自變量。例如，TSM和Nasdaq指數(shù)與ASML的相關(guān)系數(shù)分別為0.99和0.97。同時包含TSM和NASDAQ可能不會提高預(yù)測準確性，因為它們與因變量ASML股票價格具有相似的關(guān)系。

Step 6- Before we start dropping the redundant independent variables, let us check the Variance inflation factor (VIF) among the independent variables. VIF quantifies the severity of multicollinearity in an ordinary least squares regression analysis. It provides an index that measures how much the variance (the square of the estimate’s standard deviation) of an estimated regression coefficient is increased because of collinearity. I will encourage you all to read the Wikipedia page on Variance inflation factor to gain a good understanding of it.

第6步-在開始刪除冗余自變量之前，讓我們檢查自變量之間的方差膨脹因子 ( VIF )。 VIF在普通最小二乘回歸分析中量化多重共線性的嚴重性。它提供了一個指標，用于衡量由于共線性而導(dǎo)致估計的回歸系數(shù)的方差(估計的標準偏差的平方)增加了多少。我鼓勵大家閱讀Wikipedia頁面上關(guān)于方差膨脹因子的知識，以更好地理解它。

In the below code we calculate the VIF of each independent variables and print it. We will create a new DataFrame without ASML historical stock prices as we aim is to determine the VIF among the potential independent variables.

在下面的代碼中，我們計算每個獨立變量的VIF并將其打印出來。我們將創(chuàng)建一個沒有ASML歷史股價的新DataFrame，因為我們的目的是確定潛在自變量中的VIF。

X=data.drop(["ASML"], axis=1)
vif = pd.DataFrame()
vif["features"] = X.columns
vif["vif_Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print(vif)

In general, we should aim for the VIF of less than 10 for the independent variables. We have seen from the heatmap earlier that TSM and NASDAQ have similar correlation coefficient with ASML and the same is also reflecting with high VIF indicator.

通常，我們應(yīng)將自變量的VIF設(shè)置為小于10。從較早的熱圖中我們可以看到，TSM和NASDAQ與ASML具有相似的相關(guān)系數(shù)，并且在高VIF指標下也反映出相同的相關(guān)系數(shù)。

Based on our understanding from heatmap and VIF result let us drop NASDAQ (as highest VIF) as a potential candidate for the independent variable for our model and re-evaluate the VIF.

根據(jù)我們對熱圖和VIF結(jié)果的理解，讓我們放棄納斯達克(作為最高VIF)作為模型自變量的潛在候選者，然后重新評估VIF。

X=data.drop(["ASML","NASDAQ"], axis=1)
vif = pd.DataFrame()
vif["features"] = X.columns
vif["vif_Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print(vif)

We can see that on removing the NASDAQ, VIF of few other potential independent also decreased.

我們可以看到，在刪除納斯達克后，其他潛在獨立股的VIF也降低了。

Selecting the right combination of independent variables is a bit of experience along with trial and error VIF checking with different permutations. TSMC is a leading semiconductor foundry in the world and as a customer of ASML has a strong influence on ASML’s business. Considering this aspect, I will drop “INTC” and “PHLX” and re-evaluate the VIF for the remaining variables.

選擇正確的自變量組合會帶來一些經(jīng)驗，并會嘗試使用具有不同排列的反復(fù)試驗VIF。臺積電是全球領(lǐng)先的半導(dǎo)體代工廠，作為ASML的客戶，對ASML的業(yè)務(wù)有深遠的影響。考慮到這方面，我將刪除“ INTC”和“ PHLX”，并對剩余變量重新評估VIF。

As we can see that after two iterations we have VIF of all the remaining variables less than 10. We have removed the variables with multicollinearity and have identified the list of independent variables which are relevant for predicting the stock prices of ASML.

正如我們看到的那樣，經(jīng)過兩次迭代，我們剩下的所有變量的VIF都小于10。我們刪除了具有多重共線性的變量，并確定了與預(yù)測ASML股價相關(guān)的自變量列表。

I hope, in selecting the right sets of the independent variable for your machine learning models, you will find the approach explained in this program helpful.

我希望，在為您的機器學習模型選擇正確的獨立變量集時，您會發(fā)現(xiàn)此程序中介紹的方法會有所幫助。

If you like this article then you may also like Machine Learning and Supply Chain Management: Hands-on Series

如果您喜歡本文，那么您可能也喜歡機器學習和供應(yīng)鏈管理：動手系列

Disclaimer — This article is written for educational purpose only. Do not make any actual stock buying, selling or any financial transaction based on the independent variables identified in this article.

免責聲明—本文僅用于教育目的。不要根據(jù)本文確定的獨立變量進行任何實際的股票買賣，金融交易。

翻譯自: https://towardsdatascience.com/how-to-identify-the-right-independent-variables-for-machine-learning-supervised-algorithms-439986562d32

機器學習多變量回歸算法

總結(jié)

以上是生活随笔為你收集整理的机器学习多变量回归算法_如何为机器学习监督算法识别正确的自变量？的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇：太阳能汽车没出路？光年公司破产光年2号
下一篇：利用PyCaret的力量

编程问答

机器学习 多变量回归算法_如何为机器学习监督算法识别正确的自变量？

總結(jié)

机器学习多变量回归算法_如何为机器学习监督算法识别正确的自变量？