日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

r语言解释回归模型的假设_模型假设-解释

發布時間:2023/12/15 编程问答 26 豆豆
生活随笔 收集整理的這篇文章主要介紹了 r语言解释回归模型的假设_模型假设-解释 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

r語言解釋回歸模型的假設

Ever heard of model assumptions? What are they? And why are they important? A model is a simplified version of reality, and with machine learning models this is not different. To create models, we need to make assumptions, and if these assumptions are not verified and met, we may get into some trouble.

聽說過模型假設嗎? 這些是什么? 為什么它們很重要 ? 模型是現實的簡化版本,對于機器學習模型而言,這沒有什么不同。 要創建模型,我們需要做出假設,如果這些假設沒有得到驗證和滿足,我們可能會遇到麻煩。

If these assumptions are not verified and met, we may get into some trouble.

如果這些假設沒有得到證實和滿足,我們可能會遇到麻煩。

Every (machine learning) model has a different set of assumptions. We make assumptions on the data, on the relationship between different variables, and on the model we create with this data. Most of these assumptions can actually be verified. So one thing you’ll always want to do is ask whether the assumptions have been verified. Some assumptions are only relevant for making conclusions about relationships (e.g. a 1-degree increase in temperature shows a 4% increase in ice-cream sales), and others are also relevant to predict outcomes (we predict ice cream sales of x tomorrow).

每個(機器學習)模型都有不同的假設 。 我們對數據 ,不同變量之間的關系以及由此數據創建的模型進行假設。 這些假設大多數都可以得到驗證。 因此,您始終想做的一件事就是詢問這些假設是否已得到驗證。 一些假設僅 得出有關關系的 結論有關(例如,溫度每升高1度,冰淇淋銷售量就會增加4%),而其他假設也 預測結果 有關 (我們預測明天x的冰淇淋銷售量)。

Most of these assumptions can actually be verified.

這些假設中的大多數實際上可以得到驗證。

Let’s go through the assumptions that are made for the simplest model out there. The linear regression.

讓我們來看一下為最簡單的模型所做的假設。 線性回歸。

假設1:固定回歸器 (Assumption 1: fixed regressors)

What this actually means is that we assume that the variables (input data) are not random variables but fixed numbers and that if we rerun the experiment (we collect the data again in the same manner), we expect the same results.

這實際上意味著我們假設變量(輸入數據)不是隨機變量,而是固定數字,并且如果我們重新運行實驗(我們以相同的方式再次收集數據),則預期結果相同。

The opposite of fixed regressors is a random (or stochastic) regressor, which is typically looked at as data sampled from a wider population. Now if this is the case, then you can only make conclusions ‘conditional’ on the data. Meaning you can draw the same conclusions, but only on this data. You cannot generalize outside of your dataset.

固定回歸變量的對立面是隨機(或隨機)回歸變量,通常將其視為從更廣泛的人群中采樣的數據。 現在,如果是這種情況,那么您只能對數據做出“有條件的”結論。 意味著您可以得出相同的結論,但只能基于此數據。 您無法在數據集之外進行概括。

The verdict — If your data is (representative of) the population, you are good. Otherwise, try to collect representative data or only make a conclusion on the data you have created the model on.

結論 -如果您的數據代表 人口 ( 代表 ),那就 。 否則,請嘗試收集代表性數據或僅對在其上創建模型的數據做出結論。

For business readers — If you have data on all your customers, and want to predict the behavior of new customers, you are fine as long as you are targeting a similar type of customer. If not, you may be looking at providing totally wrong recommendations or conclusions about these new customers, and losing them before you got them in. So ask for the representativeness of the dataset.

對于商業讀者 -如果您擁有所有客戶的數據,并且希望預測新客戶的行為,那么只要定位到類似類型的客戶,就可以了。 如果沒有,您可能正在尋找關于這些新客戶的完全錯誤的建議或結論,并在吸引他們之前就失去了它們。因此, 請索要 數據集代表性

So ask for the representativeness of the dataset. If the data is representative for the population, you are good.

因此,要求數據集的代表性。 如果數據可以代表總體,那您就很好。

假設2:隨機擾動,均值為零 (Assumption 2: random disturbances, zero mean)

We assume that the error margin around our model is random and on average level out over all observations. This is something you can actually check.

我們假設模型周圍的誤差幅度是隨機的,并且在所有觀察結果中平均處于誤差水平。 您實際上可以檢查一下。

The verdict — Take the average of all your error terms and verify if it’s statistically significantly different from zero. If yes → you may want to adjust your model and include more terms.

判決 -取所有誤差項的平均值,并驗證其在統計上是否顯著不同于零。 如果是→您可能要調整模型并包括更多術語

For business readers — You want your model to predict the right thing. If this condition is not met, you are either always under- or overestimating. For example, if your error term is on average 3.5, that means you are on average overestimating with 3.5. Not a good thing to happen if you are predicting stock prices and making automatic trading decisions. So ask for the average of the error terms.

對于商業讀者 -您希望模型預測正確的事情。 如果不滿足此條件,則說明您總是低估 或高估了 。 例如,如果您的誤差項平均為3.5,則意味著您平均高估了3.5。 如果您預測股票價格并做出自動交易決策,那將不是一件好事。 因此,請提供誤差項的平均值。

Ask for the average of the error terms, to understand whether you are over- or underestimating. If the average is about 0, you are good.

要求平均誤差項,以了解您是高估還是低估了 。 如果平均值大約為0,則表示您很好。

假設3:同調 (Assumption 3: homoscedasticity)

The variance of the disturbances exist and are equal. This means as much as that we expect the error in the model to be of similar size for all different data points and is sometimes referred to as homogeneity of variance. This only applies if the relationship that we are looking at is linear on all different levels.

擾動的方差存在且相等。 這意味著我們可以預期模型中的誤差對于所有不同的數據點都具有相似的大小,并且有時被稱為方差均勻性。 僅當我們正在研究的關系在所有不同級別上都是線性的時,這才適用。

For example, if you are looking at the relationship between income and spendings on traveling. The spread will be much less for lower incomes than for higher incomes, simply because higher-incomes will provide more of a choice on what to spend. The result is that your model gets ‘pulled’ in the wrong direction (because it assumes the spread is equal everywhere and tries to reduce the error), and the influence on the model of the higher-income data points is much larger than the lower-income data points.

例如,如果您正在查看收入與旅行支出之間的關系。 低收入者的點差將比高收入者的點差小得多,這僅僅是因為高收入者將提供更多消費選擇。 結果是您的模型在錯誤的方向上被“拉”(因為它假定分布在所有地方都是相等的,并試圖減少誤差),并且對高收入數據點的模型的影響要比低收入數據點大得多-收入數據點。

In addition, this will influence the ability to make conclusions on the significance of your parameters.

此外,這將影響對參數重要性做出結論的能力。

The verdict — If you want to use your model for inference test for homoscedasticity, if you find your error terms aren’t equally distributed scale (one of) your variable(s) or use WLS.

結論 —如果您想使用模型進行同態推斷測試,并且發現錯誤項分布不均縮放 ( 變量之一)或使用WLS

For business readers — You want the error terms to have homogeneous variance, otherwise, some of your data points may have a too large influence on the model and disturb the view for the rest of the data points. It is not that big of an issue, your model will still predict the right thing. So if that is what you care about, this is one to let slip.

對于商業讀者 -您希望誤差項具有均一的方差,否則,您的某些數據點可能會對模型產生太大的影響,并干擾其余數據點的視圖 。 這不是什么大問題,您的模型仍然可以預測 正確的事情。 因此,如果這是您所關心的,那么這是一個令人毛骨悚然的問題。

If you just want to predict, let this one slip. If you want to infer on relations, better make a change.

如果您只想預測,就讓它滑一下。 如果要推斷關系,最好進行更改。

假設4:無相關 (Assumption 4: no correlation)

The error terms are uncorrelated. If they weren’t, there would actually be potential to improve the model. What it means is that if there is a correlation in the error terms, there is still “explanatory” power that is available. The result of the violation of this assumption is a bias in the coefficients of your model. These coefficients “absorb” the information from the error terms.

錯誤項是不相關的。 如果沒有,那么實際上就有改進模型的潛力。 意思是,如果誤差項之間存在相關性,那么仍然有“解釋性”的能力可用。 違反此假設的結果是模型系數存在偏差。 這些系數從誤差項中“吸收”信息。

The verdict — If you want to use your model for inference test correlation in your error terms, and if you find correlation → Add in more variables.

結論 —如果您想將模型用于錯誤項的推理測試相關性,并且發現相關性,請添加更多變量。

For business readers — If you are interested in making conclusions on relationships, correlation in the error terms is a no go. Correlation in the error terms also tells you there is a potential to improve the model and generate better predictions.

對于商業讀者 —如果您有興趣對關系做出結論 ,那么錯誤術語之間的相關性不可行的 。 誤差項中的相關性還告訴您,有可能 改進模型并生成更好的 預測

If there is a correlation present, you need to improve the model, your predictions get better and your inference will make sense.

如果存在相關性,則需要改進模型,您的預測會變得更好,并且您的推論將變得有意義。

假設5:常量參數 (Assumption 5: constant parameters)

The parameters that you are estimating with the model are fixed and unknown numbers. For starters, if they were known, there’s no need for a model. And the reason why we assume they are fixed is that we want to avoid changes over time. That is the time meant in the sense of time when the data was collected. If there are changes over time, we may need to include two different parameters or take only the most recent sample of the data.

您要使用模型估計的參數是固定和未知數。 對于初學者來說,如果知道的話,就不需要模型了。 我們之所以認為它們是固定的,是因為我們希望避免隨著時間的變化。 從時間上看,這是指收集數據的時間。 如果隨時間發生變化,我們可能需要包括兩個不同的參數或僅獲取最新的數據樣本。

An example of a violation would be if data was collected by asking a customer how much money they have paid into their pension fund, and the yearly maximum amount has been changed last year and suddenly you can add in a few thousand more. In this case, your parameters aren’t constant, and you need to account for that.

一個違規的例子是,如果通過詢問客戶已向養老基金支付了多少錢來收集數據,并且去年更改了年度最高金額,突然您又可以增加幾千元。 在這種情況下,您的參數不是恒定的,您需要考慮到這一點。

The verdict — Can you safely say that the data at hand has been produced by the same process, that hasn’t changed over time? → Then you are good. If not → you will want to adjust your model and allow for new variables to enter.

結論 —您是否可以肯定地說,手頭的數據是通過相同的過程生成的,并且隨著時間的推移沒有變化 ? →那你就好了。 如果不是→,則需要調整模型并允許輸入幾個變量

For business readers — The key here is that data was produced by the same process, has the data collection changed over time? If it has, the conclusions made on relations between the different variables will not hold, and predictions on new data coming in may actually be under- or overestimated.

對于商業讀者來說 ,關鍵是數據是通過相同的過程生成的 ,數據收集是否隨時間而變化? 如果有,關于不同變量之間關系的結論不成立 ,對新數據的預測實際上可能被低估或高估了

Has the data collection changed over time? Then adjust the model, otherwise you may risk over- or underestimate your predictions with new data coming in.

數據收集是否隨著時間而改變? 然后調整模型,否則可能會因輸入新數據而有可能高估或低估您的預測。

假設6:線性模型 (Assumption 6: linear model)

The relationship between the different variables is a linear relationship. If this weren’t the case, and you would have a non-linear relationship, you cannot estimate a model that fits your data properly. Therefore, when you are creating a linear model, you need to assume linearity. This is not a linear relationship, and if you would treat it that way, you would estimate many people on the streets with 50 degrees Celcius.

不同變量之間的關系是線性關系。 如果不是這種情況,并且您將具有非線性關系,那么您將無法估算出適合您數據的模型。 因此,在創建線性模型時,需要假設線性。 這不是線性關系,如果以這種方式對待,您將估計許多街道上攝氏50度的人。

The verdict — Test for linearity (scatterplots do the trick), and if the relationship isn’t linear → Transform your variables or go for a different model

判決 -測試線性(散點圖可以解決問題),并且如果關系不是線性的→ 轉換變量或使用其他模型

For business readers — This type of model dictates the structure between what we try to predict and what goes into the model. If the structure isn’t met (in this case linearity), the model is meaningless. You can think logically if the relationship is expected to be linear. If it’s not, and if the test tell the relationship isn’t linear → This is a no go and the model needs adjustment both for making conclusions on the relationship as well as prediction.

對于商業讀者來說 ,這種類型的模型決定了我們試圖預測的內容與模型所包含的內容之間的結構 。 如果不滿足結構要求(在這種情況下為線性),則該模型無意義 。 您可以從邏輯上考慮是否期望該關系是線性的。 如果不是,并且測試證明該關系不是線性的→這是不可行的,并且該模型需要進行調整以得出該關系以及預測的結論。

If the model is linear, but the relationship isn’t, you can forget about inference as well as prediction.

如果模型是線性的,但關系不是線性的,則您可以忘記推斷和預測。

假設7:正常 (Assumption 7: normality)

This assumption says that the error terms are normally distributed. We want to verify this because we want to be able to make tests on significance, as well as define our confidence intervals.

該假設表明誤差項是正態分布的。 我們想要驗證這一點,因為我們希望能夠對重要性進行檢驗,并定義我們的置信區間。

The verdict — Plot your error terms and verify if they are normal. If they are not normally distributed→ check your linearity assumption again.

判決-繪制錯誤術語并驗證它們是否正常。 如果他們不是正態分布 →再次檢查線性假設

For business readers — This assumption allows us to tell us something about how sure we are about the estimated values in our model. If this assumption is not met, we cannot make conclusions about relationships, we can predict though.

對于商業讀者 -此假設使我們可以告訴我們一些有關我們如何確定模型中估計值的信息。 如果不滿足這個假設,我們將無法 得出關于關系的 結論 ,但是我們可以預測

Without this assumption, we cannot say how sure we are about our estimated parameters. We can predict on new data.

沒有這個假設,我們就無法說出我們對估計參數的確信程度。 我們可以預測新數據。

Inspired by: “Econometric Methods with Applications in Business and Economics” by Christiaan Heij, Paul de Boer, Philip Hans Franses, Teun Kloek and Herman K. van Dijk

靈感來自于:Christiaan Heij,Paul de Boer,Philip Hans Franses,Teun Kloek和Herman K. van Dijk撰寫的“計量經濟學方法在商業和經濟學中的應用”

About me: I am an Analytics Consultant and Director of Studies for “AI Management” at a local business school. I am on a mission to help organizations generating business value with AI and creating an environment in which Data Scientists can thrive. Sign up to my newsletter for new articles, insights, and offerings on AI Management here.

關于我:我是當地商學院的分析顧問和“ AI管理”研究總監。 我的使命是幫助組織通過AI創造業務價值,并創造一個數據科學家可以蓬勃發展的環境。 此處 注冊我的時事通訊,以獲得有關AI Management的新文章,新見解和新產品

翻譯自: https://towardsdatascience.com/model-assumptions-explained-2c7bb7607f1c

r語言解釋回歸模型的假設

總結

以上是生活随笔為你收集整理的r语言解释回归模型的假设_模型假设-解释的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。