android 揭示动画_遗传编程揭示具有相互作用的多元线性回归
android 揭示動畫
We all had some sort of experience with linear regression. It’s one of the most used regression techniques used. Why? Because it is simple to explain and it is easy to implement. But what happens when you have more than one variable? How can you deal with this increased complexity and still use an easy to understand regression like this? And what happen if the system is even more complicated? Let’s imagine when you have an interaction between two variables.
我們都有線性回歸的某種經驗。 這是最常用的回歸技術之一。 為什么? 因為它易于解釋且易于實現。 但是,當您擁有多個變量時會發生什么? 您如何處理這種日益增加的復雜性,并且仍然使用這種易于理解的回歸? 如果系統更加復雜怎么辦? 讓我們想象一下,當兩個變量之間有相互作用時。
Here is where multiple linear regression kicks in and we will see how to deal with interactions using some handy libraries in python. Finally we will try to deal with the same problem also with symbolic regression and we will enjoy the benefits that come with it!
這是多個線性回歸的開始,我們將看到如何使用python中的一些方便的庫來處理交互。 最后,我們將嘗試通過符號回歸來解決相同的問題,我們將享受隨之而來的好處!
If you want to have a refresh on linear regression there are plenty of resources available and I also wrote a brief introduction with coding. What about symbolic regression? In this article we will be using gplearn. See its documentation for more informations or, if you like, see my other article about how to use it with complex functions in python here.
如果您想刷新線性回歸,可以使用很多資源,我還編寫了有關編碼的簡短介紹 。 那么符號回歸呢? 在本文中,我們將使用gplearn 。 看到它的文檔獲取更多的信息,或者,如果你喜歡,看到我如何與復雜的功能在Python中使用它的另一篇文章在這里 。
資料準備 (Data preparation)
We will explore two use cases of regression. In in the first case we will just have four variables (x1 to x4) which adds up plus some predetermined interactions: x1*x2, x3*x2 and x4*x2.
我們將探討兩個回歸用例。 在第一種情況下,我們將只有四個變量(x1至x4),這些變量加起來加上一些預定的相互作用:x1 * x2,x3 * x2和x4 * x2。
Note that in our dataset “out_df” we don’t have the interactions terms. What we will be doing will try to discover those relationships with our tools. This is how the variables look like when we plot them with seaborn, using x4 as hue (figure 1):
請注意,在我們的數據集“ out_df”中,我們沒有交互項。 我們將要做的事情將是嘗試發現與我們工具之間的關系。 這是當我們使用x4作為色相對seaborn進行繪制時,變量的外觀(圖1):
Figure 1: 1st order interactions: dataframe variables pairplot圖1:一階交互:數據幀變量pairplot Figure 2: 2nd order interactions: dataframe pairplot;圖2:二階交互:數據幀對圖;The y of the second case (figure 2) is given by:
第二種情況的y(圖2)由下式給出:
y_true = x1+x2+x3+x4+ (x1*x2)*x2 - x3*x2 + x4*x2*x3*x2 + x1**2Pretty complex sceneario!
相當復雜的場景!
情況1:多元線性回歸 (Case 1: Multiple Linear Regression)
The first step is to have a better understanding of the relationships so we will try our standard approach and fit a multiple linear regression to this dataset. We will be using statsmodels for that. In figure 3 we have the OLS regressions results.
第一步是要更好地了解這些關系,因此我們將嘗試標準方法,并對該數據集進行多元線性回歸。 我們將為此使用statsmodels。 在圖3中,我們有OLS回歸結果。
import statsmodels.api as sm Xb = sm.add_constant(out_df[['x1','x2','x3','x4']])mod = sm.OLS(y_true, Xb)
res = mod.fit()
res.summary()Figure 3: Fit Summary for statsmodels.圖3:統計模型的擬合摘要。
Ouch, this is clearly not the result we were hoping for. R2 is just 0.567 and moreover I am surprised to see that P value for x1 and x4 is incredibly high. We need some different strategy.
clearly,這顯然不是我們期望的結果。 R2僅為0.567,而且令我驚訝的是x1和x4的P值非常高。 我們需要一些不同的策略。
多項式特征 (Polynomial Features)
What we can do is to import a python library called PolynomialFeatures from sklearn which will generate polynomial and interaction features. For example, if an input sample is two dimensional and of the form [a, b], the degree-2 polynomial features are [1, a, b, a2, ab, b2].
我們可以做的是從sklearn導入一個名為PolynomialFeatures的python庫,該庫將生成多項式和交互特征。 例如,如果輸入樣本是二維且格式為[a,b],則2階多項式特征為[1,a,b,a2,ab,b2]。
from sklearn.preprocessing import PolynomialFeaturesimport scipy.specialpoly = PolynomialFeatures(interaction_only=True)
X_tr = poly.fit_transform(Xb)
Xt = pd.concat([Xb,pd.DataFrame(X_tr,columns=poly.get_feature_names()).drop([‘1’,’x0',’x1',’x2',’x3',’x4'],1)],1)
With “interaction_only=True” only interaction features are produced: features that are products of at most degree distinct input features (so not x[1] ** 2, x[0] * x[2] ** 3, etc.). The default degree parameter is 2.
用“interaction_only =真”僅相互作用特征產生:即所有的產品特征至多degree 不同的輸入功能(因此不是x[1] ** 2 , x[0] * x[2] ** 3等) 。 默認的度數參數為2。
With the same code as before, but using Xt now, yields the results below.
使用與以前相同的代碼,但現在使用Xt ,將產生以下結果。
mod = sm.OLS(y_true, Xt)res = mod.fit()
res.summary()Figure 4: statsmodels regression result with intractions.圖4:帶有吸引力的statsmodels回歸結果。
Now R2 in Figure 4 is 1 which is perfect. Too perfect to be good? In fact there are a lot of interaction terms in the summary statistics. Some that we did not even be aware of. Our equation is of the kind of: y = x?+05*x?+2*x?+x?+ x?*x? — x?*x? + x?*x? So our fit introduces interactions that we didn’t explicitly use in our function. Even if we remove those with high p-value (x? x?), we are left with a complex scenario. This might be a problem for generalization. We can exploit genetic programming to give us some advice here.
現在圖4中的R2是1,這是完美的。 太完美了不能成為好人? 實際上,摘要統計信息中有很多交互項。 我們什至沒有意識到的一些。 我們的方程式是:y =x?+ 05 *x2+ 2 *x?+x?+x?* x_2-x?* x2 +x?* x2因此,我們的擬合引入了我們在函數中未明確使用的相互作用。 即使我們刪除那些具有較高p值(x?x?)的圖形,我們仍然面臨著復雜的情況。 這可能是一個普遍化的問題。 我們可以利用基因編程在這里給我們一些建議。
基因編程:GPlearn (Genetic Programming: GPlearn)
With genetic programming we are basically telling the system to do its best to find relationships in our data in an analytical form. If you read the other tutorial some functions I will call here will be clearer. However what we basically want to do is to import SymbolicRegressor from gplearn.genetic and we will use sympy to pretty formatting our equations. Since we are at it, we will also import RandomForest and DecisionTree regressors to compare the results between all those tools later on. Below the code to get it working:
通過基因編程,我們基本上是在告訴系統要盡最大努力以分析形式在我們的數據中查找關系。 如果您閱讀其他教程,我將在此處調用的某些功能將更加清晰。 但是,我們基本上要做的是從gplearn.genetic導入SymbolicRegressor,我們將使用sympy對公式進行漂亮的格式化。 既然可以了,我們還將導入RandomForest和DecisionTree回歸變量,以便稍后比較所有這些工具的結果。 在代碼下方,使其正常工作:
The converter dictionary is there to help us map the equation with its corrispondent python function to let simpy do its work. We also do train_test split of our data so that we will compare our predictions on the test data alone. We defined a function set in which we use standard functions from gplearn’s set. At the 40th generation the code stops and we see that R2 is almost 1, while the formula generated is now pretty easy to read.
轉換器詞典在那里可以幫助我們使用其對應的python函數映射方程式,以使simpy能夠完成其工作。 我們還對數據進行了train_test拆分,以便我們將僅對測試數據的預測進行比較。 我們定義了一個函數集,在其中使用了gplearn集合中的標準函數。 在第40代,代碼停止了,我們看到R2幾乎為1,而生成的公式現在很容易閱讀。
Figure 5: gplearn results圖5:gplearn結果If you compare it with the formula we actually used you will see that its a close match, refactoring our formula becomes:
如果將其與我們實際使用的公式進行比較,您會發現它非常匹配,重構我們的公式將變成:
y = -x? (x?–2) + x? (x?+x?+ 0.5)+x?+x?
y =-x?(x 2–2)+ x 2(x?+x?+ 0.5)+x?+x?
All algorithms performed good on this work: here are the R2.
所有算法在這項工作上都表現出色:這是R2。
statsmodels OLS with polynomial features 1.0,random forest 0.9964436147653762,
decision tree 0.9939005077996459,
gplearn regression 0.9999946996993035
情況2:二階互動 (Case 2: 2nd order interactions)
In this case the relationship is more complex as the interaction order is increased:
在這種情況下,關系隨著交互順序的增加而變得更加復雜:
X = np.column_stack((x1, x2, x3, x4))y_true = x1+x2+x3+x4+ (x1*x2)*x2 - x3*x2 + x4*x2*x3*x2 + x1**2out_df['y'] = y_trueWe do basically the same steps as in the first case, but here we already start with polynomial features:
我們執行與第一種情況基本相同的步驟,但是這里我們已經從多項式特征開始:
poly = PolynomialFeatures(interaction_only=True)X_tr = poly.fit_transform(out_df.drop('y',1))
Xt = pd.concat([out_df.drop('y',1),pd.DataFrame(X_tr,columns=poly.get_feature_names()).drop(['1','x0','x1','x2','x3'],1)],1)Xt = sm.add_constant(Xt)
mod = sm.OLS(y_true, Xt)
res = mod.fit()
res.summary()Figure 6: statsmodels summary for case 2圖6:案例2的statsmodels摘要
In this scenario our approach is not rewarding anymore. It is clear that we don’t have the correct predictors in our dataset. We could use polynomialfeatures to investigate higher orders of interactions but the dimensionality will likely increase too much and we will be left with no much more knowledge then before. Besides, if you had a real dataset and you did not know the formula of the target, would you increase the interactions order? I guess not!
在這種情況下,我們的方法不再有用。 顯然,我們的數據集中沒有正確的預測變量。 我們可以使用多項式特征來研究更高階的交互,但是維數可能會增加太多,并且我們將比以前擁有更多的知識。 此外,如果您有一個真實的數據集,但您不知道目標的公式,您會增加交互順序嗎? 我猜不會!
In the code below we again fit and predict our dataset with decision tree and random forest algorithms but also employ gplearn.
在下面的代碼中,我們再次使用決策樹和隨機森林算法擬合并預測我們的數據集,但也使用gplearn。
X_train, X_test, y_train, y_test = train_test_split(out_df.drop('y',1), y, test_size=0.30, random_state=42)est_tree = DecisionTreeRegressor(max_depth=5)est_tree.fit(X_train, y_train)
est_rf = RandomForestRegressor(n_estimators=100,max_depth=5)
est_rf.fit(X_train, y_train)y_gp = est_gp.predict(X_test)
score_gp = est_gp.score(X_test, y_test)
y_tree = est_tree.predict(X_test)
score_tree = est_tree.score(X_test, y_test)
y_rf = est_rf.predict(X_test)
score_rf = est_rf.score(X_test, y_test)
y_sm = res.predict(Xt)
est_gp.fit(X_train, y_train)
print('R2:',est_gp.score(X_test,y_test))
next_e = sympify((est_gp._program), locals=converter)
next_e
The result is incredible: again after 40 generations we are left with an incredibly high R2 and even better a simple analytical equation.
結果令人難以置信:40代之后,我們仍然擁有令人難以置信的高R2和更好的簡單解析方程式。
Figure 7: last generation, R2 and analytical formula.圖7:上一代,R 2和分析公式。The original formula is like this:
原始公式如下:
So we see that there are indeed differences on the terms which involves x1 and its interactions. While the terms which don’t depend on it are perfectly there. Neverthless, if compared with the polynomialfeatures approach, we’re dealing with a much less complicated formula here.
因此,我們看到涉及x1及其交互的術語確實存在差異。 不依賴于此的術語完全存在。 不過,如果與多項式特征方法相比,我們在這里處理的是一個簡單得多的公式。
What is the error of the different systems? Well for gplearn it is incredibly low if compared with other. In figure 8 the error in the y-coordinate versus the actual y is reported. While the x axis is shared, you can notice how different the y axis become. The maximum error with GPlearn is around 4 while other methods can show spikes up to 1000.
不同系統的錯誤是什么? 對于gplearn來說,與其他同類產品相比,其低得令人難以置信。 在圖8中,報告了y坐標相對于實際y的誤差。 共享x軸時,您會注意到y軸的差異。 GPlearn的最大誤差約為4,而其他方法可能會顯示高達1000的峰值。
Figure 8: Error plots for different methods used as function of y. The legend shows the method.圖8:用作y函數的不同方法的誤差圖。 圖例顯示了該方法。結論 (Conclusion)
In the first part of this article we saw how to deal with multiple linear regression in the presence of interactions. We used statsmodels OLS for multiple linear regression and sklearn polynomialfeatures to generate interactions. We then approached the same problem with a different class of algorithm, namely genetic programming, which is easy to import and implement and gives an analytical expression.
在本文的第一部分,我們了解了如何在存在交互的情況下處理多元線性回歸。 我們使用statsmodels OLS進行多元線性回歸,并使用sklearn多項式特征生成相互作用。 然后,我們使用另一類算法來解決同一問題,即遺傳編程,該算法易于導入和實現,并給出了解析表達式。
In the second part we saw that when things get messy, we are left with some uncertainty using standard tools, even those from traditional machine learning. However, this class of problems is easier to face with the use of gplearn. With this library we were given an analytical formula for our problem directly.
在第二部分中,我們看到了當事情變得混亂時,使用標準工具,甚至是傳統機器學習的工具,我們都將面臨不確定性。 但是,使用gplearn更容易面對此類問題。 有了這個庫,我們直接得到了我們問題的解析公式。
翻譯自: https://towardsdatascience.com/multiple-linear-regression-with-interactions-unveiled-by-genetic-programming-4cc325ac1b65
android 揭示動畫
總結
以上是生活随笔為你收集整理的android 揭示动画_遗传编程揭示具有相互作用的多元线性回归的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: win10笔记本电脑双系统 安装黑苹果系
- 下一篇: 检测和语义分割_分割和对象检测-第5部分