當(dāng)前位置：首頁 > 编程语言 > python >内容正文

python

如何在Python中建立回归模型

發(fā)布時間：2023/12/15 python 27 豆豆

生活随笔收集整理的這篇文章主要介紹了如何在Python中建立回归模型小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

數(shù)據(jù)科學(xué) (DATA SCIENCE)

If you are an aspiring data scientist or a veteran data scientist, this article is for you! In this article, we will be building a simple regression model in Python. To spice things up a bit, we will not be using the widely popular and ubiquitous Boston Housing dataset but instead, we will be using a simple Bioinformatics dataset. Particularly, we will be using the Delaney Solubility dataset that represents an important physicochemical property in computational drug discovery.

如果您是有抱負(fù)的數(shù)據(jù)科學(xué)家或經(jīng)驗(yàn)豐富的數(shù)據(jù)科學(xué)家，那么本文適合您！在本文中，我們將在Python中構(gòu)建一個簡單的回歸模型。為了使事情更加有趣，我們將不使用廣泛流行且無處不在的Boston Housing數(shù)據(jù)集，而是將使用簡單的Bioinformatics數(shù)據(jù)集。特別是，我們將使用代表計算藥物發(fā)現(xiàn)中重要物理化學(xué)性質(zhì)的Delaney溶解度數(shù)據(jù)集。

The aspiring data scientist will find the step-by-step tutorial particularly accessible while the veteran data scientist may want to find a new challenging dataset for which to try out their state-of-the-art machine learning algorithm or workflow.

有抱負(fù)的數(shù)據(jù)科學(xué)家會發(fā)現(xiàn)分步教程特別易于訪問，而經(jīng)驗(yàn)豐富的數(shù)據(jù)科學(xué)家可能希望找到一個新的具有挑戰(zhàn)性的數(shù)據(jù)集，以嘗試其最新的機(jī)器學(xué)習(xí)算法或工作流程。

1.我們今天要建設(shè)什么？ (1. What we are Building Today?)

A regression model! And we are going to use Python to do that. While we’re at it, we are going to use a bioinformatics dataset (technically, it’s cheminformatics dataset) for the model building.

回歸模型！我們將使用Python來做到這一點(diǎn)。在此過程中，我們將使用生物信息學(xué)數(shù)據(jù)集(從技術(shù)上講，它是化學(xué)信息學(xué)數(shù)據(jù)集)進(jìn)行模型構(gòu)建。

Particularly, we are going to predict the LogS value which is the aqueous solubility of small molecules. The aqueous solubility value is a relative measure of the ability of a molecule to be soluble in water. It is an important physicochemical property of effective drugs.

特別是，我們將預(yù)測LogS值，該值是小分子的水溶性。水溶性值是分子溶于水的能力的相對量度。它是有效藥物的重要理化性質(zhì)。

What better way to get acquainted with the concept of what we are building today than a cartoon illustration!

有比卡通插圖更好的方法來熟悉我們今天正在構(gòu)建的概念！

Cartoon illustration of the schematic workflow of machine learning model building of the cheminformatics dataset where the target response variable is predicted as a function of input molecular features. Technically, this procedure is known as quantitative structure-activity relationship (QSAR). (Drawn by Chanin Nantasenamat 化學(xué)數(shù)據(jù)集機(jī)器學(xué)習(xí)模型構(gòu)建的示意性工作流程的卡通插圖，其中目標(biāo)響應(yīng)變量根據(jù)輸入分子特征而預(yù)測。從技術(shù)上講，此過程稱為定量構(gòu)效關(guān)系 (QSAR)。 (由Chanin Nantasenamat繪制

2.德萊尼溶解度數(shù)據(jù)集 (2. Delaney Solubility Dataset)

2.1。數(shù)據(jù)理解 (2.1. Data Understanding)

As the name implies, the Delaney solubility dataset is comprised of the aqueous solubility values along with their corresponding chemical structure for a set of 1,144 molecules. For those, outside the field of biology there are some terms that we will spend some time on clarifying.

顧名思義， Delaney溶解度數(shù)據(jù)集由水溶性溶解度值以及一組1,144個分子的相應(yīng)化學(xué)結(jié)構(gòu)組成。對于那些在生物學(xué)領(lǐng)域之外的人，我們將花費(fèi)一些時間來澄清它們。

Molecules or sometimes referred to as small molecules or compounds are chemical entities that are made up of atoms. Let’s use some analogy here and let’s think of atoms as being equivalent to Lego blocks where 1 atom being 1 Lego block. When we use several Lego blocks to build something whether it be a house, a car or some abstract entity; such constructed entities are comparable to molecules. Thus, we can refer to the specific arrangement and connectivity of atoms to form a molecule as the chemical structure.

分子或有時稱為小分子或化合物的分子是由原子組成的化學(xué)實(shí)體。讓我們在這里使用一些類比，讓我們認(rèn)為原子等同于樂高積木，其中1個原子等于1個樂高積木。當(dāng)我們使用幾個樂高積木來建造東西時，無論是房屋，汽車還是抽象物體。這樣構(gòu)造的實(shí)體可與分子相比。因此，我們可以將形成分子的原子的特定排列和連通性稱為化學(xué)結(jié)構(gòu) 。

Analogy of the construction of molecules to Lego blocks. This yellow house is from Lego 10703 Creative Builder Box. (Drawn by Chanin Nantasenamat)類似于樂高積木的分子構(gòu)造。 這個黃色的房子來自Lego 10703 Creative Builder Box。 (由Chanin Nantasenamat繪制)

So how does each of the entities that you are building differ? Well, they differ by the spatial connectivity of the blocks (i.e. how the individual blocks are connected). In chemical terms, each molecules differ by their chemical structures. Thus, if you alter the connectivity of the blocks, consequently you would have effectively altered the entity that you are building. For molecules, if atom types (e.g. carbon, oxygen, nitrogen, sulfur, phosphorus, fluorine, chlorine, etc.) or groups of atoms (e.g. hydroxy, methoxy, carboxy, ether, etc.) are altered then the molecules would also be altered consequently becoming a new chemical entity (i.e. that is a new molecule is produced).

那么，您要構(gòu)建的每個實(shí)體有何不同？好吧，它們的區(qū)別在于塊的空間連通性(即各個塊的連接方式)。用化學(xué)術(shù)語來說，每個分子的化學(xué)結(jié)構(gòu)都不同。因此，如果您更改塊的連接性，則將有效地更改您正在構(gòu)建的實(shí)體。對于分子，如果原子類型(例如碳，氧，氮，硫，磷，氟，氯等)或原子團(tuán)(例如羥基，甲氧基，羧基，醚等)發(fā)生改變，則分子也將被改變改變從而成為新的化學(xué)實(shí)體(即產(chǎn)生了新的分子)。

Cartoon illustration of a molecular model. Red, blue, dark gray and white represents oxygen, nitrogen, carbon and hydrogen atoms while the light gray connecting the atoms are the bonds. Each atoms can be comparable to a Lego block. The constructed molecule shown above is comparable to a constructed Lego entity (such as the yellow house shown above in this article). (Drawn by Chanin Nantasenamat)一個分子模型的動畫片例證。 紅色，藍(lán)色，深灰色和白色表示氧，氮，碳和氫原子，而連接原子的淺灰色是鍵。每個原子都可以相當(dāng)于一個樂高積木。上面顯示的構(gòu)建分子與構(gòu)建的Lego實(shí)體(例如本文上面顯示的黃色房屋)相當(dāng)。 (由Chanin Nantasenamat繪制)

To become an effective drug, molecules will need to be uptake and distributed in the human body and such property is directly governed by the aqueous solubility. Solubility is an important property that researchers take into consideration in the design and development of therapeutic drugs. Thus, a potent drug that is unable to reach the desired destination target owing to its poor solubility would be a poor drug candidate.

為了成為有效的藥物，分子將需要被吸收并分布在人體中，并且這種性質(zhì)直接受水溶性的支配 。溶解度是研究人員在設(shè)計和開發(fā)治療藥物時要考慮的重要屬性。因此，由于溶解度差而無法達(dá)到所需目標(biāo)靶點(diǎn)的有效藥物將是較差的藥物候選物。

2.2。檢索數(shù)據(jù)集 (2.2. Retrieving the Dataset)

The aqueous solubility dataset as performed by Delaney in the research paper entitled ESOL: Estimating Aqueous Solubility Directly from Molecular Structure is available as a Supplementary file. For your convenience, we have also downloaded the entire Delaney solubility dataset and made it available on the Data Professor GitHub.

Delaney在題為ESOL：直接從分子結(jié)構(gòu)直接估算水溶性的研究論文中提供的水溶性數(shù)據(jù)集可作為補(bǔ)充文件使用。為了方便起見，我們還下載了整個Delaney溶解度數(shù)據(jù)集，并在Data Professor GitHub上提供了該數(shù)據(jù) 集。

Preview of the raw version of the Delaney solubility dataset. The Delaney溶解度數(shù)據(jù)集的原始版本的預(yù)覽。 full version is available on the 完整版本可在Data Professor GitHub.Data Professor GitHub上獲得。

CODE PRACTICE

守則實(shí)務(wù)

Let’s get started, shall we?

讓我們開始吧，好嗎？

Fire up Google Colab or your Jupyter Notebook and run the following code cells.

啟動Google Colab或Jupyter Notebook，然后運(yùn)行以下代碼單元。

CODE EXPLANATION

代碼說明

Let’s now go over what each code cells mean.

現(xiàn)在讓我們看一下每個代碼單元的含義。

The first code cell,

第一個代碼單元 ，

As the code literally says, we are going to import the pandas library as pd.
就像代碼所說的那樣，我們將把pandas庫導(dǎo)入為pd 。

The second code cell:

第二個代碼單元 ：

Assigns the URL where the Delaney solubility dataset resides to the delaney_url variable.
將Delaney溶解度數(shù)據(jù)集所在的URL分配給delaney_url變量。
Reads in the Delaney solubility dataset via the pd.read_csv() function and assigns the resulting dataframe to the delaney_df variable.
通過pd.read_csv()函數(shù)讀取Delaney溶解度數(shù)據(jù)集，并將結(jié)果數(shù)據(jù)幀分配給delaney_df變量。
Calls the delaney_df variable to return the output value that essentially prints out a dataframe containing the following 4 columns:
調(diào)用delaney_df變量以返回輸出值，該輸出值實(shí)質(zhì)上打印出包含以下4列的數(shù)據(jù)delaney_df ：

Compound ID — Names of the compounds.

化合物ID-化合物的名稱。

measured log(solubility:mol/L) — The experimental aqueous solubility values as reported in the original research article by Delaney.

測得的log(溶解度：mol / L) -實(shí)驗(yàn)水溶解度值??，由Delaney在原始研究文章中報道。

ESOL predicted log(solubility:mol/L) — Predicted aqueous solubility values as reported in the original research article by Delaney.

ESOL預(yù)測的log(溶解度：mol / L) -預(yù)測的水溶解度值??，由Delaney在原始研究文章中報告。

SMILES — A 1-dimensional encoding of the chemical structure information

SMILES —化學(xué)結(jié)構(gòu)信息的一維編碼

2.3。計算分子描述符 (2.3. Calculating the Molecular Descriptors)

A point it note is that the above dataset as originally provided by the authors is not yet useable out of the box. Particularly, we will have to use the SMILES notation to calculate the molecular descriptors via the rdkit Python library as demonstrated in a step-by-step manner in a previous Medium article (How to Use Machine Learning for Drug Discovery).

需要注意的一點(diǎn)是，上述由作者最初提供的數(shù)據(jù)集尚無法立即使用。特別是，我們將不得不使用SMILES表示法來通過rdkit Python庫計算分子描述符 ，如先前的中篇文章( 如何使用機(jī)器學(xué)習(xí)進(jìn)行藥物發(fā)現(xiàn) )中逐步說明的那樣。

It should be noted that the SMILES notation is a one-dimensional depiction of the chemical structure information of the molecules. Molecular descriptors are quantitative or qualitative description of the unique physicochemical properties of molecules.

應(yīng)該注意的是， SMILES符號是分子化學(xué)結(jié)構(gòu)信息的一維描述。 分子描述符是分子獨(dú)特物理化學(xué)性質(zhì)的定量或定性描述。

Let’s think of molecular descriptors as a way to uniquely represent the molecules in numerical form that can be understood by machine learning algorithms to learn from, make predictions and provide useful knowledge on the structure-activity relationship. As previously noted, the specific arrangement and connectivity of atoms produce different chemical structures that consequently dictates the resulting activity that they will produce. Such notion is known as structure-activity relationship.

讓我們將分子描??述符視為以數(shù)字形式唯一表示分子的一種方法，機(jī)器學(xué)習(xí)算法可以理解該分子以學(xué)習(xí)，進(jìn)行預(yù)測并提供有關(guān)結(jié)構(gòu)-活性關(guān)系的有用知識。如前所述，原子的特定排列和連通性會產(chǎn)生不同的化學(xué)結(jié)構(gòu)，從而決定它們將產(chǎn)生的最終活性。這種概念被稱為結(jié)構(gòu)-活性關(guān)系。

The processed version of the dataset containing the calculated molecular descriptors along with their corresponding response variable (logS) is shown below. This processed dataset is now ready to be used for machine learning model building whereby the first 4 variables can be used as the X variables and the logS variables can be used as the Y variable.

包含計算的分子描述符及其相應(yīng)的響應(yīng)變量(logS)的數(shù)據(jù)集的處理版本如下所示。現(xiàn)在已準(zhǔn)備好將此處理后的數(shù)據(jù)集用于機(jī)器學(xué)習(xí)模型的構(gòu)建，其中前四個變量可以用作X變量，而logS變量可以用作Y變量。

Preview of the processed version of the Delaney solubility dataset. Essentially, the SMILES notation from the raw version was used as input to compute the 4 molecular descriptors as described in detail in a previous Delaney溶解度數(shù)據(jù)集處理版本的預(yù)覽。 本質(zhì)上，原始版本的SMILES表示法用作輸入來計算4個分子描述符，如先前的上Medium article and 一篇中型文章和YouTube video. The YouTube視頻中詳細(xì)描述的那樣。 full version is available on the 完整版本可在Data Professor GitHub.Data Professor GitHub上獲得。

A quick description of the 4 molecular descriptors and response variable is provided below:

下面提供了4種分子描述符和響應(yīng)變量的快速描述：

cLogP — Octanol-water partition coefficient

cLogP —辛醇-水分配系數(shù)

MW — Molecular weight

MW —分子量

RB —Number of rotatable bonds

可旋轉(zhuǎn)鍵RB -Number

AP—Aromatic proportion = number of aromatic atoms / total number of heavy atoms

AP —芳香比例=芳香原子數(shù)/重原子總數(shù)

LogS — Log of the aqueous solubility

LogS —水溶性的對數(shù)

CODE PRACTICELet’s continue by reading in the CSV file that contains the calculated molecular descriptors.

代碼實(shí)踐讓我們繼續(xù)閱讀包含計算出的分子描述符的CSV文件。

CODE EXPLANATION

代碼說明

Let’s now go over what the code cells mean.

現(xiàn)在讓我們來看一下代碼單元的含義。

Assigns the URL where the Delaney solubility dataset (with calculated descriptors) resides to the delaney_url variable.
將Delaney溶解度數(shù)據(jù)集(具有計算的描述符)所在的URL分配給delaney_url變量。
Reads in the Delaney solubility dataset (with calculated descriptors) via the pd.read_csv() function and assigns the resulting dataframe to the delaney_descriptors_df variable.
通過pd.read_csv()函數(shù)讀取Delaney溶解度數(shù)據(jù)集(具有計算的描述符)，并將結(jié)果數(shù)據(jù)幀分配給delaney_descriptors_df變量。
Calls the delaney_descriptors_df variable to return the output value that essentially prints out a dataframe containing the following 5 columns:
調(diào)用delaney_descriptors_df變量以返回輸出值，該輸出值實(shí)質(zhì)上打印出包含以下5列的數(shù)據(jù)delaney_descriptors_df ：

MolLogP

MolWt

摩爾

NumRotatableBonds

AromaticProportion

芳香比例

logS

日志

The first 4 columns are molecular descriptors computed using the rdkit Python library. The fifth column is the response variable logS.

前4列是使用rdkit Python庫計算的分子描述符。第五列是響應(yīng)變量logS 。

3.數(shù)據(jù)準(zhǔn)備 (3. Data Preparation)

3.1。將數(shù)據(jù)分離為X和Y變量 (3.1. Separating the data as X and Y variables)

In building a machine learning model using the scikit-learn library, we would need to separate the dataset into the input features (the X variables) and the target response variable (the Y variable).

在使用scikit-learn庫構(gòu)建機(jī)器學(xué)習(xí)模型時，我們需要將數(shù)據(jù)集分為輸入要素( X變量)和目標(biāo)響應(yīng)變量( Y變量)。

CODE PRACTICE

守則實(shí)務(wù)

Follow along and implement the following 2 code cells to separate the dataset contained with the delaney_descriptors_df dataframe to X and Y subsets.

遵循并實(shí)現(xiàn)以下2個代碼單元，以將delaney_descriptors_df數(shù)據(jù)幀中包含的數(shù)據(jù)集分離為X和Y子集。

CODE EXPLANATION

代碼說明

Let’s take a look at the 2 code cells.

讓我們看一下這兩個代碼單元。

First code cell:

第一個代碼單元：

Here we are using the drop() function to specifically ‘drop’ the logS variable (which is the Y variable and we will be dealing with it in the next code cell). As a result, we will have 4 remaining variables which are assigned to the X dataframe. Particularly, we apply the drop() function to the delaney_descriptors_df dataframe as in delaney_descriptors_df.drop(‘logS’, axis=1) where the first input argument is the specific column that we want to drop and the second input argument of axis=1 specifies that the first input argument is a column.
在這里，我們使用drop()函數(shù)專門“刪除” logS變量(它是Y變量，我們將在下一個代碼單元中處理它)。結(jié)果，我們將有4個剩余變量被分配給X數(shù)據(jù)幀。特別是，我們將drop()函數(shù)應(yīng)用于delaney_descriptors_df數(shù)據(jù)幀，如delaney_descriptors_df.drop('logS', axis=1) ，其中第一個輸入?yún)?shù)是我們要刪除的特定列，第二個輸入?yún)?shù)是axis=1指定第一個輸入?yún)?shù)是一列。

Second code cell:

第二個代碼單元：

Here we select a single column (the ‘logS’ column) from the delaney_descriptors_df dataframe via delaney_descriptors_df.logS and assigning this to the Y variable.
在這里，我們通過delaney_descriptors_df.logS從delaney_descriptors_df數(shù)據(jù)delaney_descriptors_df.logS選擇單個列(“ logS”列)，并將其分配給Y變量。

3.2。數(shù)據(jù)分割 (3.2. Data splitting)

In evaluating the model performance, the standard practice is to split the dataset into 2 (or more partitions) partitions and here we will be using the 80/20 split ratio whereby the 80% subset will be used as the train set and the 20% subset the test set. As scikit-learn requires that the data be further separated to their X and Y components, the train_test_split() function can readily perform the above-mentioned task.

在評估模型性能時，標(biāo)準(zhǔn)做法是將數(shù)據(jù)集分為2個(或更多分區(qū))分區(qū)，這里我們將使用80/20的拆分比率，其中80％的子集將用作訓(xùn)練集，而20％子集測試集。由于scikit-learn需要將數(shù)據(jù)進(jìn)一步分離為其X和Y分量，所以train_test_split()函數(shù)可以輕松地執(zhí)行上述任務(wù)。

CODE PRACTICE

守則實(shí)務(wù)

Let’s implement the following 2 code cells.

讓我們實(shí)現(xiàn)以下2個代碼單元。

CODE EXPLANATION

代碼說明

Let’s take a look at what the code is doing.

讓我們看一下代碼在做什么。

First code cell:

第一個代碼單元：

Here we will be importing the train_test_split from thescikit-learn library.
在這里，我們將從thescikit-learn庫中導(dǎo)入train_test_split 。

Second code cell:

第二個代碼單元：

We start by defining the names of the 4 variables that the train_test_split() function will generate and this includes X_train, X_test, Y_train and Y_test. The first 2 corresponds to the X dataframes for the train and test sets while the last 2 corresponds to the Y variables for the train and test sets.
我們首先定義train_test_split()函數(shù)將生成的4個變量的名稱，其中包括X_train ， X_test ， Y_train和Y_test 。前2個對應(yīng)于火車和測試集的X個數(shù)據(jù)幀，而后2個對應(yīng)于火車和測試集的Y個變量。

4.線性回歸模型 (4. Linear Regression Model)

Now, comes the fun part and let’s build a regression model.

現(xiàn)在，有趣的部分來了，讓我們建立一個回歸模型。

4.1。訓(xùn)練線性回歸模型 (4.1. Training a linear regression model)

CODE PRACTICE

守則實(shí)務(wù)

Here, we will be using the LinearRegression() function from scikit-learn to build a model using the ordinary least squares linear regression.

在這里，我們將使用scikit-learn的LinearRegression()函數(shù)使用普通的最小二乘線性回歸來構(gòu)建模型。

CODE EXPLANATION

代碼說明

Let’s see what the codes are doing

讓我們看看代碼在做什么

First code cell:

第一個代碼單元：

Here we import the linear_model from the scikit-learn library
在這里，我們從scikit-learn庫中導(dǎo)入linear_model

Second code cell:

第二個代碼單元：

We assign the linear_model.LinearRegression() function to the model variable.
我們將linear_model.LinearRegression()函數(shù)分配給model變量。
A model is built using the command model.fit(X_train, Y_train) whereby the model.fit() function will take X_train and Y_train as input arguments to build or train a model. Particularly, the X_train contains the input features while the Y_train contains the response variable (logS).
使用命令model.fit(X_train, Y_train)構(gòu)建模型model.fit(X_train, Y_train)其中model.fit()函數(shù)將X_train和Y_train作為輸入?yún)?shù)來構(gòu)建或訓(xùn)練模型。特別是， X_train包含輸入X_train ，而Y_train包含響應(yīng)變量(logS)。

4.2。應(yīng)用訓(xùn)練好的模型來預(yù)測訓(xùn)練和測試集中的logS (4.2. Apply trained model to predict logS from the training and test set)

As mentioned above, model.fit() trains the model and the resulting trained model is saved into the model variable.

如上所述， model.fit()對模型進(jìn)行訓(xùn)練，并將得到的訓(xùn)練后的模型保存到model變量中。

CODE PRACTICE

守則實(shí)務(wù)

We will now apply the trained model to make predictions on the training set (X_train).

現(xiàn)在，我們將應(yīng)用訓(xùn)練后的模型對訓(xùn)練集( X_train )進(jìn)行預(yù)測。

We will now apply the trained model to make predictions on the test set (X_test).

現(xiàn)在，我們將應(yīng)用經(jīng)過訓(xùn)練的模型對測試集( X_test )進(jìn)行預(yù)測。

CODE EXPLANATION

代碼說明

Let’s proceed to the explanation.

讓我們繼續(xù)進(jìn)行說明。

The following explanation will cover only the training set (X_train) as the exact same concept can be identically applied to the test set (X_test) by performing the following simple tweaks:

以下解釋將僅涵蓋訓(xùn)練集( X_train )，因?yàn)榭梢酝ㄟ^執(zhí)行以下簡單的調(diào)整將完全相同的概念等同地應(yīng)用于測試集( X_test )：

Replace X_train by X_test
用X_train替換X_test
Replace Y_train by Y_test
將Y_train替換為Y_test
Replace Y_pred_train by Y_pred_test
將Y_pred_train替換為Y_pred_test

Everything else are exactly the same.

其他所有內(nèi)容都完全相同。

First code cell:

第一個代碼單元：

Predictions of the logS values will be performed by calling the model.predict() and using X_train as the input argument such that we run the command model.predict(X_train). The resulting predicted values will be assigned to the Y_pred_train variable.
通過調(diào)用model.predict()并使用X_train作為輸入?yún)?shù)來執(zhí)行l(wèi)ogS值的預(yù)測，以便我們運(yùn)行命令model.predict(X_train) 。結(jié)果預(yù)測值將分配給Y_pred_train變量。

Second code cell:

第二個代碼單元：

Model performance metrics are now printed.

現(xiàn)在將顯示模型性能指標(biāo)。

Regression coefficient values are obtained from model.coef_,
回歸系數(shù)值是從model.coef_獲得的，
The y-intercept value is obtained from model.intercept_,
y截距值是從model.intercept_獲得的，
The mean squared error (MSE) is computed using the mean_squared_error() function using Y_train and Y_pred_train as input arguments such that we run mean_squared_error(Y_train, Y_pred_train)
使用mean_squared_error()函數(shù)并使用Y_train和Y_pred_train作為輸入?yún)?shù)來計算均方誤差(MSE)，以便我們運(yùn)行mean_squared_error(Y_train, Y_pred_train)
The coefficient of determination (also known as R2) is computed using the r2_score() function using Y_train and Y_pred_train as input arguments such that we run r2_score(Y_train, Y_pred_train)
確定系數(shù)(也稱為R2)是使用r2_score()函數(shù)使用Y_train和Y_pred_train作為輸入?yún)?shù)來計算的，因此我們可以運(yùn)行r2_score(Y_train, Y_pred_train)

4.3。打印出回歸方程 (4.3. Printing out the Regression Equation)

The equation of a linear regression model is actually the model itself whereby you can plug in the input feature values and the equation will return the target response values (LogS).

線性回歸模型的方程實(shí)際上是模型本身，您可以在其中插入輸入要素值，該方程將返回目標(biāo)響應(yīng)值(LogS)。

CODE PRACTICE

守則實(shí)務(wù)

Let’s now print out the regression model equation.

現(xiàn)在讓我們打印出回歸模型方程式。

CODE EXPLANATION

代碼說明

First code cell:

第一個代碼單元：

All the components of the regression model equation is derived from the model variable. The y-intercept and the regression coefficients for LogP, MW, RB and AP are provided in model.intercept_, model.coef_[0], model.coef_[1], model.coef_[2] and model.coef_[3].
回歸模型方程式的所有組成部分均來自model變量。在model.intercept_ ， model.coef_[0] ， model.coef_[1] ， model.coef_[2]和model.coef_[3]中提供了model.intercept_ ，MW，RB和AP的y截距和回歸系數(shù)。。

Second code cell:

第二個代碼單元：

Here we put together the components and print out the equation via the print() function.
在這里，我們將各個組件放在一起，然后通過print()函數(shù)打印出方程式。

5.實(shí)驗(yàn)與預(yù)測LogS的散點(diǎn)圖 (5. Scatter Plot of experimental vs. predicted LogS)

We will now visualize the relative distribution of the experimental versus predicted LogS by means of a scatter plot. Such plot will allow us to quickly see the model performance.

現(xiàn)在，我們將通過散點(diǎn)圖可視化實(shí)驗(yàn)與預(yù)測LogS的相對分布。這樣的繪圖將使我們能夠快速查看模型性能。

CODE PRACTICE

守則實(shí)務(wù)

In the forthcoming examples, I will show you how to layout the 2 sub-plots differently namely: (1) vertical plot and (2) horizontal plot.

在接下來的示例中，我將向您展示如何以不同的方式布局兩個子圖：(1)垂直圖和(2)水平圖。

CODE EXPLANATION

代碼說明

Let’s now take a look at the underlying code for implementing the vertical and horizontal plots. Here, I provide 2 options for you to choose from whether to have the layout of this multi-plot figure in the vertical or horizontal layout.

現(xiàn)在讓我們看一下實(shí)現(xiàn)垂直和水平繪圖的基礎(chǔ)代碼。在這里，我提供2個選項(xiàng)供您選擇，以垂直或水平布局顯示此多圖圖形的布局。

Import libraries

導(dǎo)入庫

Both start by importing the necessary libraries namely matplotlib and numpy. Particularly, most of the code will be using matplotlib for creating the plot while the numpy library is used here to add a trend line.

兩者都從導(dǎo)入必要的庫matplotlib和numpy 。特別是，大多數(shù)代碼將使用matplotlib創(chuàng)建圖，而此處使用numpy庫添加趨勢線。

Define figure size

定義圖形尺寸

Next, we specify the figure dimensions (what will be the width and height of the figure) via plt.figure(figsize=(5,11)) for the vertical plot and plt.figure(figsize=(11,5)) for the horizontal plot. Particularly, (5,11) tells matplotlib that the figure for the vertical plot should be 5 inches wide and 11 inches tall while the inverse is used for the horizontal plot.

接下來，我們通過plt.figure(figsize=(5,11))為垂直圖指定圖形尺寸(圖形的寬度和高度plt.figure(figsize=(5,11)) ，并為以下圖形plt.figure(figsize=(11,5))水平圖。特別是，(5,11)告訴matplotlib，垂直圖的圖形應(yīng)為5英寸寬，11英寸高，而水平圖應(yīng)使用反圖。

Define placeholders for the sub-plots

定義子圖的占位符

We will tell matplotlib that we want to have 2 rows and 1 column and thus its layout will be that of a vertical plot. This is specified by plt.subplot(2, 1, 1) where input arguments of 2, 1, 1 refers to 2 rows, 1 column and the particular sub-plot that we are creating underneath it. In other words, let’s think of the use of plt.subplot() function as a way of structuring the plot by creating placeholders for the various sub-plots that the figure contains. The second sub-plot of the vertical plot is specified by the value of 2 in the third input argument of the plt.subplot() function as in plt.subplot(2, 1, 2).

我們將告訴matplotlib我們想要2行1列，因此其布局應(yīng)為垂直圖。這是通過指定plt.subplot(2, 1, 1)其中的輸入?yún)?shù)2, 1, 1指的是2行，第1列和所述特定子情節(jié)我們正在創(chuàng)建它的下方。換句話說，讓我們考慮使用plt.subplot()函數(shù)，通過為圖形所包含的各個子圖創(chuàng)建占位符來構(gòu)造圖的方式。垂直圖的第二個子圖由plt.subplot()函數(shù)的第三個輸入?yún)?shù)中的值2指定，如plt.subplot(2, 1, 2) 。

By applying the same concept, the structure of the horizontal plot is created to have 1 row and 2 columns via plt.subplot(1, 2, 1) and plt.subplot(1, 2, 2) that houses the 2 sub-plots.

通過應(yīng)用相同的概念，通過容納2個子圖的plt.subplot(1, 2, 2) plt.subplot(1, 2, 1)和plt.subplot(1, 2, 2) plt.subplot(1, 2, 1)將水平圖的結(jié)構(gòu)創(chuàng)建為具有1行和2列。

Creating the scatter plot

創(chuàng)建散點(diǎn)圖

Now that the general structure of the figure is in place, let’s now add the data visualizations. The data scatters are added using the plt.scatter() function as in plt.scatter(x=Y_train, y=Y_pred_train, c=”#7CAE00", alpha=0.3) where x refers to the data column to use for the x axis, y refers to the data column to use for the y axis, c refers to the color to use for the scattered data points and alpha refers to the alpha transparency level (how translucent the scattered data points should be, the lower the number the more transparent it becomes), respectively.

現(xiàn)在已經(jīng)有了圖形的一般結(jié)構(gòu)，現(xiàn)在讓我們添加數(shù)據(jù)可視化。像使用plt.scatter(x=Y_train, y=Y_pred_train, c=”#7CAE00", alpha=0.3)一樣，使用plt.scatter()函數(shù)添加數(shù)據(jù)分散plt.scatter(x=Y_train, y=Y_pred_train, c=”#7CAE00", alpha=0.3)其中x用于x的數(shù)據(jù)列軸， y要用于y軸的數(shù)據(jù)列， c要用于散亂數(shù)據(jù)點(diǎn)的顏色， alpha表示alpha透明度級別(散亂數(shù)據(jù)點(diǎn)應(yīng)具有的半透明性，數(shù)字越低變得更加透明)。

Adding the trend line

添加趨勢線

Next, we use the np.polyfit() and np.poly1d() functions from numpy together with the plt.plot () function from matplotlib to create the trend line.

接下來，我們使用numpy的np.polyfit()和np.poly1d()函數(shù)以及matplotlib的plt.plot ()函數(shù)來創(chuàng)建趨勢線。

# Add trendline# https://stackoverflow.com/questions/26447191/how-to-add-trendline-in-python-matplotlib-dot-scatter-graphs
z = np.polyfit(Y_train, Y_pred_train, 1)
p = np.poly1d(z)
plt.plot(Y_test,p(Y_test),"#F8766D")

Adding the x and y axes labels

添加x和y軸標(biāo)簽

To add labels for the x and y axes, we use the plt.xlabel() and plt.ylabel() functions. It should be noticed that for the vertical plot, we omit the x axis label for the top sub-plot (Why? Because it is redundant with the x-axis label for the bottom sub-plot).

要為x和y軸添加標(biāo)簽，我們使用plt.xlabel()和plt.ylabel()函數(shù)。應(yīng)當(dāng)注意，對于垂直圖，我們省略了頂部子圖的x軸標(biāo)簽( 為什么？因?yàn)樗c底部子圖的x軸標(biāo)簽是多余的 )。

Saving the figure

保存身材

Finally, we are going to save the constructed figure to file and we can do that using the plt.savefig() function from matplotlib and specifying the file name as the input argument. Lastly, finish off with plt.show().

最后，我們將把構(gòu)造plt.savefig()圖形保存到文件中，我們可以使用matplotlib的plt.savefig()函數(shù)并指定文件名作為輸入?yún)?shù)來完成此操作。最后，以plt.show() 。

plt.savefig('plot_vertical_logS.png')
plt.savefig('plot_vertical_logS.pdf')
plt.show()

VISUAL EXPLANATION

視覺說明

The above section provides a text-based explanation and in this section we are going to do the same with this visual explanation that makes use of color highlights to distinguish the different components of the plot.

上一節(jié)提供了基于文本的解釋，在本節(jié)中，我們將使用視覺突出顯示來做同樣的事情，該視覺解釋使用顏色突出顯示來區(qū)分繪圖的不同組成部分。

Visual explanation on creating a scatter plot. Here we color highlight the specific lines of code and their corresponding plot component. (Drawn by Chanin Nantasenamat)關(guān)于創(chuàng)建散點(diǎn)圖的直觀說明。 在這里，我們用彩色突出顯示特定的代碼行及其對應(yīng)的繪圖組件。 (由Chanin Nantasenamat繪制)

需要您的反饋 (Need Your Feedback)

As an educator, I love to hear how I can improve my contents. Please let me know in the comments whether:

作為一名教育工作者，我喜歡聽聽如何改善自己的內(nèi)容。請在評論中讓我知道是否：

the visual illustration is helpful for understanding how the code works,

視覺插圖有助于理解代碼的工作原理，

the visual illustration is redundant and not necessary, OR whether

視覺插圖是多余的，不是必需的，或者

the visual illustration complements the text-based explanation to help understand how the code works.

視覺插圖補(bǔ)充了基于文本的解釋，以幫助理解代碼的工作方式。

關(guān)于我 (About Me)

I work full-time as an Associate Professor of Bioinformatics and Head of Data Mining and Biomedical Informatics at a Research University in Thailand. In my after work hours, I’m a YouTuber (AKA the Data Professor) making online videos about data science. In all tutorial videos that I make, I also share Jupyter notebooks on GitHub (Data Professor GitHub page).

我是泰國研究大學(xué)的生物信息學(xué)副教授兼數(shù)據(jù)挖掘和生物醫(yī)學(xué)信息學(xué)負(fù)責(zé)人，全職工作。在下班后，我是YouTuber(又名數(shù)據(jù)教授 )，負(fù)責(zé)制作有關(guān)數(shù)據(jù)科學(xué)的在線視頻。在我制作的所有教程視頻中，我也在GitHub上共享Jupyter筆記本( 數(shù)據(jù)教授GitHub頁面 )。

在社交網(wǎng)絡(luò)上與我聯(lián)系 (Connect with Me on Social Network)

? YouTube: http://youtube.com/dataprofessor/? Website: http://dataprofessor.org/ (Under construction)? LinkedIn: https://www.linkedin.com/company/dataprofessor/? Twitter: https://twitter.com/thedataprof? FaceBook: http://facebook.com/dataprofessor/? GitHub: https://github.com/dataprofessor/? Instagram: https://www.instagram.com/data.professor/

?的YouTube： http://youtube.com/dataprofessor/ ?網(wǎng)站： http://dataprofessor.org/ (在建)?LinkedIn： https://www.linkedin.com/company/dataprofessor/ ?的Twitter： HTTPS： //twitter.com/thedataprof ?Facebook的： http://facebook.com/dataprofessor/ ?GitHub的： https://github.com/dataprofessor/ ?Instagram： https://www.instagram.com/data.professor/

翻譯自: https://towardsdatascience.com/how-to-build-a-regression-model-in-python-9a10685c7f09

總結(jié)

以上是生活随笔為你收集整理的如何在Python中建立回归模型的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇：苹果xs max值得买吗
下一篇：使用Python和OpenCV创建自己的

python

如何在Python中建立回归模型

數(shù)據(jù)科學(xué) (DATA SCIENCE)

1.我們今天要建設(shè)什么？ (1. What we are Building Today?)

2.德萊尼溶解度數(shù)據(jù)集 (2. Delaney Solubility Dataset)

2.1。 數(shù)據(jù)理解 (2.1. Data Understanding)

2.2。 檢索數(shù)據(jù)集 (2.2. Retrieving the Dataset)

2.3。 計算分子描述符 (2.3. Calculating the Molecular Descriptors)

3.數(shù)據(jù)準(zhǔn)備 (3. Data Preparation)

3.1。 將數(shù)據(jù)分離為X和Y變量 (3.1. Separating the data as X and Y variables)

3.2。 數(shù)據(jù)分割 (3.2. Data splitting)

4.線性回歸模型 (4. Linear Regression Model)

4.1。 訓(xùn)練線性回歸模型 (4.1. Training a linear regression model)

4.2。 應(yīng)用訓(xùn)練好的模型來預(yù)測訓(xùn)練和測試集中的logS (4.2. Apply trained model to predict logS from the training and test set)

4.3。 打印出回歸方程 (4.3. Printing out the Regression Equation)

5.實(shí)驗(yàn)與預(yù)測LogS的散點(diǎn)圖 (5. Scatter Plot of experimental vs. predicted LogS)

需要您的反饋 (Need Your Feedback)

關(guān)于我 (About Me)

在社交網(wǎng)絡(luò)上與我聯(lián)系 (Connect with Me on Social Network)

總結(jié)

2.1。數(shù)據(jù)理解 (2.1. Data Understanding)

2.2。檢索數(shù)據(jù)集 (2.2. Retrieving the Dataset)

2.3。計算分子描述符 (2.3. Calculating the Molecular Descriptors)

3.1。將數(shù)據(jù)分離為X和Y變量 (3.1. Separating the data as X and Y variables)

3.2。數(shù)據(jù)分割 (3.2. Data splitting)

4.1。訓(xùn)練線性回歸模型 (4.1. Training a linear regression model)

4.2。應(yīng)用訓(xùn)練好的模型來預(yù)測訓(xùn)練和測試集中的logS (4.2. Apply trained model to predict logS from the training and test set)

4.3。打印出回歸方程 (4.3. Printing out the Regression Equation)