逻辑回归 python_深入研究Python的逻辑回归
邏輯回歸 python
Classification techniques are an essential part of machine learning and data science applications. Approximately 70% of problems in machine learning are classification problems. There are lots of classification problems that are available, but the logistics regression is the most common way and is a useful regression method for solving the binary classification(0–1) problem,Another category of classification is multi-class classification, which takes care of the issues where several classes are present in the outcome variable. For example, IRIS dataset a very famous example of multi-class classification. Other examples are classifying articles ( require NLP skills ).
分類技術(shù)是機(jī)器學(xué)習(xí)和數(shù)據(jù)科學(xué)應(yīng)用程序的重要組成部分。 機(jī)器學(xué)習(xí)中大約70%的問題是分類問題。 可用的分類問題很多,但是物流回歸是最常用的方法,是解決二元分類(0-1)問題的一種有用的回歸方法。另一類分類是多分類,需要注意結(jié)果變量中存在幾個(gè)類別的問題。 例如,IRIS數(shù)據(jù)集是非常著名的多類分類示例。 其他示例是對(duì)文章進(jìn)行分類(需要NLP技能)。
Logistic Regression can be used for multiple classification problems such as credit cards approvalls ,Diabetes prediction and given a customer will purchase a particular product or will they churn another competitor, whether the user will click on a given advertisement link or not, and so much more.
Logistic回歸可用于多種分類問題,例如信用卡審批,糖尿病預(yù)測以及給定客戶將購買特定產(chǎn)品還是會(huì)吸引其他競爭對(duì)手(無論用戶是否會(huì)點(diǎn)擊給定的廣告鏈接)等等。 。
Logistic Regression is one of the most famous or simple way and commonly used Machine Learning algorithms for binary (0–1) classification. It is easy to code and can be used as the baseline for any binary classification problem. Its basic fundamental concepts are also constructive in deep learning. Logistic regression describes and estimates the relationship between one dependent binary variable and independent variables.
Logistic回歸是最著名或最簡單的方法之一,也是用于二進(jìn)制(0-1)分類的機(jī)器學(xué)習(xí)算法。 它易于編碼,可以用作任何二進(jìn)制分類問題的基準(zhǔn)。 它的基本基本概念在深度學(xué)習(xí)中也具有建設(shè)性。 Logistic回歸描述和估計(jì)一個(gè)因變量和自變量之間的關(guān)系。
To break down to you,here is what we gonna be learning in this tutorial.
總結(jié)一下,這是我們將在本教程中學(xué)習(xí)的內(nèi)容。
Introduction To Logistic Regression.
Logistic回歸概論。
Linear Regression Vs Logistic Regression.
線性回歸與邏輯回歸。
- Maximum Likelihood Estimation Vs. Ordinary Least Square Method 最大似然估計(jì)與 普通最小二乘法
- Logistic Regression under the hood? 引擎蓋下的邏輯回歸?
Logistic Regression in Scikit-learn.
Scikit學(xué)習(xí)中的邏輯回歸 。
- Confusion Matrix for Model Evaluation. 用于模型評(píng)估的混淆矩陣。
- Advantages over Disadvantages. 優(yōu)勢(shì)勝過劣勢(shì)。
邏輯回歸: (Logistic Regression:)
Logistic regression is a statistical learning method for predicting two-class(0–1). the target variable is dichotomous in nature. Dichotomous means there are only two possible classes. For example, it can be used for cancer detection problems. the logistic regression output probabilities and based on that output we can predict the corresponding classes.
Logistic回歸是一種用于預(yù)測兩級(jí)(0-1)的統(tǒng)計(jì)學(xué)習(xí)方法。 目標(biāo)變量本質(zhì)上是二分法的。 二分法意味著只有兩種可能的類別。 例如,它可以用于癌癥檢測問題。 邏輯回歸輸出的概率,并基于該輸出,我們可以預(yù)測相應(yīng)的類。
It is a special case of linear regression where the target variable is categorical in nature. It uses a log of odds as the dependent variable. Logistic Regression predicts the probability of occurrence of a binary event utilizing a logit function.
這是線性回歸的一種特殊情況,其中目標(biāo)變量本質(zhì)上是分類的。 它使用幾率對(duì)數(shù)作為因變量。 Logistic回歸利用logit函數(shù)預(yù)測二進(jìn)制事件的發(fā)生概率。
Linear Regression Equation:
線性回歸方程:
In statistics, econometrics and machine learning, a linear regression model is a regression model that seeks to establish a linear relationship between a variable, called explained, and one or more variables, called explanatory,It is also referred to as a linear model or a linear regression model.
在統(tǒng)計(jì),計(jì)量經(jīng)濟(jì)學(xué)和機(jī)器學(xué)習(xí)中,線性回歸模型是一種回歸模型,旨在建立一個(gè)變量(稱為解釋變量)和一個(gè)或多個(gè)變量(稱為解釋變量)之間的線性關(guān)系,也稱為線性模型或線性模型。線性回歸模型。
We consider the model for individual i. For each individual, the explained variable is written as a linear function of the explanatory variables.
我們考慮個(gè)人i的模型。 對(duì)于每個(gè)人,將解釋變量寫為解釋變量的線性函數(shù)。
where yi and the xi,j are fixed and εi represents the error.
其中yi和xi , j是固定的,而εi表示誤差。
Sigmoid Function:
乙狀結(jié)腸功能:
The sigmoid function, also called logistic function gives an ‘S’ shaped curve that can take any real-valued number and map it into a value between 0 and 1( probability). If the curve goes to positive infinity, y predicted will become 1, and if the curve goes to negative infinity, y predicted will become 0. If the output of the sigmoid function is more than 0.5, we can classify the outcome as 1 or YES( by default in logistic regression), and if it is less than 0.5, we can classify it as 0 or NO. The output cannot For example: If the output is 0.75, we can say in terms of probability as: There is a 75 percent chance that patient will suffer from cancer.
S形函數(shù) (也稱為邏輯函數(shù))給出了一個(gè)“ S”形曲線,該曲線可以采用任何實(shí)數(shù)值,并將其映射為0到1(概率)之間的值。 如果曲線變?yōu)檎裏o窮大,則y預(yù)測將變?yōu)?,如果曲線變?yōu)樨?fù)無窮大,則y預(yù)測將變?yōu)?。如果S型函數(shù)的輸出大于0.5,我們可以將結(jié)果分類為1或YES。 (在邏輯回歸中默認(rèn)情況下),如果小于0.5,我們可以將其分類為0或NO。 輸出不能為例如:如果輸出為0.75,就概率而言,我們可以這樣說:患者有75%的機(jī)會(huì)患上癌癥。
Properties of Logistic Regression:
Logistic回歸的屬性:
The dependent variable in logistic regression follows Bernoulli Distribution.
邏輯回歸中的因變量遵循伯努利分布。
- Estimation is done through maximum likelihood. 估計(jì)是通過最大似然來完成的。
- No R Square, Model fitness is calculated through Concordance, KS-Statistics. 沒有R平方,模型適用性通過Concordance,KS-Statistics計(jì)算。
線性回歸與邏輯回歸: (Linear Regression Vs Logistic Regression:)
Linear regression gives you a continuous output, but logistic regression provides a constant output. An example of the continuous output is house price and stock price. Example’s of the discrete output is predicting whether a patient has cancer or not, predicting whether the customer will churn. Linear regression is estimated using Ordinary Least Squares (OLS) while logistic regression is estimated using Maximum Likelihood Estimation (MLE) approach.
線性回歸可提供連續(xù)的輸出,而邏輯回歸可提供恒定的輸出。 連續(xù)輸出的一個(gè)例子是房屋價(jià)格和股票價(jià)格。 離散輸出的示例正在預(yù)測患者是否患有癌癥,并預(yù)測客戶是否會(huì)流失。 使用普通最小二乘(OLS)估計(jì)線性回歸,而使用最大似然估計(jì)(MLE)方法估計(jì)邏輯回歸。
最大似然估計(jì)與 最小二乘法: (Maximum Likelihood Estimation Vs. Least Square Method:)
In statistics, maximum likelihood estimate (MLE) is a method of estimating the parameters of a probability distribution by maximizing a likelihood function, so that under the assumed statistical model the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.
在統(tǒng)計(jì)中,最大似然估計(jì)(MLE)是通過最大化似然函數(shù)來估計(jì)概率分布參數(shù)的方法,因此在假定的統(tǒng)計(jì)模型下,觀察到的數(shù)據(jù)最有可能。 參數(shù)空間中使似然函數(shù)最大化的點(diǎn)稱為最大似然估計(jì)。 最大似然邏輯既直觀又靈活,因此該方法已成為統(tǒng)計(jì)推斷的主要手段。
Ordinary Least squares estimates are computed by fitting a regression line on given data points that has the minimum sum of the squared deviations (least square error). Both are used to estimate the parameters of a linear regression model. MLE assumes a joint probability mass function, while OLS doesn’t require any stochastic assumptions for minimizing distance.
普通最小二乘估計(jì)值是通過在給定的數(shù)據(jù)點(diǎn)上擬合回歸線來計(jì)算的,該數(shù)據(jù)點(diǎn)具有平方差的最小和(最小二乘誤差)。 兩者都用于估計(jì)線性回歸模型的參數(shù)。 MLE假設(shè)一個(gè)聯(lián)合概率質(zhì)量函數(shù),而OLS不需要任何隨機(jī)假設(shè)來最小化距離。
fitting a regression line on given data points在給定數(shù)據(jù)點(diǎn)上擬合回歸線Logistic回歸的類型: (Types of Logistic Regression:)
Multinomial Logistic Regression: The target variable has three or more nominal categories such as predicting the type of Wine.
多項(xiàng)邏輯回歸 :目標(biāo)變量具有三個(gè)或更多名義類別,例如預(yù)測酒的類型。
Binary Logistic Regression: The target variable has only two possible outcomes such as Spam or Not Spam.
二進(jìn)制Logistic回歸 :目標(biāo)變量只有兩個(gè)可能的結(jié)果,例如垃圾郵件或非垃圾郵件。
multi-label Logistic Regression: the target variable has three or more ordinal categories such as restaurant or product rating from 1 to 5.
多標(biāo)簽Logistic回歸 :目標(biāo)變量具有三個(gè)或多個(gè)序數(shù)類別,例如餐廳或產(chǎn)品等級(jí)從1到5。
Scikit學(xué)習(xí)中的模型構(gòu)建: (Model building in Scikit-learn:)
Scikit-learn is a free Python library for machine learning. It has been developed by many contributors, particularly in the academic world, by French higher education and research institutes such as Inria. It includes functions for estimating random forests, logistic regressions, classification algorithms, and support vector machines. It is designed to harmonize with other free Python libraries, including NumPy and SciPy.
Scikit-learn是用于機(jī)器學(xué)習(xí)的免費(fèi)Python庫。 它是由許多貢獻(xiàn)者開發(fā)的,特別是在學(xué)術(shù)界,是由法國高等教育和研究所(例如Inria)開發(fā)的。 它包括用于估計(jì)隨機(jī)森林的功能,邏輯回歸,分類算法和支持向量機(jī)。 它旨在與其他免費(fèi)的Python庫(包括NumPy和SciPy)保持一致 。
datasets: https://www.kaggle.com/uciml/pima-indians-diabetes-database
數(shù)據(jù)集: https : //www.kaggle.com/uciml/pima-indians-diabetes-database
now Let’s build the diabetes prediction model,Here, you are going to predict diabetes using Logistic Regression Classifier,first load the required Pima Indian Diabetes.
現(xiàn)在,我們建立糖尿病預(yù)測模型,在這里,您將使用Logistic回歸分類器預(yù)測糖尿病,首先加載所需的Pima印度糖尿病。
python codepython代碼 the first five rows of the dataframe數(shù)據(jù)框的前五行選擇功能: (Selecting Feature:)
Here, you need to divide the given columns into two types of variables dependent(or target variable) and independent variable(or feature variables or predictors variables).
在這里,您需要將給定的列分為因變量(或目標(biāo)變量)和自變量(或特征變量或預(yù)測變量)兩種類型。
python codepython代碼分割資料: (Splitting Data:)
To understand model performance, dividing the dataset into a training set and a test set is a good strategy.
為了了解模型的性能,將數(shù)據(jù)集分為訓(xùn)練集和測試集是一個(gè)很好的策略。
Let’s split dataset by using function train_test_split(). You need to pass 3 parameters features, target, and test_set size. Additionally, you can use random_state to select records randomly.
讓我們使用函數(shù)train_test_split()拆分?jǐn)?shù)據(jù)集。 您需要傳遞3個(gè)參數(shù)功能,目標(biāo)和test_set大小。 此外,您可以使用random_state隨機(jī)選擇記錄。
python codepython代碼Here, the Dataset is broken into two parts in a ratio of 75:25. It means 75% data will be used for model training and 25% for model testing.
在這里,數(shù)據(jù)集按75:25的比例分為兩部分。 這意味著75%的數(shù)據(jù)將用于模型訓(xùn)練,而25%的數(shù)據(jù)將用于模型測試。
模型開發(fā)和預(yù)測: (Model Development and Prediction:)
First, import the Logistic Regression module and create a Logistic Regression classifier object using LogisticRegression() function.
首先,導(dǎo)入Logistic回歸模塊,并使用LogisticRegression()函數(shù)創(chuàng)建一個(gè)Logistic回歸分類器對(duì)象。
Then, fit your model on the train set using fit() and perform prediction on the test set using predict().
然后,使用fit()將模型擬合到訓(xùn)練集上,并使用predict()對(duì)測試集執(zhí)行預(yù)測。
python codepython代碼使用混淆矩陣的模型評(píng)估: (Model Evaluation using Confusion Matrix:)
A confusion matrix is a table that is used to evaluate the performance of a classification model. You can also visualize the performance of an algorithm. The fundamental of a confusion matrix is the number of correct and incorrect predictions are summed up class-wise.
混淆矩陣是用于評(píng)估分類模型的性能的表。 您還可以可視化算法的性能。 混淆矩陣的基本原理是按類別匯總正確和錯(cuò)誤預(yù)測的數(shù)量。
Confusion Matrix混淆矩陣 python codepython代碼 output matrix輸出矩陣Here, you can see the confusion matrix in the form of the array object. The dimension of this matrix is 2*2 because this model is binary classification. You have two classes 0 and 1. Diagonal values represent accurate predictions, while non-diagonal elements are inaccurate predictions. In the output, 119 and 36 are actual predictions, and 26 and 11 are incorrect predictions.
在這里,您可以看到數(shù)組對(duì)象形式的混淆矩陣。 該矩陣的維數(shù)為2 * 2,因?yàn)樵撃P褪嵌M(jìn)制分類。 您有兩個(gè)類別0和1。對(duì)角線值表示準(zhǔn)確的預(yù)測,而非對(duì)角線元素則表示不正確的預(yù)測。 在輸出中,119和36是實(shí)際預(yù)測,而26和11是不正確的預(yù)測。
使用熱圖可視化混淆矩陣: (Visualizing Confusion Matrix using Heatmap:)
A heatmap is a data visualization technique that shows magnitude of a phenomenon as color in two dimensions. The variation in color may be by hue or intensity, giving obvious visual cues to the reader about how the phenomenon is clustered or varies over space.
熱圖是一種數(shù)據(jù)可視化技術(shù),可將現(xiàn)象的大小顯示為二維顏色。 顏色的變化可能是色相或強(qiáng)度 ,從而為讀者提供了有關(guān)現(xiàn)象如何聚集或隨空間變化的明顯視覺提示。
Here, you will visualize the confusion matrix using Heatmap.
在這里,您將使用Heatmap可視化混淆矩陣。
python codepython代碼 python codepython代碼 output輸出混淆矩陣評(píng)估指標(biāo): (Confusion Matrix Evaluation Metrics:)
Let’s evaluate the model using model evaluation metrics such as accuracy, precision, and recall.
讓我們使用模型評(píng)估指標(biāo)(例如準(zhǔn)確性,準(zhǔn)確性和召回率)評(píng)估模型。
python codepython代碼Well, you got a classification rate of 80%, considered as good accuracy.
好吧,您的分類率為80%,被認(rèn)為是不錯(cuò)的準(zhǔn)確性。
Precision: Precision is about being precise, i.e., how accurate your model is. In other words, you can say, when a model makes a prediction, how often it is correct. In your prediction case, when your Logistic Regression model predicted patients are going to suffer from diabetes, that patients have 76% of the time.
精度 :精度是指精度,即模型的精度。 換句話說,您可以說,當(dāng)模型做出預(yù)測時(shí),預(yù)測正確的頻率是多少。 在您的預(yù)測案例中,當(dāng)您的Logistic回歸模型預(yù)測患者將患有糖尿病時(shí),該患者有76%的時(shí)間。
Recall: If there are patients who have diabetes in the test set and your Logistic Regression model can identify it 58% of the time.
回想一下 :如果測試集中有糖尿病患者,并且您的Logistic回歸模型可以在58%的時(shí)間內(nèi)識(shí)別出糖尿病。
優(yōu)點(diǎn): (Advantages:)
Because of its efficient and straightforward nature, doesn’t require high computation power, easy to implement, easily interpretable, used widely by data analyst and scientist. Also, it doesn’t require scaling of features. Logistic regression provides a probability score for observations.
由于其高效而直接的特性,它不需要高計(jì)算能力,易于實(shí)現(xiàn),易于解釋的方法,并被數(shù)據(jù)分析師和科學(xué)家廣泛使用。 而且,它不需要縮放功能。 Logistic回歸為觀察提供了概率分?jǐn)?shù)。
缺點(diǎn): (Disadvantages:)
Logistic regression is not able to handle a large number of categorical features/variables. It is vulnerable to overfitting. Also, can’t solve the non-linear problem with the logistic regression that is why it requires a transformation of non-linear features. Logistic regression will not perform well with independent variables that are not correlated to the target variable and are very similar or correlated to each other.
Logistic回歸無法處理大量分類特征/變量。 它很容易過擬合。 此外,無法通過邏輯回歸來解決非線性問題,這就是為什么它需要轉(zhuǎn)換非線性特征的原因。 如果邏輯變量與目標(biāo)變量不相關(guān),非常相似或彼此相關(guān),則邏輯回歸將無法很好地執(zhí)行。
結(jié)論: (Conclusion:)
this tutorial, you covered a lot of details about Logistic Regression. You have learned what the logistic regression is, how to build respective models, how to visualize results and some of the theoretical background information. Also, you covered some basic concepts such as the sigmoid function, maximum likelihood, confusion matrix, with that said see you guys in the next article and don’t forget to keep learning.
在本教程中,您涵蓋了有關(guān)Logistic回歸的許多詳細(xì)信息。 您已經(jīng)了解了邏輯回歸是什么,如何建立各自的模型,如何可視化結(jié)果以及一些理論背景信息。 另外,您還介紹了一些基本概念,例如S形函數(shù),最大似然,混淆矩陣,并說在下一篇文章中與大家見面, 不要忘記繼續(xù)學(xué)習(xí)。
翻譯自: https://medium.com/analytics-vidhya/dive-into-logistic-regression-with-python-48911f37f8ee
邏輯回歸 python
總結(jié)
以上是生活随笔為你收集整理的逻辑回归 python_深入研究Python的逻辑回归的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 提高机器学习质量的想法_如何提高机器学习
- 下一篇: cayenne:用于随机模拟的Pytho