當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

图像梯度增强_使用梯度增强机在R中进行分类

發布時間：2023/12/15 编程问答 31 豆豆

生活随笔收集整理的這篇文章主要介紹了图像梯度增强_使用梯度增强机在R中进行分类小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

圖像梯度增強

背景 (Background)

Purpose of analysis:

分析目的：

Understand the factors driving student success so that Open University can allocate resources to improve student success

了解推動學生成功的因素，以便開放大學可以分配資源來提高學生的成功

Description of dataset:

數據集說明：

The Open University Learning Analytics Dataset is a publicly available dataset containing data about courses, students and their interactions with VLE for seven selected courses (modules).

開放大學學習分析數據集是一個公共可用的數據集，其中包含有關課程，學生及其與七個所選課程(模塊)的VLE交互的數據。

As the unique identifiers across all the data tables were student ID, module and course description, data was aggregated at this level. The following variables were used in the analysis:

由于所有數據表中的唯一標識符是學生ID，模塊和課程說明，因此在此級別匯總了數據。分析中使用了以下變量：

Variable NameVariable TypeVariable NameVariable Typeid_student unique identifier/primary keyimd band categorical code_modulecategoricalage band categorical code_presentation categorical num of previous attemptsnumericalgender categorical studied credits categorical region categorical disability categorical highest education categorical final result numerical sum weighted score numerical average module length numerical average submission duration numerical average proportion content accessed numerical average date registration numerical trimmed assessment type categorical

變量名稱變量類型變量名稱變量類型 id_student唯一標識符/主keyimd頻帶類別code_module類別ageage類別類別code_presentation類別先前嘗試的次數numericalgender類別classical所學習的學分類別區域categorical殘疾categorical最高學歷類別最終平均結果數值平均值數值總和訪問的內容數值平均日期注冊數值修剪評估類型類別

方法 (Methodology)

數據轉換 (Data transformation)

Combine datasets based on unique identifier (student ID, course and code_description)
根據唯一標識符(學生ID，課程和code_description)組合數據集
Aggregate variables at unique identifier level
在唯一標識符級別匯總變量
Update nominal variable types from character to factor
將名義變量類型從字符更新為因子

Predictors:

預測變量：

code module
代碼模塊
code presentation (changed to 1 = B and 2 = J),
代碼表示(更改為1 = B和2 = J)，
gender, region, highest education, IMD band, Age band, number of previous attempts, studied credits,
性別，地區，最高學歷，IMD頻段，年齡段，以前的嘗試次數，學分，
average submission duration (averaged as a single code presentation can have multiple assessments with varying submission durations),
平均提交時間(平均一次代碼演示可以進行多次評估，且提交時間有所不同)，
average module length,
平均模塊長度
average proportion content accessed (based on sum of clicks on “ou” content and resources/quiz/glossary divided by total sum of clicks per code description),
平均訪問的內容比例(基于對“ ou”內容和資源/測驗/詞匯的點擊總和除以每個代碼描述的總點擊總和)，
average date registration and
平均日期注冊和
trimmed assessment type (as a single code description can have multiple assessments and multiple of the same type of assessment. It is important to determine whether the types of assessments per course and description drive student success)
修整的評估類型(因為單個代碼描述可以具有多個評估，也可以具有多個相同類型的評估。確定每門課程和描述的評估類型是否能促進學生成功很重要)

Excluded variables: Student ID and sum weighted score

排除的變量：學生ID和加權總和

Reason for exclusion: Student ID– identifying information, sum weighted score — correlated to final result

排除原因：學生證-識別信息，總和加權分-與最終結果相關

Define “success” target variable

定義“成功”目標變量

Response variable: Success (“Pass” or “Distinction” = “Yes”, “Fail” or “Withdrawn” = “No”
響應變量：成功(“通過”或“不同” =“是”，“失敗”或“撤回” =“否”

Reason for not using “final result” as success factor: There is a disproportionate of “Pass” records within the dataset thus reducing the model accuracy at predicting “Fail”, “Withdrawn” and “Distinction”. As such, the response variable was binarised.

不使用“最終結果”作為成功因素的原因：數據集中的“通過”記錄不成比例，因此降低了預測“失敗”，“撤回”和“區別”時的模型準確性。這樣，將響應變量二值化。

Check for nulls, missing values, and outliers for each variable that will be input into the model
檢查將輸入到模型中的每個變量的空值，缺失值和離群值

Why do we check for missing data? — If a large proportion of data is missing/null, sample is not representative enough to provide accurate results. This is not the case when the missing/null data is missing for a valid reason.

為什么我們要檢查丟失的數據？ —如果大部分數據丟失/為空，則樣本的代表性不足以提供準確的結果。當丟失/空數據由于正當原因而丟失時，情況并非如此。

資料設定 (Data set-up)

? Split data into training & test sets for nominal and binary predictors to test for model accuracy

?將數據分為用于名義和二進制預測變量的訓練和測試集，以測試模型的準確性

Why do we split the data into training and test sets — The model is trained on the training set. Model accuracy is tested on the test set to determine how well the model is good at prediction success against non-success on data it has not “seen”.

我們為什么將數據分為訓練集和測試集-在訓練集上訓練模型。在測試集上測試模型準確性，以確定模型針對未“看到”的數據不成功的預測成功率。

資料建模 (Data modelling)

Analytical method: Gradient boosting machine (GBM)
分析方法：梯度提升機(GBM)
Alternate methods could have been used but model accuracy was good enough — Distributed Random Forest (DRF) and Generalized Linear Model (GLMNET)
可以使用其他方法，但是模型的精度足夠好-分布式隨機森林(DRF)和廣義線性模型(GLMNET)
Check for model accuracy using confusion matrix (i.e. proportion that were predicted correctly) and area under curve (how well did we do compared to random chance).
使用混淆矩陣(即正確預測的比例)和曲線下面積(與隨機機會相比，我們做得如何)檢查模型準確性 。

輸出量 (Output)

Top predictors (by variable importance score)
最佳預測變量(按重要性變量計)

數據ETL (Data ETL)

All data exploration, transformation, and loading of final dataset was done in Alteryx to avoid writing code and test its functionality for basic data transformation steps that would typically be carried out in R such as joins, mutate and group by.

所有數據探索，轉換和最終數據集的加載均在Alteryx中完成，以避免編寫代碼并測試其功能，以進行通常在R中執行的基本數據轉換步驟，例如連接，變異和分組。

First, I joined three datasets — assessments.csv, studentAssessments.csv and courses.csv by ID assessments (primary key).

首先，我通過ID評估(主鍵)加入了三個數據集-Assessments.csv，studentAssessments.csv和courses.csv。

Next, I engineered two features: weighted_score and submission_duration. Dates are usually meaningless unless they are transformed into useful features. Course description only had two categories, which I converted to 1 and 2 for ease of reference.

接下來，我設計了兩個功能：weighted_score和Submitting_duration。日期通常是沒有意義的，除非將其轉換為有用的功能。課程描述只有兩類，為了便于參考，我將其轉換為1和2。

The next step was to add in data on student interactions with the virtual learning platform (VLE). Some feature engineering was done for ease of analysis such as creating a new variable called activity_type_sum that reduces the number of categories in activity type into two broad categories — content access and browsing. The reason for doing this is that granular categories only result in more features and reduce the number of observations per category. The number of clicks were summed by the activity type feature. Proportion of activity out of total activity that is browsing related and content access related was also calculated. This is a good way to create a feature that is relative to another feature and scaled by total activity thus ensuring that all students are represented on a similar scale by their activity type.

下一步是添加有關學生與虛擬學習平臺(VLE)互動的數據。為了簡化分析，進行了一些功能設計，例如創建一個名為activity_type_sum的新變量，該變量將活動類型中的類別數量減少為兩大類-內容訪問和瀏覽。這樣做的原因是，細化類別只會帶來更多功能，并減少每個類別的觀測次數。點擊次數是按活動類型功能求和的。還計算了與瀏覽相關和與內容訪問相關的總活動中活動的比例。這是創建相對于另一個功能并按總活動進行縮放的功能的好方法，從而確保所有學生按其活動類型以相似的比例表示。

Block 1 was joined to Block 2 using student_id, code_module and code_presentation as the primary key. The resulting output is shown below.

使用student_id，code_module和code_presentation作為主鍵，將塊1加入到塊2。結果輸出如下所示。

The above output — Block 3 — was joined with student registration data using student_id, code_module, and code_presentation to bring across the data_registered field.

上面的輸出(第3塊)通過使用student_id，code_module和code_presentation與學生注冊數據結合在一起，以顯示data_registered字段。

The date_unregistered field was ignored as it had a lot of missing values. Moreover, students with empty unregistered field cells have withdrawal as the value for their final_result. This variable is our target/response variable. So, the date_unregistered field appears to be a proxy measure for final_result and as such it makes sense to exclude this variable from our analysis.

date_unregistered字段被忽略，因為它缺少很多值。此外，未注冊字段單元格為空的學生可以將他們的final_result值取回。此變量是我們的目標/響應變量。因此， date_unregistered字段似乎是final_result的代理度量，因此從我們的分析中排除此變量是有意義的。

As shown above, for a given id_student, code_module, and code_presentation, the module_presentation length, proportion_content and date_registration is repeated. As we want to have unique records, we can aggregate the data as follows:

如上所示，對于給定的id_student，code_module和code_presentation，重復module_presentation的長度，比例內容和date_registration。由于我們希望擁有唯一的記錄，因此可以按以下方式匯總數據：

Summarise weighted score using total sum
使用總和總結加權分數
Average submission duration
平均提交時間
Average module presentation (you can also use other aggregates such as minimum, maximum, and median)
平均模塊展示(您也可以使用其他匯總，例如最小值，最大值和中位數)
Average of proportion_content_access
ratio_content_access的平均值
Average of date_registration
date_registration的平均值

Data is now at student_id, code_module, code_presentation and assessment_type level; however, the target variable — final_result — is at student_id, code_module and code_presentation level. Hence, this data will need to be further aggregated.

數據現在位于student_id，code_module，code_presentation和評估類型級別；但是，目標變量final_result處于student_id，code_module和code_presentation級別。因此，該數據將需要進一步匯總。

Let’s look at student info first. A unique record here is id_student, code_module, code_presentation. So, we will need to go back a step and summarise a student_id, code_module and code_presentation to represent all assessments taken by an individual. We will still use the previous summary formulas.

我們先來看一下學生信息。這里的唯一記錄是id_student，code_module，code_presentation。因此，我們將需要退后一步，總結一個student_id，code_module和code_presentation來代表一個人進行的所有評估。我們仍將使用以前的匯總公式。

By doing this we have 8 unique assessment types that a student can take for a given code module and code presentation. Assessment types are not repeated (trimmed only) so if at student took 3 TMAs this is not reflected as shown below.

通過這樣做，我們可以為學生提供針對給定代碼模塊和代碼表示形式的8種獨特的評估類型。評估類型不會重復(僅修整)，因此，如果學生參加了3次TMA，則不會如下所示。

A variable could be created to count the number of assessments per assessment type but it would contain lots of missing values as not all assessments have all three types of assessments. Now, we are ready to join to the student info data with output shown below.

可以創建一個變量來計算每種評估類型的評估數量，但是它會包含很多缺失值，因為并非所有評估都具有這三種評估類型。現在，我們準備加入學生信息數據，輸出如下所示。

Now, we have 18 columns. We have been told that a presentation may differ if presented in February vs. October. We will assume that it does not differ year on year (i.e. 2013B is same as 2014B). As such, we will recode code_presentation into 1 for B and 2 for J as a binary variable.

現在，我們有18列。我們被告知，如果在2月和10月之間進行演示，則演示可能會有所不同。我們將假定它沒有同比差異(即2013B與2014B相同)。這樣，我們將將code_presentation重新編碼為B的1和J的2作為二進制變量。

The final output is shown below.

最終輸出如下所示。

It is finally time for some data exploration.

現在是時候進行一些數據探索了。

探索性數據分析 (Exploratory Data Analysis)

Categorical variables can be represented with bar charts where the y-axis is the frequency of the occurence of a given category. For example, in the chart below we can see that the most frequently taken code module is FFF followed by BBB. There are seven unique code modules with no missing values.

類別變量可以用條形圖表示，其中y軸是給定類別的出現頻率。例如，在下面的圖表中，我們可以看到最常用的代碼模塊是FFF，其次是BBB。有七個唯一的代碼模塊，沒有缺失值。

Data can also be summarised numerically using a five-point summary for continuous variables and using mode for categorical variables as shown below.

數據也可以使用五點匯總(用于連續變量)和使用模式(用于分類變量)以數字方式匯總，如下所示。

Insights we can make from the summary below is that the more common student is a Scottish male student presenting without a disability with an Imd_band between 20 to 40% with a typical Pass as their final result.

我們可以從下面的摘要中得出的見解是，最普通的學生是蘇格蘭的一名男生，表現出無障礙，Imd_band在20％到40％之間，最終成績為正常。

Now we can move towards modelling the dataset.

現在我們可以對數據集建模。

機器學習模型 (Machine Learning Model)

We have been asked to assist Open University in better understanding student success. ? We will assume that student success is measured via the final result where pass and distinction are indicators of “success” and withdrawn and fail are indicators of “non-success”. ? For the independent variables, we will use all variables from the previoustable except for weighted_score. The reason for this is because weighted score determines the final result for a given student. As such, it is highly correlated (multicollinear) to final result and as such will be excluded. ? Student ID is identifying information and as such will not be used as a predictor.

我們被要求協助開放大學更好地了解學生的成功。 ?我們將假設學生的成功是通過最終結果來衡量的，合格和區別是“成功”的指標，退縮和失敗是“不成功”的指標。 ?對于自變量，我們將使用上一張表中的所有變量(weighted_score除外)。這樣做的原因是因為加權分數決定了給定學生的最終結果。因此，它與最終結果高度相關(多重共線性)，因此將被排除。 ?學生證是識別信息，因此不會用作預測指標。

GBM (Gradient Boosted Model) was used as a model of choice. This type of model creates a series of weak learners (shallow trees) where each new tree tries to improve on the error rate of the previous tree. The final tree is one with the lowest error rate. It is an ensemble machine learning method as several trees are created to provide the final results. However, unlike in randomForest, these trees are created in a series rather than in parallel. Furthermore, these trees are not independent and are depenent on the previous tree’s error rate where the following three will try harder to improve prediction for the more difficult cases. This is controlled by a parameter called hte learning rate.

使用GBM(梯度增強模型)作為選擇模型。這種類型的模型會創建一系列弱學習者(淺樹)，其中的每棵新樹都試圖提高前一棵樹的錯誤率。最終的樹是錯誤率最低的樹。這是一種集成的機器學習方法，因為創建了幾棵樹以提供最終結果。但是，與randomForest不同，這些樹是按系列而不是并行創建的。此外，這些樹不是獨立的，而是取決于前一棵樹的錯誤率，在后一棵樹上，后三棵將更努力地提高對更困難情況的預測。這是由稱為學習率的參數控制的。

The model was run with 500 rounds (500 trees) with minimum and maximum depths of 4 for the tree. Typically, it is not good to have very deep trees as this can lead to overfitting where the algorithm tries to explain every observation in the dataset as it increases the depth of the tree leading to leaves containing a very small number of observations that fit the given rule.

該模型以500輪(500棵樹)運行，最小和最大深度為4。通常，擁有非常深的樹不是很好，因為這會導致過度擬合，因為算法會嘗試解釋數據集中的每個觀察值，因為它會增加樹的深度，從而導致葉子中包含非常少的符合給定觀察值的觀察值規則。

We can see from the above output that the model has a RMSE (root-mean-squared error) value of 0.55 which is quite high. It is particularly bad at predicting Distinction and Fail, which may be due to the imbalance in the dataset where we know from our exploratory data analysis that Pass is the most common final result.

從上面的輸出中我們可以看到，該模型的RMSE(均方根誤差)值為0.55，這非常高。在預測Distinction和Fail時特別糟糕，這可能是由于數據集的不平衡所致，從探索性數據分析中我們知道Pass是最常見的最終結果。

To counteract this imbalance issue, the target variable was redefined as “success” (distinction and pass) and “failure” (fail and withdrawn). It is common to combine categories to deal with imbalanced datasets. Other ways are to undersample (i.e. reduce the number of instances for the most frequent class) or oversample (i.e. create artificial observations for the non-frequent classes).

為了解決此不平衡問題，將目標變量重新定義為“成功”(區分并通過)和“失敗”(失敗并撤回)。合并類別以處理不平衡的數據集是很常見的。其他方法是欠采樣(即減少最頻繁分類的實例數量)或過采樣(即為非頻繁分類創建人工觀察)。

The model was re-run with the following output. Here we can see that the mean per-class error has dropped significantly. The Area Under the Curve (AUC) is another accuracy metric that tells you how well the model is at classifying cases correctly (i.e. maximising the true positive rate (TPR)). The higher the AUC, the more accurate the model. As the AUC is measured between 0 and 1, an AUC of 0.87 is pretty good.

使用以下輸出重新運行模型。在這里，我們可以看到平均每類錯誤已大大降低。曲線下面積(AUC)是另一種精度度量標準，它告訴您模型在正確分類案例(即最大化真實正利率(TPR))方面的表現。 AUC越高，模型越準確。由于測得的AUC在0到1之間，因此0.87的AUC相當不錯。

Another metric that is commonly used in classification problems is the F1 score which is the harmonic mean of precision and recall. Both metrics aim to maximise the TPR while minimising either the false negative rate (recall) or false positive rate (precision). A true positive is when a success or failure is classified correctly. A false negative is when a success is labelled as a failure. A false positive is when a failure is labelled as a success. For the F1 score to be high, both precision and recall need to be high.

分類問題中通常使用的另一個度量標準是F1分數，它是精確度和查全率的諧波平均值。兩種指標均旨在最大程度地提高TPR，同時最大程度地降低誤報率(召回率)或誤報率(精度)。正確的判斷是對成功或失敗進行了正確分類。假否定是將成功標記為失敗。誤報是指將失敗標記為成功。為了使F1分數高，準確性和召回率都必須高。

Confusion matrix indicates an overall error rate of 17.11% which is mainly driven by how good the model is at classifying successes. The model is not so good at classifying failures with an error rate of 39.07%. Again this may be due to the data being overrepresented by “Passes”. Thus, the results should be treated with caution and model re-run with a more balanced dataset.

混淆矩陣表明總體錯誤率為17.11％，這主要是由模型對成功分類的良好程度驅動的。該模型不能很好地對故障進行分類，錯誤率為39.07％。同樣，這可能是由于數據被“通行證”過多所代表。因此，應謹慎對待結果，并使用更平衡的數據集重新運行模型。

Now, let’s look at the top predictors of success or failure by looking at the variable importance list.

現在，讓我們通過查看變量重要性列表來查看成功或失敗的主要預測因素。

The top 3 variables are code module, trimmed_assessment_type and average submission duration.
前三個變量是代碼模塊，trimmed_assessment_type和平均提交時間。
The bottom 3 variables at predicting whether or not a successful outcome will be reached for a given student are disability status, average date for registration and gender.
預測給定學生是否達到成功結果的最后3個變量是殘疾狀況，平均注冊日期和性別。
Note: As code module and code presentation are part of the unique identifier they should have been excluded from analysis. However, as presentations in February and October may be different for some courses, both variables were kept in the model. It is possible that excluding these variables may increase accuracy or make other variables more “important”.
注意：由于代碼模塊和代碼表示是唯一標識符的一部分，因此應將它們從分析中排除。但是，由于某些課程在2月和10月的演示文稿可能有所不同，因此兩個變量都保留在模型中。排除這些變量可能會提高準確性或使其他變量更“重要”。

Now, let’s visualise information on the top predictors to better understand the model. The stacked bar plot below shows the proportion of records by course module and final_result. We can deduce that students are more likely to be successful in completing AAA, EEE and GGG courses over other courses.

現在，讓我們可視化頂級預測變量上的信息，以更好地理解模型。下面的堆疊條形圖按課程模塊和final_result顯示了記錄的比例。我們可以推斷出，與其他課程相比，學生更有可能成功完成AAA，EEE和GGG課程。

From the above table, we can see that there is a 100% success rate if an exam is the only assessment for a given course and presentation.
從上表中可以看出，如果考試是給定課程和演示的唯一評估，則成功率為100％。
If only computer marked assessments make up the course component, there is a very high failure/withdrawn rate. It would be interesting to investigate why having CMA as part of a presentation assessment leads to a decrease in success rate.
如果僅由計算機標記的評估構成課程組成部分，則故障/退出率非常高。調查為什么將CMA作為演示評估的一部分會導致成功率降低，這將是很有趣的。

The histograms above show the average submission duration by success and failure.

上面的直方圖顯示了成功和失敗的平均提交時間。

It appears that when students are successful, they are more likely to submit their assignment within 10 days (+/-) of the assessment submission date.

看來，當學生成功時，他們更有可能在評估提交日期的10天內(+/-)提交作業。

結語 (Wrapping up)

Machine learning was used to quickly identify top contributors to student success.

機器學習用于快速確定學生成功的主要貢獻者。

Recommendations for model improvement include:

有關模型改進的建議包括：

Working with a balanced dataset
使用平衡的數據集
Including proxy measures for resource allocation within the dataset
包括數據集中資源分配的代理指標
Count of the number of assessments by type per course and presentation as a
按課程類型和演示形式按類型進行的評估數量計數

feature
特征
Remove categorical variables that are associated with each other (i.e. use of chi-squared test of independence)
刪除彼此關聯的分類變量(即，使用卡方獨立性檢驗)

Hopefully, you now have a better understanding of utilising GBM for a classification problem, the pitfalls of a classification problem (i.e. imbalanced dataset) and the use of various accuracy metrics.

希望您現在對使用GBM解決分類問題，分類問題的陷阱(即數據集不平衡)以及使用各種準確性指標有了更好的了解。

The reference to all R code is provided in my git repository: https://github.com/shedoesdatascience/openlearning

在我的git存儲庫中提供了對所有R代碼的引用： https : //github.com/shedoesdatascience/openlearning

翻譯自: https://towardsdatascience.com/using-gradient-boosting-machines-for-classification-in-r-b22b2f8ec1f1

圖像梯度增強

總結

以上是生活随笔為你收集整理的图像梯度增强_使用梯度增强机在R中进行分类的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：注意了！22岁女生熬夜关灯玩手机致霰粒肿
下一篇：机器学习文本分类代码_无需担心机器学