日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

kaggle研究生招生(上)

發布時間:2024/10/8 编程问答 28 豆豆
生活随笔 收集整理的這篇文章主要介紹了 kaggle研究生招生(上) 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

每天逛 kaggle

https://www.kaggle.com/mohansacharya/graduate-admissions




看來這個也是非常出名的數據集

  • GRE分數(290至340)
  • 托福成績(92-120)
  • 大學評級(1至5)
  • 目的聲明(1至5)
  • 推薦信強度(1至5)
  • 本科生CGPA(6.8至9.92)
  • 研究經驗(0或1)
  • 入學率(0.34至0.97)
import pandas as pd import matplotlib.pyplot as plt import numpy as np import seaborn as sns import sys import os df = pd.read_csv("../input/Admission_Predict.csv",sep = ",")


碩士入學的三個最重要特征:CGPA、GRE和托福成績

進入碩士學位的三個最不重要的特征:研究、LOR和SOP

相關系數矩陣

fig,ax = plt.subplots(figsize=(10, 10)) sns.heatmap(df.corr(), ax=ax, annot=True, linewidths=0.05, fmt= '.2f',cmap="magma") plt.show()


但是數據大多數候選人都有研究經驗。

因此,本研究將成為入學機會的一個不重要的特征

print("Not Having Research:",len(df[df.Research == 0])) print("Having Research:",len(df[df.Research == 1])) y = np.array([len(df[df.Research == 0]),len(df[df.Research == 1])]) x = ["Not Having Research","Having Research"] plt.bar(x,y) plt.title("Research Experience") plt.xlabel("Canditates") plt.ylabel("Frequency") plt.show()


數據中托福最低分為92分,托福最高分為120分。平均107.41。

y = np.array([df["TOEFL Score"].min(),df["TOEFL Score"].mean(),df["TOEFL Score"].max()]) x = ["Worst","Average","Best"] plt.bar(x,y) plt.title("TOEFL Scores") plt.xlabel("Level") plt.ylabel("TOEFL Score") plt.show()


GRE分數:

此柱狀圖顯示GRE分數的頻率。

密度介于310和330之間。在這個范圍以上是候選人脫穎而出的一個很好的特征。

df["GRE Score"].plot(kind = 'hist',bins = 200,figsize = (6,6)) plt.title("GRE Scores") plt.xlabel("GRE Score") plt.ylabel("Frequency") plt.show()


大學評分的CGPA分數:

隨著大學質量的提高,CGPA分數也隨之提高。

plt.scatter(df["University Rating"],df.CGPA) plt.title("CGPA Scores for University Ratings") plt.xlabel("University Rating") plt.ylabel("CGPA") plt.show()


GRE分數高的個體通常有較高的CGPA分數。

plt.scatter(df["GRE Score"],df.CGPA) plt.title("CGPA for GRE Scores") plt.xlabel("GRE Score") plt.ylabel("CGPA") plt.show()

df[df.CGPA >= 8.5].plot(kind='scatter', x='GRE Score', y='TOEFL Score',color="red") plt.xlabel("GRE Score") plt.ylabel("TOEFL SCORE") plt.title("CGPA>=8.5") plt.grid(True) plt.show()


從好大學畢業的候選人更有幸被錄取。

s = df[df["Chance of Admit"] >= 0.75]["University Rating"].value_counts().head(5) plt.title("University Ratings of Candidates with an 75% acceptance chance") s.plot(kind='bar',figsize=(20, 10)) plt.xlabel("University Rating") plt.ylabel("Candidates") plt.show()


CGPA分數高的候選人通常具有較高的SOP分數。

plt.scatter(df["CGPA"],df.SOP) plt.xlabel("CGPA") plt.ylabel("SOP") plt.title("SOP for CGPA") plt.show()


GRE分數高的候選人通常具有較高的SOP分數。

plt.scatter(df["GRE Score"],df["SOP"]) plt.xlabel("GRE Score") plt.ylabel("SOP") plt.title("SOP for GRE Score") plt.show()

上面是數據分析過程,下面開始model的訓練

去掉第一列的序號

# reading the dataset df = pd.read_csv("../input/Admission_Predict.csv",sep = ",")# it may be needed in the future. serialNo = df["Serial No."].valuesdf.drop(["Serial No."],axis=1,inplace = True) y = df["Chance of Admit"].values x = df.drop(["Chance of Admit"],axis=1)# separating train (80%) and test (%20) sets from sklearn.model_selection import train_test_splitx_train, x_test,y_train, y_test = train_test_split(x,y,test_size = 0.20,random_state = 42)

縮放到固定范圍(0-1)

# normalization from sklearn.preprocessing import MinMaxScaler scalerX = MinMaxScaler(feature_range=(0, 1)) x_train[x_train.columns] = scalerX.fit_transform(x_train[x_train.columns]) x_test[x_test.columns] = scalerX.transform(x_test[x_test.columns])

線性模型

from sklearn.linear_model import LinearRegression lr = LinearRegression() lr.fit(x_train,y_train) y_head_lr = lr.predict(x_test)print("real value of y_test[1]: " + str(y_test[1]) + " -> the predict: " + str(lr.predict(x_test.iloc[[1],:]))) print("real value of y_test[2]: " + str(y_test[2]) + " -> the predict: " + str(lr.predict(x_test.iloc[[2],:])))from sklearn.metrics import r2_score print("r_square score: ", r2_score(y_test,y_head_lr))y_head_lr_train = lr.predict(x_train) print("r_square score (train dataset): ", r2_score(y_train,y_head_lr_train))

real value of y_test[1]: 0.68 -> the predict: [0.72368741]
real value of y_test[2]: 0.9 -> the predict: [0.93536809]
r_square score: 0.821208259148699
r_square score (train dataset): 0.7951946003191085

隨機森林

from sklearn.ensemble import RandomForestRegressor rfr = RandomForestRegressor(n_estimators = 100, random_state = 42) rfr.fit(x_train,y_train) y_head_rfr = rfr.predict(x_test) from sklearn.metrics import r2_score print("r_square score: ", r2_score(y_test,y_head_rfr)) print("real value of y_test[1]: " + str(y_test[1]) + " -> the predict: " + str(rfr.predict(x_test.iloc[[1],:]))) print("real value of y_test[2]: " + str(y_test[2]) + " -> the predict: " + str(rfr.predict(x_test.iloc[[2],:])))y_head_rf_train = rfr.predict(x_train) print("r_square score (train dataset): ", r2_score(y_train,y_head_rf_train))

r_square score: 0.8074111823415694
real value of y_test[1]: 0.68 -> the predict: [0.7249]
real value of y_test[2]: 0.9 -> the predict: [0.9407]
r_square score (train dataset): 0.9634880602889714

決策樹

from sklearn.tree import DecisionTreeRegressor dtr = DecisionTreeRegressor(random_state = 42) dtr.fit(x_train,y_train) y_head_dtr = dtr.predict(x_test) from sklearn.metrics import r2_score print("r_square score: ", r2_score(y_test,y_head_dtr)) print("real value of y_test[1]: " + str(y_test[1]) + " -> the predict: " + str(dtr.predict(x_test.iloc[[1],:]))) print("real value of y_test[2]: " + str(y_test[2]) + " -> the predict: " + str(dtr.predict(x_test.iloc[[2],:])))y_head_dtr_train = dtr.predict(x_train) print("r_square score (train dataset): ", r2_score(y_train,y_head_dtr_train))

r_square score: 0.6262105228127393
real value of y_test[1]: 0.68 -> the predict: [0.73]
real value of y_test[2]: 0.9 -> the predict: [0.94]
r_square score (train dataset): 1.0

線性回歸和隨機森林回歸算法優于決策樹回歸算法。
y = np.array([r2_score(y_test,y_head_lr),r2_score(y_test,y_head_rfr),r2_score(y_test,y_head_dtr)]) x = ["LinearRegression","RandomForestReg.","DecisionTreeReg."] plt.bar(x,y) plt.title("Comparison of Regression Algorithms") plt.xlabel("Regressor") plt.ylabel("r2_score") plt.show()


可視化三種算法

red = plt.scatter(np.arange(0,80,5),y_head_lr[0:80:5],color = "red") green = plt.scatter(np.arange(0,80,5),y_head_rfr[0:80:5],color = "green") blue = plt.scatter(np.arange(0,80,5),y_head_dtr[0:80:5],color = "blue") black = plt.scatter(np.arange(0,80,5),y_test[0:80:5],color = "black") plt.title("Comparison of Regression Algorithms") plt.xlabel("Index of Candidate") plt.ylabel("Chance of Admit") plt.legend((red,green,blue,black),('LR', 'RFR', 'DTR', 'REAL')) plt.show()

總結

以上是生活随笔為你收集整理的kaggle研究生招生(上)的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。