當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

李宏毅ML作业一

發布時間：2025/3/21 编程问答 68 豆豆

生活随笔收集整理的這篇文章主要介紹了李宏毅ML作业一小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

任務說明

train.csv

test.csv?

任務目標

輸入：9個小時的數據，共18項特征（AMB_TEMP, CH4, CO, NHMC, NO, NO2, NOx, O3, PM10, PM2.5, RAINFALL, RH, SO2, THC, WD_HR, WIND_DIREC, WIND_SPEED, WS_HR）

輸出：第10小時的PM2.5數值

模型：線性回歸

任務解答

數據處理

"""導入數據""" import sys import pandas as pd import numpy as np #讀入train.csv，繁體字以big5編碼 data = pd.read_csv('/Users/zhucan/Desktop/李宏毅深度學習作業/第一次作業/train.csv',encoding = 'big5') #顯示前10行 print(data.head()) data.shape

結果：

data.shape #(4320, 27) # 丟棄前兩列，需要的是從第三列開始的數值 data = data.iloc[:, 3:] # 把降雨的NR字符變成數值0 data[data == 'NR'] = 0 # 把dataframe轉換成numpy的數組 raw_data = data.to_numpy() raw_data

結果：

現在shape變成了（4320，24）?

提取特征

分成了12個月，每個月有18行×480列的數據。

對于每個月，每10個小時分成一組，由前9個小時的數據來預測第10個小時的PM2.5，把前9小時的數據放入x，把第10個小時的數據放入y。窗口的大小為10，從第1個小時開始向右滑動，每次滑動1小時。因此，每個月都有471組這樣的數據。

把一組18×9的數據平鋪成一行向量，然后放入x的一行中，每個月有471組，共有12×471組向量，因此x有12×471行，18×9列。

將預測值放入y中，y有12（月）×471（組）行，1列。

month_data = {} for month in range(12):sample = np.empty([18, 480])for day in range(20):sample[:, day * 24 : (day + 1) * 24] = raw_data[18 * (20 * month + day) : 18 * (20 * month + day + 1), :]month_data[month] = sample x = np.empty([12 * 471, 18 * 9], dtype = float) y = np.empty([12 * 471, 1], dtype = float) for month in range(12):for day in range(20):for hour in range(24):if day == 19 and hour > 14:continuex[month * 471 + day * 24 + hour, :] = month_data[month][:,day * 24 + hour : day * 24 + hour + 9].reshape(1, -1) #vector dim:18*9 (9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9)y[month * 471 + day * 24 + hour, 0] = month_data[month][9, day * 24 + hour + 9] #value print(x) print(y)

結果：

[[14. 14. 14. ... 2. 2. 0.5][14. 14. 13. ... 2. 0.5 0.3][14. 13. 12. ... 0.5 0.3 0.8]...[17. 18. 19. ... 1.1 1.4 1.3][18. 19. 18. ... 1.4 1.3 1.6][19. 18. 17. ... 1.3 1.6 1.8]] [[30.][41.][44.]...[17.][24.][29.]]

標準化（Normalization）

mean_x = np.mean(x, axis = 0) #18 * 9 按列求平均 std_x = np.std(x, axis = 0) #18 * 9 按列求標準差 for i in range(len(x)): #12 * 471for j in range(len(x[0])): #18 * 9 if std_x[j] != 0:x[i][j] = (x[i][j] - mean_x[j]) / std_x[j] x

結果：

劃分數據

把訓練數據分成訓練集train_set和驗證集validation，其中train_set用于訓練，而validation不會參與訓練，僅用于驗證。

import math x_train_set = x[: math.floor(len(x) * 0.8), :] #math.floor向下取整 y_train_set = y[: math.floor(len(y) * 0.8), :] x_validation = x[math.floor(len(x) * 0.8): , :] y_validation = y[math.floor(len(y) * 0.8): , :] print(x_train_set) print(y_train_set) print(x_validation) print(y_validation) print(len(x_train_set)) print(len(y_train_set)) print(len(x_validation)) print(len(y_validation))

結果：

訓練?

和上圖不同處: 下面Loss的代碼用到的是 Root Mean Square Error

因為存在常數項b，所以維度（dim）需要多加一列，即原來是y = wx + b，可以統一成 y = [w b] [x 1]；eps項是極小值，避免adagrad的分母為0.

每一個維度（dim）會對應到各自的gradient和權重w，通過一次次的迭代（iter_time）學習。最終，將訓練得到的模型（權重w）存儲為.npy格式的文件。

dim = 18 * 9 + 1 w = np.zeros([dim, 1]) #最后一個w是b x = np.concatenate((np.ones([12 * 471, 1]), x), axis = 1).astype(float) learning_rate = 100 iter_time = 1000 adagrad = np.zeros([dim, 1]) eps = 0.0000000001 for t in range(iter_time):loss = np.sqrt(np.sum(np.power(np.dot(x, w) - y, 2))/471/12)#rmse power(x,y)函數，計算x的y次方。if(t%100 == 0):print(str(t) + ":" + str(loss))gradient = 2 * np.dot(x.transpose(), np.dot(x, w) - y) #dim*1adagrad += gradient ** 2w = w - learning_rate * gradient / np.sqrt(adagrad + eps) np.save('weight.npy', w) w

結果：

0:27.071214829194115 100:33.78905859777454 200:19.91375129819709 300:13.531068193689686 400:10.645466158446165 500:9.27735345547506 600:8.518042045956497 700:8.014061987588416 800:7.636756824775686 900:7.33656374037112 array([[ 2.13740269e+01],[ 3.58888909e+00],[ 4.56386323e+00],[ 2.16307023e+00],[-6.58545223e+00],[-3.38885580e+01],[ 3.22235518e+01],...[-5.57512471e-01],[ 8.76239582e-02],[ 3.02594902e-01],[-4.23463160e-01],[ 4.89922051e-01]])

預測

# 讀入測試數據test.csv testdata = pd.read_csv('/Users/zhucan/Desktop/李宏毅深度學習作業/第一次作業/test.csv', header = None, encoding = 'big5') # 丟棄前兩列，需要的是從第3列開始的數據 test_data = testdata.iloc[:, 2:] # 把降雨為NR字符變成數字0 test_data[test_data == 'NR'] = 0 # 將dataframe變成numpy數組 test_data = test_data.to_numpy() # 將test數據也變成 240 個維度為 18 * 9 + 1 的數據 test_x = np.empty([240, 18*9], dtype = float)for i in range(240):test_x[i, :] = test_data[18 * i: 18* (i + 1), :].reshape(1, -1) for i in range(len(test_x)):for j in range(len(test_x[0])):if std_x[j] != 0:test_x[i][j] = (test_x[i][j] - mean_x[j]) / std_x[j] test_x = np.concatenate((np.ones([240, 1]), test_x), axis = 1).astype(float) test_x

結果:

ans_y = np.dot(test_x, w) ans_y

?結果：?

Out: array([[ 5.17496040e+00],[ 1.83062143e+01],[ 2.04912181e+01],[ 1.15239429e+01],[ 2.66160568e+01],...,[ 4.12665445e+01],[ 6.90278920e+01],[ 4.03462492e+01],[ 1.43137440e+01],[ 1.57707266e+01]])

修改代碼(加入二次項)

# 訓練集 for month in range(12):for day in range(20):for hour in range(24):if day == 19 and hour > 14:continuex1 = month_data[month][:, day * 24 + hour: day * 24 + hour + 9].reshape(1,-1) # vector dim:18*9 (9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9)x[month * 471 + day * 24 + hour, :18 * 9] = x1# 在這里加入了x的二次項x[month * 471 + day * 24 + hour, 18 * 9: 18 * 9 * 2] = np.power(x1, 2)y[month * 471 + day * 24 + hour, 0] = month_data[month][9, day * 24 + hour + 9] # value # 測試集 testdata = pd.read_csv('./test.csv', header = None, encoding = 'big5') test_data = testdata.iloc[:, 2:] test_data[test_data == 'NR'] = 0 test_data = test_data.to_numpy() test_x1 = np.empty([240, 18*9], dtype = float) test_x = np.empty([240, 18*9*2], dtype = float) for i in range(240):test_x1 = test_data[18 * i: 18 * (i + 1), :].reshape(1, -1).astype(float)# 同樣在這里加入test x的二次項test_x[i, : 18 * 9] = test_x1test_x[i, 18 * 9:] = np.power(test_x1 , 2) for i in range(len(test_x)):for j in range(len(test_x[0])):if std_x[j] != 0:test_x[i][j] = (test_x[i][j] - mean_x[j]) / std_x[j] test_x = np.concatenate((np.ones([240, 1]), test_x), axis = 1).astype(float)

總結

以上是生活随笔為你收集整理的李宏毅ML作业一的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：金融风控实战——特征工程上
下一篇： Lesson 2.张量的索引、分片、合并