情感数据对LSTM股票预测模型的影响研究
情感數(shù)據(jù)對(duì)LSTM股票預(yù)測(cè)模型的影響研究
作者:丁紀(jì)翔
發(fā)布時(shí)間:06/28/2021
摘要:探究了情感結(jié)構(gòu)化特征數(shù)據(jù)在LSTM股票預(yù)測(cè)模型中的影響。利用Pandas對(duì)所給數(shù)據(jù)進(jìn)行預(yù)處理(數(shù)據(jù)載入、清洗與準(zhǔn)備、規(guī)整、時(shí)間序列處理、數(shù)據(jù)聚合等)。[1] 借助NLTK和LM金融詞庫(kù),對(duì)非結(jié)構(gòu)化文本信息進(jìn)行情感分析,并將所得結(jié)構(gòu)化數(shù)據(jù)融入純技術(shù)指標(biāo)的股票數(shù)據(jù)中。分析各股票指標(biāo)的相關(guān)性,實(shí)現(xiàn)數(shù)據(jù)降維?;贙eras的以MSE為誤差評(píng)價(jià)方法的LSTM模型,實(shí)現(xiàn)對(duì)股票收盤價(jià)Close的預(yù)測(cè)。最終得出當(dāng)訓(xùn)練樣本充足時(shí),融入了情感特征數(shù)據(jù),使得預(yù)測(cè)精度適當(dāng)增加的結(jié)論。
實(shí)驗(yàn)說(shuō)明:
設(shè)計(jì)一個(gè)預(yù)測(cè)股票價(jià)格的方法,并用實(shí)例證明此方法的有效性。
所給的數(shù)據(jù),要求全部都要使用,注意數(shù)據(jù)需清洗、特征綜合使用,可自己額外補(bǔ)充資源或數(shù)據(jù)。
提供的數(shù)據(jù)說(shuō)明:
全標(biāo)題
a) 這是股票平臺(tái)上發(fā)布的對(duì)各公司的分析文章
b) 標(biāo)題:文章的標(biāo)題
c) 字段1_鏈接_鏈接:原文章所在的URL
d) ABOUT:文章針對(duì)的公司,都為縮寫形式,多個(gè)公司以逗號(hào)隔開
e) TIME:文章發(fā)布的時(shí)間
f) AUTHOR:作者
g) COMMENTS:采集時(shí),文章的被評(píng)論次數(shù)
摘要
a) 這是股票平臺(tái)上發(fā)布的對(duì)各公司的分析文章的摘要部分,和“全標(biāo)題”中的內(nèi)容對(duì)應(yīng)
b) 標(biāo)題:文章的標(biāo)題
c) 字段2:文章發(fā)布的時(shí)間
d) 字段5:文章針對(duì)的公司及提及的公司;
? i. About為針對(duì)公司,都提取縮寫的大寫模型,多個(gè)公司以逗號(hào)隔開
? ii. include為提及的其它公司,都提取縮寫的大寫模型,多個(gè)公司以逗號(hào)隔開
e) 字段1:摘要的全文字內(nèi)容
回帖
a) 這是網(wǎng)友在各文章下的回復(fù)內(nèi)容
b) Title:各文章的標(biāo)題;空標(biāo)題的,用最靠近的有內(nèi)容的下方標(biāo)題
c) Content:回復(fù)的全文字內(nèi)容
論壇
a) 這是網(wǎng)友在各公司的論壇頁(yè)面下,對(duì)之進(jìn)行評(píng)論的發(fā)帖內(nèi)容
b) 字段1:作者
c) 字段2:發(fā)帖日期
d) 字段3:帖子內(nèi)容
e) 字段4_鏈接:具體的各公司的頁(yè)面URL
股票價(jià)格
a) 為各公司工作日股票的價(jià)格
b) PERMNO:公司編號(hào)
c) Date:日期
d) TICKER:公司簡(jiǎn)寫
e) COMNAM:公司全寫
f) BIDLO:最低價(jià)
g) ASKHI:最高價(jià)
h) PRC: 收盤價(jià)
i) VOL:成交量
j) OPENPRC: 開盤價(jià)
文章目錄
- 情感數(shù)據(jù)對(duì)LSTM股票預(yù)測(cè)模型的影響研究
- 1 LSTM
- 1.1 LSTM是什么?
- 1.2 為什么決定使用LSTM?
- 2 深度學(xué)習(xí)名詞概念解釋
- 2.1 為什么要使用多于一個(gè)epoch?
- 2.2 Batch 和 Batch_Size
- 2.3 Iterations
- 2.4 為什么不要shuffle?
- 3 實(shí)驗(yàn)過(guò)程
- 3.1 庫(kù)導(dǎo)入
- 3.2 pandas核心設(shè)置
- 3.3 數(shù)據(jù)載入、數(shù)據(jù)清洗與準(zhǔn)備、數(shù)據(jù)規(guī)整、時(shí)間序列處理
- 3.3.1 股票價(jià)格.csv
- 3.3.2 論壇.csv
- 3.3.3 全標(biāo)題.xlsx
- 3.3.4 摘要.xlsx
- 3.3.5 回帖
- 3.4 情感分析
- 3.4.1 情感分析思路
- 3.4.2 詞庫(kù)導(dǎo)入和添加停用詞
- 3.4.3 函數(shù)定義
- 3.4.4 情感分析處理
- 3.4.5 情感特征數(shù)據(jù)聚合
- 3.5 \* 融入情感數(shù)據(jù)的股票指標(biāo)相關(guān)性分析
- 3.5.1 數(shù)據(jù)聯(lián)合
- 3.5.2 pairplot繪圖
- 3.5.3 股票指標(biāo)相關(guān)性分析
- 3.6 LSTM預(yù)測(cè)融合情感特征的股票數(shù)據(jù)
- 3.6.1 時(shí)間序列轉(zhuǎn)有監(jiān)督函數(shù)定義
- 3.6.2 融合情感的股票數(shù)據(jù)歸一化
- 3.6.3 時(shí)間序列構(gòu)建有監(jiān)督數(shù)據(jù)集
- 3.6.4 訓(xùn)練集驗(yàn)證集劃分
- 3.6.5 基于Keras的LSTM模型搭建
- 3.6.5 (一)、重塑LSTM的輸入X
- 3.6.5 (二)、搭建LSTM模型并繪制損失圖
- 3.6.6 預(yù)測(cè)結(jié)果并反歸一化
- 3.6.7 模型評(píng)估
- 3.7 對(duì)比實(shí)驗(yàn):預(yù)測(cè)純技術(shù)指標(biāo)的股票數(shù)據(jù)
- 3.7.1 對(duì)比實(shí)驗(yàn)流程(通用函數(shù)構(gòu)造)
- 3.7.2 對(duì)比實(shí)驗(yàn)結(jié)果分析
- 3.7.3 對(duì)比實(shí)驗(yàn)結(jié)論
- 3.8 補(bǔ)充對(duì)比實(shí)驗(yàn):補(bǔ)充AAPL股票技術(shù)指標(biāo)樣本量進(jìn)行預(yù)測(cè)
- 3.8.1 數(shù)據(jù)獲取
- 3.8.2 數(shù)據(jù)處理
- 3.8.3 預(yù)測(cè)分析
- 3.8.4 結(jié)果分析
- 3.9 2018全年含情感特征的股票數(shù)據(jù)預(yù)測(cè)實(shí)驗(yàn)
- 3.9.1 情感特征數(shù)據(jù)聚合
- 3.9.2 預(yù)測(cè)分析
- 3.9.3 結(jié)果分析
- 4. 結(jié)論與總結(jié)
- 5. 參考文獻(xiàn)
核心思想:使用LSTM模型解決股票數(shù)據(jù)的時(shí)間序列預(yù)測(cè)問(wèn)題和使用NLTK庫(kù)對(duì)文本情感進(jìn)行分析。
根本觀點(diǎn):歷史會(huì)不斷重演。本次作業(yè)均基于如下假設(shè),股票規(guī)律并不是完全隨機(jī)的,而是受人類心理學(xué)中某些規(guī)律的制約,在面對(duì)相似的情境時(shí),會(huì)根據(jù)以往的經(jīng)驗(yàn)和規(guī)律作出相似的反應(yīng)。因此,可以根據(jù)歷史資料的數(shù)據(jù)來(lái)預(yù)測(cè)未來(lái)股票的波動(dòng)趨勢(shì)。在股票的技術(shù)指標(biāo)中,收盤價(jià)是一天結(jié)束時(shí)的價(jià)格,又是第二天的開盤價(jià),聯(lián)系前后兩天,因此最為重要。[2]
影響因素:影響股票價(jià)格的因素除了基本的股票技術(shù)指標(biāo)外,股票價(jià)格還和股民的情緒和相關(guān)股票分析文章的情感密切相關(guān)。
分析方法:將股票的技術(shù)指標(biāo)和股民大眾的情感評(píng)價(jià)相結(jié)合[3],選擇AAPL個(gè)股,對(duì)股票價(jià)格,即收盤價(jià)進(jìn)行預(yù)測(cè)。分別對(duì)只含有技術(shù)指標(biāo)和含有技術(shù)指標(biāo)和情感評(píng)價(jià)的樣本進(jìn)行LSTM建模,使用MSE(均方誤差)作為損失函數(shù),對(duì)二者預(yù)測(cè)結(jié)果進(jìn)行評(píng)價(jià)。
1 LSTM
1.1 LSTM是什么?
LSTM Networks(Long Short-Term Memory)- Hochreiter 1997,長(zhǎng)短期記憶神經(jīng)網(wǎng)絡(luò),是一種特殊的RNN,能夠?qū)W習(xí)長(zhǎng)的依賴關(guān)系,記住較長(zhǎng)的歷史信息。
1.2 為什么決定使用LSTM?
Deep Neural Networks (DNN),深度神經(jīng)網(wǎng)絡(luò),有若干輸入和一個(gè)輸出,在輸出和輸入間學(xué)習(xí)得到一個(gè)線性關(guān)系,接著通過(guò)一個(gè)神經(jīng)元激活函數(shù)得到結(jié)果1或-1. 但DNN不能較好地處理時(shí)間序列數(shù)據(jù)。Recurrent Neural Networks (RNN),循環(huán)神經(jīng)網(wǎng)絡(luò),可以更好地處理序列信息,但其缺點(diǎn)是不能記憶較長(zhǎng)時(shí)期的時(shí)間序列,而且 Standard RNN Shortcomings 難以訓(xùn)練,給定初值條件下,收斂難度大。
LSTM解決了RNN的缺陷。LSTM相較于RNN模型增加了Forget Gate Layer(遺忘門),可以對(duì)上一個(gè)節(jié)點(diǎn)傳進(jìn)的輸入進(jìn)行選擇性忘記。接著,選擇需要記憶的重要輸入信息。也就是“忘記不重要的,記住重要的”。這樣,就解決了RNN在長(zhǎng)序列訓(xùn)練過(guò)程中的梯度消失和梯度爆炸問(wèn)題,在長(zhǎng)序列訓(xùn)練中有更佳的表現(xiàn)。因此,我選用LSTM作為股票時(shí)間序列數(shù)據(jù)的訓(xùn)練模型。
2 深度學(xué)習(xí)名詞概念解釋
| Epoch | 使用訓(xùn)練集的全部數(shù)據(jù)對(duì)模型進(jìn)行一次完整的訓(xùn)練,被稱之為“一代訓(xùn)練”。包括一次正向傳播和一次反向傳播 |
| Batch | 使用訓(xùn)練集中的一小部分樣本對(duì)模型權(quán)重進(jìn)行一次反向傳播的參數(shù)更新,這一小部分樣本被稱為“一批數(shù)據(jù)” |
| Iteration | 使用一個(gè)Batch數(shù)據(jù)對(duì)模型進(jìn)行一次參數(shù)更新的過(guò)程,被稱之為“一次迭代 |
[Source1] https://www.jianshu.com/p/22c50ded4cf7?from=groupmessage
2.1 為什么要使用多于一個(gè)epoch?
只傳遞一次完整數(shù)據(jù)集是不夠的,需要在神經(jīng)網(wǎng)絡(luò)中傳遞多次。隨著epoch數(shù)量的增加,神經(jīng)網(wǎng)絡(luò)中的權(quán)重更新次數(shù)也在增加,這就導(dǎo)致了擬合曲線從欠擬合變?yōu)檫^(guò)擬合。
每次epoch之后,需要對(duì)總樣本shuffle,再進(jìn)入下一輪訓(xùn)練。(本次實(shí)驗(yàn)不用shuffle)
對(duì)不同數(shù)據(jù)集,epoch個(gè)數(shù)不同。
2.2 Batch 和 Batch_Size
目前絕大部分深度學(xué)習(xí)框架使用Mini-batch Gradient Decent 小批梯度下降,把數(shù)據(jù)分為若干批(Batch),每批有Batch_Size個(gè)數(shù)據(jù),按批更新權(quán)重,一個(gè)Batch中的一組數(shù)據(jù)共同決定本次梯度的下降方向。
NumberofBatches=TrainingSetSizeBatchSizeNumber of Batches = \frac{Training Set Size}{Batch Size} NumberofBatches=BatchSizeTrainingSetSize?
小批梯度下降克服了在數(shù)據(jù)量較大的情況下時(shí),Batch Gradient Decent 的計(jì)算開銷大、速度慢 和 Stochastic Gradient Decent 的隨機(jī)性、收斂效果不佳的缺點(diǎn)。
[Source2] https://blog.csdn.net/dancing_power/article/details/97015723
2.3 Iterations
一次iteration進(jìn)行一次前向傳播和反向傳播。前向傳播,基于屬性X,得到預(yù)測(cè)結(jié)果y。反向傳播根據(jù)給定的損失函數(shù),求解參數(shù)(權(quán)重)。
NumbersofIterations=NumberofBatchedNumbers of Iterations = Number of Batched NumbersofIterations=NumberofBatched
2.4 為什么不要shuffle?
避免數(shù)據(jù)投入的順序?qū)W(wǎng)絡(luò)訓(xùn)練造成影響,增加訓(xùn)練的隨機(jī)性,提高網(wǎng)絡(luò)的泛化性能。
但是針對(duì)本次股票價(jià)格的預(yù)測(cè),使用LSTM模型,考慮時(shí)間因素,因此,需要設(shè)置shuffle=False,按時(shí)序順序依次使用Batch更新參數(shù)。
3 實(shí)驗(yàn)過(guò)程
以下實(shí)驗(yàn)均基于對(duì)Apple, Inc.(AAPL)蘋果公司的股票進(jìn)行預(yù)測(cè)分析。
CORPORATIONABBR = 'AAPL'
3.1 庫(kù)導(dǎo)入
# 數(shù)據(jù)分析的核心庫(kù) import numpy as np import pandas as pd from matplotlib import pyplot as plt # 時(shí)間序列處理 from datetime import datetime from dateutil.parser import parse as dt_parse # 正則庫(kù) import re # os庫(kù) from os import listdir # NLTK自然語(yǔ)言處理庫(kù) import nltk from nltk.corpus import stopwords # seaborn成對(duì)圖矩陣生成 from seaborn import pairplot # sklearn庫(kù)的歸一化、訓(xùn)練集測(cè)試集劃分 from sklearn.preprocessing import MinMaxScaler from sklearn.model_selection import train_test_split # Keras LSTM from tensorflow.keras.models import Sequential from tensorflow.keras.layers import LSTM, Dense, Dropout # sklearn MSE from sklearn.metrics import mean_squared_error3.2 pandas核心設(shè)置
# 設(shè)置pandas的最大顯示行數(shù)、列數(shù)和輸出寬度 pd.set_option('display.max_rows', 6) pd.set_option('display.max_columns', 999) pd.set_option('display.max_colwidth', 50)3.3 數(shù)據(jù)載入、數(shù)據(jù)清洗與準(zhǔn)備、數(shù)據(jù)規(guī)整、時(shí)間序列處理
3.3.1 股票價(jià)格.csv
sharePrices = pd.read_csv('股票價(jià)格.csv') sharePrices| 10026 | 20180702 | JJSF | J & J SNACK FOODS CORP | 150.70000 | 153.27499 | 152.92000 | 100388.0 | 152.17999 |
| 10026 | 20180703 | JJSF | J & J SNACK FOODS CORP | 151.35001 | 153.73000 | 153.32001 | 55547.0 | 153.67000 |
| 10026 | 20180705 | JJSF | J & J SNACK FOODS CORP | 152.46001 | 156.00000 | 155.81000 | 199370.0 | 153.95000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 93436 | 20181227 | TSLA | TESLA INC | 301.50000 | 322.17169 | 316.13000 | 8575133.0 | 319.84000 |
| 93436 | 20181228 | TSLA | TESLA INC | 318.41000 | 336.23999 | 333.87000 | 9938992.0 | 323.10001 |
| 93436 | 20181231 | TSLA | TESLA INC | 325.26001 | 339.20999 | 332.79999 | 6302338.0 | 337.79001 |
941518 rows × 9 columns
索引過(guò)濾:索引過(guò)濾出TICKER(公司簡(jiǎn)寫)為AAPL的數(shù)據(jù)行。
sharePricesAAPL = sharePrices[sharePrices['TICKER']==CORPORATIONABBR]DataFrame降維:不需要PERMNO(公司編號(hào))、COMNAM(公司全寫)、TICKER(公司簡(jiǎn)寫)這三列數(shù)據(jù),刪除列。
sharePricesAAPL.drop(['PERMNO', 'COMNAM', 'TICKER'], axis=1, inplace=True)索引數(shù)據(jù)類型檢測(cè):確保相應(yīng)索引的數(shù)據(jù)類型為float。
sharePricesAAPL.info() <class 'pandas.core.frame.DataFrame'> Int64Index: 126 entries, 163028 to 163153 Data columns (total 6 columns):# Column Non-Null Count Dtype --- ------ -------------- ----- 0 date 126 non-null int64 1 BIDLO 126 non-null float642 ASKHI 126 non-null float643 PRC 126 non-null float644 VOL 126 non-null float645 OPENPRC 126 non-null float64 dtypes: float64(5), int64(1) memory usage: 6.9 KB索引檢查:檢查date索引是否存在重復(fù)。
sharePricesAAPL['date'].is_unique True時(shí)間序列:將date(日期)轉(zhuǎn)化為時(shí)間序列索引,并按此時(shí)間序列以升序排序。
# date列轉(zhuǎn)化為datetime類 sharePricesAAPL['date'] = sharePricesAAPL['date'].apply(lambda dt: datetime.strptime(str(dt), '%Y%m%d')) # 設(shè)date列為索引 sharePricesAAPL.set_index('date', inplace=True) # 按date升序排列 sharePricesAAPL.sort_values(by='date', inplace=True, ascending=True)| 183.42000 | 187.30 | 187.17999 | 17612113.0 | 183.82001 |
| 183.53999 | 187.95 | 183.92000 | 13909764.0 | 187.78999 |
| 184.28000 | 186.41 | 185.39999 | 16592763.0 | 185.25999 |
| ... | ... | ... | ... | ... |
| 150.07001 | 156.77 | 156.14999 | 53117005.0 | 155.84000 |
| 154.55000 | 158.52 | 156.23000 | 42291347.0 | 157.50000 |
| 156.48000 | 159.36 | 157.74001 | 35003466.0 | 158.53000 |
126 rows × 5 columns
缺失值處理:檢查AAPL股票技術(shù)指標(biāo)數(shù)據(jù)每列缺失比,發(fā)現(xiàn)無(wú)缺失。若有,則可對(duì)BIDLO(最低價(jià))、ASKHI(最高價(jià))、PRC收盤價(jià)、VOL(成交量)有缺失的數(shù)據(jù)行直接刪除。對(duì)OPENPRC(開盤價(jià))有缺失的使用拉格朗日插值法進(jìn)行填充。
其實(shí)之后對(duì)股票價(jià)格.csv分析可知,缺失項(xiàng)的分布都在同一行,故只要使用df.dropna()刪除存在任意數(shù)目缺失項(xiàng)的行即可。
sharePricesAAPL.isnull().mean() BIDLO 0.0 ASKHI 0.0 PRC 0.0 VOL 0.0 OPENPRC 0.0 dtype: float64重建索引:重命名索引,方便后期使用,映射為BIDLO-low、ASKHI-high、PRC-close、VOL-vol、OPENPRC-open。改變索引順序?yàn)閛pen、high、low、vol、close。
# rename AAPL_newIndex = {'BIDLO': 'low','ASKHI': 'high','PRC': 'close','VOL': 'vol','OPENPRC': 'open'} sharePricesAAPL.rename(columns=AAPL_newIndex, inplace=True) # reindex AAPL_newColOrder = ['open', 'high', 'low', 'vol', 'close'] sharePricesAAPL = sharePricesAAPL.reindex(columns=AAPL_newColOrder)檢測(cè)過(guò)濾異常值:無(wú)異常。
sharePricesAAPL.describe()| 126.000000 | 126.000000 | 126.000000 | 1.260000e+02 | 126.000000 |
| 201.247420 | 203.380885 | 198.893344 | 3.510172e+07 | 201.106033 |
| 21.368524 | 21.499932 | 21.596966 | 1.577876e+07 | 21.663971 |
| ... | ... | ... | ... | ... |
| 207.320000 | 209.375000 | 205.785150 | 3.234006e+07 | 207.760005 |
| 219.155000 | 222.172503 | 216.798175 | 4.188390e+07 | 219.602500 |
| 230.780000 | 233.470000 | 229.780000 | 9.624355e+07 | 232.070010 |
8 rows × 5 columns
數(shù)據(jù)存儲(chǔ):存儲(chǔ)處理好的數(shù)據(jù)為AAPL股票價(jià)格.csv,存至補(bǔ)充數(shù)據(jù)1925102007文件夾。方便后續(xù)讀取使用。
sharePricesAAPL.to_csv('補(bǔ)充數(shù)據(jù)1925102007/AAPL股票價(jià)格.csv')3.3.2 論壇.csv
| ComputerBlue | 31-Dec-18 | Let's create a small spec POS portfolio $COTY ... | https://seekingalpha.com/symbol/COTY |
| Darren McCammon | 31-Dec-18 | $RICK "Now that we've reported results, we'll ... | https://seekingalpha.com/symbol/RICK |
| Jonathan Cooper | 31-Dec-18 | Do any $APHA shareholders support the $GGB tak... | https://seekingalpha.com/symbol/APHA |
| ... | ... | ... | ... |
| Power Hedge | 1-Jan-18 | USD Expected to Collapse in 2018 https://goo.g... | https://goo.gl/RG1CDd |
| Norman Tweed | 1-Jan-18 | Happy New Year everyone! I'm adding to $MORL @... | https://seekingalpha.com/symbol/MORL |
| User 40986305 | 1-Jan-18 | Jamie Diamond says Trump is most pro business ... | NaN |
25117 rows × 4 columns
缺失值處理:刪除字段4(各公司頁(yè)面的URL)缺失的數(shù)據(jù)行。
forum = pd.read_csv('論壇.csv') forum.dropna(inplace=True)字符串操作和正則:觀察字段4(URL),seekingalpha.com/symbol/網(wǎng)址后的內(nèi)容為公司簡(jiǎn)稱,使用pandas字符串操作和正則對(duì)公司簡(jiǎn)稱進(jìn)行提取,提取失敗則刪除該數(shù)據(jù)行。將字段4的數(shù)據(jù)內(nèi)容替換為公司簡(jiǎn)稱。
forum_regExp = re.compile(r'seekingalpha\.com/symbol/([A-Z]+)') def forumAbbr(link):# 成功查找公司簡(jiǎn)稱則返回簡(jiǎn)稱,否則以缺失值填補(bǔ)res = forum_regExp.search(link)return np.NAN if res is None else res.group(1) forum['字段4_鏈接'] = forum['字段4_鏈接'].apply(forumAbbr)索引過(guò)濾:提取所有公司簡(jiǎn)稱為AAPL的評(píng)論。
降維處理:字段1(作者名稱)無(wú)用,可以刪除。
索引重構(gòu):重命名索引,字段3(帖子內(nèi)容)-remark。
時(shí)間序列:將字段2轉(zhuǎn)化為時(shí)間序列索引,命名為date,并按此索引升序排列。
# 索引過(guò)濾 forum = forum[forum['字段4_鏈接']==CORPORATIONABBR] # 降維處理 forum.drop(['字段1', '字段4_鏈接'], axis=1, inplace=True) # 索引重構(gòu) AAPL_newIndex_forum = {'字段2': 'date', '字段3': 'remark'} forum.rename(columns=AAPL_newIndex_forum, inplace=True) # 時(shí)間序列 forum['date'] = forum['date'].apply(lambda dt: datetime.strptime(str(dt), '%d-%b-%y'))正則過(guò)濾評(píng)論網(wǎng)址:觀察評(píng)論不難發(fā)現(xiàn),部分評(píng)論內(nèi)有網(wǎng)址,使用正則表達(dá)式過(guò)濾之,防止對(duì)后續(xù)情感分析產(chǎn)生影響。
forum_regExp_linkFilter = re.compile(r'(http|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?') forum['remark'] = forum['remark'].apply(lambda x: forum_regExp_linkFilter.sub('', x)) forum| 2018-12-26 | Many Chinese companies are encouraging their e... |
| 2018-12-21 | This Week in Germany 🇩🇪 | Apple Smashed 📱 $AAP... |
| 2018-12-21 | $AAPL gets hit with another partial ban in Ger... |
| ... | ... |
| 2018-01-05 | $AAPL. Claims by GHH is 200 billion repatriati... |
| 2018-01-03 | $AAPL Barclays says battery replacement could ... |
| 2018-01-02 | 2018 will be the year for $AAPL to hit the 1 t... |
330 rows × 2 columns
同時(shí),在進(jìn)行情感分析時(shí),應(yīng)增加停用詞AAPL.
數(shù)據(jù)存儲(chǔ):存儲(chǔ)為補(bǔ)充數(shù)據(jù)1925102007/AAPL論壇.csv。
# 數(shù)據(jù)儲(chǔ)存 forum.to_csv('補(bǔ)充數(shù)據(jù)1925102007/AAPL論壇.csv', index=False)3.3.3 全標(biāo)題.xlsx
| Micron Technology: Insanely Cheap Stock Given ... | https://seekingalpha.com/article/4230920-micro... | MU | Dec. 31, 2018, 7:57 PM | Ruerd Heeg | 75?Comments | NaN |
| Molson Coors Seems Attractive At These Valuations | https://seekingalpha.com/article/4230922-molso... | TAP | Dec. 31, 2018, 7:44 PM | Sanjit Deepalam | 16?Comments | NaN |
| Gerdau: The Brazilian Play On U.S. Steel | https://seekingalpha.com/article/4230917-gerda... | GGB | Dec. 31, 2018, 7:10 PM | Shannon Bruce | 1?Comment | NaN |
| ... | ... | ... | ... | ... | ... | ... |
| Big Changes For Centurylink, AT&T And Verizon ... | https://seekingalpha.com/article/4134687-big-c... | CTL, T, VZ | Jan. 1, 2018, 5:38 AM | EconDad | 32?Comments | NaN |
| UPS: If The Founders Were Alive Today | https://seekingalpha.com/article/4134684-ups-f... | UPS | Jan. 1, 2018, 5:11 AM | Roger Gaebel | 15?Comments | NaN |
| U.S. Silica - Buying The Dip Of This Booming C... | https://seekingalpha.com/article/4134664-u-s-s... | SLCA | Jan. 1, 2018, 12:20 AM | The Value Investor | 27?Comments | NaN |
17928 rows × 7 columns
索引過(guò)濾:提取所有ABOUT為AAPL的標(biāo)題數(shù)據(jù)行。
降維處理:字段1_鏈接_鏈接、ABOUT、AUTHOR、COMMENTS、Unnamed: 6列刪除。
索引重構(gòu):重命名索引,標(biāo)題-title、ABOUT-abbr、TIME-date。
時(shí)間序列:將date轉(zhuǎn)化為時(shí)間序列索引,并按此索引升序排列。
數(shù)據(jù)存儲(chǔ):存儲(chǔ)為補(bǔ)充數(shù)據(jù)1925102007/AAPL全標(biāo)題.csv。
allTitles = pd.read_excel('全標(biāo)題.xlsx') # 索引過(guò)濾 allTitles = allTitles[allTitles['ABOUT']==CORPORATIONABBR] # 降維 allTitles.drop(['字段1_鏈接_鏈接','ABOUT','AUTHOR','COMMENTS','Unnamed: 6'], axis=1, inplace=True) # 索引重構(gòu) AAPL_newIndex_allTitles = {'標(biāo)題': 'title', 'TIME': 'date'} allTitles.rename(columns=AAPL_newIndex_allTitles, inplace=True) # 時(shí)間序列處理 # 因時(shí)間日期格式非統(tǒng)一,故選用dateutil包對(duì)parser.parse方法識(shí)別多變時(shí)間格式 allTitles['date'] = allTitles['date'].apply(lambda dt: dt_parse(dt)) # 設(shè)date列為索引 allTitles.set_index('date', inplace=True) # 按date升序排列 allTitles.sort_values(by='date', inplace=True, ascending=True) # 數(shù)據(jù)儲(chǔ)存 allTitles.to_csv('補(bǔ)充數(shù)據(jù)1925102007/AAPL全標(biāo)題.csv') allTitles| Apple Ia Above A 'Golden Cross' And Has A Posi... |
| Apple Cash: What Would Warren Buffett Say? |
| Apple's iPhone Battery Replacement Could Consu... |
| ... |
| Will Apple Beat Its Guidance? |
| How Much Stock Could Apple Have Repurchased In... |
| Will Apple Get Its Mojo Back? |
204 rows × 1 columns
3.3.4 摘要.xlsx
| HealthEquity: Strong Growth May Be Slowing Hea... | Apr. 1, 2019 10:46 PM ET | | About: HealthEquity, Inc. (HQY) | SummaryHealthEquity’s revenue and earnings hav... |
| Valero May Rally Up To 40% Within The Next 12 ... | Apr. 1, 2019 10:38 PM ET | | About: Valero Energy Corporation (VLO) | SummaryValero is ideally positioned to benefit... |
| Apple Makes A China Move | Apr. 1, 2019 7:21 PM ET | | About: Apple Inc. (AAPL) | SummaryCompany cuts prices on many key product... |
| ... | ... | ... | ... |
| Rubicon Technology: A Promising Net-Net Cash-B... | Jul. 24, 2018 2:16 PM ET | | About: Rubicon Technology, Inc. (RBCN) | SummaryRubicon is trading well below likely li... |
| Stamps.com: A Cash Machine | Jul. 24, 2018 1:57 PM ET | | About: Stamps.com Inc. (STMP) | SummaryThe Momentum Growth Quotient for the co... |
| Can Heineken Turn The 'Mallya Drama' In Its Ow... | Jul. 24, 2018 1:24 PM ET | | About: Heineken N.V. (HEINY), Includes: BUD,... | SummaryMallya, United Breweries' chairman, can... |
10131 rows × 4 columns
經(jīng)檢查,摘要.xlsx無(wú)缺失值,我們只需要標(biāo)題和字段1(摘要的全文字內(nèi)容),其余數(shù)據(jù)列刪去。將索引映射為:標(biāo)題-title、字段1-abstract.
abstracts = pd.read_excel('摘要.xlsx') abstracts.drop(['字段2', '字段5'], axis=1, inplace=True) newIndex_abstracts = {'標(biāo)題': 'title', '字段1': 'abstract'} abstracts.rename(columns=newIndex_abstracts, inplace=True)求交集:和AAPL全標(biāo)題.csv中title相對(duì)應(yīng)的數(shù)據(jù)行是針對(duì)AAPL股票公司文章的摘要,只需要對(duì)AAPL文章的摘要即可。
abstracts = abstracts.merge(allTitles, on=['title'], how='inner')保存:存儲(chǔ)為補(bǔ)充數(shù)據(jù)1925102007/AAPL摘要.csv。
abstracts.to_csv('補(bǔ)充數(shù)據(jù)1925102007/AAPL摘要.csv', index=False) abstracts| Will Apple Get Its Mojo Back? | SummaryApple has been resting on a reputation ... |
| How Much Stock Could Apple Have Repurchased In... | SummaryApple's stock plummeted from $227.26 to... |
| Will Apple Beat Its Guidance? | SummaryApple has sold fewer iPhones, which gen... |
| ... | ... |
| Apple: Still The Ultimate Value Growth Stock T... | SummaryApple reported superb earnings on Tuesd... |
| Apple In 2023 | SummaryWhere can the iPhone go from here?The A... |
| Apple's Real Value Today | SummaryApple has reached new highs this week.W... |
86 rows × 2 columns
3.3.5 回帖
pd.read_excel('回帖/SA_Comment_Page131-153.xlsx')| you should all switch to instagram | NaN |
| Long Facebook and Instagram. They will recover... | NaN |
| Personally, I think people will be buying FB a... | NaN |
| ... | ... |
| Thank you for the article.If you really think ... | Qiwi: The Current Sell-Off Was Too Emotional |
| Isn't WRK much better investment than PKG? Thanks | NaN |
| GuruFocus is also showing a Priotroski score o... | Packaging Corporation Of America: Target Retur... |
19971 rows × 2 columns
pd.read_csv('回帖/SA_Comment_Page181-255(1).csv')| I bought at $95 and holding strong. Glad I did... | NaN |
| The price rally you are referring to is not be... | Michael Kors: Potential For Further Upside Ahead |
| only a concern if you own it.... | NaN |
| ... | ... |
| What can Enron Musk do legally to boost balan... | NaN |
| The last two weeks feels like a short squeeze.... | NaN |
| " Tesla is no longer a growth or value proposi... | NaN |
20000 rows × 2 columns
索引重命名:字段1(回帖內(nèi)容)-content、標(biāo)題-title.(注意.csv和.xlsx不同)
缺失值處理:對(duì)于回帖中標(biāo)題1(各文章標(biāo)題)的定義空標(biāo)題的,用最靠近的有內(nèi)容的下方標(biāo)題,故采取用下一個(gè)非缺失值填充前缺失值的方法df.fillna(method='bfill')。
數(shù)據(jù)文件讀取:使用os.listdir()返回指定文件夾下包含的文件名列表,以.xlsx或.csv結(jié)尾的文件均為數(shù)據(jù)文件,讀入后進(jìn)行上述缺失值處理和索引重命名。
回帖過(guò)濾:遍歷所有數(shù)據(jù)文件,找出所有title在AAPL全標(biāo)題.csv中的回帖行數(shù)據(jù),檢查是否有缺失,存至補(bǔ)充數(shù)據(jù)1925102007/AAPL回帖.csv
# 數(shù)據(jù)文件讀取 repliesFiles = listdir('回帖') allAALPReplies = [] newIndex_replies_csv = {'字段1': 'content', '標(biāo)題': 'title'} newIndex_replies_xlsx = {'字段': 'content', '標(biāo)題1': 'title'} # 遍歷回帖目錄下所有回帖數(shù)據(jù)找出和AAPL相關(guān)的回帖 for file in repliesFiles:path = '回帖/'+fileif file.endswith('.csv'):replies = pd.read_csv(path)newIndex_replies = newIndex_replies_csvelif file.endswith('.xlsx'):replies = pd.read_excel(path)newIndex_replies = newIndex_replies_xlsxelse:print('Wrong file format,', file)break# 索引重命名replies.rename(columns=newIndex_replies, inplace=True)# 缺失值填充replies.fillna(method='bfill', inplace=True)# 回帖過(guò)濾allAALPReplies.extend(replies.merge(allTitles, on=['title'], how='inner').values) # 所有和AAPL文章標(biāo)題所對(duì)應(yīng)的回帖 allAALPReplies = pd.DataFrame(allAALPReplies, columns=['content', 'title']) # 保存 allAALPReplies.to_csv('補(bǔ)充數(shù)據(jù)1925102007/AAPL回帖.csv', index=False) # 展示 allAALPReplies| Understood. But let me ask you. 64GB of pics i... | iPhone XR And XS May Be Apple's Most Profitabl... |
| Just upgraded from 6 to XS, 256G. Love it. I'l... | iPhone XR And XS May Be Apple's Most Profitabl... |
| Yup, AAPL will grow profits 20% per year despi... | iPhone XR And XS May Be Apple's Most Profitabl... |
| ... | ... |
| With all due respect, never have paid for and ... | Gain Exposure To Apple Through Berkshire Hathaway |
| This one's easy - own both! | Gain Exposure To Apple Through Berkshire Hathaway |
| No Thanks! I like my divys,and splits too much... | Gain Exposure To Apple Through Berkshire Hathaway |
4506 rows × 2 columns
3.4 情感分析
使用第三方NLP庫(kù):NLTK (Natural Language Toolkit)
NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries.
安裝完nltk庫(kù)以后,需要使用nltk.download()命令下載相應(yīng)語(yǔ)料庫(kù)。因?yàn)樗俣忍?#xff0c;我選擇直接裝nltk_data數(shù)據(jù)包,核心數(shù)據(jù)包放在補(bǔ)充文件夾內(nèi)。
為提高情感分析效率和精度,停用詞還需增加['!', ',' ,'.' ,'?' ,'-s' ,'-ly' ,'</s> ', 's', 'AAPL', 'apple', '$', '%']. 使用stopwords.add()添加停用詞。
[Source3] http://www.nltk.org
金融情感詞庫(kù):LM (LoughranMcDonald) sentiment word lists 2018
[Loughran-McDonald Sentiment Word Lists](https://sraf.nd.edu/textual-analysis/resources/#LM Sentiment Word Lists) is an Excel file containing each of the LM sentiment words by category (Negative, Positive, Uncertainty, Litigious, Strong Modal, Weak Modal, Constraining).
詞庫(kù)路徑:/補(bǔ)充數(shù)據(jù)1925102007/LoughranMcDonald_SentimentWordLists_2018.xlsx
[Source4] https://sraf.nd.edu/textual-analysis/resources
3.4.1 情感分析思路
- 分詞處理:使用NLTK對(duì)文本(這里指評(píng)論數(shù)據(jù))進(jìn)行分詞處理(tokenize)
- 停用詞處理:去除停用詞(stopwords)
- 結(jié)構(gòu)化:利用LM金融情感詞庫(kù)中的Positive和Negative表單詞庫(kù),計(jì)算pos和neg值作為非結(jié)構(gòu)化文本數(shù)據(jù)的結(jié)構(gòu)化特征。(即以評(píng)論中posWords和negWords的占比作為文本數(shù)據(jù)的特征)
- 數(shù)據(jù)聚合:對(duì)上述數(shù)據(jù)進(jìn)行聚合操作,并按工作日(股票的交易時(shí)間是Business Day)為單位進(jìn)行重采樣
pos=NumofPosWrodsTotalWordspos = \frac{Num of PosWrods}{Total Words} pos=TotalWordsNumofPosWrods?
neg=NumofNegWrodsTotalWordsneg = \frac{Num of NegWrods}{Total Words} neg=TotalWordsNumofNegWrods?
3.4.2 詞庫(kù)導(dǎo)入和添加停用詞
# 詞庫(kù)導(dǎo)入 wordListsPath = '補(bǔ)充數(shù)據(jù)1925102007/LoughranMcDonald_SentimentWordLists_2018.xlsx' posWords = pd.read_excel(wordListsPath, header=None, sheet_name='Positive').iloc[:,0].values negWords = pd.read_excel(wordListsPath, header=None, sheet_name='Negative').iloc[:,0].values# 添加停用詞 extraStopwords = ['!', ',' ,'.' ,'?' ,'-s' ,'-ly' ,'</s> ', 's', 'AAPL', 'apple', '$', '%'] stopWs = stopwords.words('english') + extraStopwords3.4.3 函數(shù)定義
def structComment(sentence, posW, negW, stopW):"""結(jié)構(gòu)化句子:param sentence: 待結(jié)構(gòu)化的評(píng)論:param posW: 正詞性:param negW: 負(fù)詞性:param stopW: 停用詞:return: 去除停用詞后的評(píng)論中posWords和negWords的占比(pos, neg)"""# 分詞tokenizer = nltk.word_tokenize(sentence)# 停用詞過(guò)濾tokenizer = [w.upper() for w in tokenizer if w.lower() not in stopW]# 正詞提取posWs = [w for w in tokenizer if w in posW]# 負(fù)詞提取negWs = [w for w in tokenizer if w in negW]# tokenizer長(zhǎng)度len_token = len(tokenizer)# 句子長(zhǎng)度為0,即分母為0時(shí)if len_token<=0:return 0, 0else:return len(posWs)/len_token, len(negWs)/len_token def NLProcessing(fileName, colName):"""自然語(yǔ)言處理方法:將傳入的fileName(.csv)對(duì)應(yīng)的數(shù)據(jù)中的colName列文本數(shù)據(jù)結(jié)構(gòu)化,并保存:param fileName: 文件名,在文件夾 補(bǔ)充數(shù)據(jù)1925102007/ 下查找對(duì)應(yīng)文件:param colName: 需要結(jié)構(gòu)化的文本數(shù)據(jù)列:return: 新增pos和neg列的DataFrame"""pathNLP = '補(bǔ)充數(shù)據(jù)1925102007/'+fileName+'.csv'data = pd.read_csv(pathNLP)# pos和neg結(jié)構(gòu)化數(shù)據(jù)列構(gòu)造posAndneg = [ structComment(st, posWords, negWords, stopWs) for st in data[colName].values]# 構(gòu)造posAndneg的DataFrameposAndneg = pd.DataFrame(posAndneg, columns=['pos', 'neg'])# 軸向連接data = pd.concat([data, posAndneg], axis=1)# 刪除文本數(shù)據(jù)列data.drop([colName], axis=1, inplace=True)# 保存結(jié)構(gòu)化的數(shù)據(jù)data.to_csv(pathNLP)return data3.4.4 情感分析處理
# AAPL論壇.csv forum = NLProcessing('AAPL論壇', 'remark') # AAPL摘要.csv abstracts = NLProcessing('AAPL摘要', 'abstract') # AAPL回帖.csv allAALPReplies = NLProcessing('AAPL回帖', 'content')3.4.5 情感特征數(shù)據(jù)聚合
上述操作得到帶有title列的結(jié)構(gòu)化數(shù)據(jù)(AAPL回帖.csv和AAPL摘要.csv)后,先將回帖和摘要用concat函數(shù)沿縱軸連接,再以title為索引,與AAPL全標(biāo)題.csv(allTitles)進(jìn)行外聯(lián)合并(Outer Merge),刪除無(wú)用的title列。forum結(jié)構(gòu)化數(shù)據(jù)和上一步所得數(shù)據(jù)進(jìn)行concat軸相連接(沿縱軸)。最后,以時(shí)間天為單位進(jìn)行重采樣,得出每日的pos和neg特征的平均值。
# 軸相連接abstracts和allAALPReplies allEssaysComment = pd.concat([abstracts,allAALPReplies], ignore_index=True) # 聯(lián)表 allEssaysComment = allTitles.merge(allEssaysComment, how='outer', on='title') # 刪除缺失行 allEssaysComment.dropna(inplace=True) # 刪除title列 allEssaysComment.drop('title', axis=1, inplace=True) # 和forum情感數(shù)據(jù)進(jìn)行軸向連接 allEssaysComment = pd.concat([allEssaysComment,forum], ignore_index=True) # 刪除pos和neg均為0的無(wú)用數(shù)據(jù)行 allEssaysComment = allEssaysComment[(allEssaysComment['pos']+allEssaysComment['neg'])>0]# 設(shè)date為時(shí)間序列索引 allEssaysComment['date'] = pd.to_datetime(allEssaysComment['date']) allEssaysComment.set_index('date', inplace=True) # 按"工作日"重采樣,求pos和neg的均值,不存在的天以0填充 allEssaysComment = allEssaysComment.resample('B').mean() allEssaysComment.fillna(0, inplace=True) # 儲(chǔ)存 allEssaysComment.to_csv('補(bǔ)充數(shù)據(jù)1925102007/allPosAndNeg.csv') # 展示 allEssaysComment| 0.041667 | 0.043478 |
| 0.000000 | 0.000000 |
| 0.000000 | 0.090909 |
| ... | ... |
| 0.000000 | 0.000000 |
| 0.000000 | 0.000000 |
| 0.090909 | 0.090909 |
254 rows × 2 columns
3.5 * 融入情感數(shù)據(jù)的股票指標(biāo)相關(guān)性分析
方法:希望借助seaborn的pairplot函數(shù)繪制AAPL股票價(jià)格.csv(sharePricesAAPL)的各項(xiàng)指標(biāo)數(shù)據(jù)兩兩關(guān)聯(lián)的散點(diǎn)圖(對(duì)角線為變量的直方圖),從而探究不同指標(biāo)間的關(guān)系。
目的:分析股票各指標(biāo)間的關(guān)系。以及是否找出線性相關(guān)程度高的指標(biāo),刪除之,以減少LSTM的訓(xùn)練時(shí)間成本。
pairplot函數(shù)文檔:http://seaborn.pydata.org/generated/seaborn.pairplot.html
3.5.1 數(shù)據(jù)聯(lián)合
將2.2所得時(shí)間序列情感分析數(shù)據(jù)(allPosAndNeg.csv)和AAPL股票價(jià)格.csv(sharePricesAAPL)以date為索引合并。
聯(lián)合時(shí)可以發(fā)現(xiàn),評(píng)論數(shù)據(jù)的時(shí)間跨度足以覆蓋AAPL股票價(jià)格數(shù)據(jù),所以不用擔(dān)心缺失值的問(wèn)題。 [Jump to relative contents]
# 文件讀取 sharePricesAAPL = pd.read_csv('補(bǔ)充數(shù)據(jù)1925102007/AAPL股票價(jià)格.csv') allPosAndNeg = pd.read_csv('補(bǔ)充數(shù)據(jù)1925102007/allPosAndNeg.csv') # 合并 sharePricesAAPLwithEmotion = sharePricesAAPL.merge(allPosAndNeg, how='inner', on='date') # 序列化時(shí)間索引date sharePricesAAPLwithEmotion['date'] = pd.DatetimeIndex(sharePricesAAPLwithEmotion['date']) sharePricesAAPLwithEmotion.set_index('date', inplace=True) # reindex AAPL_newColOrder_emotionPrices = ['open', 'high', 'low', 'vol', 'pos', 'neg', 'close'] sharePricesAAPLwithEmotion = sharePricesAAPLwithEmotion.reindex(columns=AAPL_newColOrder_emotionPrices) # 保存 sharePricesAAPLwithEmotion.to_csv('補(bǔ)充數(shù)據(jù)1925102007/AAPL股票價(jià)格融合情感.csv')3.5.2 pairplot繪圖
留下必要的OHLC技術(shù)指標(biāo),對(duì)剩余的vol、pos、neg進(jìn)行相關(guān)性分析繪圖
實(shí)驗(yàn)時(shí),我也繪制了OHLC技術(shù)指標(biāo)的軸線網(wǎng)格圖,可以發(fā)現(xiàn),其兩兩間具有較高的線性相關(guān)性。
# Parameters: # data: pandas.DataFrame [Tidy (long-form) dataframe where each column is a variable and each row is an observation.] # diag_kind: {‘a(chǎn)uto’, ‘hist’, ‘kde’, None} [Kind of plot for the diagonal subplots.] # kind: {‘scatter’, ‘kde’, ‘hist’, ‘reg’} [Kind of plot to make.] fig1 = pairplot(sharePricesAAPLwithEmotion[['vol', 'pos', 'neg']], diag_kind='hist', kind='reg') # save the fig1 to 補(bǔ)充數(shù)據(jù)1925102007/ fig1.savefig('補(bǔ)充數(shù)據(jù)1925102007/fig1_a_Grid_of_Axes.png')3.5.3 股票指標(biāo)相關(guān)性分析
觀察所得Fig1: a Grid of Axes不難發(fā)現(xiàn),指標(biāo)vol、pos、neg之間線性相關(guān)性較弱,所以均保留,作為L(zhǎng)STM預(yù)測(cè)指標(biāo)。
3.6 LSTM預(yù)測(cè)融合情感特征的股票數(shù)據(jù)
依賴的庫(kù):Keras、Sklearn、Tensorflow [4]
預(yù)測(cè)目標(biāo):close(收盤價(jià))
引用函數(shù):series_to_supervised(data, n_in=1, n_out=1, dropnan=True)
來(lái)源:Time Series Forecasting With Python
用途:Frame a time series as a supervised learning dataset. 將輸入的單變量或多變量時(shí)間序列轉(zhuǎn)化為有監(jiān)督學(xué)習(xí)數(shù)據(jù)集。
參數(shù)(Arguments):
data: Sequence of observations as a list or NumPy array.
n_in: Number of lag observations as input (X).
n_out: Number of observations as output (y).
dropnan: Boolean whether or not to drop rows with NaN values.
# 因?yàn)長(zhǎng)STM已經(jīng)具有記憶功能了,所以我的n_in和n_out參數(shù)直接使用默認(rèn)的1即可(也就是構(gòu)造[t-1]現(xiàn)態(tài)列和[t]次態(tài)列)。
返回值(Returns):
Pandas DataFrame of series framed for supervised learning.
3.6.1 時(shí)間序列轉(zhuǎn)有監(jiān)督函數(shù)定義
def series_to_supervised(data, n_in=1):# 默認(rèn)參數(shù)n_out=1dropnan=True# 對(duì)該函數(shù)進(jìn)行微調(diào),注意data為以close列(需要預(yù)測(cè)的列)結(jié)尾的DataFrame時(shí)間序列股票數(shù)據(jù)n_vars = 1 if type(data) is list else data.shape[1]df = pd.DataFrame(data)cols, names = list(), list()# input sequence (t-n, ... t-1)for i in range(n_in, 0, -1):cols.append(df.shift(i))names += [('var%d(t-%d)' % (j+1, i)) for j in range(n_vars)]# forecast sequence (t, t+1, ... t+n)for i in range(0, n_out):cols.append(df.shift(-i))if i == 0:names += [('var%d(t)' % (j+1)) for j in range(n_vars)]else:names += [('var%d(t+%d)' % (j+1, i)) for j in range(n_vars)]# put it all togetheragg = pd.concat(cols, axis=1)agg.columns = names# 刪除無(wú)關(guān)的次態(tài)[t]列,只留下需要預(yù)測(cè)的close[t]列和上一時(shí)刻狀態(tài)特征[t-1]列agg.drop(agg.columns[[x for x in range(data.shape[1], 2*data.shape[1]-1)]], axis=1, inplace=True)# drop rows with NaN valuesif dropnan:agg.dropna(inplace=True)return agg3.6.2 融合情感的股票數(shù)據(jù)歸一化
# 讀取數(shù)據(jù) sharePricesAAPLwithEmotion = pd.read_csv('補(bǔ)充數(shù)據(jù)1925102007/AAPL股票價(jià)格融合情感.csv', parse_dates=['date'], index_col='date').values # 生成歸一化容器 # feature_range參數(shù)沿用默認(rèn)(0,1) scaler = MinMaxScaler() # 訓(xùn)練模型 scaler = scaler.fit(sharePricesAAPLwithEmotion) # 歸一化 sharePricesAAPLwithEmotion = scaler.fit_transform(sharePricesAAPLwithEmotion) # 部分結(jié)果展示 sharePricesAAPLwithEmotion[:5,:] array([[0.4316836 , 0.43640137, 0.44272148, 0.06118638, 0. ,0. , 0.47336914],[0.47972885, 0.44433594, 0.44416384, 0.01698249, 0. ,0. , 0.4351243 ],[0.44911044, 0.42553711, 0.45305926, 0.04901593, 0. ,0. , 0.45248692],[0.4510469 , 0.45024426, 0.46411828, 0.05954544, 0. ,0. , 0.4826372 ],[0.50042364, 0.47766101, 0.51340305, 0.08659896, 0. ,0. , 0.51325663]])3.6.3 時(shí)間序列構(gòu)建有監(jiān)督數(shù)據(jù)集
# 使用series_to_supervised函數(shù)構(gòu)建有監(jiān)督數(shù)據(jù)集 sharePricesAAPLwithEmotion = series_to_supervised(sharePricesAAPLwithEmotion) sharePricesAAPLwithEmotion| 0.431684 | 0.436401 | 0.442721 | 0.061186 | 0.0 | 0.0 | 0.473369 | 0.435124 |
| 0.479729 | 0.444336 | 0.444164 | 0.016982 | 0.0 | 0.0 | 0.435124 | 0.452487 |
| 0.449110 | 0.425537 | 0.453059 | 0.049016 | 0.0 | 0.0 | 0.452487 | 0.482637 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 0.148251 | 0.128906 | 0.104700 | 0.624252 | 0.0 | 0.0 | 0.117316 | 0.045753 |
| 0.105410 | 0.080688 | 0.036543 | 0.994059 | 0.0 | 0.0 | 0.045753 | 0.000000 |
| 0.000000 | 0.000000 | 0.000000 | 0.295643 | 0.0 | 0.0 | 0.000000 | 0.121305 |
122 rows × 8 columns
3.6.4 訓(xùn)練集驗(yàn)證集劃分
# 必須規(guī)定ndarray的dtype為float32(默認(rèn)float64),否則后續(xù)輸入LSTM模型報(bào)錯(cuò) sharePricesAAPLwithEmotion = sharePricesAAPLwithEmotion.values.astype(np.float32) # 訓(xùn)練集:驗(yàn)證集=7:3 X_train, X_test, y_train, y_test = train_test_split(sharePricesAAPLwithEmotion[:,:-1], sharePricesAAPLwithEmotion[:,-1], test_size=0.3, shuffle=False)3.6.5 基于Keras的LSTM模型搭建
參考文檔:
Keras core: Dense and Dropout
Keras Activation relu
Keras Losses mean_squared_error
Keras Optimizer adam
Keras LSTM Layers
Keras Sequential Model
3.6.5 (一)、重塑LSTM的輸入X
LSTM的輸入格式為**shape = [samples,timesteps,features]**:
samples:樣本數(shù)量
timesteps:時(shí)間步長(zhǎng)
features (input_dim):每一個(gè)時(shí)間步上的維度
重塑X_train和X_test:
# reshape input to be 3D [samples, timesteps, features] X_train = X_train.reshape((X_train.shape[0], 1, X_train.shape[1])) X_test = X_test.reshape((X_test.shape[0], 1, X_test.shape[1]))3.6.5 (二)、搭建LSTM模型并繪制損失圖
- 建立Sequential模型
- 添加LSTM層(64個(gè)隱藏層神經(jīng)元,1個(gè)輸出層神經(jīng)元,指定多層LSTM模型第一層的input_shape參數(shù))回歸模型
- 設(shè)定Dropout在每次訓(xùn)練時(shí)的丟棄比(rate)為0.4
- 設(shè)定Dense全連接層的輸出空間維度(units)為1,激活函數(shù)(activation)為relu(整流線性單元)
- 設(shè)定Sequential的損失函數(shù)(loss)為MSE(Mean-Square Error)均方誤差,優(yōu)化器(optimizer)為adam
- 模型訓(xùn)練設(shè)置epochs=50; batch_size=30
- 損失圖繪制
-
損失圖分析:
由Fig2含情感的股票價(jià)格LSTM損失圖可以看出,MSE隨迭代次數(shù)增加而減小,在大約30次迭代后,其趨于穩(wěn)定(收斂)。
3.6.6 預(yù)測(cè)結(jié)果并反歸一化
# 因?yàn)橹灰獙?duì)結(jié)果列進(jìn)行反歸一化操作, # 故不用inverse_transform函數(shù), # 這里自定義對(duì)某列的反歸一化函數(shù) inverse_transform_col def inverse_transform_col(_scaler, y, n_col):"""對(duì)某個(gè)列進(jìn)行反歸一化處理的函數(shù):param _scaler: sklearn歸一化模型:param y: 需要反歸一化的數(shù)據(jù)列:param n_col: y在歸一化時(shí)所屬的列編號(hào):return: y的反歸一化結(jié)果"""y = y.copy()y -= _scaler.min_[n_col]y /= _scaler.scale_[n_col]return y # 模型預(yù)測(cè)結(jié)果繪圖函數(shù) def predictGraph(yTrain, yPredict, yTest, timelabels, title, num):"""預(yù)測(cè)結(jié)果圖像繪制函數(shù):param yTrain: 訓(xùn)練集結(jié)果:param yPredict: 驗(yàn)證集的預(yù)測(cè)結(jié)果:param yTest: 驗(yàn)證集的真實(shí)結(jié)果:param timelabels: x軸刻度標(biāo)簽:param title: 圖表標(biāo)題:param num: 圖標(biāo)編號(hào):return: 無(wú)"""len_yTrain = yTrain.shape[0]len_y = len_yTrain+yPredict.shape[0]# 真實(shí)曲線繪制plt.plot(np.concatenate([yTrain,yTest]), color='r', label='sample')# 預(yù)測(cè)曲線繪制plt.plot([x for x in range(len_yTrain,len_y)],yPredict, color='g', label='predict')# 標(biāo)題和軸標(biāo)簽plt.title('Fig'+num+'. '+title)plt.xlabel('date')plt.ylabel('close')plt.legend()# 刻度和刻度標(biāo)簽xticks = [0,len_yTrain,len_y-1]xtick_labels = [timelabels[x] for x in xticks]plt.xticks(ticks=xticks, labels=xtick_labels, rotation=30)# 保存于 補(bǔ)充數(shù)據(jù)1925102007/savingPath = '補(bǔ)充數(shù)據(jù)1925102007/fig'+num+'_'+title.replace(' ', '_')+'.png'plt.savefig(savingPath, dpi=400, bbox_inches='tight')# 展示plt.show() # 由X_test前日股票指標(biāo)預(yù)測(cè)當(dāng)天股票close值 # 注:predict生成的array需降維成 shape=(n_samples, ) y_predict = model.predict(X_test)[:,0]# 反歸一化 # 重新讀取 AAPL股票價(jià)格融合情感.csv sharePricesAAPLwithEmotion = pd.read_csv('補(bǔ)充數(shù)據(jù)1925102007/AAPL股票價(jià)格融合情感.csv') col_n = sharePricesAAPLwithEmotion.shape[1]-2 # 預(yù)測(cè)結(jié)果反歸一化 inv_yPredict = inverse_transform_col(scaler, y_predict, col_n) # 真實(shí)結(jié)果反歸一化 inv_yTest = inverse_transform_col(scaler, y_test, col_n) # 訓(xùn)練集結(jié)果反歸一化(以繪制完整圖像) inv_yTrain = inverse_transform_col(scaler, y_train, col_n) # 繪圖 predictGraph(inv_yTrain, inv_yPredict, inv_yTest, timelabels=sharePricesAAPLwithEmotion['date'].values, title='Prediction Graph of Stock Prices with Emotions', num='3')3.6.7 模型評(píng)估
誤差評(píng)價(jià)方法:MSE
# sklearn.metrics.mean_squared_error(y_true, y_pred) mse = mean_squared_error(inv_yTest, inv_yPredict) print('帶有情感特征的股票數(shù)據(jù)預(yù)測(cè)結(jié)果的均方誤差(MSE)為 ', mse) 帶有情感特征的股票數(shù)據(jù)預(yù)測(cè)結(jié)果的均方誤差(MSE)為 160.42007分析:
觀察Fig3可知,用含有情感特征的股票數(shù)據(jù)訓(xùn)練的LSTM模型預(yù)測(cè)結(jié)果(綠色曲線)和真實(shí)結(jié)果(紅色曲線的后段)總體變化趨勢(shì)一致,即真實(shí)值下降或上升時(shí),預(yù)測(cè)值跟著下降或上升。在模型預(yù)測(cè)的開始階段,擬合效果較好,但隨著時(shí)間推移,預(yù)測(cè)值和真實(shí)值的結(jié)果差距愈發(fā)增大。
3.7 對(duì)比實(shí)驗(yàn):預(yù)測(cè)純技術(shù)指標(biāo)的股票數(shù)據(jù)
作為對(duì)比,導(dǎo)入補(bǔ)充數(shù)據(jù)1925102007/AAPL股票價(jià)格.csv,具體操作和上述一致,對(duì)不含情感特征的純技術(shù)指標(biāo)股票數(shù)據(jù)進(jìn)行預(yù)測(cè)分析。
(操作基本一致,故不作詳細(xì)注釋)
3.7.1 對(duì)比實(shí)驗(yàn)流程(通用函數(shù)構(gòu)造)
def formatData(sharePricesData):"""模式化樣本數(shù)據(jù)的函數(shù):param sharePricesData: 樣本數(shù)據(jù)的DataFrame:return: X_train, X_test, y_train, y_test, scaler"""# 歸一化_scaler = MinMaxScaler()_scaler = _scaler.fit(sharePricesData)sharePricesData = _scaler.fit_transform(sharePricesData)# 構(gòu)建有監(jiān)督數(shù)據(jù)集sharePricesData = series_to_supervised(sharePricesData)# dtype為float32sharePricesData = sharePricesData.values.astype(np.float32)# 訓(xùn)練集和驗(yàn)證集的劃分_X_train, _X_test, _y_train, _y_test = train_test_split(sharePricesData[:,:-1], sharePricesData[:,-1], test_size=0.3, shuffle=False)# reshape input_X_train = _X_train.reshape((_X_train.shape[0], 1, _X_train.shape[1]))_X_test = _X_test.reshape((_X_test.shape[0], 1, _X_test.shape[1]))return _X_train, _X_test, _y_train, _y_test, _scaler def invTransformMulti(_scaler, _y_predict, _y_test, _y_train, _col_n):# 批量反歸一化_inv_yPredict = inverse_transform_col(_scaler, _y_predict, _col_n)_inv_yTest = inverse_transform_col(_scaler, _y_test, _col_n)_inv_yTrain = inverse_transform_col(_scaler, _y_train, _col_n)return _inv_yPredict, _inv_yTest, _inv_yTrain # 讀取數(shù)據(jù) sharePricesAAPL = pd.read_csv('補(bǔ)充數(shù)據(jù)1925102007/AAPL股票價(jià)格.csv', parse_dates=['date'], index_col='date').values # 標(biāo)準(zhǔn)化數(shù)據(jù)輸入 X_train, X_test, y_train, y_test, scaler = formatData(sharePricesAAPL) # 建模 history, model = LSTMModelGenerate(X_train, X_test, y_train, y_test) # 損失函數(shù)繪圖 drawLossGraph(history, title='LSTM Loss Graph for Stock Prices without Emotions', num='4') # 預(yù)測(cè) y_predict = model.predict(X_test)[:,0] # 反歸一化 sharePricesAAPL = pd.read_csv('補(bǔ)充數(shù)據(jù)1925102007/AAPL股票價(jià)格.csv') col_n = sharePricesAAPL.shape[1]-2 inv_yPredict, inv_yTest, inv_yTrain = invTransformMulti(scaler, y_predict, y_test, y_train, col_n) # 繪圖 predictGraph(inv_yTrain, inv_yPredict, inv_yTest, timelabels=sharePricesAAPL['date'].values, title='Prediction Graph of Stock Prices without Emotions', num='5') # 均方誤差 mse = mean_squared_error(inv_yTest, inv_yPredict) print('無(wú)情感特征的純技術(shù)指標(biāo)股票數(shù)據(jù)預(yù)測(cè)結(jié)果的均方誤差(MSE)為 ', mse) 無(wú)情感特征的純技術(shù)指標(biāo)股票數(shù)據(jù)預(yù)測(cè)結(jié)果的均方誤差(MSE)為 142.502273.7.2 對(duì)比實(shí)驗(yàn)結(jié)果分析
對(duì)比Fig3和Fig5(含情感和不含情感)
-
均方誤差:通過(guò)去除情感信息,用LSTM模型得出的純技術(shù)指標(biāo)的股票close預(yù)測(cè)結(jié)果單就誤差來(lái)看要優(yōu)于含情感特征的股票數(shù)據(jù)預(yù)測(cè)結(jié)果,純技術(shù)指標(biāo)預(yù)測(cè)的精度更高,總體上更接近于真值。
MSE (含情感特征) = 160.42007
MSE (純技術(shù)指標(biāo)) = 142.50227
-
曲線特征:顯然,含有情感數(shù)據(jù)信息的預(yù)測(cè)結(jié)果曲線較無(wú)情感的預(yù)測(cè)曲線更靈敏。Fig3(含情感特征)的預(yù)測(cè)曲線隨真值曲線的升降而漲跌,真值曲線的變化(突變)趨勢(shì)較為完整地體現(xiàn)在預(yù)測(cè)曲線中,而Fig5(純技術(shù)指標(biāo))的預(yù)測(cè)曲線隨真值曲線的波動(dòng)并不明顯。
Fig3. Prediction Graph of Stock Prices with Emotions
Fig5. Prediction Graph of Stock Prices without Emotions
3.7.3 對(duì)比實(shí)驗(yàn)結(jié)論
在現(xiàn)有數(shù)據(jù)下,從總體上來(lái)看,純技術(shù)指標(biāo)的股票數(shù)據(jù)預(yù)測(cè)精度更高,但從局部來(lái)看,融入了情感特征的股票數(shù)據(jù)則更加靈敏。實(shí)驗(yàn)結(jié)果基本和預(yù)期一致。
結(jié)果表明,股票的價(jià)格漲跌并非無(wú)規(guī)律的隨機(jī)游走,而是和股民的情感息息相關(guān)。在對(duì)股票數(shù)據(jù)的預(yù)測(cè)中,融入互聯(lián)網(wǎng)論壇上股民大眾的情感數(shù)據(jù)信息,能夠更好地判斷出未來(lái)一段時(shí)間內(nèi)股票的漲跌情況,從而幫助判斷股票的最佳購(gòu)入點(diǎn)和賣出點(diǎn)、分析股票投資風(fēng)險(xiǎn)。情感數(shù)據(jù)信息有助于在量化投資中輔助股民和數(shù)據(jù)分析師做出最優(yōu)決策。
3.8 補(bǔ)充對(duì)比實(shí)驗(yàn):補(bǔ)充AAPL股票技術(shù)指標(biāo)樣本量進(jìn)行預(yù)測(cè)
在 數(shù)據(jù)聯(lián)合 步驟時(shí),發(fā)現(xiàn)所給補(bǔ)充數(shù)據(jù)1925102007/AAPL股票價(jià)格.csv數(shù)據(jù)并不能覆蓋所有的評(píng)論數(shù)據(jù)(allPosAndNeg.csv)。
此外,該數(shù)據(jù)樣本量較少,按訓(xùn)練集和驗(yàn)證集7:3比例劃分后,導(dǎo)致訓(xùn)練集樣本數(shù)只有88條。
因此決定使用英為財(cái)情股票行情網(wǎng)站所提供的2018年全年AAPL股票工作日純技術(shù)指標(biāo)數(shù)據(jù),使用上述方法對(duì)收盤價(jià)(close)進(jìn)行預(yù)測(cè),和2.5 對(duì)比實(shí)驗(yàn)進(jìn)行對(duì)比。
事實(shí)上,
AAPL股票價(jià)格.csv覆蓋時(shí)間為2018-07-02至2018-12-31,
allPosAndNeg.csv覆蓋時(shí)間為2018-01-05至2018-12-31.
3.8.1 數(shù)據(jù)獲取
從英為財(cái)情AAPL個(gè)股頁(yè)面下載近五年AAPL純技術(shù)指標(biāo)股票數(shù)據(jù),儲(chǔ)存于補(bǔ)充數(shù)據(jù)1925102007\AAPLHistoricalData_5years.csv.
3.8.2 數(shù)據(jù)處理
# 讀取數(shù)據(jù) allYearAAPL = pd.read_csv('補(bǔ)充數(shù)據(jù)1925102007/AAPLHistoricalData_5years.csv', parse_dates=['Date'], index_col='Date') # 時(shí)間序列索引切片 allYearAAPL = allYearAAPL['2018-12-31':'2018-01-01'] # 排序 allYearAAPL.sort_index(inplace=True) # 展示 allYearAAPL| $43.065 | 101602160 | $42.54 | $43.075 | $42.315 |
| $43.0575 | 117844160 | $43.1325 | $43.6375 | $42.99 |
| $43.2575 | 89370600 | $43.135 | $43.3675 | $43.02 |
| ... | ... | ... | ... | ... |
| $39.0375 | 206435400 | $38.96 | $39.1925 | $37.5175 |
| $39.0575 | 166962400 | $39.375 | $39.63 | $38.6375 |
| $39.435 | 137997560 | $39.6325 | $39.84 | $39.12 |
251 rows × 5 columns
# pandas字符串切割、Series類型修改(去除$) allYearAAPL[['Close/Last', 'Open', 'High', 'Low']] = allYearAAPL[['Close/Last', 'Open', 'High', 'Low']].apply(lambda x: (x.str[1:]).astype(np.float32)) # reindex allAAPL_newColOrder = ['Open', 'High', 'Low', 'Volume', 'Close/Last'] allYearAAPL = allYearAAPL.reindex(columns=allAAPL_newColOrder) # 保存為AAPL2018allYearData.csv allYearAAPL.to_csv('補(bǔ)充數(shù)據(jù)1925102007/AAPL2018allYearData.csv') # 展示 allYearAAPL| 42.540001 | 43.075001 | 42.314999 | 101602160 | 43.064999 |
| 43.132500 | 43.637501 | 42.990002 | 117844160 | 43.057499 |
| 43.134998 | 43.367500 | 43.020000 | 89370600 | 43.257500 |
| ... | ... | ... | ... | ... |
| 38.959999 | 39.192501 | 37.517502 | 206435400 | 39.037498 |
| 39.375000 | 39.630001 | 38.637501 | 166962400 | 39.057499 |
| 39.632500 | 39.840000 | 39.119999 | 137997560 | 39.435001 |
251 rows × 5 columns
3.8.3 預(yù)測(cè)分析
# 標(biāo)準(zhǔn)化數(shù)據(jù)輸入 X_train, X_test, y_train, y_test, scaler = formatData(allYearAAPL) # 建模 history, model = LSTMModelGenerate(X_train, X_test, y_train, y_test) # 損失函數(shù)繪圖 drawLossGraph(history, title='LSTM Loss Graph for 2018 All Year AAPL Stock Prices', num='6') # 預(yù)測(cè) y_predict = model.predict(X_test)[:,0] # 反歸一化 allYearAAPL = pd.read_csv('補(bǔ)充數(shù)據(jù)1925102007/AAPL2018allYearData.csv') col_n = allYearAAPL.shape[1]-2 inv_yPredict, inv_yTest, inv_yTrain = invTransformMulti(scaler, y_predict, y_test, y_train, col_n) # 繪圖 predictGraph(inv_yTrain, inv_yPredict, inv_yTest, timelabels=allYearAAPL['Date'].values, title='Prediction Graph of 2018 All Year AAPL Stock Prices', num='7') # 均方誤差 mse = mean_squared_error(inv_yTest, inv_yPredict) print('2018全年純技術(shù)指標(biāo)AAPL股票數(shù)據(jù)預(yù)測(cè)結(jié)果的均方誤差(MSE)為 ', mse)3.8.4 結(jié)果分析
由Fig7. Prediction Graph of 2018 All Year AAPL Stock Prices、2018全年純技術(shù)指標(biāo)AAPL股票數(shù)據(jù)預(yù)測(cè)結(jié)果的均方誤差和2.5 不含情感特征的AAPL股票數(shù)據(jù)預(yù)測(cè)的對(duì)比實(shí)驗(yàn)比較得知,在增加股票的時(shí)間序列數(shù)據(jù)后,即由原本2018-07-02~2018-12-31擴(kuò)充至2018-01-01~2018-12-31,純技術(shù)指標(biāo)預(yù)測(cè)的精度大幅提升,LSTM模型的擬合效果極佳。
由此推斷,Fig3.和Fig5.(即未增添數(shù)據(jù)前的AAPL含情感特征預(yù)測(cè)圖和純技術(shù)指標(biāo)預(yù)測(cè)圖)的預(yù)測(cè)結(jié)果精度低,且隨時(shí)間推移,預(yù)測(cè)結(jié)果嚴(yán)重偏離真值的原因在于樣本數(shù)目不足,導(dǎo)致LSTM模型訓(xùn)練不到位。接下來(lái),將添加補(bǔ)充數(shù)據(jù)后的2018全年AAPL股票數(shù)據(jù)融合情感特征,進(jìn)行含情感特征的股票數(shù)據(jù)預(yù)測(cè),以驗(yàn)證這一推斷。
3.9 2018全年含情感特征的股票數(shù)據(jù)預(yù)測(cè)實(shí)驗(yàn)
3.9.1 情感特征數(shù)據(jù)聚合
# 文件讀取 allYearAAPL_withEmos = pd.read_csv('補(bǔ)充數(shù)據(jù)1925102007/AAPL2018allYearData.csv') allPosAndNeg = pd.read_csv('補(bǔ)充數(shù)據(jù)1925102007/allPosAndNeg.csv') # 合并 allYearAAPL_withEmos = allYearAAPL_withEmos.merge(allPosAndNeg, how='inner', left_on='Date', right_on='date').drop('date', axis=1) # 序列化時(shí)間索引date allYearAAPL_withEmos['Date'] = pd.DatetimeIndex(allYearAAPL_withEmos['Date']) allYearAAPL_withEmos.set_index('Date', inplace=True) # reindex allYearAAPLwithEmos_newColOrder = ['Open','High','Low','Volume','pos','neg','Close/Last'] allYearAAPL_withEmos = allYearAAPL_withEmos.reindex(columns=allYearAAPLwithEmos_newColOrder) # 保存 allYearAAPL_withEmos.to_csv('補(bǔ)充數(shù)據(jù)1925102007/AAPL2018allYearData_withEmos.csv') # 展示 allYearAAPL_withEmos| 43.3600 | 43.8425 | 43.2625 | 94359720 | 0.041667 | 0.043478 | 43.7500 |
| 43.5875 | 43.9025 | 43.4825 | 82095480 | 0.000000 | 0.000000 | 43.5875 |
| 43.6375 | 43.7650 | 43.3525 | 86128800 | 0.000000 | 0.090909 | 43.5825 |
| ... | ... | ... | ... | ... | ... | ... |
| 39.2150 | 39.5400 | 37.4075 | 381991600 | 0.000000 | 0.000000 | 37.6825 |
| 37.0375 | 37.8875 | 36.6475 | 148676920 | 0.000000 | 0.000000 | 36.7075 |
| 37.0750 | 39.3075 | 36.6800 | 232535400 | 0.090909 | 0.090909 | 39.2925 |
245 rows × 7 columns
3.9.2 預(yù)測(cè)分析
# 標(biāo)準(zhǔn)化數(shù)據(jù)輸入 X_train, X_test, y_train, y_test, scaler = formatData(allYearAAPL_withEmos) # 建模 history, model = LSTMModelGenerate(X_train, X_test, y_train, y_test) # 損失函數(shù)繪圖 drawLossGraph(history, title='LSTM Loss Graph for 2018 All Year AAPL Stock Prices with Emotions', num='8') # 預(yù)測(cè) y_predict = model.predict(X_test)[:,0] # 反歸一化 allYearAAPL_withEmos = pd.read_csv('補(bǔ)充數(shù)據(jù)1925102007/AAPL2018allYearData_withEmos.csv') col_n = allYearAAPL_withEmos.shape[1]-2 inv_yPredict, inv_yTest, inv_yTrain = invTransformMulti(scaler, y_predict, y_test, y_train, col_n) # 繪圖 predictGraph(inv_yTrain, inv_yPredict, inv_yTest, timelabels=allYearAAPL_withEmos['Date'].values, title='Prediction Graph of 2018 All Year AAPL Stock Prices with Emotions', num='9') # 均方誤差 mse = mean_squared_error(inv_yTest, inv_yPredict) print('2018全年含情感特征的AAPL股票數(shù)據(jù)預(yù)測(cè)結(jié)果的均方誤差(MSE)為 ', mse) 2018全年含情感特征的AAPL股票數(shù)據(jù)預(yù)測(cè)結(jié)果的均方誤差(MSE)為 1.55267913.9.3 結(jié)果分析
模型訓(xùn)練損失圖:對(duì)比Fig2. LSTM Loss Graph for Stock Prices with Emotions和Fig8. LSTM Loss Graph for 2018 All Year AAPL Stock Prices with Emotions,發(fā)現(xiàn)使用2018全年AAPL含情感特征的股票數(shù)據(jù)訓(xùn)練LSTM模型,在約10次左右epochs時(shí)收斂,而部分AAPL含情感特征的股票數(shù)據(jù)訓(xùn)練則需要約20次左右epochs才能收斂。表明,隨訓(xùn)練樣本的增加,LSTM模型使損失函數(shù)收斂所需的迭代次數(shù)更少,且擬合效果更佳。
預(yù)測(cè)結(jié)果圖:對(duì)比Fig7. Prediction Graph of 2018 All Year AAPL Stock Prices和Fig9. Prediction Graph of 2018 All Year AAPL Stock Prices with Emotions(即只含純技術(shù)指標(biāo)的和加入情感特征后的2018全年AAPL股票數(shù)據(jù)預(yù)測(cè)結(jié)果圖),發(fā)現(xiàn)二者差異甚微。但通過(guò)二者M(jìn)SE值不難發(fā)現(xiàn),MSE (2018全年含情感特征的AAPL股票數(shù)據(jù)) < MSE (2018全年純技術(shù)指標(biāo)AAPL股票數(shù)據(jù)),表明在總體樣本量擴(kuò)大,讓評(píng)論情感特征數(shù)據(jù)的時(shí)間能夠覆蓋所有股票技術(shù)指標(biāo)的情況下,向純技術(shù)指標(biāo)的股票數(shù)據(jù)中添加情感特征數(shù)據(jù)后,能夠增加對(duì)股票收盤價(jià)close的預(yù)測(cè)精度。
MSE (2018全年含情感特征的AAPL股票數(shù)據(jù)) = 1.5526791
MSE (2018全年純技術(shù)指標(biāo)AAPL股票數(shù)據(jù)) = 1.7402486
4. 結(jié)論與總結(jié)
本實(shí)驗(yàn)探究了情感結(jié)構(gòu)化特征數(shù)據(jù)在LSTM股票預(yù)測(cè)模型中的影響。利用Pandas對(duì)所給數(shù)據(jù)進(jìn)行預(yù)處理(數(shù)據(jù)載入、清洗與準(zhǔn)備、規(guī)整、時(shí)間序列處理、數(shù)據(jù)聚合等),確保數(shù)據(jù)的可用性。再借助NLTK和LM金融詞庫(kù),對(duì)非結(jié)構(gòu)化文本信息進(jìn)行情感分析,并將所得結(jié)構(gòu)化數(shù)據(jù)融入純技術(shù)指標(biāo)的股票數(shù)據(jù)中。分析各股票指標(biāo)的相關(guān)性,實(shí)現(xiàn)數(shù)據(jù)降維,提升模型訓(xùn)練速度?;贙eras的以MSE為誤差評(píng)價(jià)方法的LSTM模型,分別使用含有情感和不含情感的部分股票數(shù)據(jù)和2018全年股票數(shù)據(jù)實(shí)現(xiàn)對(duì)股票收盤價(jià)Close的預(yù)測(cè)。
實(shí)驗(yàn)結(jié)果表明,LSTM模型預(yù)測(cè)股票收盤價(jià)Close時(shí),在訓(xùn)練樣本量較少的情況下,無(wú)論有無(wú)情感數(shù)據(jù)的融入,預(yù)測(cè)值隨時(shí)間的推移嚴(yán)重偏離真值,即預(yù)測(cè)精度較低,而情感數(shù)據(jù)的融入讓預(yù)測(cè)值變得更加靈敏,漲跌情況更符合真值,但預(yù)測(cè)精度有所下降。然而,當(dāng)訓(xùn)練樣本充足時(shí),不僅預(yù)測(cè)精度大幅提升,而且因融入了情感特征數(shù)據(jù),使得預(yù)測(cè)靈敏度適當(dāng)增加,導(dǎo)致總體預(yù)測(cè)精度再次增長(zhǎng)。
5. 參考文獻(xiàn)
[1] Wes McKinney. 利用Python進(jìn)行數(shù)據(jù)分析[M]. 機(jī)械工業(yè)出版社. 2013
[2] 洪志令, 吳梅紅. 股票大數(shù)據(jù)挖掘?qū)崙?zhàn)——股票分析篇[M]. 清華大學(xué)出版社. 2020
[3] 楊妥, 李萬(wàn)龍, 鄭山紅. 融合情感分析與SVM_LSTM模型的股票指數(shù)預(yù)測(cè). 軟件導(dǎo)刊, 2020(8):14-18.
[4] Francesca Lazzeri. Machine Learning for Time Series Forecasting with Python[M]. Wiley. 2020
數(shù)據(jù)集下載:
百度云- https://pan.baidu.com/s/1tC1AFx0kMHPUGobvqf47pg
華大云盤- https://pan.hqu.edu.cn/share/a474d56c6b6557f7a7fd0e0eb7
密碼- ued8
總結(jié)
以上是生活随笔為你收集整理的情感数据对LSTM股票预测模型的影响研究的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 使用浏览器获取网页模板(HTML+CSS
- 下一篇: Leetcode--397. 整数替换