當(dāng)前位置：首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

情感数据对LSTM股票预测模型的影响研究

發(fā)布時(shí)間：2024/7/19 编程问答 33 豆豆

生活随笔收集整理的這篇文章主要介紹了情感数据对LSTM股票预测模型的影响研究小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

情感數(shù)據(jù)對(duì)LSTM股票預(yù)測(cè)模型的影響研究

作者：丁紀(jì)翔

發(fā)布時(shí)間：06/28/2021

摘要：探究了情感結(jié)構(gòu)化特征數(shù)據(jù)在LSTM股票預(yù)測(cè)模型中的影響。利用Pandas對(duì)所給數(shù)據(jù)進(jìn)行預(yù)處理（數(shù)據(jù)載入、清洗與準(zhǔn)備、規(guī)整、時(shí)間序列處理、數(shù)據(jù)聚合等）。[1] 借助NLTK和LM金融詞庫(kù)，對(duì)非結(jié)構(gòu)化文本信息進(jìn)行情感分析，并將所得結(jié)構(gòu)化數(shù)據(jù)融入純技術(shù)指標(biāo)的股票數(shù)據(jù)中。分析各股票指標(biāo)的相關(guān)性，實(shí)現(xiàn)數(shù)據(jù)降維?；贙eras的以MSE為誤差評(píng)價(jià)方法的LSTM模型，實(shí)現(xiàn)對(duì)股票收盤價(jià)Close的預(yù)測(cè)。最終得出當(dāng)訓(xùn)練樣本充足時(shí)，融入了情感特征數(shù)據(jù)，使得預(yù)測(cè)精度適當(dāng)增加的結(jié)論。

實(shí)驗(yàn)說(shuō)明：

設(shè)計(jì)一個(gè)預(yù)測(cè)股票價(jià)格的方法，并用實(shí)例證明此方法的有效性。

所給的數(shù)據(jù)，要求全部都要使用，注意數(shù)據(jù)需清洗、特征綜合使用，可自己額外補(bǔ)充資源或數(shù)據(jù)。

提供的數(shù)據(jù)說(shuō)明：

全標(biāo)題

a) 這是股票平臺(tái)上發(fā)布的對(duì)各公司的分析文章

b) 標(biāo)題：文章的標(biāo)題

c) 字段1_鏈接_鏈接：原文章所在的URL

d) ABOUT：文章針對(duì)的公司，都為縮寫形式，多個(gè)公司以逗號(hào)隔開

e) TIME：文章發(fā)布的時(shí)間

f) AUTHOR：作者

g) COMMENTS：采集時(shí)，文章的被評(píng)論次數(shù)

摘要

a) 這是股票平臺(tái)上發(fā)布的對(duì)各公司的分析文章的摘要部分，和“全標(biāo)題”中的內(nèi)容對(duì)應(yīng)

b) 標(biāo)題：文章的標(biāo)題

c) 字段2：文章發(fā)布的時(shí)間

d) 字段5：文章針對(duì)的公司及提及的公司；

? i. About為針對(duì)公司，都提取縮寫的大寫模型，多個(gè)公司以逗號(hào)隔開

? ii. include為提及的其它公司，都提取縮寫的大寫模型，多個(gè)公司以逗號(hào)隔開

e) 字段1：摘要的全文字內(nèi)容

回帖

a) 這是網(wǎng)友在各文章下的回復(fù)內(nèi)容

b) Title：各文章的標(biāo)題；空標(biāo)題的，用最靠近的有內(nèi)容的下方標(biāo)題

c) Content：回復(fù)的全文字內(nèi)容

論壇

a) 這是網(wǎng)友在各公司的論壇頁(yè)面下，對(duì)之進(jìn)行評(píng)論的發(fā)帖內(nèi)容

b) 字段1：作者

c) 字段2：發(fā)帖日期

d) 字段3：帖子內(nèi)容

e) 字段4_鏈接：具體的各公司的頁(yè)面URL

股票價(jià)格

a) 為各公司工作日股票的價(jià)格

b) PERMNO：公司編號(hào)

c) Date：日期

d) TICKER：公司簡(jiǎn)寫

e) COMNAM：公司全寫

f) BIDLO：最低價(jià)

g) ASKHI：最高價(jià)

h) PRC：收盤價(jià)

i) VOL：成交量

j) OPENPRC：開盤價(jià)

文章目錄

情感數(shù)據(jù)對(duì)LSTM股票預(yù)測(cè)模型的影響研究
- 1 LSTM
- - 1.1 LSTM是什么？
  - 1.2 為什么決定使用LSTM？
- 2 深度學(xué)習(xí)名詞概念解釋
- - 2.1 為什么要使用多于一個(gè)epoch？
  - 2.2 Batch 和 Batch_Size
  - 2.3 Iterations
  - 2.4 為什么不要shuffle？
- 3 實(shí)驗(yàn)過(guò)程
- - 3.1 庫(kù)導(dǎo)入
  - 3.2 pandas核心設(shè)置
  - 3.3 數(shù)據(jù)載入、數(shù)據(jù)清洗與準(zhǔn)備、數(shù)據(jù)規(guī)整、時(shí)間序列處理
  - - 3.3.1 股票價(jià)格.csv
    - 3.3.2 論壇.csv
    - 3.3.3 全標(biāo)題.xlsx
    - 3.3.4 摘要.xlsx
    - 3.3.5 回帖
  - 3.4 情感分析
  - - 3.4.1 情感分析思路
    - 3.4.2 詞庫(kù)導(dǎo)入和添加停用詞
    - 3.4.3 函數(shù)定義
    - 3.4.4 情感分析處理
    - 3.4.5 情感特征數(shù)據(jù)聚合
  - 3.5 \* 融入情感數(shù)據(jù)的股票指標(biāo)相關(guān)性分析
  - - 3.5.1 數(shù)據(jù)聯(lián)合
    - 3.5.2 pairplot繪圖
    - 3.5.3 股票指標(biāo)相關(guān)性分析
  - 3.6 LSTM預(yù)測(cè)融合情感特征的股票數(shù)據(jù)
  - - 3.6.1 時(shí)間序列轉(zhuǎn)有監(jiān)督函數(shù)定義
    - 3.6.2 融合情感的股票數(shù)據(jù)歸一化
    - 3.6.3 時(shí)間序列構(gòu)建有監(jiān)督數(shù)據(jù)集
    - 3.6.4 訓(xùn)練集驗(yàn)證集劃分
    - 3.6.5 基于Keras的LSTM模型搭建
    - - 3.6.5 (一)、重塑LSTM的輸入X
      - 3.6.5 (二)、搭建LSTM模型并繪制損失圖
    - 3.6.6 預(yù)測(cè)結(jié)果并反歸一化
    - 3.6.7 模型評(píng)估
  - 3.7 對(duì)比實(shí)驗(yàn)：預(yù)測(cè)純技術(shù)指標(biāo)的股票數(shù)據(jù)
  - - 3.7.1 對(duì)比實(shí)驗(yàn)流程（通用函數(shù)構(gòu)造）
    - 3.7.2 對(duì)比實(shí)驗(yàn)結(jié)果分析
    - 3.7.3 對(duì)比實(shí)驗(yàn)結(jié)論
  - 3.8 補(bǔ)充對(duì)比實(shí)驗(yàn)：補(bǔ)充AAPL股票技術(shù)指標(biāo)樣本量進(jìn)行預(yù)測(cè)
  - - 3.8.1 數(shù)據(jù)獲取
    - 3.8.2 數(shù)據(jù)處理
    - 3.8.3 預(yù)測(cè)分析
    - 3.8.4 結(jié)果分析
  - 3.9 2018全年含情感特征的股票數(shù)據(jù)預(yù)測(cè)實(shí)驗(yàn)
  - - 3.9.1 情感特征數(shù)據(jù)聚合
    - 3.9.2 預(yù)測(cè)分析
    - 3.9.3 結(jié)果分析
- 4. 結(jié)論與總結(jié)
- 5. 參考文獻(xiàn)

核心思想：使用LSTM模型解決股票數(shù)據(jù)的時(shí)間序列預(yù)測(cè)問(wèn)題和使用NLTK庫(kù)對(duì)文本情感進(jìn)行分析。

根本觀點(diǎn)：歷史會(huì)不斷重演。本次作業(yè)均基于如下假設(shè)，股票規(guī)律并不是完全隨機(jī)的，而是受人類心理學(xué)中某些規(guī)律的制約，在面對(duì)相似的情境時(shí)，會(huì)根據(jù)以往的經(jīng)驗(yàn)和規(guī)律作出相似的反應(yīng)。因此，可以根據(jù)歷史資料的數(shù)據(jù)來(lái)預(yù)測(cè)未來(lái)股票的波動(dòng)趨勢(shì)。在股票的技術(shù)指標(biāo)中，收盤價(jià)是一天結(jié)束時(shí)的價(jià)格，又是第二天的開盤價(jià)，聯(lián)系前后兩天，因此最為重要。[2]

影響因素：影響股票價(jià)格的因素除了基本的股票技術(shù)指標(biāo)外，股票價(jià)格還和股民的情緒和相關(guān)股票分析文章的情感密切相關(guān)。

分析方法：將股票的技術(shù)指標(biāo)和股民大眾的情感評(píng)價(jià)相結(jié)合[3]，選擇AAPL個(gè)股，對(duì)股票價(jià)格，即收盤價(jià)進(jìn)行預(yù)測(cè)。分別對(duì)只含有技術(shù)指標(biāo)和含有技術(shù)指標(biāo)和情感評(píng)價(jià)的樣本進(jìn)行LSTM建模，使用MSE（均方誤差）作為損失函數(shù)，對(duì)二者預(yù)測(cè)結(jié)果進(jìn)行評(píng)價(jià)。

1 LSTM

1.1 LSTM是什么？

LSTM Networks（Long Short-Term Memory）- Hochreiter 1997，長(zhǎng)短期記憶神經(jīng)網(wǎng)絡(luò)，是一種特殊的RNN，能夠?qū)W習(xí)長(zhǎng)的依賴關(guān)系，記住較長(zhǎng)的歷史信息。

1.2 為什么決定使用LSTM？

Deep Neural Networks (DNN)，深度神經(jīng)網(wǎng)絡(luò)，有若干輸入和一個(gè)輸出，在輸出和輸入間學(xué)習(xí)得到一個(gè)線性關(guān)系，接著通過(guò)一個(gè)神經(jīng)元激活函數(shù)得到結(jié)果1或-1. 但DNN不能較好地處理時(shí)間序列數(shù)據(jù)。Recurrent Neural Networks (RNN)，循環(huán)神經(jīng)網(wǎng)絡(luò)，可以更好地處理序列信息，但其缺點(diǎn)是不能記憶較長(zhǎng)時(shí)期的時(shí)間序列，而且 Standard RNN Shortcomings 難以訓(xùn)練，給定初值條件下，收斂難度大。

LSTM解決了RNN的缺陷。LSTM相較于RNN模型增加了Forget Gate Layer（遺忘門），可以對(duì)上一個(gè)節(jié)點(diǎn)傳進(jìn)的輸入進(jìn)行選擇性忘記。接著，選擇需要記憶的重要輸入信息。也就是“忘記不重要的，記住重要的”。這樣，就解決了RNN在長(zhǎng)序列訓(xùn)練過(guò)程中的梯度消失和梯度爆炸問(wèn)題，在長(zhǎng)序列訓(xùn)練中有更佳的表現(xiàn)。因此，我選用LSTM作為股票時(shí)間序列數(shù)據(jù)的訓(xùn)練模型。

2 深度學(xué)習(xí)名詞概念解釋

WrodsDefinitions

Epoch	使用訓(xùn)練集的全部數(shù)據(jù)對(duì)模型進(jìn)行一次完整的訓(xùn)練，被稱之為“一代訓(xùn)練”。包括一次正向傳播和一次反向傳播
Batch	使用訓(xùn)練集中的一小部分樣本對(duì)模型權(quán)重進(jìn)行一次反向傳播的參數(shù)更新，這一小部分樣本被稱為“一批數(shù)據(jù)”
Iteration	使用一個(gè)Batch數(shù)據(jù)對(duì)模型進(jìn)行一次參數(shù)更新的過(guò)程，被稱之為“一次迭代

[Source1] https://www.jianshu.com/p/22c50ded4cf7?from=groupmessage

2.1 為什么要使用多于一個(gè)epoch？

只傳遞一次完整數(shù)據(jù)集是不夠的，需要在神經(jīng)網(wǎng)絡(luò)中傳遞多次。隨著epoch數(shù)量的增加，神經(jīng)網(wǎng)絡(luò)中的權(quán)重更新次數(shù)也在增加，這就導(dǎo)致了擬合曲線從欠擬合變?yōu)檫^(guò)擬合。

每次epoch之后，需要對(duì)總樣本shuffle，再進(jìn)入下一輪訓(xùn)練。（本次實(shí)驗(yàn)不用shuffle）

對(duì)不同數(shù)據(jù)集，epoch個(gè)數(shù)不同。

2.2 Batch 和 Batch_Size

目前絕大部分深度學(xué)習(xí)框架使用Mini-batch Gradient Decent 小批梯度下降，把數(shù)據(jù)分為若干批（Batch），每批有Batch_Size個(gè)數(shù)據(jù)，按批更新權(quán)重，一個(gè)Batch中的一組數(shù)據(jù)共同決定本次梯度的下降方向。
$\frac{Training Set Size}{Batch Size}$

小批梯度下降克服了在數(shù)據(jù)量較大的情況下時(shí)，Batch Gradient Decent 的計(jì)算開銷大、速度慢和 Stochastic Gradient Decent 的隨機(jī)性、收斂效果不佳的缺點(diǎn)。

[Source2] https://blog.csdn.net/dancing_power/article/details/97015723

2.3 Iterations

一次iteration進(jìn)行一次前向傳播和反向傳播。前向傳播，基于屬性X，得到預(yù)測(cè)結(jié)果y。反向傳播根據(jù)給定的損失函數(shù)，求解參數(shù)（權(quán)重）。
$N u m b e r s o f I t e r a t i o n s = N u m b e r o f B a t c h e d$

2.4 為什么不要shuffle？

避免數(shù)據(jù)投入的順序?qū)W(wǎng)絡(luò)訓(xùn)練造成影響，增加訓(xùn)練的隨機(jī)性，提高網(wǎng)絡(luò)的泛化性能。

但是針對(duì)本次股票價(jià)格的預(yù)測(cè)，使用LSTM模型，考慮時(shí)間因素，因此，需要設(shè)置shuffle=False，按時(shí)序順序依次使用Batch更新參數(shù)。

3 實(shí)驗(yàn)過(guò)程

以下實(shí)驗(yàn)均基于對(duì)Apple, Inc.（AAPL）蘋果公司的股票進(jìn)行預(yù)測(cè)分析。

CORPORATIONABBR = 'AAPL'

3.1 庫(kù)導(dǎo)入

# 數(shù)據(jù)分析的核心庫(kù) import numpy as np import pandas as pd from matplotlib import pyplot as plt # 時(shí)間序列處理 from datetime import datetime from dateutil.parser import parse as dt_parse # 正則庫(kù) import re # os庫(kù) from os import listdir # NLTK自然語(yǔ)言處理庫(kù) import nltk from nltk.corpus import stopwords # seaborn成對(duì)圖矩陣生成 from seaborn import pairplot # sklearn庫(kù)的歸一化、訓(xùn)練集測(cè)試集劃分 from sklearn.preprocessing import MinMaxScaler from sklearn.model_selection import train_test_split # Keras LSTM from tensorflow.keras.models import Sequential from tensorflow.keras.layers import LSTM, Dense, Dropout # sklearn MSE from sklearn.metrics import mean_squared_error

3.2 pandas核心設(shè)置

# 設(shè)置pandas的最大顯示行數(shù)、列數(shù)和輸出寬度 pd.set_option('display.max_rows', 6) pd.set_option('display.max_columns', 999) pd.set_option('display.max_colwidth', 50)

3.3 數(shù)據(jù)載入、數(shù)據(jù)清洗與準(zhǔn)備、數(shù)據(jù)規(guī)整、時(shí)間序列處理

3.3.1 股票價(jià)格.csv

sharePrices = pd.read_csv('股票價(jià)格.csv') sharePrices PERMNOdateTICKERCOMNAMBIDLOASKHIPRCVOLOPENPRC012...941515941516941517

10026	20180702	JJSF	J & J SNACK FOODS CORP	150.70000	153.27499	152.92000	100388.0	152.17999
10026	20180703	JJSF	J & J SNACK FOODS CORP	151.35001	153.73000	153.32001	55547.0	153.67000
10026	20180705	JJSF	J & J SNACK FOODS CORP	152.46001	156.00000	155.81000	199370.0	153.95000
...	...	...	...	...	...	...	...	...
93436	20181227	TSLA	TESLA INC	301.50000	322.17169	316.13000	8575133.0	319.84000
93436	20181228	TSLA	TESLA INC	318.41000	336.23999	333.87000	9938992.0	323.10001
93436	20181231	TSLA	TESLA INC	325.26001	339.20999	332.79999	6302338.0	337.79001

941518 rows × 9 columns

索引過(guò)濾：索引過(guò)濾出TICKER（公司簡(jiǎn)寫）為AAPL的數(shù)據(jù)行。

sharePricesAAPL = sharePrices[sharePrices['TICKER']==CORPORATIONABBR]

DataFrame降維：不需要PERMNO（公司編號(hào)）、COMNAM（公司全寫）、TICKER（公司簡(jiǎn)寫）這三列數(shù)據(jù)，刪除列。

sharePricesAAPL.drop(['PERMNO', 'COMNAM', 'TICKER'], axis=1, inplace=True)

索引數(shù)據(jù)類型檢測(cè)：確保相應(yīng)索引的數(shù)據(jù)類型為float。

sharePricesAAPL.info() <class 'pandas.core.frame.DataFrame'> Int64Index: 126 entries, 163028 to 163153 Data columns (total 6 columns):# Column Non-Null Count Dtype --- ------ -------------- ----- 0 date 126 non-null int64 1 BIDLO 126 non-null float642 ASKHI 126 non-null float643 PRC 126 non-null float644 VOL 126 non-null float645 OPENPRC 126 non-null float64 dtypes: float64(5), int64(1) memory usage: 6.9 KB

索引檢查：檢查date索引是否存在重復(fù)。

sharePricesAAPL['date'].is_unique True

時(shí)間序列：將date（日期）轉(zhuǎn)化為時(shí)間序列索引，并按此時(shí)間序列以升序排序。

# date列轉(zhuǎn)化為datetime類 sharePricesAAPL['date'] = sharePricesAAPL['date'].apply(lambda dt: datetime.strptime(str(dt), '%Y%m%d')) # 設(shè)date列為索引 sharePricesAAPL.set_index('date', inplace=True) # 按date升序排列 sharePricesAAPL.sort_values(by='date', inplace=True, ascending=True) BIDLOASKHIPRCVOLOPENPRCdate2018-07-022018-07-032018-07-05...2018-12-272018-12-282018-12-31

183.42000	187.30	187.17999	17612113.0	183.82001
183.53999	187.95	183.92000	13909764.0	187.78999
184.28000	186.41	185.39999	16592763.0	185.25999
...	...	...	...	...
150.07001	156.77	156.14999	53117005.0	155.84000
154.55000	158.52	156.23000	42291347.0	157.50000
156.48000	159.36	157.74001	35003466.0	158.53000

126 rows × 5 columns

缺失值處理：檢查AAPL股票技術(shù)指標(biāo)數(shù)據(jù)每列缺失比，發(fā)現(xiàn)無(wú)缺失。若有，則可對(duì)BIDLO（最低價(jià)）、ASKHI（最高價(jià)）、PRC收盤價(jià)、VOL（成交量）有缺失的數(shù)據(jù)行直接刪除。對(duì)OPENPRC（開盤價(jià)）有缺失的使用拉格朗日插值法進(jìn)行填充。

其實(shí)之后對(duì)股票價(jià)格.csv分析可知，缺失項(xiàng)的分布都在同一行，故只要使用df.dropna()刪除存在任意數(shù)目缺失項(xiàng)的行即可。

sharePricesAAPL.isnull().mean() BIDLO 0.0 ASKHI 0.0 PRC 0.0 VOL 0.0 OPENPRC 0.0 dtype: float64

重建索引：重命名索引，方便后期使用，映射為BIDLO-low、ASKHI-high、PRC-close、VOL-vol、OPENPRC-open。改變索引順序?yàn)閛pen、high、low、vol、close。

# rename AAPL_newIndex = {'BIDLO': 'low','ASKHI': 'high','PRC': 'close','VOL': 'vol','OPENPRC': 'open'} sharePricesAAPL.rename(columns=AAPL_newIndex, inplace=True) # reindex AAPL_newColOrder = ['open', 'high', 'low', 'vol', 'close'] sharePricesAAPL = sharePricesAAPL.reindex(columns=AAPL_newColOrder)

檢測(cè)過(guò)濾異常值：無(wú)異常。

sharePricesAAPL.describe() openhighlowvolclosecountmeanstd...50%75%max

126.000000	126.000000	126.000000	1.260000e+02	126.000000
201.247420	203.380885	198.893344	3.510172e+07	201.106033
21.368524	21.499932	21.596966	1.577876e+07	21.663971
...	...	...	...	...
207.320000	209.375000	205.785150	3.234006e+07	207.760005
219.155000	222.172503	216.798175	4.188390e+07	219.602500
230.780000	233.470000	229.780000	9.624355e+07	232.070010

8 rows × 5 columns

數(shù)據(jù)存儲(chǔ)：存儲(chǔ)處理好的數(shù)據(jù)為AAPL股票價(jià)格.csv，存至補(bǔ)充數(shù)據(jù)1925102007文件夾。方便后續(xù)讀取使用。

sharePricesAAPL.to_csv('補(bǔ)充數(shù)據(jù)1925102007/AAPL股票價(jià)格.csv')

3.3.2 論壇.csv

字段1字段2字段3字段4_鏈接012...251142511525116

ComputerBlue	31-Dec-18	Let's create a small spec POS portfolio $COTY ...	https://seekingalpha.com/symbol/COTY
Darren McCammon	31-Dec-18	$RICK "Now that we've reported results, we'll ...	https://seekingalpha.com/symbol/RICK
Jonathan Cooper	31-Dec-18	Do any $APHA shareholders support the $GGB tak...	https://seekingalpha.com/symbol/APHA
...	...	...	...
Power Hedge	1-Jan-18	USD Expected to Collapse in 2018 https://goo.g...	https://goo.gl/RG1CDd
Norman Tweed	1-Jan-18	Happy New Year everyone! I'm adding to $MORL @...	https://seekingalpha.com/symbol/MORL
User 40986305	1-Jan-18	Jamie Diamond says Trump is most pro business ...	NaN

25117 rows × 4 columns

缺失值處理：刪除字段4（各公司頁(yè)面的URL）缺失的數(shù)據(jù)行。

forum = pd.read_csv('論壇.csv') forum.dropna(inplace=True)

字符串操作和正則：觀察字段4（URL），seekingalpha.com/symbol/網(wǎng)址后的內(nèi)容為公司簡(jiǎn)稱，使用pandas字符串操作和正則對(duì)公司簡(jiǎn)稱進(jìn)行提取，提取失敗則刪除該數(shù)據(jù)行。將字段4的數(shù)據(jù)內(nèi)容替換為公司簡(jiǎn)稱。

forum_regExp = re.compile(r'seekingalpha\.com/symbol/([A-Z]+)') def forumAbbr(link):# 成功查找公司簡(jiǎn)稱則返回簡(jiǎn)稱，否則以缺失值填補(bǔ)res = forum_regExp.search(link)return np.NAN if res is None else res.group(1) forum['字段4_鏈接'] = forum['字段4_鏈接'].apply(forumAbbr)

索引過(guò)濾：提取所有公司簡(jiǎn)稱為AAPL的評(píng)論。

降維處理：字段1（作者名稱）無(wú)用，可以刪除。

索引重構(gòu)：重命名索引，字段3（帖子內(nèi)容）-remark。

時(shí)間序列：將字段2轉(zhuǎn)化為時(shí)間序列索引，命名為date，并按此索引升序排列。

# 索引過(guò)濾 forum = forum[forum['字段4_鏈接']==CORPORATIONABBR] # 降維處理 forum.drop(['字段1', '字段4_鏈接'], axis=1, inplace=True) # 索引重構(gòu) AAPL_newIndex_forum = {'字段2': 'date', '字段3': 'remark'} forum.rename(columns=AAPL_newIndex_forum, inplace=True) # 時(shí)間序列 forum['date'] = forum['date'].apply(lambda dt: datetime.strptime(str(dt), '%d-%b-%y'))

正則過(guò)濾評(píng)論網(wǎng)址：觀察評(píng)論不難發(fā)現(xiàn)，部分評(píng)論內(nèi)有網(wǎng)址，使用正則表達(dá)式過(guò)濾之，防止對(duì)后續(xù)情感分析產(chǎn)生影響。

forum_regExp_linkFilter = re.compile(r'(http|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?') forum['remark'] = forum['remark'].apply(lambda x: forum_regExp_linkFilter.sub('', x)) forum dateremark204418471...247022490225083

2018-12-26	Many Chinese companies are encouraging their e...
2018-12-21	This Week in Germany 🇩🇪 \| Apple Smashed 📱 $AAP...
2018-12-21	$AAPL gets hit with another partial ban in Ger...
...	...
2018-01-05	$AAPL. Claims by GHH is 200 billion repatriati...
2018-01-03	$AAPL Barclays says battery replacement could ...
2018-01-02	2018 will be the year for $AAPL to hit the 1 t...

330 rows × 2 columns

同時(shí)，在進(jìn)行情感分析時(shí)，應(yīng)增加停用詞AAPL.

數(shù)據(jù)存儲(chǔ)：存儲(chǔ)為補(bǔ)充數(shù)據(jù)1925102007/AAPL論壇.csv。

# 數(shù)據(jù)儲(chǔ)存 forum.to_csv('補(bǔ)充數(shù)據(jù)1925102007/AAPL論壇.csv', index=False)

3.3.3 全標(biāo)題.xlsx

標(biāo)題字段1_鏈接_鏈接ABOUTTIMEAUTHORCOMMENTSUnnamed: 6012...179251792617927

Micron Technology: Insanely Cheap Stock Given ...	https://seekingalpha.com/article/4230920-micro...	MU	Dec. 31, 2018, 7:57 PM	Ruerd Heeg	75?Comments	NaN
Molson Coors Seems Attractive At These Valuations	https://seekingalpha.com/article/4230922-molso...	TAP	Dec. 31, 2018, 7:44 PM	Sanjit Deepalam	16?Comments	NaN
Gerdau: The Brazilian Play On U.S. Steel	https://seekingalpha.com/article/4230917-gerda...	GGB	Dec. 31, 2018, 7:10 PM	Shannon Bruce	1?Comment	NaN
...	...	...	...	...	...	...
Big Changes For Centurylink, AT&T And Verizon ...	https://seekingalpha.com/article/4134687-big-c...	CTL, T, VZ	Jan. 1, 2018, 5:38 AM	EconDad	32?Comments	NaN
UPS: If The Founders Were Alive Today	https://seekingalpha.com/article/4134684-ups-f...	UPS	Jan. 1, 2018, 5:11 AM	Roger Gaebel	15?Comments	NaN
U.S. Silica - Buying The Dip Of This Booming C...	https://seekingalpha.com/article/4134664-u-s-s...	SLCA	Jan. 1, 2018, 12:20 AM	The Value Investor	27?Comments	NaN

17928 rows × 7 columns

索引過(guò)濾：提取所有ABOUT為AAPL的標(biāo)題數(shù)據(jù)行。

降維處理：字段1_鏈接_鏈接、ABOUT、AUTHOR、COMMENTS、Unnamed: 6列刪除。

索引重構(gòu)：重命名索引，標(biāo)題-title、ABOUT-abbr、TIME-date。

時(shí)間序列：將date轉(zhuǎn)化為時(shí)間序列索引，并按此索引升序排列。

數(shù)據(jù)存儲(chǔ)：存儲(chǔ)為補(bǔ)充數(shù)據(jù)1925102007/AAPL全標(biāo)題.csv。

allTitles = pd.read_excel('全標(biāo)題.xlsx') # 索引過(guò)濾 allTitles = allTitles[allTitles['ABOUT']==CORPORATIONABBR] # 降維 allTitles.drop(['字段1_鏈接_鏈接','ABOUT','AUTHOR','COMMENTS','Unnamed: 6'], axis=1, inplace=True) # 索引重構(gòu) AAPL_newIndex_allTitles = {'標(biāo)題': 'title', 'TIME': 'date'} allTitles.rename(columns=AAPL_newIndex_allTitles, inplace=True) # 時(shí)間序列處理 # 因時(shí)間日期格式非統(tǒng)一，故選用dateutil包對(duì)parser.parse方法識(shí)別多變時(shí)間格式 allTitles['date'] = allTitles['date'].apply(lambda dt: dt_parse(dt)) # 設(shè)date列為索引 allTitles.set_index('date', inplace=True) # 按date升序排列 allTitles.sort_values(by='date', inplace=True, ascending=True) # 數(shù)據(jù)儲(chǔ)存 allTitles.to_csv('補(bǔ)充數(shù)據(jù)1925102007/AAPL全標(biāo)題.csv') allTitles titledate2018-01-04 10:12:002018-01-08 10:59:002018-01-16 06:34:00...2018-12-31 08:52:002018-12-31 17:12:002018-12-31 17:36:00

Apple Ia Above A 'Golden Cross' And Has A Posi...
Apple Cash: What Would Warren Buffett Say?
Apple's iPhone Battery Replacement Could Consu...
...
Will Apple Beat Its Guidance?
How Much Stock Could Apple Have Repurchased In...
Will Apple Get Its Mojo Back?

Apple Ia Above A 'Golden Cross' And Has A Posi...

Apple Cash: What Would Warren Buffett Say?

Apple's iPhone Battery Replacement Could Consu...

...

Will Apple Beat Its Guidance?

How Much Stock Could Apple Have Repurchased In...

Will Apple Get Its Mojo Back?

204 rows × 1 columns

3.3.4 摘要.xlsx

標(biāo)題字段2字段5字段1012...101281012910130

HealthEquity: Strong Growth May Be Slowing Hea...	Apr. 1, 2019 10:46 PM ET	\| About: HealthEquity, Inc. (HQY)	SummaryHealthEquity’s revenue and earnings hav...
Valero May Rally Up To 40% Within The Next 12 ...	Apr. 1, 2019 10:38 PM ET	\| About: Valero Energy Corporation (VLO)	SummaryValero is ideally positioned to benefit...
Apple Makes A China Move	Apr. 1, 2019 7:21 PM ET	\| About: Apple Inc. (AAPL)	SummaryCompany cuts prices on many key product...
...	...	...	...
Rubicon Technology: A Promising Net-Net Cash-B...	Jul. 24, 2018 2:16 PM ET	\| About: Rubicon Technology, Inc. (RBCN)	SummaryRubicon is trading well below likely li...
Stamps.com: A Cash Machine	Jul. 24, 2018 1:57 PM ET	\| About: Stamps.com Inc. (STMP)	SummaryThe Momentum Growth Quotient for the co...
Can Heineken Turn The 'Mallya Drama' In Its Ow...	Jul. 24, 2018 1:24 PM ET	\| About: Heineken N.V. (HEINY), Includes: BUD,...	SummaryMallya, United Breweries' chairman, can...

10131 rows × 4 columns

經(jīng)檢查，摘要.xlsx無(wú)缺失值，我們只需要標(biāo)題和字段1（摘要的全文字內(nèi)容），其余數(shù)據(jù)列刪去。將索引映射為：標(biāo)題-title、字段1-abstract.

abstracts = pd.read_excel('摘要.xlsx') abstracts.drop(['字段2', '字段5'], axis=1, inplace=True) newIndex_abstracts = {'標(biāo)題': 'title', '字段1': 'abstract'} abstracts.rename(columns=newIndex_abstracts, inplace=True)

求交集：和AAPL全標(biāo)題.csv中title相對(duì)應(yīng)的數(shù)據(jù)行是針對(duì)AAPL股票公司文章的摘要，只需要對(duì)AAPL文章的摘要即可。

abstracts = abstracts.merge(allTitles, on=['title'], how='inner')

保存：存儲(chǔ)為補(bǔ)充數(shù)據(jù)1925102007/AAPL摘要.csv。

abstracts.to_csv('補(bǔ)充數(shù)據(jù)1925102007/AAPL摘要.csv', index=False) abstracts titleabstract012...838485

Will Apple Get Its Mojo Back?	SummaryApple has been resting on a reputation ...
How Much Stock Could Apple Have Repurchased In...	SummaryApple's stock plummeted from $227.26 to...
Will Apple Beat Its Guidance?	SummaryApple has sold fewer iPhones, which gen...
...	...
Apple: Still The Ultimate Value Growth Stock T...	SummaryApple reported superb earnings on Tuesd...
Apple In 2023	SummaryWhere can the iPhone go from here?The A...
Apple's Real Value Today	SummaryApple has reached new highs this week.W...

86 rows × 2 columns

3.3.5 回帖

pd.read_excel('回帖/SA_Comment_Page131-153.xlsx') 字段標(biāo)題1012...199681996919970

you should all switch to instagram	NaN
Long Facebook and Instagram. They will recover...	NaN
Personally, I think people will be buying FB a...	NaN
...	...
Thank you for the article.If you really think ...	Qiwi: The Current Sell-Off Was Too Emotional
Isn't WRK much better investment than PKG? Thanks	NaN
GuruFocus is also showing a Priotroski score o...	Packaging Corporation Of America: Target Retur...

19971 rows × 2 columns

pd.read_csv('回帖/SA_Comment_Page181-255(1).csv') 字段1標(biāo)題012...199971999819999

I bought at $95 and holding strong. Glad I did...	NaN
The price rally you are referring to is not be...	Michael Kors: Potential For Further Upside Ahead
only a concern if you own it....	NaN
...	...
What can Enron Musk do legally to boost balan...	NaN
The last two weeks feels like a short squeeze....	NaN
" Tesla is no longer a growth or value proposi...	NaN

20000 rows × 2 columns

索引重命名：字段1（回帖內(nèi)容）-content、標(biāo)題-title.（注意.csv和.xlsx不同）

缺失值處理：對(duì)于回帖中標(biāo)題1（各文章標(biāo)題）的定義空標(biāo)題的，用最靠近的有內(nèi)容的下方標(biāo)題，故采取用下一個(gè)非缺失值填充前缺失值的方法df.fillna(method='bfill')。

數(shù)據(jù)文件讀取：使用os.listdir()返回指定文件夾下包含的文件名列表，以.xlsx或.csv結(jié)尾的文件均為數(shù)據(jù)文件，讀入后進(jìn)行上述缺失值處理和索引重命名。

回帖過(guò)濾：遍歷所有數(shù)據(jù)文件，找出所有title在AAPL全標(biāo)題.csv中的回帖行數(shù)據(jù)，檢查是否有缺失，存至補(bǔ)充數(shù)據(jù)1925102007/AAPL回帖.csv

# 數(shù)據(jù)文件讀取 repliesFiles = listdir('回帖') allAALPReplies = [] newIndex_replies_csv = {'字段1': 'content', '標(biāo)題': 'title'} newIndex_replies_xlsx = {'字段': 'content', '標(biāo)題1': 'title'} # 遍歷回帖目錄下所有回帖數(shù)據(jù)找出和AAPL相關(guān)的回帖 for file in repliesFiles:path = '回帖/'+fileif file.endswith('.csv'):replies = pd.read_csv(path)newIndex_replies = newIndex_replies_csvelif file.endswith('.xlsx'):replies = pd.read_excel(path)newIndex_replies = newIndex_replies_xlsxelse:print('Wrong file format,', file)break# 索引重命名replies.rename(columns=newIndex_replies, inplace=True)# 缺失值填充replies.fillna(method='bfill', inplace=True)# 回帖過(guò)濾allAALPReplies.extend(replies.merge(allTitles, on=['title'], how='inner').values) # 所有和AAPL文章標(biāo)題所對(duì)應(yīng)的回帖 allAALPReplies = pd.DataFrame(allAALPReplies, columns=['content', 'title']) # 保存 allAALPReplies.to_csv('補(bǔ)充數(shù)據(jù)1925102007/AAPL回帖.csv', index=False) # 展示 allAALPReplies contenttitle012...450345044505

Understood. But let me ask you. 64GB of pics i...	iPhone XR And XS May Be Apple's Most Profitabl...
Just upgraded from 6 to XS, 256G. Love it. I'l...	iPhone XR And XS May Be Apple's Most Profitabl...
Yup, AAPL will grow profits 20% per year despi...	iPhone XR And XS May Be Apple's Most Profitabl...
...	...
With all due respect, never have paid for and ...	Gain Exposure To Apple Through Berkshire Hathaway
This one's easy - own both!	Gain Exposure To Apple Through Berkshire Hathaway
No Thanks! I like my divys,and splits too much...	Gain Exposure To Apple Through Berkshire Hathaway

4506 rows × 2 columns

3.4 情感分析

使用第三方NLP庫(kù)：NLTK (Natural Language Toolkit)

NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries.

安裝完nltk庫(kù)以后，需要使用nltk.download()命令下載相應(yīng)語(yǔ)料庫(kù)。因?yàn)樗俣忍?#xff0c;我選擇直接裝nltk_data數(shù)據(jù)包，核心數(shù)據(jù)包放在補(bǔ)充文件夾內(nèi)。

為提高情感分析效率和精度，停用詞還需增加['!', ',' ,'.' ,'?' ,'-s' ,'-ly' ,'</s> ', 's', 'AAPL', 'apple', '$', '%']. 使用stopwords.add()添加停用詞。

[Source3] http://www.nltk.org

金融情感詞庫(kù)：LM (LoughranMcDonald) sentiment word lists 2018

[Loughran-McDonald Sentiment Word Lists](https://sraf.nd.edu/textual-analysis/resources/#LM Sentiment Word Lists) is an Excel file containing each of the LM sentiment words by category (Negative, Positive, Uncertainty, Litigious, Strong Modal, Weak Modal, Constraining).

詞庫(kù)路徑：/補(bǔ)充數(shù)據(jù)1925102007/LoughranMcDonald_SentimentWordLists_2018.xlsx

[Source4] https://sraf.nd.edu/textual-analysis/resources

3.4.1 情感分析思路

分詞處理：使用NLTK對(duì)文本（這里指評(píng)論數(shù)據(jù)）進(jìn)行分詞處理（tokenize）
停用詞處理：去除停用詞（stopwords）
結(jié)構(gòu)化：利用LM金融情感詞庫(kù)中的Positive和Negative表單詞庫(kù)，計(jì)算pos和neg值作為非結(jié)構(gòu)化文本數(shù)據(jù)的結(jié)構(gòu)化特征。（即以評(píng)論中posWords和negWords的占比作為文本數(shù)據(jù)的特征）
數(shù)據(jù)聚合：對(duì)上述數(shù)據(jù)進(jìn)行聚合操作，并按工作日（股票的交易時(shí)間是Business Day）為單位進(jìn)行重采樣

$\frac{Num of PosWrods}{Total Words}$

$\frac{Num of NegWrods}{Total Words}$

3.4.2 詞庫(kù)導(dǎo)入和添加停用詞

# 詞庫(kù)導(dǎo)入 wordListsPath = '補(bǔ)充數(shù)據(jù)1925102007/LoughranMcDonald_SentimentWordLists_2018.xlsx' posWords = pd.read_excel(wordListsPath, header=None, sheet_name='Positive').iloc[:,0].values negWords = pd.read_excel(wordListsPath, header=None, sheet_name='Negative').iloc[:,0].values# 添加停用詞 extraStopwords = ['!', ',' ,'.' ,'?' ,'-s' ,'-ly' ,'</s> ', 's', 'AAPL', 'apple', '$', '%'] stopWs = stopwords.words('english') + extraStopwords

3.4.3 函數(shù)定義

def structComment(sentence, posW, negW, stopW):"""結(jié)構(gòu)化句子:param sentence: 待結(jié)構(gòu)化的評(píng)論:param posW: 正詞性:param negW: 負(fù)詞性:param stopW: 停用詞:return: 去除停用詞后的評(píng)論中posWords和negWords的占比(pos, neg)"""# 分詞tokenizer = nltk.word_tokenize(sentence)# 停用詞過(guò)濾tokenizer = [w.upper() for w in tokenizer if w.lower() not in stopW]# 正詞提取posWs = [w for w in tokenizer if w in posW]# 負(fù)詞提取negWs = [w for w in tokenizer if w in negW]# tokenizer長(zhǎng)度len_token = len(tokenizer)# 句子長(zhǎng)度為0，即分母為0時(shí)if len_token<=0:return 0, 0else:return len(posWs)/len_token, len(negWs)/len_token def NLProcessing(fileName, colName):"""自然語(yǔ)言處理方法：將傳入的fileName(.csv)對(duì)應(yīng)的數(shù)據(jù)中的colName列文本數(shù)據(jù)結(jié)構(gòu)化，并保存:param fileName: 文件名，在文件夾補(bǔ)充數(shù)據(jù)1925102007/ 下查找對(duì)應(yīng)文件:param colName: 需要結(jié)構(gòu)化的文本數(shù)據(jù)列:return: 新增pos和neg列的DataFrame"""pathNLP = '補(bǔ)充數(shù)據(jù)1925102007/'+fileName+'.csv'data = pd.read_csv(pathNLP)# pos和neg結(jié)構(gòu)化數(shù)據(jù)列構(gòu)造posAndneg = [ structComment(st, posWords, negWords, stopWs) for st in data[colName].values]# 構(gòu)造posAndneg的DataFrameposAndneg = pd.DataFrame(posAndneg, columns=['pos', 'neg'])# 軸向連接data = pd.concat([data, posAndneg], axis=1)# 刪除文本數(shù)據(jù)列data.drop([colName], axis=1, inplace=True)# 保存結(jié)構(gòu)化的數(shù)據(jù)data.to_csv(pathNLP)return data

3.4.4 情感分析處理

# AAPL論壇.csv forum = NLProcessing('AAPL論壇', 'remark') # AAPL摘要.csv abstracts = NLProcessing('AAPL摘要', 'abstract') # AAPL回帖.csv allAALPReplies = NLProcessing('AAPL回帖', 'content')

3.4.5 情感特征數(shù)據(jù)聚合

上述操作得到帶有title列的結(jié)構(gòu)化數(shù)據(jù)（AAPL回帖.csv和AAPL摘要.csv）后，先將回帖和摘要用concat函數(shù)沿縱軸連接，再以title為索引，與AAPL全標(biāo)題.csv（allTitles）進(jìn)行外聯(lián)合并（Outer Merge），刪除無(wú)用的title列。forum結(jié)構(gòu)化數(shù)據(jù)和上一步所得數(shù)據(jù)進(jìn)行concat軸相連接（沿縱軸）。最后，以時(shí)間天為單位進(jìn)行重采樣，得出每日的pos和neg特征的平均值。

# 軸相連接abstracts和allAALPReplies allEssaysComment = pd.concat([abstracts,allAALPReplies], ignore_index=True) # 聯(lián)表 allEssaysComment = allTitles.merge(allEssaysComment, how='outer', on='title') # 刪除缺失行 allEssaysComment.dropna(inplace=True) # 刪除title列 allEssaysComment.drop('title', axis=1, inplace=True) # 和forum情感數(shù)據(jù)進(jìn)行軸向連接 allEssaysComment = pd.concat([allEssaysComment,forum], ignore_index=True) # 刪除pos和neg均為0的無(wú)用數(shù)據(jù)行 allEssaysComment = allEssaysComment[(allEssaysComment['pos']+allEssaysComment['neg'])>0]# 設(shè)date為時(shí)間序列索引 allEssaysComment['date'] = pd.to_datetime(allEssaysComment['date']) allEssaysComment.set_index('date', inplace=True) # 按"工作日"重采樣，求pos和neg的均值，不存在的天以0填充 allEssaysComment = allEssaysComment.resample('B').mean() allEssaysComment.fillna(0, inplace=True) # 儲(chǔ)存 allEssaysComment.to_csv('補(bǔ)充數(shù)據(jù)1925102007/allPosAndNeg.csv') # 展示 allEssaysComment posnegdate2018-01-052018-01-082018-01-09...2018-12-242018-12-252018-12-26

0.041667	0.043478
0.000000	0.000000
0.000000	0.090909
...	...
0.000000	0.000000
0.000000	0.000000
0.090909	0.090909

254 rows × 2 columns

3.5 * 融入情感數(shù)據(jù)的股票指標(biāo)相關(guān)性分析

方法：希望借助seaborn的pairplot函數(shù)繪制AAPL股票價(jià)格.csv（sharePricesAAPL）的各項(xiàng)指標(biāo)數(shù)據(jù)兩兩關(guān)聯(lián)的散點(diǎn)圖（對(duì)角線為變量的直方圖），從而探究不同指標(biāo)間的關(guān)系。

目的：分析股票各指標(biāo)間的關(guān)系。以及是否找出線性相關(guān)程度高的指標(biāo)，刪除之，以減少LSTM的訓(xùn)練時(shí)間成本。

pairplot函數(shù)文檔：http://seaborn.pydata.org/generated/seaborn.pairplot.html

3.5.1 數(shù)據(jù)聯(lián)合

將2.2所得時(shí)間序列情感分析數(shù)據(jù)（allPosAndNeg.csv）和AAPL股票價(jià)格.csv（sharePricesAAPL）以date為索引合并。

聯(lián)合時(shí)可以發(fā)現(xiàn)，評(píng)論數(shù)據(jù)的時(shí)間跨度足以覆蓋AAPL股票價(jià)格數(shù)據(jù)，所以不用擔(dān)心缺失值的問(wèn)題。 [Jump to relative contents]

# 文件讀取 sharePricesAAPL = pd.read_csv('補(bǔ)充數(shù)據(jù)1925102007/AAPL股票價(jià)格.csv') allPosAndNeg = pd.read_csv('補(bǔ)充數(shù)據(jù)1925102007/allPosAndNeg.csv') # 合并 sharePricesAAPLwithEmotion = sharePricesAAPL.merge(allPosAndNeg, how='inner', on='date') # 序列化時(shí)間索引date sharePricesAAPLwithEmotion['date'] = pd.DatetimeIndex(sharePricesAAPLwithEmotion['date']) sharePricesAAPLwithEmotion.set_index('date', inplace=True) # reindex AAPL_newColOrder_emotionPrices = ['open', 'high', 'low', 'vol', 'pos', 'neg', 'close'] sharePricesAAPLwithEmotion = sharePricesAAPLwithEmotion.reindex(columns=AAPL_newColOrder_emotionPrices) # 保存 sharePricesAAPLwithEmotion.to_csv('補(bǔ)充數(shù)據(jù)1925102007/AAPL股票價(jià)格融合情感.csv')

3.5.2 pairplot繪圖

留下必要的OHLC技術(shù)指標(biāo)，對(duì)剩余的vol、pos、neg進(jìn)行相關(guān)性分析繪圖

實(shí)驗(yàn)時(shí)，我也繪制了OHLC技術(shù)指標(biāo)的軸線網(wǎng)格圖，可以發(fā)現(xiàn)，其兩兩間具有較高的線性相關(guān)性。

# Parameters: # data: pandas.DataFrame [Tidy (long-form) dataframe where each column is a variable and each row is an observation.] # diag_kind: {‘a(chǎn)uto’, ‘hist’, ‘kde’, None} [Kind of plot for the diagonal subplots.] # kind: {‘scatter’, ‘kde’, ‘hist’, ‘reg’} [Kind of plot to make.] fig1 = pairplot(sharePricesAAPLwithEmotion[['vol', 'pos', 'neg']], diag_kind='hist', kind='reg') # save the fig1 to 補(bǔ)充數(shù)據(jù)1925102007/ fig1.savefig('補(bǔ)充數(shù)據(jù)1925102007/fig1_a_Grid_of_Axes.png')

3.5.3 股票指標(biāo)相關(guān)性分析

觀察所得Fig1: a Grid of Axes不難發(fā)現(xiàn)，指標(biāo)vol、pos、neg之間線性相關(guān)性較弱，所以均保留，作為L(zhǎng)STM預(yù)測(cè)指標(biāo)。

3.6 LSTM預(yù)測(cè)融合情感特征的股票數(shù)據(jù)

依賴的庫(kù)：Keras、Sklearn、Tensorflow [4]

預(yù)測(cè)目標(biāo)：close（收盤價(jià)）

引用函數(shù)：series_to_supervised(data, n_in=1, n_out=1, dropnan=True)

來(lái)源：Time Series Forecasting With Python

用途：Frame a time series as a supervised learning dataset. 將輸入的單變量或多變量時(shí)間序列轉(zhuǎn)化為有監(jiān)督學(xué)習(xí)數(shù)據(jù)集。

參數(shù)（Arguments）：

data: Sequence of observations as a list or NumPy array.

n_in: Number of lag observations as input (X).

n_out: Number of observations as output (y).

dropnan: Boolean whether or not to drop rows with NaN values.

# 因?yàn)長(zhǎng)STM已經(jīng)具有記憶功能了，所以我的n_in和n_out參數(shù)直接使用默認(rèn)的1即可（也就是構(gòu)造[t-1]現(xiàn)態(tài)列和[t]次態(tài)列）。

返回值（Returns）：

Pandas DataFrame of series framed for supervised learning.

3.6.1 時(shí)間序列轉(zhuǎn)有監(jiān)督函數(shù)定義

def series_to_supervised(data, n_in=1):# 默認(rèn)參數(shù)n_out=1dropnan=True# 對(duì)該函數(shù)進(jìn)行微調(diào)，注意data為以close列（需要預(yù)測(cè)的列）結(jié)尾的DataFrame時(shí)間序列股票數(shù)據(jù)n_vars = 1 if type(data) is list else data.shape[1]df = pd.DataFrame(data)cols, names = list(), list()# input sequence (t-n, ... t-1)for i in range(n_in, 0, -1):cols.append(df.shift(i))names += [('var%d(t-%d)' % (j+1, i)) for j in range(n_vars)]# forecast sequence (t, t+1, ... t+n)for i in range(0, n_out):cols.append(df.shift(-i))if i == 0:names += [('var%d(t)' % (j+1)) for j in range(n_vars)]else:names += [('var%d(t+%d)' % (j+1, i)) for j in range(n_vars)]# put it all togetheragg = pd.concat(cols, axis=1)agg.columns = names# 刪除無(wú)關(guān)的次態(tài)[t]列，只留下需要預(yù)測(cè)的close[t]列和上一時(shí)刻狀態(tài)特征[t-1]列agg.drop(agg.columns[[x for x in range(data.shape[1], 2*data.shape[1]-1)]], axis=1, inplace=True)# drop rows with NaN valuesif dropnan:agg.dropna(inplace=True)return agg

3.6.2 融合情感的股票數(shù)據(jù)歸一化

# 讀取數(shù)據(jù) sharePricesAAPLwithEmotion = pd.read_csv('補(bǔ)充數(shù)據(jù)1925102007/AAPL股票價(jià)格融合情感.csv', parse_dates=['date'], index_col='date').values # 生成歸一化容器 # feature_range參數(shù)沿用默認(rèn)(0,1) scaler = MinMaxScaler() # 訓(xùn)練模型 scaler = scaler.fit(sharePricesAAPLwithEmotion) # 歸一化 sharePricesAAPLwithEmotion = scaler.fit_transform(sharePricesAAPLwithEmotion) # 部分結(jié)果展示 sharePricesAAPLwithEmotion[:5,:] array([[0.4316836 , 0.43640137, 0.44272148, 0.06118638, 0. ,0. , 0.47336914],[0.47972885, 0.44433594, 0.44416384, 0.01698249, 0. ,0. , 0.4351243 ],[0.44911044, 0.42553711, 0.45305926, 0.04901593, 0. ,0. , 0.45248692],[0.4510469 , 0.45024426, 0.46411828, 0.05954544, 0. ,0. , 0.4826372 ],[0.50042364, 0.47766101, 0.51340305, 0.08659896, 0. ,0. , 0.51325663]])

3.6.3 時(shí)間序列構(gòu)建有監(jiān)督數(shù)據(jù)集

# 使用series_to_supervised函數(shù)構(gòu)建有監(jiān)督數(shù)據(jù)集 sharePricesAAPLwithEmotion = series_to_supervised(sharePricesAAPLwithEmotion) sharePricesAAPLwithEmotion var1(t-1)var2(t-1)var3(t-1)var4(t-1)var5(t-1)var6(t-1)var7(t-1)var7(t)123...120121122

0.431684	0.436401	0.442721	0.061186	0.0	0.0	0.473369	0.435124
0.479729	0.444336	0.444164	0.016982	0.0	0.0	0.435124	0.452487
0.449110	0.425537	0.453059	0.049016	0.0	0.0	0.452487	0.482637
...	...	...	...	...	...	...	...
0.148251	0.128906	0.104700	0.624252	0.0	0.0	0.117316	0.045753
0.105410	0.080688	0.036543	0.994059	0.0	0.0	0.045753	0.000000
0.000000	0.000000	0.000000	0.295643	0.0	0.0	0.000000	0.121305

122 rows × 8 columns

3.6.4 訓(xùn)練集驗(yàn)證集劃分

# 必須規(guī)定ndarray的dtype為float32（默認(rèn)float64），否則后續(xù)輸入LSTM模型報(bào)錯(cuò) sharePricesAAPLwithEmotion = sharePricesAAPLwithEmotion.values.astype(np.float32) # 訓(xùn)練集:驗(yàn)證集=7:3 X_train, X_test, y_train, y_test = train_test_split(sharePricesAAPLwithEmotion[:,:-1], sharePricesAAPLwithEmotion[:,-1], test_size=0.3, shuffle=False)

3.6.5 基于Keras的LSTM模型搭建

參考文檔：

Keras core: Dense and Dropout

Keras Activation relu

Keras Losses mean_squared_error

Keras Optimizer adam

Keras LSTM Layers

Keras Sequential Model

3.6.5 (一)、重塑LSTM的輸入X

LSTM的輸入格式為**shape = [samples,timesteps,features]**：

samples：樣本數(shù)量

timesteps：時(shí)間步長(zhǎng)

features (input_dim)：每一個(gè)時(shí)間步上的維度

重塑X_train和X_test：

# reshape input to be 3D [samples, timesteps, features] X_train = X_train.reshape((X_train.shape[0], 1, X_train.shape[1])) X_test = X_test.reshape((X_test.shape[0], 1, X_test.shape[1]))

3.6.5 (二)、搭建LSTM模型并繪制損失圖

建立Sequential模型
添加LSTM層（64個(gè)隱藏層神經(jīng)元，1個(gè)輸出層神經(jīng)元，指定多層LSTM模型第一層的input_shape參數(shù)）回歸模型
設(shè)定Dropout在每次訓(xùn)練時(shí)的丟棄比（rate）為0.4
設(shè)定Dense全連接層的輸出空間維度（units）為1，激活函數(shù)（activation）為relu（整流線性單元）
設(shè)定Sequential的損失函數(shù)（loss）為MSE（Mean-Square Error）均方誤差，優(yōu)化器（optimizer）為adam
模型訓(xùn)練設(shè)置epochs=50; batch_size=30

def LSTMModelGenerate(Xtrain, Xtest, ytrain, ytest):"""LSTM模型搭建函數(shù):param Xtrain: 訓(xùn)練集屬性:param Xtest: 測(cè)試集屬性:param ytrain: 訓(xùn)練集標(biāo)簽:param ytest: 測(cè)試集標(biāo)簽:return: history,model（模型編譯記錄和模型）"""# 搭建LSTM模型_model = Sequential()_model.add(LSTM(64, input_shape=(Xtrain.shape[1], Xtrain.shape[2])))_model.add(Dropout(0.4))_model.add(Dense(1, activation='relu'))# 模型編譯_model.compile(loss='mse', optimizer='adam')# 模型訓(xùn)練_history = _model.fit(Xtrain, ytrain, epochs=50, batch_size=30, validation_data=(Xtest, ytest), shuffle=False, verbose=0)return _history,_modelhistory, model = LSTMModelGenerate(X_train, X_test, y_train, y_test)

損失圖繪制

def drawLossGraph(_history, title, num):"""損失圖繪制，尋找最優(yōu)epochs:param _history: 訓(xùn)練歷史:param title: 圖表標(biāo)題:param num: 圖表編號(hào):return: 無(wú)"""plt.plot(_history.history['loss'], color='g', label='train')plt.plot(_history.history['val_loss'], color='r', label='test')plt.title('Fig'+num+'. '+title)plt.xlabel('epochs')plt.ylabel('loss')plt.legend()# 保存于補(bǔ)充數(shù)據(jù)1925102007/savingPath = '補(bǔ)充數(shù)據(jù)1925102007/fig'+num+'_'+title.replace(' ', '_')+'.png'plt.savefig(savingPath, dpi=400, bbox_inches='tight')# 展示plt.show()drawLossGraph(history, title='LSTM Loss Graph for Stock Prices with Emotions', num='2')

損失圖分析：

由Fig2含情感的股票價(jià)格LSTM損失圖可以看出，MSE隨迭代次數(shù)增加而減小，在大約30次迭代后，其趨于穩(wěn)定（收斂）。

3.6.6 預(yù)測(cè)結(jié)果并反歸一化

# 因?yàn)橹灰獙?duì)結(jié)果列進(jìn)行反歸一化操作， # 故不用inverse_transform函數(shù)， # 這里自定義對(duì)某列的反歸一化函數(shù) inverse_transform_col def inverse_transform_col(_scaler, y, n_col):"""對(duì)某個(gè)列進(jìn)行反歸一化處理的函數(shù):param _scaler: sklearn歸一化模型:param y: 需要反歸一化的數(shù)據(jù)列:param n_col: y在歸一化時(shí)所屬的列編號(hào):return: y的反歸一化結(jié)果"""y = y.copy()y -= _scaler.min_[n_col]y /= _scaler.scale_[n_col]return y # 模型預(yù)測(cè)結(jié)果繪圖函數(shù) def predictGraph(yTrain, yPredict, yTest, timelabels, title, num):"""預(yù)測(cè)結(jié)果圖像繪制函數(shù):param yTrain: 訓(xùn)練集結(jié)果:param yPredict: 驗(yàn)證集的預(yù)測(cè)結(jié)果:param yTest: 驗(yàn)證集的真實(shí)結(jié)果:param timelabels: x軸刻度標(biāo)簽:param title: 圖表標(biāo)題:param num: 圖標(biāo)編號(hào):return: 無(wú)"""len_yTrain = yTrain.shape[0]len_y = len_yTrain+yPredict.shape[0]# 真實(shí)曲線繪制plt.plot(np.concatenate([yTrain,yTest]), color='r', label='sample')# 預(yù)測(cè)曲線繪制plt.plot([x for x in range(len_yTrain,len_y)],yPredict, color='g', label='predict')# 標(biāo)題和軸標(biāo)簽plt.title('Fig'+num+'. '+title)plt.xlabel('date')plt.ylabel('close')plt.legend()# 刻度和刻度標(biāo)簽xticks = [0,len_yTrain,len_y-1]xtick_labels = [timelabels[x] for x in xticks]plt.xticks(ticks=xticks, labels=xtick_labels, rotation=30)# 保存于補(bǔ)充數(shù)據(jù)1925102007/savingPath = '補(bǔ)充數(shù)據(jù)1925102007/fig'+num+'_'+title.replace(' ', '_')+'.png'plt.savefig(savingPath, dpi=400, bbox_inches='tight')# 展示plt.show() # 由X_test前日股票指標(biāo)預(yù)測(cè)當(dāng)天股票close值 # 注：predict生成的array需降維成 shape=(n_samples, ) y_predict = model.predict(X_test)[:,0]# 反歸一化 # 重新讀取 AAPL股票價(jià)格融合情感.csv sharePricesAAPLwithEmotion = pd.read_csv('補(bǔ)充數(shù)據(jù)1925102007/AAPL股票價(jià)格融合情感.csv') col_n = sharePricesAAPLwithEmotion.shape[1]-2 # 預(yù)測(cè)結(jié)果反歸一化 inv_yPredict = inverse_transform_col(scaler, y_predict, col_n) # 真實(shí)結(jié)果反歸一化 inv_yTest = inverse_transform_col(scaler, y_test, col_n) # 訓(xùn)練集結(jié)果反歸一化（以繪制完整圖像） inv_yTrain = inverse_transform_col(scaler, y_train, col_n) # 繪圖 predictGraph(inv_yTrain, inv_yPredict, inv_yTest, timelabels=sharePricesAAPLwithEmotion['date'].values, title='Prediction Graph of Stock Prices with Emotions', num='3')

3.6.7 模型評(píng)估

誤差評(píng)價(jià)方法：MSE

# sklearn.metrics.mean_squared_error(y_true, y_pred) mse = mean_squared_error(inv_yTest, inv_yPredict) print('帶有情感特征的股票數(shù)據(jù)預(yù)測(cè)結(jié)果的均方誤差（MSE）為 ', mse) 帶有情感特征的股票數(shù)據(jù)預(yù)測(cè)結(jié)果的均方誤差（MSE）為 160.42007

分析：

觀察Fig3可知，用含有情感特征的股票數(shù)據(jù)訓(xùn)練的LSTM模型預(yù)測(cè)結(jié)果（綠色曲線）和真實(shí)結(jié)果（紅色曲線的后段）總體變化趨勢(shì)一致，即真實(shí)值下降或上升時(shí)，預(yù)測(cè)值跟著下降或上升。在模型預(yù)測(cè)的開始階段，擬合效果較好，但隨著時(shí)間推移，預(yù)測(cè)值和真實(shí)值的結(jié)果差距愈發(fā)增大。

3.7 對(duì)比實(shí)驗(yàn)：預(yù)測(cè)純技術(shù)指標(biāo)的股票數(shù)據(jù)

作為對(duì)比，導(dǎo)入補(bǔ)充數(shù)據(jù)1925102007/AAPL股票價(jià)格.csv，具體操作和上述一致，對(duì)不含情感特征的純技術(shù)指標(biāo)股票數(shù)據(jù)進(jìn)行預(yù)測(cè)分析。

（操作基本一致，故不作詳細(xì)注釋）

3.7.1 對(duì)比實(shí)驗(yàn)流程（通用函數(shù)構(gòu)造）

def formatData(sharePricesData):"""模式化樣本數(shù)據(jù)的函數(shù):param sharePricesData: 樣本數(shù)據(jù)的DataFrame:return: X_train, X_test, y_train, y_test, scaler"""# 歸一化_scaler = MinMaxScaler()_scaler = _scaler.fit(sharePricesData)sharePricesData = _scaler.fit_transform(sharePricesData)# 構(gòu)建有監(jiān)督數(shù)據(jù)集sharePricesData = series_to_supervised(sharePricesData)# dtype為float32sharePricesData = sharePricesData.values.astype(np.float32)# 訓(xùn)練集和驗(yàn)證集的劃分_X_train, _X_test, _y_train, _y_test = train_test_split(sharePricesData[:,:-1], sharePricesData[:,-1], test_size=0.3, shuffle=False)# reshape input_X_train = _X_train.reshape((_X_train.shape[0], 1, _X_train.shape[1]))_X_test = _X_test.reshape((_X_test.shape[0], 1, _X_test.shape[1]))return _X_train, _X_test, _y_train, _y_test, _scaler def invTransformMulti(_scaler, _y_predict, _y_test, _y_train, _col_n):# 批量反歸一化_inv_yPredict = inverse_transform_col(_scaler, _y_predict, _col_n)_inv_yTest = inverse_transform_col(_scaler, _y_test, _col_n)_inv_yTrain = inverse_transform_col(_scaler, _y_train, _col_n)return _inv_yPredict, _inv_yTest, _inv_yTrain # 讀取數(shù)據(jù) sharePricesAAPL = pd.read_csv('補(bǔ)充數(shù)據(jù)1925102007/AAPL股票價(jià)格.csv', parse_dates=['date'], index_col='date').values # 標(biāo)準(zhǔn)化數(shù)據(jù)輸入 X_train, X_test, y_train, y_test, scaler = formatData(sharePricesAAPL) # 建模 history, model = LSTMModelGenerate(X_train, X_test, y_train, y_test) # 損失函數(shù)繪圖 drawLossGraph(history, title='LSTM Loss Graph for Stock Prices without Emotions', num='4')

# 預(yù)測(cè) y_predict = model.predict(X_test)[:,0] # 反歸一化 sharePricesAAPL = pd.read_csv('補(bǔ)充數(shù)據(jù)1925102007/AAPL股票價(jià)格.csv') col_n = sharePricesAAPL.shape[1]-2 inv_yPredict, inv_yTest, inv_yTrain = invTransformMulti(scaler, y_predict, y_test, y_train, col_n) # 繪圖 predictGraph(inv_yTrain, inv_yPredict, inv_yTest, timelabels=sharePricesAAPL['date'].values, title='Prediction Graph of Stock Prices without Emotions', num='5')

# 均方誤差 mse = mean_squared_error(inv_yTest, inv_yPredict) print('無(wú)情感特征的純技術(shù)指標(biāo)股票數(shù)據(jù)預(yù)測(cè)結(jié)果的均方誤差（MSE）為 ', mse) 無(wú)情感特征的純技術(shù)指標(biāo)股票數(shù)據(jù)預(yù)測(cè)結(jié)果的均方誤差（MSE）為 142.50227

3.7.2 對(duì)比實(shí)驗(yàn)結(jié)果分析

對(duì)比Fig3和Fig5（含情感和不含情感）

均方誤差：通過(guò)去除情感信息，用LSTM模型得出的純技術(shù)指標(biāo)的股票close預(yù)測(cè)結(jié)果單就誤差來(lái)看要優(yōu)于含情感特征的股票數(shù)據(jù)預(yù)測(cè)結(jié)果，純技術(shù)指標(biāo)預(yù)測(cè)的精度更高，總體上更接近于真值。

MSE (含情感特征) = 160.42007

MSE (純技術(shù)指標(biāo)) = 142.50227
曲線特征：顯然，含有情感數(shù)據(jù)信息的預(yù)測(cè)結(jié)果曲線較無(wú)情感的預(yù)測(cè)曲線更靈敏。Fig3（含情感特征）的預(yù)測(cè)曲線隨真值曲線的升降而漲跌，真值曲線的變化（突變）趨勢(shì)較為完整地體現(xiàn)在預(yù)測(cè)曲線中，而Fig5（純技術(shù)指標(biāo)）的預(yù)測(cè)曲線隨真值曲線的波動(dòng)并不明顯。

Fig3. Prediction Graph of Stock Prices with Emotions

Fig5. Prediction Graph of Stock Prices without Emotions

3.7.3 對(duì)比實(shí)驗(yàn)結(jié)論

在現(xiàn)有數(shù)據(jù)下，從總體上來(lái)看，純技術(shù)指標(biāo)的股票數(shù)據(jù)預(yù)測(cè)精度更高，但從局部來(lái)看，融入了情感特征的股票數(shù)據(jù)則更加靈敏。實(shí)驗(yàn)結(jié)果基本和預(yù)期一致。

結(jié)果表明，股票的價(jià)格漲跌并非無(wú)規(guī)律的隨機(jī)游走，而是和股民的情感息息相關(guān)。在對(duì)股票數(shù)據(jù)的預(yù)測(cè)中，融入互聯(lián)網(wǎng)論壇上股民大眾的情感數(shù)據(jù)信息，能夠更好地判斷出未來(lái)一段時(shí)間內(nèi)股票的漲跌情況，從而幫助判斷股票的最佳購(gòu)入點(diǎn)和賣出點(diǎn)、分析股票投資風(fēng)險(xiǎn)。情感數(shù)據(jù)信息有助于在量化投資中輔助股民和數(shù)據(jù)分析師做出最優(yōu)決策。

3.8 補(bǔ)充對(duì)比實(shí)驗(yàn)：補(bǔ)充AAPL股票技術(shù)指標(biāo)樣本量進(jìn)行預(yù)測(cè)

在數(shù)據(jù)聯(lián)合步驟時(shí)，發(fā)現(xiàn)所給補(bǔ)充數(shù)據(jù)1925102007/AAPL股票價(jià)格.csv數(shù)據(jù)并不能覆蓋所有的評(píng)論數(shù)據(jù)（allPosAndNeg.csv）。

此外，該數(shù)據(jù)樣本量較少，按訓(xùn)練集和驗(yàn)證集7:3比例劃分后，導(dǎo)致訓(xùn)練集樣本數(shù)只有88條。

因此決定使用英為財(cái)情股票行情網(wǎng)站所提供的2018年全年AAPL股票工作日純技術(shù)指標(biāo)數(shù)據(jù)，使用上述方法對(duì)收盤價(jià)（close）進(jìn)行預(yù)測(cè)，和2.5 對(duì)比實(shí)驗(yàn)進(jìn)行對(duì)比。

事實(shí)上，

AAPL股票價(jià)格.csv覆蓋時(shí)間為2018-07-02至2018-12-31，

allPosAndNeg.csv覆蓋時(shí)間為2018-01-05至2018-12-31.

3.8.1 數(shù)據(jù)獲取

從英為財(cái)情AAPL個(gè)股頁(yè)面下載近五年AAPL純技術(shù)指標(biāo)股票數(shù)據(jù)，儲(chǔ)存于補(bǔ)充數(shù)據(jù)1925102007\AAPLHistoricalData_5years.csv.

3.8.2 數(shù)據(jù)處理

# 讀取數(shù)據(jù) allYearAAPL = pd.read_csv('補(bǔ)充數(shù)據(jù)1925102007/AAPLHistoricalData_5years.csv', parse_dates=['Date'], index_col='Date') # 時(shí)間序列索引切片 allYearAAPL = allYearAAPL['2018-12-31':'2018-01-01'] # 排序 allYearAAPL.sort_index(inplace=True) # 展示 allYearAAPL Close/LastVolumeOpenHighLowDate2018-01-022018-01-032018-01-04...2018-12-272018-12-282018-12-31

$43.065	101602160	$42.54	$43.075	$42.315
$43.0575	117844160	$43.1325	$43.6375	$42.99
$43.2575	89370600	$43.135	$43.3675	$43.02
...	...	...	...	...
$39.0375	206435400	$38.96	$39.1925	$37.5175
$39.0575	166962400	$39.375	$39.63	$38.6375
$39.435	137997560	$39.6325	$39.84	$39.12

251 rows × 5 columns

# pandas字符串切割、Series類型修改（去除$） allYearAAPL[['Close/Last', 'Open', 'High', 'Low']] = allYearAAPL[['Close/Last', 'Open', 'High', 'Low']].apply(lambda x: (x.str[1:]).astype(np.float32)) # reindex allAAPL_newColOrder = ['Open', 'High', 'Low', 'Volume', 'Close/Last'] allYearAAPL = allYearAAPL.reindex(columns=allAAPL_newColOrder) # 保存為AAPL2018allYearData.csv allYearAAPL.to_csv('補(bǔ)充數(shù)據(jù)1925102007/AAPL2018allYearData.csv') # 展示 allYearAAPL OpenHighLowVolumeClose/LastDate2018-01-022018-01-032018-01-04...2018-12-272018-12-282018-12-31

42.540001	43.075001	42.314999	101602160	43.064999
43.132500	43.637501	42.990002	117844160	43.057499
43.134998	43.367500	43.020000	89370600	43.257500
...	...	...	...	...
38.959999	39.192501	37.517502	206435400	39.037498
39.375000	39.630001	38.637501	166962400	39.057499
39.632500	39.840000	39.119999	137997560	39.435001

251 rows × 5 columns

3.8.3 預(yù)測(cè)分析

# 標(biāo)準(zhǔn)化數(shù)據(jù)輸入 X_train, X_test, y_train, y_test, scaler = formatData(allYearAAPL) # 建模 history, model = LSTMModelGenerate(X_train, X_test, y_train, y_test) # 損失函數(shù)繪圖 drawLossGraph(history, title='LSTM Loss Graph for 2018 All Year AAPL Stock Prices', num='6')

# 預(yù)測(cè) y_predict = model.predict(X_test)[:,0] # 反歸一化 allYearAAPL = pd.read_csv('補(bǔ)充數(shù)據(jù)1925102007/AAPL2018allYearData.csv') col_n = allYearAAPL.shape[1]-2 inv_yPredict, inv_yTest, inv_yTrain = invTransformMulti(scaler, y_predict, y_test, y_train, col_n) # 繪圖 predictGraph(inv_yTrain, inv_yPredict, inv_yTest, timelabels=allYearAAPL['Date'].values, title='Prediction Graph of 2018 All Year AAPL Stock Prices', num='7')

# 均方誤差 mse = mean_squared_error(inv_yTest, inv_yPredict) print('2018全年純技術(shù)指標(biāo)AAPL股票數(shù)據(jù)預(yù)測(cè)結(jié)果的均方誤差（MSE）為 ', mse)

3.8.4 結(jié)果分析

由Fig7. Prediction Graph of 2018 All Year AAPL Stock Prices、2018全年純技術(shù)指標(biāo)AAPL股票數(shù)據(jù)預(yù)測(cè)結(jié)果的均方誤差和2.5 不含情感特征的AAPL股票數(shù)據(jù)預(yù)測(cè)的對(duì)比實(shí)驗(yàn)比較得知，在增加股票的時(shí)間序列數(shù)據(jù)后，即由原本2018-07-02～2018-12-31擴(kuò)充至2018-01-01~2018-12-31，純技術(shù)指標(biāo)預(yù)測(cè)的精度大幅提升，LSTM模型的擬合效果極佳。

由此推斷，Fig3.和Fig5.（即未增添數(shù)據(jù)前的AAPL含情感特征預(yù)測(cè)圖和純技術(shù)指標(biāo)預(yù)測(cè)圖）的預(yù)測(cè)結(jié)果精度低，且隨時(shí)間推移，預(yù)測(cè)結(jié)果嚴(yán)重偏離真值的原因在于樣本數(shù)目不足，導(dǎo)致LSTM模型訓(xùn)練不到位。接下來(lái)，將添加補(bǔ)充數(shù)據(jù)后的2018全年AAPL股票數(shù)據(jù)融合情感特征，進(jìn)行含情感特征的股票數(shù)據(jù)預(yù)測(cè)，以驗(yàn)證這一推斷。

3.9 2018全年含情感特征的股票數(shù)據(jù)預(yù)測(cè)實(shí)驗(yàn)

3.9.1 情感特征數(shù)據(jù)聚合

# 文件讀取 allYearAAPL_withEmos = pd.read_csv('補(bǔ)充數(shù)據(jù)1925102007/AAPL2018allYearData.csv') allPosAndNeg = pd.read_csv('補(bǔ)充數(shù)據(jù)1925102007/allPosAndNeg.csv') # 合并 allYearAAPL_withEmos = allYearAAPL_withEmos.merge(allPosAndNeg, how='inner', left_on='Date', right_on='date').drop('date', axis=1) # 序列化時(shí)間索引date allYearAAPL_withEmos['Date'] = pd.DatetimeIndex(allYearAAPL_withEmos['Date']) allYearAAPL_withEmos.set_index('Date', inplace=True) # reindex allYearAAPLwithEmos_newColOrder = ['Open','High','Low','Volume','pos','neg','Close/Last'] allYearAAPL_withEmos = allYearAAPL_withEmos.reindex(columns=allYearAAPLwithEmos_newColOrder) # 保存 allYearAAPL_withEmos.to_csv('補(bǔ)充數(shù)據(jù)1925102007/AAPL2018allYearData_withEmos.csv') # 展示 allYearAAPL_withEmos OpenHighLowVolumeposnegClose/LastDate2018-01-052018-01-082018-01-09...2018-12-212018-12-242018-12-26

43.3600	43.8425	43.2625	94359720	0.041667	0.043478	43.7500
43.5875	43.9025	43.4825	82095480	0.000000	0.000000	43.5875
43.6375	43.7650	43.3525	86128800	0.000000	0.090909	43.5825
...	...	...	...	...	...	...
39.2150	39.5400	37.4075	381991600	0.000000	0.000000	37.6825
37.0375	37.8875	36.6475	148676920	0.000000	0.000000	36.7075
37.0750	39.3075	36.6800	232535400	0.090909	0.090909	39.2925

245 rows × 7 columns

3.9.2 預(yù)測(cè)分析

# 標(biāo)準(zhǔn)化數(shù)據(jù)輸入 X_train, X_test, y_train, y_test, scaler = formatData(allYearAAPL_withEmos) # 建模 history, model = LSTMModelGenerate(X_train, X_test, y_train, y_test) # 損失函數(shù)繪圖 drawLossGraph(history, title='LSTM Loss Graph for 2018 All Year AAPL Stock Prices with Emotions', num='8')

# 預(yù)測(cè) y_predict = model.predict(X_test)[:,0] # 反歸一化 allYearAAPL_withEmos = pd.read_csv('補(bǔ)充數(shù)據(jù)1925102007/AAPL2018allYearData_withEmos.csv') col_n = allYearAAPL_withEmos.shape[1]-2 inv_yPredict, inv_yTest, inv_yTrain = invTransformMulti(scaler, y_predict, y_test, y_train, col_n) # 繪圖 predictGraph(inv_yTrain, inv_yPredict, inv_yTest, timelabels=allYearAAPL_withEmos['Date'].values, title='Prediction Graph of 2018 All Year AAPL Stock Prices with Emotions', num='9')

# 均方誤差 mse = mean_squared_error(inv_yTest, inv_yPredict) print('2018全年含情感特征的AAPL股票數(shù)據(jù)預(yù)測(cè)結(jié)果的均方誤差（MSE）為 ', mse) 2018全年含情感特征的AAPL股票數(shù)據(jù)預(yù)測(cè)結(jié)果的均方誤差（MSE）為 1.5526791

3.9.3 結(jié)果分析

模型訓(xùn)練損失圖：對(duì)比Fig2. LSTM Loss Graph for Stock Prices with Emotions和Fig8. LSTM Loss Graph for 2018 All Year AAPL Stock Prices with Emotions，發(fā)現(xiàn)使用2018全年AAPL含情感特征的股票數(shù)據(jù)訓(xùn)練LSTM模型，在約10次左右epochs時(shí)收斂，而部分AAPL含情感特征的股票數(shù)據(jù)訓(xùn)練則需要約20次左右epochs才能收斂。表明，隨訓(xùn)練樣本的增加，LSTM模型使損失函數(shù)收斂所需的迭代次數(shù)更少，且擬合效果更佳。

預(yù)測(cè)結(jié)果圖：對(duì)比Fig7. Prediction Graph of 2018 All Year AAPL Stock Prices和Fig9. Prediction Graph of 2018 All Year AAPL Stock Prices with Emotions（即只含純技術(shù)指標(biāo)的和加入情感特征后的2018全年AAPL股票數(shù)據(jù)預(yù)測(cè)結(jié)果圖），發(fā)現(xiàn)二者差異甚微。但通過(guò)二者M(jìn)SE值不難發(fā)現(xiàn)，MSE (2018全年含情感特征的AAPL股票數(shù)據(jù)) < MSE (2018全年純技術(shù)指標(biāo)AAPL股票數(shù)據(jù))，表明在總體樣本量擴(kuò)大，讓評(píng)論情感特征數(shù)據(jù)的時(shí)間能夠覆蓋所有股票技術(shù)指標(biāo)的情況下，向純技術(shù)指標(biāo)的股票數(shù)據(jù)中添加情感特征數(shù)據(jù)后，能夠增加對(duì)股票收盤價(jià)close的預(yù)測(cè)精度。

MSE (2018全年含情感特征的AAPL股票數(shù)據(jù)) = 1.5526791

MSE (2018全年純技術(shù)指標(biāo)AAPL股票數(shù)據(jù)) = 1.7402486

4. 結(jié)論與總結(jié)

本實(shí)驗(yàn)探究了情感結(jié)構(gòu)化特征數(shù)據(jù)在LSTM股票預(yù)測(cè)模型中的影響。利用Pandas對(duì)所給數(shù)據(jù)進(jìn)行預(yù)處理（數(shù)據(jù)載入、清洗與準(zhǔn)備、規(guī)整、時(shí)間序列處理、數(shù)據(jù)聚合等），確保數(shù)據(jù)的可用性。再借助NLTK和LM金融詞庫(kù)，對(duì)非結(jié)構(gòu)化文本信息進(jìn)行情感分析，并將所得結(jié)構(gòu)化數(shù)據(jù)融入純技術(shù)指標(biāo)的股票數(shù)據(jù)中。分析各股票指標(biāo)的相關(guān)性，實(shí)現(xiàn)數(shù)據(jù)降維，提升模型訓(xùn)練速度?；贙eras的以MSE為誤差評(píng)價(jià)方法的LSTM模型，分別使用含有情感和不含情感的部分股票數(shù)據(jù)和2018全年股票數(shù)據(jù)實(shí)現(xiàn)對(duì)股票收盤價(jià)Close的預(yù)測(cè)。

實(shí)驗(yàn)結(jié)果表明，LSTM模型預(yù)測(cè)股票收盤價(jià)Close時(shí)，在訓(xùn)練樣本量較少的情況下，無(wú)論有無(wú)情感數(shù)據(jù)的融入，預(yù)測(cè)值隨時(shí)間的推移嚴(yán)重偏離真值，即預(yù)測(cè)精度較低，而情感數(shù)據(jù)的融入讓預(yù)測(cè)值變得更加靈敏，漲跌情況更符合真值，但預(yù)測(cè)精度有所下降。然而，當(dāng)訓(xùn)練樣本充足時(shí)，不僅預(yù)測(cè)精度大幅提升，而且因融入了情感特征數(shù)據(jù)，使得預(yù)測(cè)靈敏度適當(dāng)增加，導(dǎo)致總體預(yù)測(cè)精度再次增長(zhǎng)。

5. 參考文獻(xiàn)

[1] Wes McKinney. 利用Python進(jìn)行數(shù)據(jù)分析[M]. 機(jī)械工業(yè)出版社. 2013

[2] 洪志令, 吳梅紅. 股票大數(shù)據(jù)挖掘?qū)崙?zhàn)——股票分析篇[M]. 清華大學(xué)出版社. 2020

[3] 楊妥, 李萬(wàn)龍, 鄭山紅. 融合情感分析與SVM_LSTM模型的股票指數(shù)預(yù)測(cè). 軟件導(dǎo)刊, 2020(8):14-18.

[4] Francesca Lazzeri. Machine Learning for Time Series Forecasting with Python[M]. Wiley. 2020

數(shù)據(jù)集下載：

百度云- https://pan.baidu.com/s/1tC1AFx0kMHPUGobvqf47pg

華大云盤- https://pan.hqu.edu.cn/share/a474d56c6b6557f7a7fd0e0eb7

密碼- ued8

總結(jié)

以上是生活随笔為你收集整理的情感数据对LSTM股票预测模型的影响研究的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：使用浏览器获取网页模板(HTML+CSS
下一篇： Leetcode--397. 整数替换