keras时间序列数据预测_使用Keras的时间序列数据中的异常检测
keras時間序列數據預測
Anomaly Detection in time series data provides e-commerce companies, finances the insight about the past and future of data to find actionable signals in the data that takes the form of anomalies.
時間序列數據中的異常檢測為電子商務公司提供了資金,幫助他們了解數據的過去和將來,以發現數據中采取異常形式的可操作信號。
介紹 (Introduction)
In this project, we’ll build a model for Anomaly Detection in Time Series data using Deep Learning in Keras with Python code. you must be familiar with Deep Learning which is a sub-field of Machine Learning. Specifically, we’ll be designing and training an LSTM Autoencoder using Keras API, and Tensorflow2 as back-end. Along with this you will also create interactive charts and plots with plotly python and seaborn for data visualization and displaying results within Jupyter Notebook.
在這個項目中,我們將使用Keras中的Deep Learning和Python代碼構建一個用于時間序列數據中異常檢測的模型。 您必須熟悉機器學習的子領域深度學習。 具體來說,我們將使用Keras API和Tensorflow2作為后端來設計和培訓LSTM自動編碼器。 與此同時,您還將使用可繪制的python和seaborn創建交互式圖表和繪圖,以進行數據可視化并在Jupyter Notebook中顯示結果。
什么是時間序列數據? (What is Time Series Data?)
Time Series is a sequence of numerical data collected at different points in time in successive order. This is not a cross-sectional data. This is an observation on the value of a variable at different times.
時間序列是在不同時間點以連續順序收集的一系列數字數據。 這不是橫截面數據。 這是對變量在不同時間的值的觀察。
Time series data can be found in business, science, finance. Few examples of time series data are Birth rates, GDP, CPI(Consumer Price Index), Blood Pressure tracking, Global Temperature, population, insights on a product.
時間序列數據可以在商業,科學,金融中找到。 時間序列數據的幾個示例是出生率,GDP,CPI(消費者價格指數),血壓跟蹤,全球溫度,人口,對產品的洞察力。
Time Series data are very important for prediction. These data are used for understanding past outcomes, predicting future outcomes, making progress strategies, and more.
時間序列數據對于預測非常重要。 這些數據用于了解過去的結果,預測未來的結果,制定進度策略等。
什么是時間序列數據中的異常檢測? (What is Anomaly Detection in Time Series Data?)
Anomaly Detection in the data mining field is the identification of the data of a variable or events that do not follow a certain pattern. Anomaly detection helps to identify the unexpected behavior of the data with time so that businesses, companies can make strategies to overcome the situation. It also helps the firms to detect the error and frauds that are going to happen at particular time, or it helps to learn from past histories of data that showed unusual behavior.
數據挖掘字段中的“異常檢測”是對未遵循特定模式的變量或事件的數據的標識。 異常檢測有助于及時識別數據的意外行為,以便企業,公司可以制定策略來克服這種情況。 它還可以幫助公司檢測在特定時間將要發生的錯誤和欺詐,或者可以幫助我們從顯示異常行為的數據歷史中學習。
Applying machine learning in anomaly detection helps to increase the speed of execution. Machine learning algorithm’s implementation helps the companies to find simple and effective approaches for deetcting the anamolies. Since machine learning algorithms are able to learn from datas and make predictions so applying these algorithms in anomaly detection of time series data carries huge impact on it’s performance. There are various application of anomaly detection in time series data in different domain topics.
在異常檢測中應用機器學習有助于提高執行速度。 機器學習算法的實現可幫助公司找到簡單有效的方法來去除陽極。 由于機器學習算法能夠從數據中學習并做出預測,因此將這些算法應用于時間序列數據的異常檢測會對性能產生巨大影響。 異常檢測在不同領域主題的時間序列數據中有各種應用。
什么是LSTM? (What is LSTM ?)
LSTM stands for Long Short-term Memory, which is also an artificial neural network similar to Recurrent Neural Network(RNN). It processes the datas passing on the information as it propagates. It has a cell, allows the neural network to keep or forget the information. Here, I have just introduced about LSTM for your ease. If you want to know more about it, you can search it in google.
LSTM代表長期短期記憶,它也是一種類似于遞歸神經網絡(RNN)的人工神經網絡。 它在信息傳播時處理傳遞給信息的數據。 它具有一個單元,允許神經網絡保留或忘記信息。 在這里,我為您簡單介紹了LSTM。 如果您想進一步了解它,可以在google中搜索它。
含蓄 (Implemantation)
Datasets
數據集
As a case study we are gonna be working with S&P 500 Idex to detect and predict anamolies. By anamolies I mean sudden price change in S&P index.
作為案例研究,我們將與S&P 500 Idex合作以檢測和預測異常。 我的意思是標準普爾指數突然出現價格變動。
什么是標準普爾500指數? (What is S&P 500 index ?)
S&P 500 is a stock market index that tracks the stock performances of top 500 large-cap US companies listed in stock enchanges. This index represnets the performances of stock market by reporting the risks and reporting of the biggest companies. The people in the finance industry consider it as one of the best stock market index in US.
S&P 500是一個股票市場指數,用于追蹤股票交易中列出的美國500強大型公司的股票表現。 該指數通過報告最大公司的風險和報告來代表股票市場的表現。 金融業人士認為它是美國最好的股票市場指數之一。
In this project, we’ll work with this data , but captured from 1986 and 2018. This data was stored and collected on kaggle and I have downloaded it locally in my desktop.
在這個項目中,我們將使用這些數據,但是從1986年和2018年捕獲的。這些數據是在kaggle上存儲和收集的,我已經將其本地下載到了桌面上。
Implementation
實作
We will be using Python and also designing deep learning model in keras API for Anomaly Detection in Time Series Data. You need to be familiar with TensorFlow and keras and understanding of how Neural Networks work.
我們將使用Python并在keras API中設計深度學習模型,以用于時間序列數據中的異常檢測。 您需要熟悉TensorFlow和keras,并了解神經網絡的工作原理。
步驟1:導入庫 (Step 1: Importing the libraries)
Here, we will be using TensorFlow, NumPy, pandas, matplotlib, seaborn and plotly libraries form python.
在這里,我們將使用TensorFlow,NumPy,pandas,matplotlib,seaborn和plotly庫形式的python。
% matplotlib inline sets the background of matplotlib to inline because of which the output of plotting commands will be displayed inline within frontends like the Jupyter notebook, directly below the code cell.
% matplotlib inline將matplotlib的背景設置為內聯,因此繪圖命令的輸出將在前端(如Jupyter筆記本)在代碼單元正下方以內聯方式顯示。
import numpy as npimport tensorflow as tf
import pandas as pd
pd.options.mode.chained_assignment = None
import seaborn as sns
from matplotlib.pylab import rcParams
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
%matplotlib inline
sns.set(style='whitegrid', palette='muted') rcParams['figure.figsize'] = 14, 8 np.random.seed(1) tf.random.set_seed(1) print('Tensorflow version:', tf.__version__)
步驟2:載入標準普爾500指數數據 (Step 2: Loading S&P 500 Index Data)
Download the dataset from here: Click here
從此處下載數據集: 單擊此處
In the time series data graph, Dates(year) are in X-axis and the closing price on the Y-axis.
在時間序列數據圖中,日期(年)在X軸上,收盤價在Y軸上。
Now we’ll read the dataset which is CSV file, using pd.read_csv where we have imported pandas as pd. And the leftmost column parses the states into pandas date-time format. And let’s view few rows at the top using head() function.
現在,我們將使用pd.read_csv讀取CSV文件數據集,其中已將熊貓作為pd導入。 最左邊的列將狀態解析為熊貓的日期時間格式。 讓我們使用head()函數在頂部查看幾行。
df = pd.read_csv('S&P_500_Index_Data.csv', parse_dates=['date']) df.head()Output:
輸出:
Now checking the shape of our dataset, which will show (8192, 2) i.e. 8192 entries and 2 columns.
現在檢查我們的數據集的形狀,它將顯示(8192,2),即8192條目和2列。
df.shapeNow, let’s see the closing price of the stock from 1986 to 2018. Here we have used plotly, and we’ll use a sub-module graph_objects from plotly . Now we will populate the figure using add_trace() method which helps to plot different types of charts in the same figure. And Scatter mode is set to ‘line’ plot. Legend value is set to ‘close’ which is closing stock value and then update the figure layout.
現在,讓我們看一下1986年至2018年該股票的收盤價。在這里,我們使用了plotly,并且將使用來自plotly的子模塊graph_objects。 現在,我們將使用add_trace()方法填充該圖,該方法有助于在同一圖中繪制不同類型的圖表。 并且散布模式設置為“線”圖。 圖例值設置為“關閉”,即關閉庫存值,然后更新圖形布局。
fig = go.Figure()fig.add_trace(go.Scatter(x=df.date, y=df.close, mode='lines', name='close'))
fig.update_layout(showlegend=True)
fig.show()
Output:
輸出:
You will see the date and closing stock value when you hover your mouse over the plot. Now here comes the Anomaly detection into play to tell you when should you buy or shell the stock as it shows the outlier in the data.
將鼠標懸停在繪圖上時,將看到日期和期末庫存值。 現在出現異常檢測功能,告訴您何時應該購買或買入股票,因為它顯示了數據中的異常值。
任務3:數據預處理 (Task 3: Data Preprocessing)
Data preprocessing is a very important task in any data mining process as the raw data may be unclean, it may be missing the attributes, it may contain noise, wrong or duplicate data.
在任何數據挖掘過程中,數據預處理都是一項非常重要的任務,因為原始數據可能不干凈,可能缺少屬性,可能包含噪音,錯誤或重復數據。
Here, we are going to standardizing our target vector by removing the mean and scaling it to unit variance. Before standardization, let’s split the dataset into training and testing set. We have taken 80% of data frame for training and remaining 20% for testing. And then iloc method will allocate the data from index 0 to train_size to train set and remaining to test set.
在這里,我們將通過去除均值并將其縮放為單位方差來標準化目標向量。 在進行標準化之前,讓我們將數據集分為訓練和測試集。 我們已將80%的數據框架用于培訓,并將剩余的20%用于測試。 然后,iloc方法會將索引0中的數據分配給train_size并分配給訓練集,其余分配給測試集。
train_size = int(len(df) * 0.8)test_size = len(df) - train_size train,
test = df.iloc[0:train_size], df.iloc[train_size:len(df)] print(train.shape, test.shape)
Then you can see the inline output as (6553, 2) (1639, 2) as the size of training and test set respectively. Now, let’s create the instance of StandarsScaler function and then fit this helper function on the training set and then transform the train and test set.
然后,您可以分別以訓練和測試集的大小看到內聯輸出為(6553,2)(1639,2)。 現在,讓我們創建StandarsScaler函數的實例,然后將此輔助函數適合訓練集,然后轉換訓練和測試集。
from sklearn.preprocessing import StandardScalerscaler = StandardScaler()
scaler = scaler.fit(train[['close']]) train['close'] = scaler.transform(train[['close']]) test['close'] = scaler.transform(test[['close']])
Now data standardization task is performed here.
現在,在此處執行數據標準化任務。
任務4:創建訓練和測試分組 (Task 4: Create Training and Test Splits)
Since this is time series data, we need to create the subsequences before we go to using the data to train our model. We can create sequences with a specific time step, it’s 30 in our case. That means we need to create the sequences with 30 days for the historical data.
由于這是時間序列數據,因此在使用數據訓練模型之前,我們需要先創建子序列。 我們可以創建具有特定時間步長的序列,在本例中為30。 這意味著我們需要創建30天的歷史數據序列。
And as required by LSTM network, we need to reshape our input data into shape and sample by n time_steps by n features. In our case, the n is equal to 1 i.e. one feature.
并且,根據LSTM網絡的要求,我們需要將輸入數據重塑為形狀,并通過n個time_steps和n個特征進行采樣。 在我們的情況下,n等于1,即一個特征。
def create_dataset(X, y, time_steps=1):Xs, ys = [], []
for i in range(len(X) - time_steps):
v = X.iloc[i:(i + time_steps)].values
Xs.append(v)
ys.append(y.iloc[i + time_steps])
return np.array(Xs), np.array(ys)
Here, we have just converted the list into numpy arrays, where data are from i to i+time_steps are located to X array and remaining to Y array.
在這里,我們剛剛將列表轉換為numpy數組,其中數據從i到i + time_steps定位到X數組,其余保留到Y數組。
time_steps = 30X_train, y_train = create_dataset(train[['close']], train.close, time_steps)
X_test, y_test = create_dataset(test[['close']], test.close, time_steps) print(X_train.shape)
Now it prints the output as (6523, 30, 1) i.e. 6523 entries are taken for training, time_steps as 30 and 1 feature.
現在,它將輸出打印為(6523、30、1),即接受了6523個條目進行訓練,將time_steps作為30個和1個要素。
任務5:構建LSTM自動編碼器 (Task 5: Build an LSTM Autoencoder)
In this step, we are gonna build an LSTM Autoencoder network and visualize the architecture and data flow. So here’s how we are going to detect anomalies using an autoencoder.
在這一步中,我們將建立一個LSTM Autoencoder網絡并可視化體系結構和數據流。 因此,這就是我們使用自動編碼器檢測異常的方法。
For that , first, we need to train the data with no anomalies and then take the new data point and try to reconstruct that using an autoencoder.
為此,首先,我們需要訓練沒有異常的數據,然后獲取新的數據點,并嘗試使用自動編碼器對其進行重構。
If the reconstruction error for the new dataset is above some threshold, we are going to label that example/data point as an anomaly.
如果新數據集的重構誤差高于某個閾值,我們將將該示例/數據點標記為異常。
In the following 2 lines, we have just assigned values from X_train array i.e. (6523, 30, 1)
在接下來的2行中,我們剛剛從X_train數組中分配了值,即(6523,30,1)
timesteps = X_train.shape[1]num_features = X_train.shape[2]from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Dropout, RepeatVector, TimeDistributed
model = Sequential([ LSTM(128, input_shape=(timesteps, num_features)), Dropout(0.2), RepeatVector(timesteps), LSTM(128, return_sequences=True), Dropout(0.2), TimeDistributed(Dense(num_features)) ])
model.compile(loss='mae', optimizer='adam')
model.summary()
Here, we have used the Sequential model from Keras API. Our sample data is 1% which is 2D array and is passed to LSTM as input. The output of the layer is going to be a feature vector of input data.
在這里,我們使用了Keras API的順序模型。 我們的樣本數據是1%,它是2D數組,并作為輸入傳遞給LSTM。 圖層的輸出將成為輸入數據的特征向量。
We have created one LSTM layer with the number of cells to be 128. Input shape is equal to no. of time_stpes divided by no. of features. Then we have added the Dropout regularization to 0.2. Since our network is LSTM, we need to duplicate this vector using RepeatVector. It’s purpose is to just replicate the feature vector from the output of LSTM layer 3o times. Our encoder is done here.
我們創建了一個LSTM層,其單元數為128。輸入形狀等于no。 time_stpes的數量除以no。 功能。 然后,我們將Dropout正則化添加到0.2。 由于我們的網絡是LSTM,因此我們需要使用RepeatVector復制此向量。 目的是僅將LSTM層輸出的特征向量復制3o次。 我們的編碼器在這里完成。
Decoder Layer
解碼器層
Now we have mirrored the encoder in reverse fashion i.e. decoder. TimeDistributed function creates a dense layer with number of nodes equal to the number of features. And the model is compiled finally using adam optimizer function which is gradient descent optimizer. And the model summary is shown as follow:
現在,我們以相反的方式鏡像了編碼器,即解碼器。 TimeDistributed函數創建一個密集層,其節點數等于要素數。 最后使用damdam優化器功能(梯度下降優化器)對模型進行編譯。 并且模型摘要如下所示:
Summary Of Model模型總結任務6:訓練自動編碼器 (Task 6: Train the Autoencoder)
Now, let’s create Keras callback and use EarlyStopping so that we don’t need to hard code the number of epochs.
現在,讓我們創建Keras回調并使用EarlyStopping,這樣我們就不必對時期數進行硬編碼。
If our network doesn’t improve for 3 consecutive epochs,i.e. validation loss is not decreased we are going to stop our training process. That is the meaning of patience.
如果我們的網絡連續三個時期都沒有改善,即驗證損失沒有減少,我們將停止訓練過程。 那就是耐心的意思。
And now let’s fit the model to our calling data. No. of epochs is set to high as higher the epochs, more the accuracy of training. 10 % of the data is set for validation. And then the callback is done using es i.e. EarlyStopping.
現在,讓模型適合我們的調用數據。 次數越多,則訓練次數越多,訓練次數越多。 設置10%的數據進行驗證。 然后使用es即EarlyStopping完成回調。
es = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=3, mode='min')history = model.fit( X_train, y_train, epochs=100, batch_size=32, validation_split=0.1, callbacks = [es], shuffle=False )
任務7:繪制指標并評估模型 (Task 7: Plot Metrics and Evaluate the Model)
Now we’ll plot the matrix that is training loss and validation loss using matplotlib.
現在,我們將使用matplotlib繪制訓練損失和驗證損失的矩陣。
plt.plot(history.history['loss'], label='Training Loss') plt.plot(history.history['val_loss'], label='Validation Loss') plt.legend();In our plot, validation loss is consistently found to be lower than training loss that means the training data due to the high dropout value we used So you can change the hyperparameters in 5th step to optimize the model.
在我們的繪圖中,始終發現驗證損失低于訓練損失,這意味著由于我們使用了較高的失落值,因此訓練數據非常豐富。因此,您可以在第5步中更改超參數以優化模型。
We need to still predict the anomaly in our test data by calculating the mean absolute error on the training data. First, let’s get prediction on our training data. And then we evaluate the model on our test data.
我們要 仍然可以通過計算訓練數據上的平均絕對誤差來預測測試數據中的異常。 首先,讓我們對訓練數據進行預測。 然后,我們根據測試數據評估模型。
X_train_pred = model.predict(X_train)train_mae_loss = pd.DataFrame(np.mean(np.abs(X_train_pred - X_train), axis=1), columns=['Error'])model.evaluate(X_test, y_test)
Then distribution loss of training mean absolute error is shown using seaborn.
然后使用seaborn顯示訓練均值絕對誤差的分布損失。
sns.distplot(train_mae_loss, bins=50, kde=True);That shows the output like:
那顯示輸出如下:
Here, we can set the threshold as 0.65 as no value is larger than that.
在這里,我們可以將閾值設置為0.65,因為沒有任何值大于該閾值。
Now, let’s calculate the mean absolute error on test set in similar way to the training set and then plot the distribution loss.
現在,讓我們以與訓練集相似的方式計算測試集上的平均絕對誤差,然后繪制分布損失。
X_test_pred = model.predict(X_test) test_mae_loss = np.mean(np.abs(X_test_pred - X_test), axis=1)sns.distplot(test_mae_loss, bins=50, kde=True);任務8:在S&P 500指數數據中檢測異常 (Task 8: Detect Anomalies in the S&P 500 Index Data)
Now we are going to build a data frame containing loss and anomalies values. Then let’s create a boolean-valued column called an anomaly, to track whether the input in that corresponding row is an anomaly or not using the condition that the loss is greater than the threshold or not. Lastly, we will track the closing price
現在,我們將構建一個包含損耗和異常值的數據框。 然后,我們創建一個稱為異常的布爾值列,以使用損失大于閾值的條件來跟蹤相應行中的輸入是否為異常。 最后,我們將跟蹤收盤價
THRESHOLD = 0.65test_score_df = pd.DataFrame(test[time_steps:]) test_score_df['loss'] = test_mae_loss
test_score_df['threshold'] = THRESHOLD
test_score_df['anomaly'] = test_score_df.loss > test_score_df.threshold test_score_df['close'] = test[time_steps:].close
Now, let’s see the first five entry of our dataframe
現在,讓我們看一下數據框的前五個條目
test_score_df.head()and for last 5 entries
最后5個條目
test_score_df.tail()which shows result like this,
顯示這樣的結果,
Final Result
最后結果
Now let’s plot train and test loss value and overlay the line for threshold. First, we will create an empty figure and then use add_trace() method to populate the figure. We are going to create line plot using go.Scatter() method
現在,讓我們繪制火車并測試損耗值,并覆蓋該線以獲取閾值。 首先,我們將創建一個空圖,然后使用add_trace()方法填充該圖。 我們將使用go.Scatter()方法創建折線圖
fig = go.Figure()fig.add_trace(go.Scatter(x=test[time_steps:].date, y=test_score_df.loss, mode='lines', name='Test Loss')) fig.add_trace(go.Scatter(x=test[time_steps:].date, y=test_score_df.threshold, mode='lines', name='Threshold')) fig.update_layout(showlegend=True) fig.show()
This shows result like this
這樣顯示結果
The plot looks like we are thresholding extreme values quite well. All the values above the horizontal orange line are classified as Anomalies. That’s it.
該圖看起來我們很好地極限了極限值。 水平橙色線上方的所有值均歸類為“異常”。 而已。
Thank you for reading.
感謝您的閱讀。
The complete source code link to my GitHub : Click Here
完整的源代碼鏈接到我的GitHub : 單擊此處
You can reach me at: https://www.linkedin.com/in/tekrajawasthi34456b162/
您可以通過以下 網址與 我聯系: https : //www.linkedin.com/in/tekrajawasthi34456b162/
https://twitter.com/dotpyarmy
https://twitter.com/dotpyarmy
https://www.facebook.com/debuglife
https://www.facebook.com/debuglife
Originally published at https://valueml.com on July 29, 2020.
最初于 2020年7月29日 在 https://valueml.com 上 發布 。
翻譯自: https://medium.com/swlh/anomaly-detection-in-time-series-data-using-keras-60763157cecc
keras時間序列數據預測
總結
以上是生活随笔為你收集整理的keras时间序列数据预测_使用Keras的时间序列数据中的异常检测的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: Excel 表单“□方框中打钩”符号怎么
- 下一篇: 端口停止使用_我停止使用