當(dāng)前位置：首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

详解视频中动作识别模型与代码实践

發(fā)布時(shí)間：2024/3/13 编程问答 55 豆豆

生活随笔收集整理的這篇文章主要介紹了详解视频中动作识别模型与代码实践小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

摘要：本案例將為大家介紹視頻動(dòng)作識(shí)別領(lǐng)域的經(jīng)典模型并進(jìn)行代碼實(shí)踐。

本文分享自華為云社區(qū)《視頻動(dòng)作識(shí)別》，作者：HWCloudAI。實(shí)驗(yàn)?zāi)繕?biāo)

通過(guò)本案例的學(xué)習(xí)：

掌握 C3D 模型訓(xùn)練和模型推理、I3D 模型推理的方法；

注意事項(xiàng)

本案例推薦使用TensorFlow-1.13.1，需使用?GPU?運(yùn)行，請(qǐng)查看《ModelArts JupyterLab 硬件規(guī)格使用指南》了解切換硬件規(guī)格的方法；

如果您是第一次使用 JupyterLab，請(qǐng)查看《ModelArts JupyterLab使用指導(dǎo)》了解使用方法；

如果您在使用 JupyterLab 過(guò)程中碰到報(bào)錯(cuò)，請(qǐng)參考《ModelArts JupyterLab常見問題解決辦法》嘗試解決問題。

實(shí)驗(yàn)步驟

案例內(nèi)容介紹

視頻動(dòng)作識(shí)別是指對(duì)一小段視頻中的內(nèi)容進(jìn)行分析，判斷視頻中的人物做了哪種動(dòng)作。視頻動(dòng)作識(shí)別與圖像領(lǐng)域的圖像識(shí)別，既有聯(lián)系又有區(qū)別，圖像識(shí)別是對(duì)一張靜態(tài)圖片進(jìn)行識(shí)別，而視頻動(dòng)作識(shí)別不僅要考察每張圖片的靜態(tài)內(nèi)容，還要考察不同圖片靜態(tài)內(nèi)容之間的時(shí)空關(guān)系。比如一個(gè)人扶著一扇半開的門，僅憑這一張圖片無(wú)法判斷該動(dòng)作是開門動(dòng)作還是關(guān)門動(dòng)作。

視頻分析領(lǐng)域的研究相比較圖像分析領(lǐng)域的研究，發(fā)展時(shí)間更短，也更有難度。視頻分析模型完成的難點(diǎn)首先在于，需要強(qiáng)大的計(jì)算資源來(lái)完成視頻的分析。視頻要拆解成為圖像進(jìn)行分析，導(dǎo)致模型的數(shù)據(jù)量十分龐大。視頻內(nèi)容有很重要的考慮因素是動(dòng)作的時(shí)間順序，需要將視頻轉(zhuǎn)換成的圖像通過(guò)時(shí)間關(guān)系聯(lián)系起來(lái)，做出判斷，所以模型需要考慮時(shí)序因素，加入時(shí)間維度之后參數(shù)也會(huì)大量增加。

得益于 PASCAL VOC、ImageNet、MS COCO 等數(shù)據(jù)集的公開，圖像領(lǐng)域產(chǎn)生了很多的經(jīng)典模型，那么在視頻分析領(lǐng)域有沒有什么經(jīng)典的模型呢？答案是有的，本案例將為大家介紹視頻動(dòng)作識(shí)別領(lǐng)域的經(jīng)典模型并進(jìn)行代碼實(shí)踐。

1. 準(zhǔn)備源代碼和數(shù)據(jù)

這一步準(zhǔn)備案例所需的源代碼和數(shù)據(jù)，相關(guān)資源已經(jīng)保存在 OBS 中，我們通過(guò)ModelArts SDK將資源下載到本地，并解壓到當(dāng)前目錄下。解壓后，當(dāng)前目錄包含 data、dataset_subset 和其他目錄文件，分別是預(yù)訓(xùn)練參數(shù)文件、數(shù)據(jù)集和代碼文件等。

import os import moxing as mox if not os.path.exists('videos'):mox.file.copy("obs://ai-course-common-26-bj4-v2/video/video.tar.gz", "./video.tar.gz")# 使用tar命令解壓資源包os.system("tar xf ./video.tar.gz")# 使用rm命令刪除壓縮包os.system("rm ./video.tar.gz") INFO:root:Using MoXing-v1.17.3- INFO:root:Using OBS-Python-SDK-3.20.7

上一節(jié)課我們已經(jīng)介紹了視頻動(dòng)作識(shí)別有 HMDB51、UCF-101 和 Kinetics 三個(gè)常用的數(shù)據(jù)集，本案例選用了 UCF-101 數(shù)據(jù)集的部分子集作為演示用數(shù)據(jù)集，接下來(lái)，我們播放一段 UCF-101 中的視頻：

video_name = "./data/v_TaiChi_g01_c01.avi"

from IPython.display import clear_output, Image, display, HTML import time import cv2 import base64 import numpy as np def arrayShow(img):_,ret = cv2.imencode('.jpg', img) return Image(data=ret) cap = cv2.VideoCapture(video_name) while True:try:clear_output(wait=True)ret, frame = cap.read()if ret:tmp = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)img = arrayShow(frame)display(img)time.sleep(0.05)else:breakexcept KeyboardInterrupt:cap.release() cap.release()

2. 視頻動(dòng)作識(shí)別模型介紹

在圖像領(lǐng)域中，ImageNet 作為一個(gè)大型圖像識(shí)別數(shù)據(jù)集，自 2010 年開始，使用此數(shù)據(jù)集訓(xùn)練出的圖像算法層出不窮，深度學(xué)習(xí)模型經(jīng)歷了從 AlexNet 到 VGG-16 再到更加復(fù)雜的結(jié)構(gòu)，模型的表現(xiàn)也越來(lái)越好。在識(shí)別千種類別的圖片時(shí)，錯(cuò)誤率表現(xiàn)如下：

在圖像識(shí)別中表現(xiàn)很好的模型，可以在圖像領(lǐng)域的其他任務(wù)中繼續(xù)使用，通過(guò)復(fù)用模型中部分層的參數(shù)，就可以提升模型的訓(xùn)練效果。有了基于 ImageNet 模型的圖像模型，很多模型和任務(wù)都有了更好的訓(xùn)練基礎(chǔ)，比如說(shuō)物體檢測(cè)、實(shí)例分割、人臉檢測(cè)、人臉識(shí)別等。

那么訓(xùn)練效果顯著的圖像模型是否可以用于視頻模型的訓(xùn)練呢？答案是 yes，有研究證明，在視頻領(lǐng)域，如果能夠復(fù)用圖像模型結(jié)構(gòu)，甚至參數(shù)，將對(duì)視頻模型的訓(xùn)練有很大幫助。但是怎樣才能復(fù)用上圖像模型的結(jié)構(gòu)呢？首先需要知道視頻分類與圖像分類的不同，如果將視頻視作是圖像的集合，每一個(gè)幀將作為一個(gè)圖像，視頻分類任務(wù)除了要考慮到圖像中的表現(xiàn)，也要考慮圖像間的時(shí)空關(guān)系，才可以對(duì)視頻動(dòng)作進(jìn)行分類。

為了捕獲圖像間的時(shí)空關(guān)系，論文 I3D 介紹了三種舊的視頻分類模型，并提出了一種更有效的 Two-Stream Inflated 3D ConvNets（簡(jiǎn)稱 I3D）的模型，下面將逐一簡(jiǎn)介這四種模型，更多細(xì)節(jié)信息請(qǐng)查看原論文。

舊模型一：卷積網(wǎng)絡(luò) + LSTM

模型使用了訓(xùn)練成熟的圖像模型，通過(guò)卷積網(wǎng)絡(luò)，對(duì)每一幀圖像進(jìn)行特征提取、池化和預(yù)測(cè)，最后在模型的末端加一個(gè) LSTM 層（長(zhǎng)短期記憶網(wǎng)絡(luò)），如下圖所示，這樣就可以使模型能夠考慮時(shí)間性結(jié)構(gòu)，將上下文特征聯(lián)系起來(lái)，做出動(dòng)作判斷。這種模型的缺點(diǎn)是只能捕獲較大的工作，對(duì)小動(dòng)作的識(shí)別效果較差，而且由于視頻中的每一幀圖像都要經(jīng)過(guò)網(wǎng)絡(luò)的計(jì)算，所以訓(xùn)練時(shí)間很長(zhǎng)。

舊模型二：3D 卷積網(wǎng)絡(luò)

3D 卷積類似于 2D 卷積，將時(shí)序信息加入卷積操作。雖然這是一種看起來(lái)更加自然的視頻處理方式，但是由于卷積核維度增加，參數(shù)的數(shù)量也增加了，模型的訓(xùn)練變得更加困難。這種模型沒有對(duì)圖像模型進(jìn)行復(fù)用，而是直接將視頻數(shù)據(jù)傳入 3D 卷積網(wǎng)絡(luò)進(jìn)行訓(xùn)練。

舊模型三：Two-Stream 網(wǎng)絡(luò)

Two-Stream 網(wǎng)絡(luò)的兩個(gè)流分別為?1 張 RGB 快照和?10 張計(jì)算之后的光流幀畫面組成的棧。兩個(gè)流都通過(guò) ImageNet 預(yù)訓(xùn)練好的圖像卷積網(wǎng)絡(luò)，光流部分可以分為豎直和水平兩個(gè)通道，所以是普通圖片輸入的 2 倍，模型在訓(xùn)練和測(cè)試中表現(xiàn)都十分出色。

光流視頻 optical flow video

上面講到了光流，在此對(duì)光流做一下介紹。光流是什么呢？名字很專業(yè)，感覺很陌生，但實(shí)際上這種視覺現(xiàn)象我們每天都在經(jīng)歷，我們坐高鐵的時(shí)候，可以看到窗外的景物都在快速往后退，開得越快，就感受到外面的景物就是 “刷” 地一個(gè)殘影，這種視覺上目標(biāo)的運(yùn)動(dòng)方向和速度就是光流。光流從概念上講，是對(duì)物體運(yùn)動(dòng)的觀察，通過(guò)找到相鄰幀之間的相關(guān)性來(lái)判斷幀之間的對(duì)應(yīng)關(guān)系，計(jì)算出相鄰幀畫面中物體的運(yùn)動(dòng)信息，獲取像素運(yùn)動(dòng)的瞬時(shí)速度。在原始視頻中，有運(yùn)動(dòng)部分和靜止的背景部分，我們通常需要判斷的只是視頻中運(yùn)動(dòng)部分的狀態(tài)，而光流就是通過(guò)計(jì)算得到了視頻中運(yùn)動(dòng)部分的運(yùn)動(dòng)信息。

下面是一個(gè)經(jīng)過(guò)計(jì)算后的原視頻及光流視頻。

原視頻

光流視頻

新模型：Two-Stream Inflated 3D ConvNets

新模型采取了以下幾點(diǎn)結(jié)構(gòu)改進(jìn)：

拓展 2D 卷積為 3D。直接利用成熟的圖像分類模型，只不過(guò)將網(wǎng)絡(luò)中二維 $ N × N 的 filters 和 pooling kernels 直接變成的?filters?和?poolingkernels?直接變成 N × N × N $；
用 2D filter 的預(yù)訓(xùn)練參數(shù)來(lái)初始化 3D filter 的參數(shù)。上一步已經(jīng)利用了圖像分類模型的網(wǎng)絡(luò)，這一步的目的是能利用上網(wǎng)絡(luò)的預(yù)訓(xùn)練參數(shù)，直接將 2D filter 的參數(shù)直接沿著第三個(gè)時(shí)間維度進(jìn)行復(fù)制 N 次，最后將所有參數(shù)值再除以 N；
調(diào)整感受野的形狀和大小。新模型改造了圖像分類模型 Inception-v1 的結(jié)構(gòu)，前兩個(gè) max-pooling 層改成使用 $ 1 × 3 × 3 kernels and stride 1 in time，其他所有 max-pooling 層都仍然使用對(duì)此的 kernel 和 stride，最后一個(gè) average pooling 層使用?kernelsandstride1intime，其他所有?max?pooling?層都仍然使用對(duì)此的?kernel?和?stride，最后一個(gè)?averagepooling?層使用 2 × 7 × 7 $ 的 kernel。
延續(xù)了 Two-Stream 的基本方法。用雙流結(jié)構(gòu)來(lái)捕獲圖片之間的時(shí)空關(guān)系仍然是有效的。

最后新模型的整體結(jié)構(gòu)如下圖所示：

好，到目前為止，我們已經(jīng)講解了視頻動(dòng)作識(shí)別的經(jīng)典數(shù)據(jù)集和經(jīng)典模型，下面我們通過(guò)代碼來(lái)實(shí)踐地跑一跑其中的兩個(gè)模型：C3D 模型（ 3D 卷積網(wǎng)絡(luò)）以及?I3D 模型（Two-Stream Inflated 3D ConvNets）。

C3D 模型結(jié)構(gòu)

我們已經(jīng)在前面的 “舊模型二：3D 卷積網(wǎng)絡(luò)” 中講解到 3D 卷積網(wǎng)絡(luò)是一種看起來(lái)比較自然的處理視頻的網(wǎng)絡(luò)，雖然它有效果不夠好，計(jì)算量也大的特點(diǎn)，但它的結(jié)構(gòu)很簡(jiǎn)單，可以構(gòu)造一個(gè)很簡(jiǎn)單的網(wǎng)絡(luò)就可以實(shí)現(xiàn)視頻動(dòng)作識(shí)別，如下圖所示是 3D 卷積的示意圖：

a) 中，一張圖片進(jìn)行了 2D 卷積， b) 中，對(duì)視頻進(jìn)行 2D 卷積，將多個(gè)幀視作多個(gè)通道， c) 中，對(duì)視頻進(jìn)行 3D 卷積，將時(shí)序信息加入輸入信號(hào)中。

ab 中，output 都是一張二維特征圖，所以無(wú)論是輸入是否有時(shí)間信息，輸出都是一張二維的特征圖，2D 卷積失去了時(shí)序信息。只有 3D 卷積在輸出時(shí)，保留了時(shí)序信息。2D 和 3D 池化操作同樣有這樣的問題。

如下圖所示是一種 C3D 網(wǎng)絡(luò)的變種：（如需閱讀原文描述，請(qǐng)查看 I3D 論文 2.2 節(jié)）

C3D 結(jié)構(gòu)，包括 8 個(gè)卷積層，5 個(gè)最大池化層以及 2 個(gè)全連接層，最后是 softmax 輸出層。

所有的 3D 卷積核為 $ 3 × 3 × 3$ 步長(zhǎng)為 1，使用 SGD，初始學(xué)習(xí)率為 0.003，每 150k 個(gè)迭代，除以 2。優(yōu)化在 1.9M 個(gè)迭代的時(shí)候結(jié)束，大約 13epoch。

數(shù)據(jù)處理時(shí)，視頻抽幀定義大小為：$ c × l × h × w，，c 為通道數(shù)量，為通道數(shù)量，l 為幀的數(shù)量，為幀的數(shù)量，h 為幀畫面的高度，為幀畫面的高度，w 為幀畫面的寬度。3D 卷積核和池化核的大小為為幀畫面的寬度。3D?卷積核和池化核的大小為 d × k × k，，d 是核的時(shí)間深度，是核的時(shí)間深度，k 是核的空間大小。網(wǎng)絡(luò)的輸入為視頻的抽幀，預(yù)測(cè)出的是類別標(biāo)簽。所有的視頻幀畫面都調(diào)整大小為是核的空間大小。網(wǎng)絡(luò)的輸入為視頻的抽幀，預(yù)測(cè)出的是類別標(biāo)簽。所有的視頻幀畫面都調(diào)整大小為 128 × 171 $，幾乎將 UCF-101 數(shù)據(jù)集中的幀調(diào)整為一半大小。視頻被分為不重復(fù)的 16 幀畫面，這些畫面將作為模型網(wǎng)絡(luò)的輸入。最后對(duì)幀畫面的大小進(jìn)行裁剪，輸入的數(shù)據(jù)為 $16 × 112 × 112 $

3.C3D 模型訓(xùn)練

接下來(lái)，我們將對(duì) C3D 模型進(jìn)行訓(xùn)練，訓(xùn)練過(guò)程分為：數(shù)據(jù)預(yù)處理以及模型訓(xùn)練。在此次訓(xùn)練中，我們使用的數(shù)據(jù)集為 UCF-101，由于 C3D 模型的輸入是視頻的每幀圖片，因此我們需要對(duì)數(shù)據(jù)集的視頻進(jìn)行抽幀，也就是將視頻轉(zhuǎn)換為圖片，然后將圖片數(shù)據(jù)傳入模型之中，進(jìn)行訓(xùn)練。

在本案例中，我們隨機(jī)抽取了 UCF-101 數(shù)據(jù)集的一部分進(jìn)行訓(xùn)練的演示，感興趣的同學(xué)可以下載完整的 UCF-101 數(shù)據(jù)集進(jìn)行訓(xùn)練。

UCF-101 下載

數(shù)據(jù)集存儲(chǔ)在目錄 dataset_subset 下

如下代碼是使用 cv2 庫(kù)進(jìn)行視頻文件到圖片文件的轉(zhuǎn)換

import cv2 import os # 視頻數(shù)據(jù)集存儲(chǔ)位置 video_path = './dataset_subset/' # 生成的圖像數(shù)據(jù)集存儲(chǔ)位置 save_path = './dataset/' # 如果文件路徑不存在則創(chuàng)建路徑 if not os.path.exists(save_path):os.mkdir(save_path) # 獲取動(dòng)作列表 action_list = os.listdir(video_path) # 遍歷所有動(dòng)作 for action in action_list:if action.startswith(".")==False:if not os.path.exists(save_path+action):os.mkdir(save_path+action)video_list = os.listdir(video_path+action)# 遍歷所有視頻for video in video_list:prefix = video.split('.')[0]if not os.path.exists(os.path.join(save_path, action, prefix)):os.mkdir(os.path.join(save_path, action, prefix))save_name = os.path.join(save_path, action, prefix) + '/'video_name = video_path+action+'/'+video# 讀取視頻文件# cap為視頻的幀cap = cv2.VideoCapture(video_name)# fps為幀率fps = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))fps_count = 0for i in range(fps):ret, frame = cap.read()if ret:# 將幀畫面寫入圖片文件中cv2.imwrite(save_name+str(10000+fps_count)+'.jpg',frame)fps_count += 1

此時(shí)，視頻逐幀轉(zhuǎn)換成的圖片數(shù)據(jù)已經(jīng)存儲(chǔ)起來(lái)，為模型訓(xùn)練做準(zhǔn)備。

4. 模型訓(xùn)練

首先，我們構(gòu)建模型結(jié)構(gòu)。

C3D 模型結(jié)構(gòu)我們之前已經(jīng)介紹過(guò)，這里我們通過(guò) keras 提供的 Conv3D，MaxPool3D，ZeroPadding3D 等函數(shù)進(jìn)行模型的搭建。

from keras.layers import Dense,Dropout,Conv3D,Input,MaxPool3D,Flatten,Activation, ZeroPadding3D from keras.regularizers import l2 from keras.models import Model, Sequential # 輸入數(shù)據(jù)為 112×112 的圖片，16幀， 3通道 input_shape = (112,112,16,3) # 權(quán)重衰減率 weight_decay = 0.005 # 類型數(shù)量，我們使用UCF-101 為數(shù)據(jù)集，所以為101 nb_classes = 101 # 構(gòu)建模型結(jié)構(gòu) inputs = Input(input_shape) x = Conv3D(64,(3,3,3),strides=(1,1,1),padding='same',activation='relu',kernel_regularizer=l2(weight_decay))(inputs) x = MaxPool3D((2,2,1),strides=(2,2,1),padding='same')(x) x = Conv3D(128,(3,3,3),strides=(1,1,1),padding='same',activation='relu',kernel_regularizer=l2(weight_decay))(x) x = MaxPool3D((2,2,2),strides=(2,2,2),padding='same')(x) x = Conv3D(128,(3,3,3),strides=(1,1,1),padding='same',activation='relu',kernel_regularizer=l2(weight_decay))(x) x = MaxPool3D((2,2,2),strides=(2,2,2),padding='same')(x) x = Conv3D(256,(3,3,3),strides=(1,1,1),padding='same',activation='relu',kernel_regularizer=l2(weight_decay))(x) x = MaxPool3D((2,2,2),strides=(2,2,2),padding='same')(x) x = Conv3D(256, (3, 3, 3), strides=(1, 1, 1), padding='same',activation='relu',kernel_regularizer=l2(weight_decay))(x) x = MaxPool3D((2, 2, 2), strides=(2, 2, 2), padding='same')(x) x = Flatten()(x) x = Dense(2048,activation='relu',kernel_regularizer=l2(weight_decay))(x) x = Dropout(0.5)(x) x = Dense(2048,activation='relu',kernel_regularizer=l2(weight_decay))(x) x = Dropout(0.5)(x) x = Dense(nb_classes,kernel_regularizer=l2(weight_decay))(x) x = Activation('softmax')(x) model = Model(inputs, x) Using TensorFlow backend. /home/ma-user/anaconda3/envs/TensorFlow-1.13.1/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'._np_qint8 = np.dtype([("qint8", np.int8, 1)]) /home/ma-user/anaconda3/envs/TensorFlow-1.13.1/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'._np_quint8 = np.dtype([("quint8", np.uint8, 1)]) /home/ma-user/anaconda3/envs/TensorFlow-1.13.1/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:528: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'._np_qint16 = np.dtype([("qint16", np.int16, 1)]) /home/ma-user/anaconda3/envs/TensorFlow-1.13.1/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:529: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'._np_quint16 = np.dtype([("quint16", np.uint16, 1)]) /home/ma-user/anaconda3/envs/TensorFlow-1.13.1/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:530: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'._np_qint32 = np.dtype([("qint32", np.int32, 1)]) /home/ma-user/anaconda3/envs/TensorFlow-1.13.1/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:535: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.np_resource = np.dtype([("resource", np.ubyte, 1)]) WARNING:tensorflow:From /home/ma-user/anaconda3/envs/TensorFlow-1.13.1/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version. Instructions for updating: Colocations handled automatically by placer. WARNING:tensorflow:From /home/ma-user/anaconda3/envs/TensorFlow-1.13.1/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:3445: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version. Instructions for updating: Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.

通過(guò) keras 提供的 summary () 方法，打印模型結(jié)構(gòu)。可以看到模型的層構(gòu)建以及各層的輸入輸出情況。

model.summary()

此處輸出較長(zhǎng)，省略

通過(guò) keras 的 input 方法可以查看模型的輸入形狀，shape 分別為 (batch size, width, height, frames, channels) 。

model.input <tf.Tensor 'input_1:0' shape=(?, 112, 112, 16, 3) dtype=float32>

可以看到模型的數(shù)據(jù)處理的維度與圖像處理模型有一些差別，多了 frames 維度，體現(xiàn)出時(shí)序關(guān)系在視頻分析中的影響。

接下來(lái)，我們開始將圖片文件轉(zhuǎn)為訓(xùn)練需要的數(shù)據(jù)形式。

# 引用必要的庫(kù) from keras.optimizers import SGD,Adam from keras.utils import np_utils import numpy as np import random import cv2 import matplotlib.pyplot as plt # 自定義callbacks from schedules import onetenth_4_8_12 INFO:matplotlib.font_manager:font search path ['/home/ma-user/anaconda3/envs/TensorFlow-1.13.1/lib/python3.6/site-packages/matplotlib/mpl-data/fonts/ttf', '/home/ma-user/anaconda3/envs/TensorFlow-1.13.1/lib/python3.6/site-packages/matplotlib/mpl-data/fonts/afm', '/home/ma-user/anaconda3/envs/TensorFlow-1.13.1/lib/python3.6/site-packages/matplotlib/mpl-data/fonts/pdfcorefonts'] INFO:matplotlib.font_manager:generated new fontManager

參數(shù)定義

img_path = save_path # 圖片文件存儲(chǔ)位置 results_path = './results' # 訓(xùn)練結(jié)果保存位置 if not os.path.exists(results_path):os.mkdir(results_path)

數(shù)據(jù)集劃分，隨機(jī)抽取 4/5 作為訓(xùn)練集，其余為驗(yàn)證集。將文件信息分別存儲(chǔ)在 train_list 和 test_list 中，為訓(xùn)練做準(zhǔn)備。

cates = os.listdir(img_path) train_list = [] test_list = [] # 遍歷所有的動(dòng)作類型 for cate in cates:videos = os.listdir(os.path.join(img_path, cate))length = len(videos)//5# 訓(xùn)練集大小，隨機(jī)取視頻文件加入訓(xùn)練集train= random.sample(videos, length*4)train_list.extend(train)# 將余下的視頻加入測(cè)試集for video in videos:if video not in train:test_list.append(video) print("訓(xùn)練集為：") print( train_list) print("共%d 個(gè)視頻\n"%(len(train_list))) print("驗(yàn)證集為：") print(test_list) print("共%d 個(gè)視頻"%(len(test_list)))

此處輸出較長(zhǎng)，省略

接下來(lái)開始進(jìn)行模型的訓(xùn)練。

首先定義數(shù)據(jù)讀取方法。方法 process_data 中讀取一個(gè) batch 的數(shù)據(jù)，包含 16 幀的圖片信息的數(shù)據(jù)，以及數(shù)據(jù)的標(biāo)注信息。在讀取圖片數(shù)據(jù)時(shí)，對(duì)圖片進(jìn)行隨機(jī)裁剪和翻轉(zhuǎn)操作以完成數(shù)據(jù)增廣。

def process_data(img_path, file_list,batch_size=16,train=True):batch = np.zeros((batch_size,16,112,112,3),dtype='float32')labels = np.zeros(batch_size,dtype='int')cate_list = os.listdir(img_path)def read_classes():path = "./classInd.txt"with open(path, "r+") as f:lines = f.readlines()classes = {}for line in lines:c_id = line.split()[0]c_name = line.split()[1]classes[c_name] =c_id return classesclasses_dict = read_classes()for file in file_list:cate = file.split("_")[1]img_list = os.listdir(os.path.join(img_path, cate, file))img_list.sort()batch_img = []for i in range(batch_size):path = os.path.join(img_path, cate, file)label = int(classes_dict[cate])-1symbol = len(img_list)//16if train:# 隨機(jī)進(jìn)行裁剪crop_x = random.randint(0, 15)crop_y = random.randint(0, 58)# 隨機(jī)進(jìn)行翻轉(zhuǎn)is_flip = random.randint(0, 1)# 以16 幀為單位for j in range(16):img = img_list[symbol + j]image = cv2.imread( path + '/' + img)image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)image = cv2.resize(image, (171, 128))if is_flip == 1:image = cv2.flip(image, 1)batch[i][j][:][:][:] = image[crop_x:crop_x + 112, crop_y:crop_y + 112, :]symbol-=1if symbol<0:breaklabels[i] = labelelse:for j in range(16):img = img_list[symbol + j]image = cv2.imread( path + '/' + img)image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)image = cv2.resize(image, (171, 128))batch[i][j][:][:][:] = image[8:120, 30:142, :]symbol-=1if symbol<0:breaklabels[i] = labelreturn batch, labels batch, labels = process_data(img_path, train_list) print("每個(gè)batch的形狀為：%s"%(str(batch.shape))) print("每個(gè)label的形狀為：%s"%(str(labels.shape))) 每個(gè)batch的形狀為：(16, 16, 112, 112, 3) 每個(gè)label的形狀為：(16,)

定義 data generator，將數(shù)據(jù)批次傳入訓(xùn)練函數(shù)中。

def generator_train_batch(train_list, batch_size, num_classes, img_path):while True:# 讀取一個(gè)batch的數(shù)據(jù)x_train, x_labels = process_data(img_path, train_list, batch_size=16,train=True)x = preprocess(x_train)# 形成input要求的數(shù)據(jù)格式y(tǒng) = np_utils.to_categorical(np.array(x_labels), num_classes)x = np.transpose(x, (0,2,3,1,4))yield x, y def generator_val_batch(test_list, batch_size, num_classes, img_path):while True:# 讀取一個(gè)batch的數(shù)據(jù)y_test,y_labels = process_data(img_path, train_list, batch_size=16,train=False)x = preprocess(y_test)# 形成input要求的數(shù)據(jù)格式x = np.transpose(x,(0,2,3,1,4))y = np_utils.to_categorical(np.array(y_labels), num_classes)yield x, y

定義方法 preprocess，對(duì)函數(shù)的輸入數(shù)據(jù)進(jìn)行圖像的標(biāo)準(zhǔn)化處理。

def preprocess(inputs):inputs[..., 0] -= 99.9inputs[..., 1] -= 92.1inputs[..., 2] -= 82.6inputs[..., 0] /= 65.8inputs[..., 1] /= 62.3inputs[..., 2] /= 60.3return inputs # 訓(xùn)練一個(gè)epoch大約需4分鐘 # 類別數(shù)量 num_classes = 101 # batch大小 batch_size = 4 # epoch數(shù)量 epochs = 1 # 學(xué)習(xí)率大小 lr = 0.005 # 優(yōu)化器定義 sgd = SGD(lr=lr, momentum=0.9, nesterov=True) model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy']) # 開始訓(xùn)練 history = model.fit_generator(generator_train_batch(train_list, batch_size, num_classes,img_path),steps_per_epoch= len(train_list) // batch_size,epochs=epochs,callbacks=[onetenth_4_8_12(lr)],validation_data=generator_val_batch(test_list, batch_size,num_classes,img_path),validation_steps= len(test_list) // batch_size,verbose=1) # 對(duì)訓(xùn)練結(jié)果進(jìn)行保存 model.save_weights(os.path.join(results_path, 'weights_c3d.h5')) WARNING:tensorflow:From /home/ma-user/anaconda3/envs/TensorFlow-1.13.1/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.cast instead. Epoch 1/1 20/20 [==============================] - 442s 22s/step - loss: 28.7099 - acc: 0.9344 - val_loss: 27.7600 - val_acc: 1.0000

5. 模型測(cè)試

接下來(lái)我們將訓(xùn)練之后得到的模型進(jìn)行測(cè)試。隨機(jī)在 UCF-101 中選擇一個(gè)視頻文件作為測(cè)試數(shù)據(jù)，然后對(duì)視頻進(jìn)行取幀，每 16 幀畫面?zhèn)魅肽Ｐ瓦M(jìn)行一次動(dòng)作預(yù)測(cè)，并且將動(dòng)作預(yù)測(cè)以及預(yù)測(cè)百分比打印在畫面中并進(jìn)行視頻播放。

首先，引入相關(guān)的庫(kù)。

from IPython.display import clear_output, Image, display, HTML import time import cv2 import base64 import numpy as np

構(gòu)建模型結(jié)構(gòu)并且加載權(quán)重。

from models import c3d_model model = c3d_model() model.load_weights(os.path.join(results_path, 'weights_c3d.h5'), by_name=True) # 加載剛訓(xùn)練的模型

定義函數(shù) arrayshow，進(jìn)行圖片變量的編碼格式轉(zhuǎn)換。

def arrayShow(img):_,ret = cv2.imencode('.jpg', img) return Image(data=ret)

進(jìn)行視頻的預(yù)處理以及預(yù)測(cè)，將預(yù)測(cè)結(jié)果打印到畫面中，最后進(jìn)行播放。

# 加載所有的類別和編號(hào) with open('./ucfTrainTestlist/classInd.txt', 'r') as f:class_names = f.readlines()f.close() # 讀取視頻文件 video = './videos/v_Punch_g03_c01.avi' cap = cv2.VideoCapture(video) clip = [] # 將視頻畫面?zhèn)魅肽Ｐ?while True:try:clear_output(wait=True)ret, frame = cap.read()if ret:tmp = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)clip.append(cv2.resize(tmp, (171, 128)))# 每16幀進(jìn)行一次預(yù)測(cè)if len(clip) == 16:inputs = np.array(clip).astype(np.float32)inputs = np.expand_dims(inputs, axis=0)inputs[..., 0] -= 99.9inputs[..., 1] -= 92.1inputs[..., 2] -= 82.6inputs[..., 0] /= 65.8inputs[..., 1] /= 62.3inputs[..., 2] /= 60.3inputs = inputs[:,:,8:120,30:142,:]inputs = np.transpose(inputs, (0, 2, 3, 1, 4))# 獲得預(yù)測(cè)結(jié)果pred = model.predict(inputs)label = np.argmax(pred[0])# 將預(yù)測(cè)結(jié)果繪制到畫面中cv2.putText(frame, class_names[label].split(' ')[-1].strip(), (20, 20),cv2.FONT_HERSHEY_SIMPLEX, 0.6,(0, 0, 255), 1)cv2.putText(frame, "prob: %.4f" % pred[0][label], (20, 40),cv2.FONT_HERSHEY_SIMPLEX, 0.6,(0, 0, 255), 1)clip.pop(0)# 播放預(yù)測(cè)后的視頻 lines, columns, _ = frame.shapeframe = cv2.resize(frame, (int(columns), int(lines)))img = arrayShow(frame)display(img)time.sleep(0.02)else:breakexcept:print(0) cap.release()

6.I3D 模型

在之前我們簡(jiǎn)單介紹了 I3D 模型，I3D 官方 github 庫(kù)提供了在 Kinetics 上預(yù)訓(xùn)練的模型和預(yù)測(cè)代碼，接下來(lái)我們將體驗(yàn) I3D 模型如何對(duì)視頻進(jìn)行預(yù)測(cè)。

首先，引入相關(guān)的包

import numpy as np import tensorflow as tf import i3d WARNING: The TensorFlow contrib module will not be included in TensorFlow 2.0. For more information, please see:* https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md* https://github.com/tensorflow/addons If you depend on functionality not listed there, please file an issue.

進(jìn)行參數(shù)的定義

# 輸入圖片大小 _IMAGE_SIZE = 224 # 視頻的幀數(shù) _SAMPLE_VIDEO_FRAMES = 79 # 輸入數(shù)據(jù)包括兩部分：RGB和光流 # RGB和光流數(shù)據(jù)已經(jīng)經(jīng)過(guò)提前計(jì)算 _SAMPLE_PATHS = {'rgb': 'data/v_CricketShot_g04_c01_rgb.npy','flow': 'data/v_CricketShot_g04_c01_flow.npy', } # 提供了多種可以選擇的預(yù)訓(xùn)練權(quán)重 # 其中，imagenet系列模型從ImageNet的2D權(quán)重中拓展而來(lái)，其余為視頻數(shù)據(jù)下的預(yù)訓(xùn)練權(quán)重 _CHECKPOINT_PATHS = {'rgb': 'data/checkpoints/rgb_scratch/model.ckpt','flow': 'data/checkpoints/flow_scratch/model.ckpt','rgb_imagenet': 'data/checkpoints/rgb_imagenet/model.ckpt','flow_imagenet': 'data/checkpoints/flow_imagenet/model.ckpt', } # 記錄類別文件 _LABEL_MAP_PATH = 'data/label_map.txt' # 類別數(shù)量為400 NUM_CLASSES = 400

定義參數(shù)：

imagenet_pretrained ：如果為 True，則調(diào)用預(yù)訓(xùn)練權(quán)重，如果為 False，則調(diào)用 ImageNet 轉(zhuǎn)成的權(quán)重

imagenet_pretrained = True # 加載動(dòng)作類型 kinetics_classes = [x.strip() for x in open(_LABEL_MAP_PATH)] tf.logging.set_verbosity(tf.logging.INFO)

構(gòu)建 RGB 部分模型

rgb_input = tf.placeholder(tf.float32, shape=(1, _SAMPLE_VIDEO_FRAMES, _IMAGE_SIZE, _IMAGE_SIZE, 3)) with tf.variable_scope('RGB', reuse=tf.AUTO_REUSE):rgb_model = i3d.InceptionI3d(NUM_CLASSES, spatial_squeeze=True, final_endpoint='Logits')rgb_logits, _ = rgb_model(rgb_input, is_training=False, dropout_keep_prob=1.0) rgb_variable_map = {} for variable in tf.global_variables():if variable.name.split('/')[0] == 'RGB':rgb_variable_map[variable.name.replace(':0', '')] = variable rgb_saver = tf.train.Saver(var_list=rgb_variable_map, reshape=True)

構(gòu)建光流部分模型

flow_input = tf.placeholder(tf.float32,shape=(1, _SAMPLE_VIDEO_FRAMES, _IMAGE_SIZE, _IMAGE_SIZE, 2)) with tf.variable_scope('Flow', reuse=tf.AUTO_REUSE):flow_model = i3d.InceptionI3d(NUM_CLASSES, spatial_squeeze=True, final_endpoint='Logits')flow_logits, _ = flow_model(flow_input, is_training=False, dropout_keep_prob=1.0) flow_variable_map = {} for variable in tf.global_variables():if variable.name.split('/')[0] == 'Flow':flow_variable_map[variable.name.replace(':0', '')] = variable flow_saver = tf.train.Saver(var_list=flow_variable_map, reshape=True)

將模型聯(lián)合，成為完整的 I3D 模型

model_logits = rgb_logits + flow_logits model_predictions = tf.nn.softmax(model_logits)

開始模型預(yù)測(cè)，獲得視頻動(dòng)作預(yù)測(cè)結(jié)果。

預(yù)測(cè)數(shù)據(jù)為開篇提供的 RGB 和光流數(shù)據(jù)：

with tf.Session() as sess:feed_dict = {}if imagenet_pretrained:rgb_saver.restore(sess, _CHECKPOINT_PATHS['rgb_imagenet']) # 加載rgb流的模型else:rgb_saver.restore(sess, _CHECKPOINT_PATHS['rgb'])tf.logging.info('RGB checkpoint restored')if imagenet_pretrained:flow_saver.restore(sess, _CHECKPOINT_PATHS['flow_imagenet']) # 加載flow流的模型else:flow_saver.restore(sess, _CHECKPOINT_PATHS['flow'])tf.logging.info('Flow checkpoint restored') start_time = time.time()rgb_sample = np.load(_SAMPLE_PATHS['rgb']) # 加載rgb流的輸入數(shù)據(jù)tf.logging.info('RGB data loaded, shape=%s', str(rgb_sample.shape))feed_dict[rgb_input] = rgb_sampleflow_sample = np.load(_SAMPLE_PATHS['flow']) # 加載flow流的輸入數(shù)據(jù)tf.logging.info('Flow data loaded, shape=%s', str(flow_sample.shape))feed_dict[flow_input] = flow_sampleout_logits, out_predictions = sess.run([model_logits, model_predictions],feed_dict=feed_dict)out_logits = out_logits[0]out_predictions = out_predictions[0]sorted_indices = np.argsort(out_predictions)[::-1]print('Inference time in sec: %.3f' % float(time.time() - start_time))print('Norm of logits: %f' % np.linalg.norm(out_logits))print('\nTop classes and probabilities')for index in sorted_indices[:20]:print(out_predictions[index], out_logits[index], kinetics_classes[index]) WARNING:tensorflow:From /home/ma-user/anaconda3/envs/TensorFlow-1.13.1/lib/python3.6/site-packages/tensorflow/python/training/saver.py:1266: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version. Instructions for updating: Use standard file APIs to check for files with this prefix. INFO:tensorflow:Restoring parameters from data/checkpoints/rgb_imagenet/model.ckpt INFO:tensorflow:RGB checkpoint restored INFO:tensorflow:Restoring parameters from data/checkpoints/flow_imagenet/model.ckpt INFO:tensorflow:Flow checkpoint restored INFO:tensorflow:RGB data loaded, shape=(1, 79, 224, 224, 3) INFO:tensorflow:Flow data loaded, shape=(1, 79, 224, 224, 2) Inference time in sec: 1.511 Norm of logits: 138.468643 Top classes and probabilities 1.0 41.813675 playing cricket 1.497162e-09 21.49398 hurling (sport) 3.8431236e-10 20.13411 catching or throwing baseball 1.549242e-10 19.22559 catching or throwing softball 1.1360187e-10 18.915354 hitting baseball 8.801105e-11 18.660116 playing tennis 2.4415466e-11 17.37787 playing kickball 1.153184e-11 16.627766 playing squash or racquetball 6.1318893e-12 15.996157 shooting goal (soccer) 4.391727e-12 15.662376 hammer throw 2.2134352e-12 14.9772005 golf putting 1.6307096e-12 14.67167 throwing discus 1.5456218e-12 14.618079 javelin throw 7.6690325e-13 13.917259 pumping fist 5.1929587e-13 13.527372 shot put 4.2681337e-13 13.331245 celebrating 2.7205462e-13 12.880901 applauding 1.8357015e-13 12.487494 throwing ball 1.6134511e-13 12.358444 dodgeball 1.1388395e-13 12.010078 tap dancing

點(diǎn)擊關(guān)注，第一時(shí)間了解華為云新鮮技術(shù)~

總結(jié)

以上是生活随笔為你收集整理的详解视频中动作识别模型与代码实践的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：【CSS】盒子模型内边距 ① ( 内边距
下一篇： html边距设置