當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

快速使用Tensorflow读取7万数据集!

發布時間：2024/1/23 编程问答 21 豆豆

生活随笔收集整理的這篇文章主要介紹了快速使用Tensorflow读取7万数据集! 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

一、Brief概述

? ?這篇文章中，我們使用知名的圖片數據庫[THE MNIST DATABASE]作為我們的圖片來源，它的數據內容是一共七萬張28x28像素的手寫數組圖片。

? ?并被分成六萬張訓練集與一萬張測試集，其中訓練集里面，又有五千張圖片唄用來作為驗證使用，該數據庫是公認圖像處理的"Hello World"入門級別庫，在此之前已經有數不清的研究，圍繞著這個模型展開。

? ? 不過初次看到這個庫之后，肯定是對其長相產生許多的疑問，我們從外觀上既看不到圖片本身，也看不到任何的索引線索，他就是四個壓縮包分別名稱如下圖:?

? ? 對數據庫以此方法打包的理由需要從計算機對數據的運算過程和內存開始說起，人類直觀的圖像是眼睛接收的光信號，這些不同顏色的光用數據的方式儲存起來后由兩種主要的格式與其對應的格式內容:

? ? 1.? ?.jpeg: height, width, channels;

? ? 2.? ? .png: height, width, channels, alpha;

? ?(注意:.png存儲格式的圖片含有透明度的信息，在處理圖片的時候可以舍棄)

? ? 這些圖像使用模塊如opencv導入到python中后，是以列表的方式呈現排列的數據，并且每次令image = cv2.imread()這類方式把數據指向到一個image物件時。

? ? 都是把數據存入內存的一個過程，在內存里面的數據好處是可以非常快速的調用并處理，直到這個狀態我們才算布置完數據被丟進算法前的狀態。

? ? ?然而，圖像數據導入內存的轉換并不是那么的迅捷，首先必須先解析每個像素的坐標和顏色值，再把每一次讀取到的圖片數據值合起來后，放入緩存中。

? ? ?這樣的流程在移動和讀取上都顯然沒有優勢，因此我們需要把數據回歸到其最基本的本質[二進制]上。

二、Binary Data二進制數據

? ?Reasons for using binary data, 使用二進制數據的理由

? ? 如果我們手上有成批的圖片數據，把它們傳入算法中算結果的過程，就好比一個人怕上樓梯，坐上滑水道的入口，等待經歷一段未知的短暫旅程。

? ? 滑水道由很多個通道，一次可以讓假設五個人準備滑下，而這時候如果后面替補的人速度不夠快，就會造成該入口一定時間的空缺，直接導致效率低下。

? ? 而這個比喻中的滑水道入口，代表的是深度學習GPU計算端口，準備下滑的人代表數據本身，而我們現在需要優化的，就是如何讓GPU在還沒有處理完這一個數據之前，就已經為它準備好下一批預處理數據。

? ? 讓CPU永遠保持工作狀態可以進一步提升整體運算的效率。方法之一就是讓數據回歸到[二進制]的本質。

? ? 二進制是數據在電腦硬盤存儲狀態的原貌，也是數據被處理時，最本質的狀態，因此批量圖片數據第一件要被處理的事情就是讓他們以二進制的姿態被放入到內存中。

? ? 此舉就好比排隊玩滑水道的人們都要事前把鞋子手表眼睛脫掉，帶著最需要的東西上去排隊后，等輪到自己時，一屁股坐上去擺好姿勢后就可以開始，沒有其他的冗余動作拖慢時間。

? ? 而我選擇的入門數據庫MNIST已經很貼心的幫我們處理好預處理的部分，分為四個類別:

? ? 測試集圖像數據: t10k-images-idx3-ubyte.gz;
? ? 測試集圖像標簽: t10k-labels-idx1-ubyte.gz;
? ? 訓練集圖像數據: train-images-idx3-ubyte.gz;
? ? 訓練集圖像標簽: train-labels-idx1-ubyte.gz。

? ? ?圖像識別基本上都是屬于機器學習中的監督學習門類，因此四個類別其中兩個是對應圖片集的標簽集，都是使用二進制的方法保存檔案。

三、The approach to load images讀取數據的方法

? ? 既然知道了數據庫里面的結構使二進制數據，接下來就可以使用python里面的模塊包解析數據，壓縮文件為.gz因此對應到打開此文件類型的模塊名為gzip,代碼如下：

import gzip, os import numpy as nplocation = input('The directory of MNIST dataset:') path = os.path.join(location, 'train-images-idx3-ubyte.gz') try:with gzip.open(path, 'rb') as fi:data_i = np.frombuffer(fi.read(), dtype = np.int8, offset=16)images_flat_all = data_i.reshape(-1, 784)print(images_flat_all)print('------Separation -----')print('Size of images_flat: ', len(images_flat_all)) except:print("The file directory doesn't exist!")### ------Result is shown below -----### THe directory of MNIST dataset:/home/abc/MNIST_DATA [[0 0 0 ... 0 0 0][0 0 0 ... 0 0 0][0 0 0 ... 0 0 0]...[0 0 0 ... 0 0 0][0 0 0 ... 0 0 0][0 0 0 ... 0 0 0]] ------ Separation ----- Size of images_flat: 60000

path_label = os.path.join(location, 'train-labels-idx1-ubyte.gz') with gzip.open(path_label, 'rb') as fl:data_l = np.frombuffer(fl.read(), dtype=np.int8, offset = 8)print(data_l) print('-----Separation -----') print('Size of images_labels: ', len(data_l), type(data_l[0]))### ----- Result is shown below ------ ### [5 0 4 ... 5 6 8] ------Separation ----- Size of images_labels: 60000 <class 'numpy.int8'>

代碼分為上下半段，上半段的代碼用來提取MNIST DATASET中訓練集的六萬個圖像樣本，每一個樣本都是由28×28尺寸的圖片數據拉直成一個1×784 長度的向量形式記錄下來。

下半段的代碼則是提取對應訓練集圖像的標簽，表示每一個圖片所描繪的數字實際上是多少，同樣也是六萬個標簽。（注：數據儲存格式同理測試集與其他種類數據庫。）

四、Explanation to the code代碼說明

基于我們隊神經網絡的了解，

一張圖片被用來放入神經網絡解析的時候，需要把一個代表圖像之二維矩陣的每條row拼成一個長條的一維向量，以此一向量作為一張圖片的計量單位。

而MNIST進一步把六萬張圖片的一維向量拼起來，形成一個超級長的向量后，以二進制的方式儲存在電腦中，因此如果要讓人們可以圖像化的看懂內部數據，就需要下面步驟還原數據：

使用 gzip.open 的 'rb' 讀取二進制模式打開指定的壓縮文件；
為了轉換數據成為 np.array ，使用 .frombuffer；
原本的二進制數據格式使用 dtype 修改成人類讀得懂的八進制格式；
MNIST 原始數據中直到第十六位數才開始描述圖像信息，而數據標簽則是第八位就開始描述信息，因此 offset 設置從第十六或是八位開始讀取；
讀出來的數據是一整條六萬個向量拼起來的數據，因此需要重新拼接數據， .reshape(-1, 784) 中的 -1 像一個未知數一樣，數據整形的過程中，只要 column = 784，那 row 是多少就是多少；
剝離出對應的標簽時，最后還需要對其使用 one_hot（）數據的轉換，讓標簽以例如 [0, 0, 0, 1, 0, 0, 0, 0, 0, 0] 的形式表示 "3" 的意思，目的是方便套入損失函數中運算，并尋找最優解。

把數據使用 numpy 數組描述好處是處理效率高，且此庫和大多數數據處理的庫都相容，不論是便利性和效率都是很大的優勢。

后面兩個鏈接 "numpy.frombuffer" "在NumPy中使用動態數組" 進一步深入的講述了函數的用法。

五、Linear Model線性模型

在理解數據集的數據格式和調用方法后，接下來就是把最簡單的線性模型應用到數據集中，并經過多次的梯度下降算法迭代，找出我們為此模型定義的損失函數最小值。

回顧第一章的內容，一個線性函數的代碼如下：

import numpy as np import tensorflow as tfx_data = np.random.rand(100).astype(np.float32) y_data = x_data * 0.1 + 0.3weight = tf.Variable(tf.random_uniform(shape=[1], minval=-1.0, maxval=1.0)) bias = tf.Variable(tf.zeros(shape=[1])) y = weight * x_data + biasloss = tf.reduce_mean(tf.square(y - y_data)) optimizer = tf.train.GradientDescentOptimizer(0.5) training = optimizer.minimize(loss)sess = tf.Session() init = tf.global_variables_initializer() sess.run(init)for step in range(101):sess.run(training)if step % 10 == 0:print('Round {}, weight: {}, bias: {}'.format(step, sess.run(weight[0]), sess.run(bias[0])))

其中我們可以看到沿著x軸上對應的y有兩組解，其中的y_data是我們預設的正解，而另外一個由wx + b計算產生的y則是我們要用來擬合正解的未知解，對應同一樣東西x的兩個不同的y軸值接下來需要被套入一個選定的損失函數中。

上面選中的是方差法，使用該方法算出損失函數后接著用reduce_mean()取平均，然后使用梯度下降算法把該值降到盡可能低的地步。

同理圖像數據的歸類問題，圖片的每一個像素數據就好比一次上面計算的過程，如同x的角色，是正確標簽和預測標簽所共享的一個維度數據。

而y_data所對應的則是正確的標簽，預測的標簽則是經過一系列線性加法乘法與歸一化運算處理后才得出來的結果。

圖像數據有一點在計算上看起來不同上面示例的地方是：每一個像素的計算被統一包含進了一個大的矩陣中，被作為整體運算的其中一個小單元平行處理，大大的加速整體運算的進程。

但是計算機處理物件的緩存是有限的，我們需要適量的把圖像數據放入緩存中做平行處理，如果過載了則整個計算框架就會崩潰。

六、MNIST in Linear Model

梳理了一遍線性模型與MNIST數據集的組成元素后，接下來就是基于 Tensorflow搭建一個線性回歸的手寫數字識別算法，有以下幾點需要重新聲明：

batch size：每一批次訓練圖片的數量需要調控以免內存不夠；

loss function: 損失函數的原理是計算預測和實際答案之間的差距。

接下來就是制定訓練步驟：

需要一個很簡單方便的方法呼叫我們需要的 MNIST 數據，因此需要寫一個類；

開始搭建 Tensorflow 數據流圖，用節點設計一個 wx + b 的線性運算；

把運算結果和實際標簽帶入損失函數中求出損失值；

使用梯度下降法求出損失值的最小值；

迭代訓練后，查看訓練結果的準確率；

檢查錯誤判斷的圖片被歸類成了什么標簽。

import gzip, os import numpy as np################ Step No.1 to well manage the dataset. ################ class MNIST:# Images size is told in the official website 28*28 px.image_size = 28image_size_flat = image_size * image_size# Let the validation set flexible when making an instance.def __init__(self, val_ratio=0.1, data_dir='MNIST_data'):self.val_ratio = val_ratioself.data_dir = data_dir# Load 4 files to individual lists with one string pixels.img_train = self.load_flat_images('train-images-idx3-ubyte.gz')lab_train = self.load_labels('train-labels-idx1-ubyte.gz')img_test = self.load_flat_images('t10k-images-idx3-ubyte.gz')lab_test = self.load_labels('t10k-labels-idx1-ubyte.gz')# Determine the actual number of training / validation sets.self.val_train_num = round(len(img_train) * self.val_ratio)self.main_train_num = len(img_train) - self.val_train_num# The normalized image pixels value can be more convenient when training.# dtype=np.int64 would be more general when applying to Tensorflow.self.img_train = img_train[0:self.main_train_num] / 255.0self.lab_train = lab_train[0:self.main_train_num].astype(np.int)self.img_train_val = img_train[self.main_train_num:] / 255.0self.lab_train_val = lab_train[self.main_train_num:].astype(np.int)# Also convert the format of testing set.self.img_test = img_test / 255.0self.lab_test = lab_test.astype(np.int)# Extract the same codes from "load_flat_images" and "load_labels".# This method won't be called during training procedure.def load_binary_to_num(self, dataset_name, offset):path = os.path.join(self.data_dir, dataset_name)with gzip.open(path, 'rb') as binary_file:# The datasets files are stored in 8 bites, mind the format.data = np.frombuffer(binary_file.read(), np.uint8, offset=offset)return data# This method won't be called during training procedure.def load_flat_images(self, dataset_name):# Images offset position is 16 by default formatdata = self.load_binary_to_num(dataset_name, offset=16)images_flat_all = data.reshape(-1, self.image_size_flat)return images_flat_all# This method won't be called during training procedure.def load_labels(self, dataset_name):# Labels offset position is 8 by default format.labels_all = self.load_binary_to_num(dataset_name, offset=8)return labels_all# This method would be called for training usage.def one_hot(self, labels):# Properly use numpy module to mimic the one hot effect.class_num = np.max(self.lab_test) + 1convert = np.eye(class_num, dtype=float)[labels]return convert #---------------------------------------------------------------------#path = '/home/abc/MNIST_data' data = MNIST(val_ratio=0.1, data_dir=path) import tensorflow as tfflat_size = data.image_size_flat label_num = np.max(data.lab_test) + 1################ Step No.2 to construct tensor graph. ################ x_train= tf.placeholder(dtype=tf.float32, shape=[None, flat_size]) t_label_oh = tf.placeholder(dtype=tf.float32, shape=[None, label_num]) t_label = tf.placeholder(dtype=tf.int64, shape=[None])################ These are the values ################ # Initialize the beginning weights and biases by random_normal method. weights = tf.Variable(tf.random_normal([flat_size, label_num], mean=0.0, stddev=1.0, dtype=tf.float32)) biases = tf.Variable(tf.random_normal([label_num], mean=0.0, stddev=1.0, dtype=tf.float32)) ########### that we wish to get by training ##########logits = tf.matmul(x_train, weights) + biases # < Annotation No.1 > # Shrink the distances between values into 0 to 1 by softmax formula. p_label_soh = tf.nn.softmax(logits) # Pick the position of largest value along y axis. p_label = tf.argmax(p_label_soh, axis=1) #---------------------------------------------------------------------######## Step No.3 to get a loss value by certain loss function. ####### # This softmax function can not accept input being "softmaxed" before. CE = tf.nn.softmax_cross_entropy_with_logits_v2(logits=logits, labels=t_label_oh) # Shrink all loss values in a matrix to only one averaged loss. loss = tf.reduce_mean(CE) #---------------------------------------------------------------------##### Step No.4 get a minimized loss value using gradient descent. #### # Decrease this only averaged loss to a minimum value by using gradient descent. optimizer = tf.train.AdamOptimizer(learning_rate=0.5).minimize(loss) #---------------------------------------------------------------------## First return a boolean list values by tf.equal function correct_predict = tf.equal(p_label, t_label) # And cast them into 0 and 1 values so that its average value would be accuracy. accuracy = tf.reduce_mean(tf.cast(correct_predict, dtype=tf.float32))sess = tf.Session() sess.run(tf.global_variables_initializer())###### Step No.5 iterate the training set and check the accuracy. ##### # The trigger to train the linear model with a defined cycles. def optimize(iteration, batch_size=32):for i in range(iteration):total = len(data.lab_train)random = np.random.randint(0, total, size=batch_size)# Randomly pick training images / labels with a defined batch size.x_train_batch = data.img_train[random]t_label_batch_oh = data.one_hot(data.lab_train[random])batch_dict = {x_train: x_train_batch, t_label_oh: t_label_batch_oh}sess.run(optimizer, feed_dict=batch_dict)# The trigger to check the current accuracy value def Accuracy():# Use the totally separate dataset to test the trained modeltest_dict = {x_train: data.img_test,t_label_oh: data.one_hot(data.lab_test),t_label: data.lab_test}Acc = sess.run(accuracy, feed_dict=test_dict)print('Accuracy on Test Set: {0:.2%}'.format(Acc)) #---------------------------------------------------------------------#### Step No.6 plot wrong predicted pictures with its predicted label.## import matplotlib.pyplot as plt# We can decide how many wrong predicted images are going to be shown up. # We can focus on the specific wrong predicted labels def wrong_predicted_images(pic_num=[3, 4], label_number=None):test_dict = {x_train: data.img_test,t_label_oh: data.one_hot(data.lab_test),t_label: data.lab_test}correct_pred, p_lab = sess.run([correct_predict, p_label], feed_dict=test_dict)# To reverse the boolean value in order to pick up wrong labelswrong_pred = (correct_pred == False)# Pick up the wrong doing elements from the corresponding placeswrong_img_test = data.img_test[wrong_pred]wrong_t_label = data.lab_test[wrong_pred]wrong_p_label = p_lab[wrong_pred]fig, axes = plt.subplots(pic_num[0], pic_num[1])fig.subplots_adjust(hspace=0.3, wspace=0.3)edge = data.image_sizefor ax in axes.flat:# If we were not interested in certain label number,# pick up the wrong predicted images randomly.if label_number is None:i = np.random.randint(0, len(wrong_t_label), size=None, dtype=np.int)pic = wrong_img_test[i].reshape(edge, edge)ax.imshow(pic, cmap='binary')xlabel = "True: {0}, Pred: {1}".format(wrong_t_label[i], wrong_p_label[i])# If we are interested in certain label number,# pick up the specific wrong images number randomly.else:# Mind that np.where return a "tuple" that should be indexing.specific_idx = np.where(wrong_t_label==label_number)[0]i = np.random.randint(0, len(specific_idx), size=None, dtype=np.int)pic = wrong_img_test[specific_idx[i]].reshape(edge, edge)ax.imshow(pic, cmap='binary')xlabel = "True: {0}, Pred: {1}".format(wrong_t_label[specific_idx[i]], wrong_p_label[specific_idx[i]])ax.set_xlabel(xlabel)# Pictures don't need any ticks, so we remove them in both dimensionsax.set_xticks([])ax.set_yticks([])plt.show() #---------------------------------------------------------------------# Accuracy() # Accuracy before doing anything optimize(10); Accuracy() # Iterate 10 times optimize(1000); Accuracy() # Iterate 10 + 1000 times optimize(10000); Accuracy() # Iterate 10 + 1000 + 10000 times### ----- Results are shown below ----- ### Accuracy on Test Set: 11.51% Accuracy on Test Set: 68.37% Accuracy on Test Set: 86.38% Accuracy on Test Set: 89.34%

Annotation No.1 tf.matmul(x_train, weights)

這個環節是在了解整個神經網絡訓練原理后，最重要的一個子標題，計算的矩陣模型中必須兼顧 random_batch 提取隨意多的數據集，同時符合矩陣乘法的運算原理，如下圖描述：

? ? ?

? ?

矩陣位置前后順序很重要，由于數據集本身經過我們處理后，就是左邊矩陣的格式，在期望輸出為右邊矩陣的情況下，只能是 x·w 的順序，以 x 的隨機列數來決定后面預測的標簽列數， w 則決定有幾個歸類標簽。

Reason of using one_hot()

數據集經過一番線性運算后得出的結果如上圖所見，只能是 size=[None, 10] 的大小，但是數據集給的標簽答案是數字本身，因此我們需要一個手段把數字轉換成 10 個元素組成的向量，而第一選擇方法就是 one_hot() ，同時使用 one_hot 的結果來計算損失函數.

七、Finally

??呼叫上面定義的函數，如下代碼：

wrong_predicted_images(pic_num=[3, 3], label_number=5)

? ?

其中可以自行選擇想要一次陳列幾張圖片，每次陳列的圖片都是隨機選擇，并同時可以選擇想查看的標簽類別，如上面一行函數設定為 5 ，則就只顯示標簽 5 的錯誤判斷圖片和誤判結果。最后等整個框架計算完畢后，需要執行下面代碼結束 tf.Session ，釋放內存：

sess.close()

參考鏈接:https://mp.weixin.qq.com/s?__biz=MjM5MjAwODM4MA==&mid=2650707508&idx=2&sn=56166a2d432ca79db14de0127f3023eb&chksm=bea6e7e789d16ef164ef118fbf610807bc3d064660c0148ade777b6ef48734ae118d9e69538c&mpshare=1&scene=1&srcid=1030aHFfirjAEe70YkgMMSQ7#rd

總結

以上是生活随笔為你收集整理的快速使用Tensorflow读取7万数据集!的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：【算法与数据结构】堆排序是什么鬼？
下一篇： HTTP协议中的Content-Enco