tensorflow系列之1:加载数据
本文介紹了如何加載各種數據源,以生成可以用于tensorflow使用的數據集,一般指Dataset。主要包括以下幾類數據源:
- 預定義的公共數據源
- 內存中的數據
- csv文件
- TFRecord
- 任意格式的數據文件
- 稀疏數據格式文件
更完整的數據加載方式請參考:https://www.tensorflow.org/tutorials/load_data/images?hl=zh-cn
import numpy as np import pandas as pd import tensorflow as tf from tensorflow import keras import matplotlib.pyplot as plt import os print(tf.__version__) 2.5.01、預定義的公共數據源
為了方便使用,tensorflow將一些常用的數據源預先處理好,用戶可以直接使用。完整內容請參考:
https://www.tensorflow.org/datasets/overview
tensorflow的數據集有2種類型:
- 簡單的數據集,使用keras.datasets.***.load_data()即可以得到數據
- 在tensorflow_datasets中的數據集。
1.1 簡單數據集
常見的有mnist,fashion_mnist等返回的是numpy.ndarray的數據格式。
(x_train_all,y_train_all),(x_test,y_test) = keras.datasets.fashion_mnist.load_data() print(type(x_train_all)) x_train_all[5,1],y_train_all[5] <class 'numpy.ndarray'>(array([ 0, 0, 0, 1, 0, 0, 20, 131, 199, 206, 196, 202, 242,255, 255, 250, 222, 197, 206, 188, 126, 17, 0, 0, 0, 0,0, 0], dtype=uint8),2) (x_train_all,y_train_all),(x_test,y_test) = keras.datasets.mnist.load_data() x_train_all[14,14],y_train_all[14] (array([ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 29,255, 254, 109, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0], dtype=uint8),1)1.2 tensorflow_datasets
tensorflow_datasets提供的數據集。
題外話,由于tensorflow dataset被墻,請自備梯子。若在服務器等無法fq的環境,可以先在其它機器下載好,數據一般會下載到~/tensorflow_datasets目錄下,然后把目錄下的數據集上傳到服務器相同的目錄即可。tensorflow會優先檢查本地目錄是否有文件,再去下載。
通過tfds.load()可以方便的加載數據集,返回值為tf.data.Dataset類型;如果with_info=True,則返回(Dataset,ds_info)組成的tuple。
完整內容可參考:
https://www.tensorflow.org/datasets/api_docs/python/tfds/load
1.3 flower數據集
import tensorflow_datasets as tfdsdataset, info = tfds.load("tf_flowers", as_supervised=True, with_info=True) class_names = info.features["label"].names n_classes = info.features["label"].num_classes dataset_size = info.splits["train"].num_examplestest_set_raw, valid_set_raw, train_set_raw = tfds.load("tf_flowers",split=["train[:10%]", "train[10%:25%]", "train[25%:]"],as_supervised=True)# 畫一些花朵看一下 plt.figure(figsize=(12, 10)) index = 0 for image, label in train_set_raw.take(9):index += 1plt.subplot(3, 3, index)plt.imshow(image)plt.title("Class: {}".format(class_names[label]))plt.axis("off")plt.show()2、加載內存中的數據
本部分內容主要將內存中的數據(numpy)轉換為Dataset。
from_tensor_slices()將numpy數組中的每一個元素都轉化為tensorflow Dataset中的一個元素:
dataset = tf.data.Dataset.from_tensor_slices(np.arange(10)) print(dataset) for item in dataset:print(item) <TensorSliceDataset shapes: (), types: tf.int64> tf.Tensor(0, shape=(), dtype=int64) tf.Tensor(1, shape=(), dtype=int64) tf.Tensor(2, shape=(), dtype=int64) tf.Tensor(3, shape=(), dtype=int64) tf.Tensor(4, shape=(), dtype=int64) tf.Tensor(5, shape=(), dtype=int64) tf.Tensor(6, shape=(), dtype=int64) tf.Tensor(7, shape=(), dtype=int64) tf.Tensor(8, shape=(), dtype=int64) tf.Tensor(9, shape=(), dtype=int64)我們可以對這個Dataset做各種的操作,比如:
dataset = dataset.repeat(3).batch(7) for item in dataset:print(item) tf.Tensor([0 1 2 3 4 5 6], shape=(7,), dtype=int64) tf.Tensor([7 8 9 0 1 2 3], shape=(7,), dtype=int64) tf.Tensor([4 5 6 7 8 9 0], shape=(7,), dtype=int64) tf.Tensor([1 2 3 4 5 6 7], shape=(7,), dtype=int64) tf.Tensor([8 9], shape=(2,), dtype=int64)我們還可以將多個數組整合成一個Dataset,常見的比如feature和label組合成訓練樣本:
x = np.array([[1, 2], [3, 4], [5, 6]]) y = np.array(['cat', 'dog', 'fox']) dataset3 = tf.data.Dataset.from_tensor_slices((x, y)) print(dataset3)for item_x, item_y in dataset3:print(item_x.numpy(), item_y.numpy()) <TensorSliceDataset shapes: ((2,), ()), types: (tf.int64, tf.string)> [1 2] b'cat' [3 4] b'dog' [5 6] b'fox'或者這樣做:
dataset4 = tf.data.Dataset.from_tensor_slices({"feature": x,"label": y}) for item in dataset4:print(item["feature"].numpy(), item["label"].numpy()) [1 2] b'cat' [3 4] b'dog' [5 6] b'fox'3、加載csv文件的數據
本部分介紹了tensorflow如何加載csv文件生成Dataset。除了本部分介紹的方法外,如果數據量不大,也可以使用pandas.read_csv加載到內存后,再使用上面介紹的from_tensor_slice()。
3.1 生成csv文件
由于我們沒有現成的csv文件,所以我們使用預定義好的公共數據集生成csv文件:
from sklearn.datasets import fetch_california_housing from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler# 獲取數據 housing = fetch_california_housing() x_train_all, x_test, y_train_all, y_test = train_test_split(housing.data, housing.target, random_state = 7) x_train, x_valid, y_train, y_valid = train_test_split(x_train_all, y_train_all, random_state = 11) print(x_train.shape, y_train.shape) print(x_valid.shape, y_valid.shape) print(x_test.shape, y_test.shape)# 標準化 scaler = StandardScaler() x_train_scaled = scaler.fit_transform(x_train) x_valid_scaled = scaler.transform(x_valid) x_test_scaled = scaler.transform(x_test)# 寫入csv文件 output_dir = "generate_csv" if not os.path.exists(output_dir):os.mkdir(output_dir)def save_to_csv(output_dir, data, name_prefix,header=None, n_parts=10):path_format = os.path.join(output_dir, "{}_{:02d}.csv")filenames = []for file_idx, row_indices in enumerate(np.array_split(np.arange(len(data)), n_parts)):part_csv = path_format.format(name_prefix, file_idx)filenames.append(part_csv)with open(part_csv, "wt", encoding="utf-8") as f:if header is not None:f.write(header + "\n")for row_index in row_indices:f.write(",".join([repr(col) for col in data[row_index]]))f.write('\n')return filenamestrain_data = np.c_[x_train_scaled, y_train] valid_data = np.c_[x_valid_scaled, y_valid] test_data = np.c_[x_test_scaled, y_test] header_cols = housing.feature_names + ["MidianHouseValue"] header_str = ",".join(header_cols)train_filenames = save_to_csv(output_dir, train_data, "train",header_str, n_parts=20) valid_filenames = save_to_csv(output_dir, valid_data, "valid",header_str, n_parts=10) test_filenames = save_to_csv(output_dir, test_data, "test",header_str, n_parts=10)# 看一下生成的文件: import pprint print("train filenames:") pprint.pprint(train_filenames) print("valid filenames:") pprint.pprint(valid_filenames) print("test filenames:") pprint.pprint(test_filenames) (11610, 8) (11610,) (3870, 8) (3870,) (5160, 8) (5160,) train filenames: ['generate_csv/train_00.csv','generate_csv/train_01.csv','generate_csv/train_02.csv','generate_csv/train_03.csv','generate_csv/train_04.csv','generate_csv/train_05.csv','generate_csv/train_06.csv','generate_csv/train_07.csv','generate_csv/train_08.csv','generate_csv/train_09.csv','generate_csv/train_10.csv','generate_csv/train_11.csv','generate_csv/train_12.csv','generate_csv/train_13.csv','generate_csv/train_14.csv','generate_csv/train_15.csv','generate_csv/train_16.csv','generate_csv/train_17.csv','generate_csv/train_18.csv','generate_csv/train_19.csv'] valid filenames: ['generate_csv/valid_00.csv','generate_csv/valid_01.csv','generate_csv/valid_02.csv','generate_csv/valid_03.csv','generate_csv/valid_04.csv','generate_csv/valid_05.csv','generate_csv/valid_06.csv','generate_csv/valid_07.csv','generate_csv/valid_08.csv','generate_csv/valid_09.csv'] test filenames: ['generate_csv/test_00.csv','generate_csv/test_01.csv','generate_csv/test_02.csv','generate_csv/test_03.csv','generate_csv/test_04.csv','generate_csv/test_05.csv','generate_csv/test_06.csv','generate_csv/test_07.csv','generate_csv/test_08.csv','generate_csv/test_09.csv']3.2 加載csv的文件內的數據
# 1. filename -> dataset # 2. read file -> dataset -> datasets -> merge # 3. parse csv def csv_reader_dataset(filenames, n_readers=5,batch_size=32, n_parse_threads=5,shuffle_buffer_size=10000):dataset = tf.data.Dataset.list_files(filenames)dataset = dataset.repeat()dataset = dataset.interleave(lambda filename: tf.data.TextLineDataset(filename).skip(1),cycle_length = n_readers)dataset.shuffle(shuffle_buffer_size)dataset = dataset.map(parse_csv_line,num_parallel_calls=n_parse_threads)dataset = dataset.batch(batch_size)return datasetdef parse_csv_line(line, n_fields = 9):defs = [tf.constant(np.nan)] * n_fieldsparsed_fields = tf.io.decode_csv(line, record_defaults=defs)x = tf.stack(parsed_fields[0:-1])y = tf.stack(parsed_fields[-1:])return x, ytrain_set = csv_reader_dataset(train_filenames, batch_size=3) for x_batch, y_batch in train_set.take(2):print("x:")pprint.pprint(x_batch)print("y:")pprint.pprint(y_batch) x: <tf.Tensor: shape=(3, 8), dtype=float32, numpy= array([[-0.32652634, 0.4323619 , -0.09345459, -0.08402992, 0.8460036 ,-0.02663165, -0.56176794, 0.1422876 ],[ 0.48530516, -0.8492419 , -0.06530126, -0.02337966, 1.4974351 ,-0.07790658, -0.90236324, 0.78145146],[-1.0591781 , 1.3935647 , -0.02633197, -0.1100676 , -0.6138199 ,-0.09695935, 0.3247131 , -0.03747724]], dtype=float32)> y: <tf.Tensor: shape=(3, 1), dtype=float32, numpy= array([[2.431],[2.956],[0.672]], dtype=float32)> x: <tf.Tensor: shape=(3, 8), dtype=float32, numpy= array([[ 8.0154431e-01, 2.7216142e-01, -1.1624393e-01, -2.0231152e-01,-5.4305160e-01, -2.1039616e-02, -5.8976209e-01, -8.2418457e-02],[ 4.9710345e-02, -8.4924191e-01, -6.2146995e-02, 1.7878747e-01,-8.0253541e-01, 5.0660671e-04, 6.4664572e-01, -1.1060793e+00],[ 2.2754266e+00, -1.2497431e+00, 1.0294788e+00, -1.7124432e-01,-4.5413753e-01, 1.0527152e-01, -9.0236324e-01, 9.0129471e-01]],dtype=float32)> y: <tf.Tensor: shape=(3, 1), dtype=float32, numpy= array([[3.226],[2.286],[3.798]], dtype=float32)> batch_size = 32 train_set = csv_reader_dataset(train_filenames,batch_size = batch_size) valid_set = csv_reader_dataset(valid_filenames,batch_size = batch_size) test_set = csv_reader_dataset(test_filenames,batch_size = batch_size)3.3 訓練模型
model = keras.models.Sequential([keras.layers.Dense(30, activation='relu',input_shape=[8]),keras.layers.Dense(1), ]) model.compile(loss="mean_squared_error", optimizer="sgd") callbacks = [keras.callbacks.EarlyStopping(patience=5, min_delta=1e-2)]history = model.fit(train_set,validation_data = valid_set,steps_per_epoch = 11160 // batch_size,validation_steps = 3870 // batch_size,epochs = 10,callbacks = callbacks) Epoch 1/100 348/348 [==============================] - 1s 3ms/step - loss: 1.5927 - val_loss: 2.1706 Epoch 2/100 348/348 [==============================] - 1s 2ms/step - loss: 0.7043 - val_loss: 0.5049 Epoch 3/100 348/348 [==============================] - 1s 2ms/step - loss: 0.4733 - val_loss: 0.4638 Epoch 4/100 348/348 [==============================] - 1s 2ms/step - loss: 0.4384 - val_loss: 0.4345 Epoch 5/100 348/348 [==============================] - 1s 2ms/step - loss: 0.4070 - val_loss: 0.4233 Epoch 6/100 348/348 [==============================] - 1s 4ms/step - loss: 0.4066 - val_loss: 0.4139 Epoch 7/100 348/348 [==============================] - 1s 2ms/step - loss: 0.4051 - val_loss: 0.4155 Epoch 8/100 348/348 [==============================] - 1s 4ms/step - loss: 0.3824 - val_loss: 0.3957 Epoch 9/100 348/348 [==============================] - 1s 3ms/step - loss: 0.3956 - val_loss: 0.3884 Epoch 10/100 348/348 [==============================] - 1s 3ms/step - loss: 0.3814 - val_loss: 0.3856 Epoch 11/100 348/348 [==============================] - 1s 2ms/step - loss: 0.4826 - val_loss: 0.3887 Epoch 12/100 348/348 [==============================] - 1s 3ms/step - loss: 0.3653 - val_loss: 0.3853 Epoch 13/100 348/348 [==============================] - 1s 3ms/step - loss: 0.3765 - val_loss: 0.3810 Epoch 14/100 348/348 [==============================] - 1s 4ms/step - loss: 0.3632 - val_loss: 0.3775 Epoch 15/100 348/348 [==============================] - 1s 4ms/step - loss: 0.3654 - val_loss: 0.3758 model.evaluate(test_set, steps = 5160 // batch_size) 161/161 [==============================] - 1s 2ms/step - loss: 0.38110.38114801049232483總結
以上是生活随笔為你收集整理的tensorflow系列之1:加载数据的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: sklearn处理文本和分类属性的方式
- 下一篇: tensorflow综合示例3:对结构化