Tensorflow-(4)使用Tensorflow加载csv,pandas dataframes,图像,文本文件
1、使用 tf.data 加載 pandas dataframes
from __future__ import absolute_import, division, print_function, unicode_literalsimport pandas as pd import tensorflow as tf使用 pandas 讀取 csv 文件。
csv_file = tf.keras.utils.get_file('heart.csv', 'https://storage.googleapis.com/applied-dl/heart.csv')df = pd.read_csv(csv_file) df.head()df.dtypes將 thal 列(數據幀(dataframe)中的 object )轉換為離散數值。
df['thal'] = pd.Categorical(df['thal']) df['thal'] = df.thal.cat.codesdf.head() target = df.pop('target')dataset = tf.data.Dataset.from_tensor_slices((df.values, target.values))for feat, targ in dataset.take(5):print ('Features: {}, Target: {}'.format(feat, targ)) train_dataset = dataset.shuffle(len(df)).batch(1)創建并訓練模型
def get_compiled_model():model = tf.keras.Sequential([tf.keras.layers.Dense(10, activation='relu'),tf.keras.layers.Dense(10, activation='relu'),tf.keras.layers.Dense(1, activation='sigmoid')])model.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])return modelmodel = get_compiled_model() model.fit(train_dataset, epochs=15)2、用 tf.data 加載圖片
配置
from __future__ import absolute_import, division, print_function, unicode_literalsimport tensorflow as tf AUTOTUNE = tf.data.experimental.AUTOTUNE下載數據集
import pathlib data_root_orig = tf.keras.utils.get_file(origin='https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz',fname='flower_photos', untar=True) data_root = pathlib.Path(data_root_orig) print(data_root) for item in data_root.iterdir():print(item) import random all_image_paths = list(data_root.glob('*/*')) all_image_paths = [str(path) for path in all_image_paths] random.shuffle(all_image_paths)image_count = len(all_image_paths) image_count all_image_paths[:10]列出可用的標簽:
label_names = sorted(item.name for item in data_root.glob('*/') if item.is_dir()) label_names
為每個標簽分配索引:
創建一個列表,包含每個文件的標簽索引:
加載和格式化圖片
將字符串數組切片,得到一個字符串數據集:
現在創建一個新的數據集,通過在路徑數據集上映射 preprocess_image 來動態加載和格式化圖片。
image_ds = path_ds.map(load_and_preprocess_image, num_parallel_calls=AUTOTUNE) image_ds <ParallelMapDataset shapes: (192, 192, 3), types: tf.float32>使用同樣的 from_tensor_slices 方法你可以創建一個標簽數據集:
label_ds = tf.data.Dataset.from_tensor_slices(tf.cast(all_image_labels, tf.int64))for label in label_ds.take(10):print(label_names[label.numpy()]) sunflowers sunflowers roses tulips sunflowers sunflowers dandelion sunflowers dandelion roses由于這些數據集順序相同,你可以將他們打包在一起得到一個(圖片, 標簽)對數據集:
image_label_ds = tf.data.Dataset.zip((image_ds, label_ds)) image_label_ds <ZipDataset shapes: ((192, 192, 3), ()), types: (tf.float32, tf.int64)> ds = tf.data.Dataset.from_tensor_slices((all_image_paths, all_image_labels))# 元組被解壓縮到映射函數的位置參數中 def load_and_preprocess_from_path_label(path, label):return load_and_preprocess_image(path), labelimage_label_ds = ds.map(load_and_preprocess_from_path_label) image_label_ds <MapDataset shapes: ((192, 192, 3), ()), types: (tf.float32, tf.int32)>訓練的基本方法,使用 tf.data.Dataset.apply 方法和融合過的 tf.data.experimental.shuffle_and_repeat 函數來打亂和重啟
# 訓練的基本方法 BATCH_SIZE = 32ds = image_label_ds.apply(tf.data.experimental.shuffle_and_repeat(buffer_size=image_count)) ds = ds.batch(BATCH_SIZE) ds = ds.prefetch(buffer_size=AUTOTUNE) ds <PrefetchDataset shapes: ((None, 192, 192, 3), (None,)), types: (tf.float32, tf.int32)>從 tf.keras.applications 取得 MobileNet v2 副本。該模型副本會被用于一個簡單的遷移學習例子。設置 MobileNet 的權重為不可訓練:
mobile_net = tf.keras.applications.MobileNetV2(input_shape=(192, 192, 3), include_top=False) mobile_net.trainable=False Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/mobilenet_v2/mobilenet_v2_weights_tf_dim_ordering_tf_kernels_1.0_192_no_top.h5 9412608/9406464 [==============================] - 0s 0us/step在你將輸出傳遞給 MobilNet 模型之前,你需要將其范圍從 [0,1] 轉化為 [-1,1]:
def change_range(image,label):return 2*image-1, labelkeras_ds = ds.map(change_range) # 數據集可能需要幾秒來啟動,因為要填滿其隨機緩沖區。 image_batch, label_batch = next(iter(keras_ds))feature_map_batch = mobile_net(image_batch) print(feature_map_batch.shape) (32, 6, 6, 1280)構建模型
model = tf.keras.Sequential([mobile_net,tf.keras.layers.GlobalAveragePooling2D(),tf.keras.layers.Dense(len(label_names), activation = 'softmax')])model.compile(optimizer=tf.keras.optimizers.Adam(),loss='sparse_categorical_crossentropy',metrics=["accuracy"])model.summary() Model: "sequential_2" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= mobilenetv2_1.00_192 (Model) (None, 6, 6, 1280) 2257984 _________________________________________________________________ global_average_pooling2d (Gl (None, 1280) 0 _________________________________________________________________ dense_6 (Dense) (None, 5) 6405 ================================================================= Total params: 2,264,389 Trainable params: 6,405 Non-trainable params: 2,257,984 _________________________________________________________________傳遞給 model.fit() 之前你會指定 step 的真實數量
steps_per_epoch=tf.math.ceil(len(all_image_paths)/BATCH_SIZE).numpy() steps_per_epoch 115.0 model.fit(ds, epochs=1, steps_per_epoch=115) 115/115 [==============================] - 10s 86ms/step - loss: 0.6913 - accuracy: 0.7405 <tensorflow.python.keras.callbacks.History at 0x7f9ee776d0f0>3、使用 tf.data 加載文本數據
將使用相同作品(荷馬的伊利亞特)三個不同版本的英文翻譯,然后訓練一個模型來通過單行文本確定譯者。
環境搭建
from __future__ import absolute_import, division, print_function, unicode_literalsimport tensorflow as tfimport tensorflow_datasets as tfds import os DIRECTORY_URL = 'https://storage.googleapis.com/download.tensorflow.org/data/illiad/' FILE_NAMES = ['cowper.txt', 'derby.txt', 'butler.txt']for name in FILE_NAMES:text_dir = tf.keras.utils.get_file(name, origin=DIRECTORY_URL+name)parent_dir = os.path.dirname(text_dir)parent_dir Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/illiad/cowper.txt 819200/815980 [==============================] - 0s 0us/step Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/illiad/derby.txt 811008/809730 [==============================] - 0s 0us/step Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/illiad/butler.txt 811008/807992 [==============================] - 0s 0us/step '/root/.keras/datasets'將文本加載到數據集中,將迭代數據集中的每一個樣本并且返回( example, label )對。
def labeler(example, index):return example, tf.cast(index, tf.int64) labeled_data_sets = []for i, file_name in enumerate(FILE_NAMES):lines_dataset = tf.data.TextLineDataset(os.path.join(parent_dir, file_name))labeled_dataset = lines_dataset.map(lambda ex: labeler(ex, i))labeled_data_sets.append(labeled_dataset)將這些標記的數據集合并到一個數據集中,然后對其進行隨機化操作。
BUFFER_SIZE = 50000 BATCH_SIZE = 64 TAKE_SIZE = 5000all_labeled_data = labeled_data_sets[0] for labeled_dataset in labeled_data_sets[1:]:all_labeled_data = all_labeled_data.concatenate(labeled_dataset)all_labeled_data = all_labeled_data.shuffle(BUFFER_SIZE, reshuffle_each_iteration=False)你可以使用 tf.data.Dataset.take 與 print 來查看 (example, label) 對的外觀。numpy 屬性顯示每個 Tensor 的值。
for ex in all_labeled_data.take(5):print(ex) (<tf.Tensor: shape=(), dtype=string, numpy=b"In boxing, Clytomedes, OEnops' son,">, <tf.Tensor: shape=(), dtype=int64, numpy=1>) (<tf.Tensor: shape=(), dtype=string, numpy=b'in your heart, and this, all about one single girl, whereas we now'>, <tf.Tensor: shape=(), dtype=int64, numpy=2>) (<tf.Tensor: shape=(), dtype=string, numpy=b"With angry taunts he drove the gather'd crowds.">, <tf.Tensor: shape=(), dtype=int64, numpy=0>) (<tf.Tensor: shape=(), dtype=string, numpy=b'bravest of the Achaeans."'>, <tf.Tensor: shape=(), dtype=int64, numpy=2>) (<tf.Tensor: shape=(), dtype=string, numpy=b"Olympian over-arch'd with clouds of gold">, <tf.Tensor: shape=(), dtype=int64, numpy=0>)建立詞匯表
首先,通過將文本標記為單獨的單詞集合來構建詞匯表。
- 迭代每個樣本的 numpy 值。
- 使用 tfds.features.text.Tokenizer 來將其分割成 token。
- 將這些 token 放入一個 Python 集合中,借此來清除重復項。
- 獲取該詞匯表的大小以便于以后使用。
樣本編碼
通過傳遞 vocabulary_set 到 tfds.features.text.TokenTextEncoder 來構建一個編碼器。編碼器的 encode 方法傳入一行文本,返回一個整數列表。
嘗試運行這一行代碼并查看輸出的樣式。
encoder = tfds.features.text.TokenTextEncoder(vocabulary_set)example_text = next(iter(all_labeled_data))[0].numpy() print(example_text) b"In boxing, Clytomedes, OEnops' son," encoded_example = encoder.encode(example_text) print(encoded_example) [6870, 1006, 14062, 7080, 16501]通過將編碼器打包到 tf.py_function 并且傳參至數據集的 map 方法的方式來運行
def encode(text_tensor, label):encoded_text = encoder.encode(text_tensor.numpy())return encoded_text, labeldef encode_map_fn(text, label):# py_func doesn't set the shape of the returned tensors.encoded_text, label = tf.py_function(encode, inp=[text, label], Tout=(tf.int64, tf.int64))# `tf.data.Datasets` work best if all components have a shape set# so set the shapes manually: encoded_text.set_shape([None])label.set_shape([])return encoded_text, labelall_encoded_data = all_labeled_data.map(encode_map_fn) all_encoded_data <MapDataset shapes: ((None,), ()), types: (tf.int64, tf.int64)>將數據集分割為測試集和訓練集且進行分支
使用 tf.data.Dataset.take 和 tf.data.Dataset.skip 來建立一個小一些的測試數據集和稍大一些的訓練數據集。
在數據集被傳入模型之前,數據集需要被分批。最典型的是,每個分支中的樣本大小與格式需要一致。但是數據集中樣本并不全是相同大小的(每行文本字數并不相同)。因此,使用 tf.data.Dataset.padded_batch(而不是 batch )將樣本填充到相同的大小。
train_data = all_encoded_data.skip(TAKE_SIZE).shuffle(BUFFER_SIZE) train_data <ShuffleDataset shapes: ((None,), ()), types: (tf.int64, tf.int64)> train_data = train_data.padded_batch(BATCH_SIZE, padded_shapes=(tf.TensorShape([None,]),tf.TensorShape([])))test_data = all_encoded_data.take(TAKE_SIZE) test_data = test_data.padded_batch(BATCH_SIZE, padded_shapes=(tf.TensorShape([None,]),tf.TensorShape([])))建立模型
由于我們填充了零即引入了一個新的 token 來編碼,因此詞匯表大小(vocab_size)增加了一個。
model = tf.keras.Sequential() model.add(tf.keras.layers.Embedding(vocab_size+1, 64))# LSTM 層,它允許模型利用上下文中理解單詞含義。 model.add(tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64))) # 一個或多個緊密連接的層 # 編輯 `for` 行的列表去檢測層的大小 for units in [64, 64]:model.add(tf.keras.layers.Dense(units, activation='relu'))# 輸出層。第一個參數是標簽個數。 model.add(tf.keras.layers.Dense(3, activation='softmax')) model.compile(optimizer='adam',loss='sparse_categorical_crossentropy',metrics=['accuracy'])model.fit(train_data, epochs=3, validation_data=test_data) Epoch 1/3 697/697 [==============================] - 18s 25ms/step - loss: 0.5225 - accuracy: 0.7457 - val_loss: 0.3818 - val_accuracy: 0.8258 Epoch 2/3 697/697 [==============================] - 17s 24ms/step - loss: 0.3010 - accuracy: 0.8678 - val_loss: 0.3646 - val_accuracy: 0.8340 Epoch 3/3 697/697 [==============================] - 17s 24ms/step - loss: 0.2316 - accuracy: 0.8990 - val_loss: 0.3820 - val_accuracy: 0.8384 <tensorflow.python.keras.callbacks.History at 0x7f9f45767b00> eval_loss, eval_acc = model.evaluate(test_data)print('\nEval loss: {}, Eval accuracy: {}'.format(eval_loss, eval_acc)) 79/79 [==============================] - 2s 27ms/step - loss: 0.3820 - accuracy: 0.8384Eval loss: 0.3820337951183319, Eval accuracy: 0.8384000062942505總結
以上是生活随笔為你收集整理的Tensorflow-(4)使用Tensorflow加载csv,pandas dataframes,图像,文本文件的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: bootstrap guide
- 下一篇: ISO/IEC17025与ISO9000