tfds.load()和tf.data.Dataset的简介
生活随笔
收集整理的這篇文章主要介紹了
tfds.load()和tf.data.Dataset的简介
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
tfds.load()和tf.data.Dataset的簡介
tfds.load()有以下參數
tfds.load(name, split=None, data_dir=None, batch_size=None, shuffle_files=False,download=True, as_supervised=False, decoders=None, read_config=None,with_info=False, builder_kwargs=None, download_and_prepare_kwargs=None,as_dataset_kwargs=None, try_gcs=False )重要參數如下:
- name 數據集的名字
- split 對數據集的切分
- data_dir 數據的位置或者數據下載的位置
- batch_size 批道數
- shuffle_files 打亂
- as_supervised 返回元組(默認返回時字典的形式的)
1.數據的切分
# 拿數據集中訓練集(數據集默認劃分為train,test) train_ds = tfds.load('mnist', split='train')# 兩部分都拿出來 train_ds, test_ds = tfds.load('mnist', split=['train', 'test'])# 兩部分都拿出來,并合成一個 train_test_ds = tfds.load('mnist', split='train+test')# 從訓練集的10(含)到20(不含) train_10_20_ds = tfds.load('mnist', split='train[10:20]')# 訓練集的前10% train_10pct_ds = tfds.load('mnist', split='train[:10%]')# 訓練集的前10%和后80% train_10_80pct_ds = tfds.load('mnist', split='train[:10%]+train[-80%:]')#--------------------------------------------------- # 10%的交錯驗證集: # 沒批驗證集拿訓練集的10%: # [0%:10%], [10%:20%], ..., [90%:100%]. vals_ds = tfds.load('mnist', split=[f'train[{k}%:{k+10}%]' for k in range(0, 100, 10) ]) # 訓練集拿90%: # [10%:100%] (驗證集為 [0%:10%]), # [0%:10%] + [20%:100%] (驗證集為 [10%:20%]), ..., # [0%:90%] (驗證集為 [90%:100%]). trains_ds = tfds.load('mnist', split=[f'train[:{k}%]+train[{k+10}%:]' for k in range(0, 100, 10) ])還有使用ReadInstruction API 切分的,效果跟上面一樣
# The full `train` split. train_ds = tfds.load('mnist', split=tfds.core.ReadInstruction('train'))# The full `train` split and the full `test` split as two distinct datasets. train_ds, test_ds = tfds.load('mnist', split=[tfds.core.ReadInstruction('train'),tfds.core.ReadInstruction('test'), ])# The full `train` and `test` splits, interleaved together. ri = tfds.core.ReadInstruction('train') + tfds.core.ReadInstruction('test') train_test_ds = tfds.load('mnist', split=ri)# From record 10 (included) to record 20 (excluded) of `train` split. train_10_20_ds = tfds.load('mnist', split=tfds.core.ReadInstruction('train', from_=10, to=20, unit='abs'))# The first 10% of train split. train_10_20_ds = tfds.load('mnist', split=tfds.core.ReadInstruction('train', to=10, unit='%'))# The first 10% of train + the last 80% of train. ri = (tfds.core.ReadInstruction('train', to=10, unit='%') +tfds.core.ReadInstruction('train', from_=-80, unit='%')) train_10_80pct_ds = tfds.load('mnist', split=ri)# 10-fold cross-validation (see also next section on rounding behavior): # The validation datasets are each going to be 10%: # [0%:10%], [10%:20%], ..., [90%:100%]. # And the training datasets are each going to be the complementary 90%: # [10%:100%] (for a corresponding validation set of [0%:10%]), # [0%:10%] + [20%:100%] (for a validation set of [10%:20%]), ..., # [0%:90%] (for a validation set of [90%:100%]). vals_ds = tfds.load('mnist', [tfds.core.ReadInstruction('train', from_=k, to=k+10, unit='%')for k in range(0, 100, 10)]) trains_ds = tfds.load('mnist', [(tfds.core.ReadInstruction('train', to=k, unit='%') +tfds.core.ReadInstruction('train', from_=k+10, unit='%'))for k in range(0, 100, 10)])2.返回的對象
返回的對象是一個tf.data.Dataset或者和一個tfds.core.DatasetInfo(如果有的話)
3.指定目錄
指定目錄十分簡單(默認會放到用戶目錄下面)
train_ds = tfds.load('mnist', split='train',data_dir='~/user')4.獲取img和label
因為返回的是一個tf.data.Dataset對象,我們可以在對其進行迭代之前對數據集進行操作,以此來獲取符合我們要求的數據。
tf.data.Dataset有以下幾個重要的方法:
4.1 shuffle
數據的打亂
shuffle(buffer_size, seed=None, reshuffle_each_iteration=None ) #隨機重新排列此數據集的元素。 #該數據集用buffer_size元素填充緩沖區,然后從該緩沖區中隨機采樣元素,將所選元素替換為新元素。為了實現完美 #的改組,需要緩沖區大小大于或等于數據集的完整大小。 #例如,如果您的數據集包含10,000個元素但buffer_size設置為1,000個,則shuffle最初將僅從緩沖區的前1,000 #個元素中選擇一個隨機元素。選擇一個元素后,其緩沖區中的空間將由下一個(即1,001個)元素替換,并保留1,000個#元素緩沖區。 #reshuffle_each_iteration控制隨機播放順序對于每個時期是否應該不同。4.2 batch
批道大小(一批多少個數據),迭代的是時候根據批道數放回對應的數據量
batch(batch_size, drop_remainder=False )dataset = tf.data.Dataset.range(8) dataset = dataset.batch(3) list(dataset.as_numpy_iterator())dataset = tf.data.Dataset.range(8) dataset = dataset.batch(3, drop_remainder=True) list(dataset.as_numpy_iterator())返回的是一個Dataset
4.3 map
用跟普通的map方法差不多,目的是對數據集操作
map(map_func, num_parallel_calls=None, deterministic=None )dataset = Dataset.range(1, 6) # ==> [ 1, 2, 3, 4, 5 ] dataset = dataset.map(lambda x: x + 1) list(dataset.as_numpy_iterator())返回的是一個Dataset
4.4 as_numpy_iterator
返回一個迭代器,該迭代器將數據集的所有元素轉換為numpy。
使用as_numpy_iterator檢查你的數據集的內容。要查看元素的形狀和類型,請直接打印數據集元素,而不要使用 as_numpy_iterator。
dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3]) for element in dataset:print(element) #tf.Tensor( 1 , shape = ( ) , dtype = int32 ) #tf.Tensor ( 2 , shape = ( ) . dtype = int32 ) #tf.Tensor ( 3 , shape = ( ) , dtype = int32 )dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3]) for element in dataset.as_numpy_iterator():print(element) #1 #2 #34.5 對數據集操作示例
通過下面的寫法可以獲取符合格式的數據:
#先用map()將img進行resize,然后進打亂,然后設定迭代的放回的batch_size dataset_train = dataset_train.map(lambda img, label: (tf.image.resize(img, (224, 224)) / 255.0, label)).shuffle(1024).batch(batch_size)#因為是測試集,所以不打亂,只是把img進行resize dataset_test = dataset_test.map(lambda img, label: (tf.image.resize(img, (224, 224)) / 255.0, label)).batch(batch_size)對數據進行迭代:
for images, labels in dataset_train:labels_pred = model(images, training=True)loss = tf.keras.losses.sparse_categorical_crossentropy(y_true=labels, y_pred=labels_pred)loss = tf.reduce_mean(loss)········總結
以上是生活随笔為你收集整理的tfds.load()和tf.data.Dataset的简介的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 主存和内存以及一些概念
- 下一篇: from _sqlite3 import