當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

numpy和pandas的数据乱序

發(fā)布時間：2024/1/23 编程问答 27 豆豆

生活随笔收集整理的這篇文章主要介紹了 numpy和pandas的数据乱序小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

import numpy as np import pandas as pd import sklearn import urllib import os import tarfile

數(shù)據(jù)亂序

我們分別介紹numpy.ndarray和pandas.dataframe的亂序。

numpy.ndarray

拆分前，一般會先對數(shù)據(jù)進行隨機排序。

numpy.random中有shuffle()和permutation()2個函數(shù)均可用于對數(shù)據(jù)進行亂序。主要區(qū)別在于：

shuffle()直接對原數(shù)據(jù)進行重排，無返回值。
permutation()復制原數(shù)據(jù)，然后再重排，返回重排后的數(shù)組。原數(shù)據(jù)沒有任何變化。

生成數(shù)據(jù)：

data = np.arange(100).reshape(10,-1) print(data) [[ 0 1 2 3 4 5 6 7 8 9][10 11 12 13 14 15 16 17 18 19][20 21 22 23 24 25 26 27 28 29][30 31 32 33 34 35 36 37 38 39][40 41 42 43 44 45 46 47 48 49][50 51 52 53 54 55 56 57 58 59][60 61 62 63 64 65 66 67 68 69][70 71 72 73 74 75 76 77 78 79][80 81 82 83 84 85 86 87 88 89][90 91 92 93 94 95 96 97 98 99]]

使用permutation()重排：

x = np.random.permutation(data) print(x) print(data) [[50 51 52 53 54 55 56 57 58 59][70 71 72 73 74 75 76 77 78 79][40 41 42 43 44 45 46 47 48 49][90 91 92 93 94 95 96 97 98 99][20 21 22 23 24 25 26 27 28 29][10 11 12 13 14 15 16 17 18 19][30 31 32 33 34 35 36 37 38 39][80 81 82 83 84 85 86 87 88 89][60 61 62 63 64 65 66 67 68 69][ 0 1 2 3 4 5 6 7 8 9]] [[ 0 1 2 3 4 5 6 7 8 9][10 11 12 13 14 15 16 17 18 19][20 21 22 23 24 25 26 27 28 29][30 31 32 33 34 35 36 37 38 39][40 41 42 43 44 45 46 47 48 49][50 51 52 53 54 55 56 57 58 59][60 61 62 63 64 65 66 67 68 69][70 71 72 73 74 75 76 77 78 79][80 81 82 83 84 85 86 87 88 89][90 91 92 93 94 95 96 97 98 99]]

使用shuffle()重排：

np.random.shuffle(data) print(data) [[40 41 42 43 44 45 46 47 48 49][20 21 22 23 24 25 26 27 28 29][50 51 52 53 54 55 56 57 58 59][80 81 82 83 84 85 86 87 88 89][30 31 32 33 34 35 36 37 38 39][ 0 1 2 3 4 5 6 7 8 9][60 61 62 63 64 65 66 67 68 69][10 11 12 13 14 15 16 17 18 19][90 91 92 93 94 95 96 97 98 99][70 71 72 73 74 75 76 77 78 79]]

pandas.dataframe

對datafame進行亂序，只需要使用sample()即可。

我們使用iris數(shù)據(jù)集生成datafame:

from sklearn import datasets iris = datasets.load_iris() df = pd.DataFrame() df['heigh'] = iris['data'][:,0] df['length'] = iris['data'][:,1] df['label'] = iris['target']print(df.head()) heigh length label 0 5.1 3.5 0 1 4.9 3.0 0 2 4.7 3.2 0 3 4.6 3.1 0 4 5.0 3.6 0

sample()方式

我們使用sample對df進行shuffle。我們可以看到df自身是沒有變化的：

df_shuffle = df.sample(frac=1) print(df_shuffle.head()) print(df.head()) heigh length label 40 5.0 3.5 0 80 5.5 2.4 1 55 5.7 2.8 1 96 5.7 2.9 1 108 6.7 2.5 2heigh length label 0 5.1 3.5 0 1 4.9 3.0 0 2 4.7 3.2 0 3 4.6 3.1 0 4 5.0 3.6 0

參數(shù)frac是要返回的比例。如果需要打混后數(shù)據(jù)集的index（索引）還是按照正常的排序：

df_shuffle2 = df.sample(frac=1).reset_index(drop=True) print(df_shuffle2.head()) heigh length label 0 7.7 2.6 2 1 6.5 3.2 2 2 5.6 2.8 2 3 4.6 3.6 0 4 7.4 2.8 2

sklearn的方式

sklearn.utils.shuffle()也可以對datafame亂序：

df_shuffle3 = sklearn.utils.shuffle(df) print(df_shuffle3.head()) heigh length label 88 5.6 3.0 1 17 5.1 3.5 0 7 5.0 3.4 0 132 6.4 2.8 2 67 5.8 2.7 1

numpy的方式

不推薦此方式

df_shuffle4 = df.iloc[np.random.permutation(len(df))] print(df_shuffle4.head()) heigh length label 130 7.4 2.8 2 144 6.7 3.3 2 110 6.5 3.2 2 99 5.7 2.8 1 143 6.8 3.2 2

總結

以上是生活随笔為你收集整理的numpy和pandas的数据乱序的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：使用sklearn加载公共数据集、内存数
下一篇： sklearn与pandas的缺失值处理