日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程语言 > python >内容正文

python

11 Python Pandas tricks that make your work more efficient

發布時間:2025/3/19 python 30 豆豆
生活随笔 收集整理的這篇文章主要介紹了 11 Python Pandas tricks that make your work more efficient 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

Pandas is a widely used Python package for structured data. There’re many nice tutorials of it, but here I’d still like to introduce a few cool tricks the readers may not know before and I believe they’re useful.

read_csv

Everyone knows this command. But the data you’re trying to read is large, try adding this argument:?nrows = 5?to only read in a tiny portion of the table before actually loading the whole table. Then you could avoid the mistake by choosing wrong delimiter (it may not always be comma separated).

(Or, you can use ‘head’ command in linux to check out the first 5 rows (say) in any text file:?head -c 5 data.txt)

Then, you can extract the column list by using?df.columns.tolist()?to extract all columns, and then add?usecols = [‘c1’, ‘c2’,?…]?argument to load the columns you need. Also, if you know the data types of a few specific columns, you can add the argument?dtype = {‘c1’: str, ‘c2’: int,?…}?so it would load faster. Another advantage of this argument that if you have a column which contains both strings and numbers, it’s a good practice to declare its type to be string, so you won’t get errors while trying to merge tables using this column as a key.

select_dtypes

If data preprocessing has to be done in Python, then this command would save you some time. After reading in a table, the default data types for each column could be bool, int64, float64, object, category, timedelta64, or datetime64. You can first check the distribution by

df.dtypes.value_counts()

to know all possible data types of your dataframe, then do

df.select_dtypes(include=[‘float64’, ‘int64’])

to select a sub-dataframe with only numerical features.

copy

This is an important command if you haven’t heard of it already. If you do the following commands:

import pandas as pd df1 = pd.DataFrame({ ‘a’:[0,0,0], ‘b’: [1,1,1]}) df2 = df1 df2[‘a’] = df2[‘a’] + 1 df1.head()

You’ll find that df1 is changed. This is because df2 = df1 is not making a copy of df1 and assign it to df2, but setting up a pointer pointing to df1. So any changes in df2 would result in changes in df1. To fix this, you can do either

df2 = df1.copy()

or

from copy import deepcopy df2 = deepcopy(df1)

map

This is a cool command to do easy data transformations. You first define a dictionary with ‘keys’ being the old values and ‘values’ being the new values.

level_map = {1: ‘high’, 2: ‘medium’, 3: ‘low’} df[‘c_level’] = df[‘c’].map(level_map)

Some examples: True, False to 1, 0 (for modeling); defining levels; user defined lexical encodings.

apply or not?apply?

If we’d like to create a new column with a few other columns as inputs, apply function would be quite useful sometimes.

def rule(x, y):if x == ‘high’ and y > 10:return 1else:return 0 df = pd.DataFrame({ 'c1':[ 'high' ,'high', 'low', 'low'], 'c2': [0, 23, 17, 4]}) df['new'] = df.apply(lambda x: rule(x['c1'], x['c2']), axis = 1) df.head()

In the codes above, we define a function with two input variables, and use the apply function to apply it to columns ‘c1’ and ‘c2’.

but?the problem of ‘apply’ is that it’s sometimes too slow. Say if you’d like to calculate the maximum of two columns ‘c1’ and ‘c2’, of course you can do

df[‘maximum’] = df.apply(lambda x: max(x[‘c1’], x[‘c2’]), axis = 1)

but you’ll find it much slower than this command:

df[‘maximum’] = df[[‘c1’,’c2']].max(axis =1)

Takeaway: Don’t use apply if you can get the same work done with other built-in functions (they’re often faster). For example, if you want to round column ‘c’ to integers, do?round(df[‘c’], 0)?instead of using the apply function.

value counts

This is a command to check value distributions. For example, if you’d like to check what are the possible values and the frequency for each individual value in column ‘c’ you can do

df[‘c’].value_counts()

There’re some useful tricks / arguments of it:
A.?normalize = True: if you want to check the frequency instead of counts.
B.?dropna = False: if you also want to include missing values in the stats.
C.?sort = False: show the stats sorted by values instead of their counts.
D.?df[‘c].value_counts().reset_index(): if you want to convert the stats table into a pandas dataframe and manipulate it.

number of missing?values

When building models, you might want to exclude the row with too many missing values / the rows with all missing values. You can use?.isnull() and?.sum() to count the number of missing values within the specified columns.

import pandas as pd import numpy as np df = pd.DataFrame({ ‘id’: [1,2,3], ‘c1’:[0,0,np.nan], ‘c2’: [np.nan,1,1]}) df = df[[‘id’, ‘c1’, ‘c2’]] df[‘num_nulls’] = df[[‘c1’, ‘c2’]].isnull().sum(axis=1) df.head()

select rows with specific?IDs

In SQL we can do this using SELECT * FROM?… WHERE ID in (‘A001’, ‘C022’,?…) to get records with specific IDs. If you want to do the same thing with pandas, you can do

df_filter = df[‘ID’].isin([‘A001’,‘C022’,...]) df[df_filter]

Percentile groups

You have a numerical column, and would like to classify the values in that column into groups, say top 5% into group 1, 5–20% into group 2, 20%-50% into group 3, bottom 50% into group 4. Of course, you can do it with pandas.cut, but I’d like to provide another option here:

import numpy as np cut_points = [np.percentile(df[‘c’], i) for i in [50, 80, 95]] df[‘group’] = 1 for i in range(3):df[‘group’] = df[‘group’] + (df[‘c’] < cut_points[i]) # or <= cut_points[i]

which is fast to run (no apply function used).

to_csv

Again this is a command that everyone would use. I’d like to point out two tricks here. The first one is

print(df[:5].to_csv())

You can use this command to print out the first five rows of what are going to be written into the file exactly.

Another trick is dealing with integers and missing values mixed together. If a column contains both missing values and integers, the data type would still be float instead of int. When you export the table, you can add?float_format=‘%.0f’?to round all the floats to integers. Use this trick if you only want integer outputs for all columns?—?you’ll get rid of all annoying ‘.0’s.

指定加載的數據類型

比如說整數 默認是int64,指定數據類型為int8,可以讓數據大大減少;(在數據量非常大的時候非常有用)?

總結

以上是生活随笔為你收集整理的11 Python Pandas tricks that make your work more efficient的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。

主站蜘蛛池模板: 日本在线视频一区二区 | 少妇高潮惨叫久久久久久 | 免费看一级黄色大全 | 精精国产xxxx视频在线 | ww成人| 日韩在线小视频 | 日韩av黄色片 | 国产伦精品一区二区三区高清版 | 超级碰碰97 | 亚洲偷自| 亚洲精品中文无码AV在线播放 | 久久久久久91香蕉国产 | 日日日日操 | 果冻av在线 | 欧美激情视频一区二区三区在线播放 | 亚洲激情短视频 | 欧美人与禽zoz0性3d | 无码人妻丰满熟妇区bbbbxxxx | 成人无码精品1区2区3区免费看 | 91插插插插插 | 一本色道久久综合熟妇 | av中字在线 | 午夜狠狠干 | 久久久xxx| 久久久久亚洲av成人片 | 欧美精品电影一区二区 | 国产高清在线视频 | 经典杯子蛋糕日剧在线观看免费 | 日日影院 | 激情黄色小说网站 | 草草影院第一页 | ass精品国模裸体pics | 丰满少妇一级 | 97精品人妻一区二区三区蜜桃 | 色婷婷热久久 | 欧美亚洲色图视频 | 天堂在线v | 91在线视频国产 | 奇米影视四色7777 | 欧洲人妻丰满av无码久久不卡 | 97中文在线 | 亚洲高清不卡 | 国产21页| 国产黄 | 日韩精品一区二区三区丰满 | 久久久精品人妻无码专区 | 精品国产av鲁一鲁一区 | 视频在线观看一区二区 | 一区免费在线观看 | 秋霞国产精品 | 奇米影视999| 苍井空浴缸大战猛男120分钟 | 日韩国产亚洲欧美 | 91精品国产福利在线观看 | 在线小视频你懂的 | 国产成人一区二区 | www黄色在线观看 | 国产在成人精品线拍偷自揄拍 | 国产精品1000部啪视频 | 欧美国产日韩在线观看 | 国产精品国产三级国产播12软件 | 黄色不雅视频 | 欧美变态绿帽cuckold | 国产一级视频在线 | 巨胸挤奶视频www网站 | 性开放淫合集 | 天堂免费av | 欧美性生活网站 | 91综合精品 | 女优色图| 好吊妞一区二区三区 | 成人性做爰aaa片免费 | 在线观看亚洲天堂 | 日韩av网址在线观看 | 开心激情网站 | 日韩av不卡电影 | 李丽珍毛片 | 黄色一级视频 | 久久日视频 | 91 在线观看 | 激情婷婷六月 | 午夜写真片福利电影网 | av在线资源播放 | 四季av一区二区三区免费观看 | 黄色wwwww| 日本欧美亚洲 | 精品3p| 精品人妻无码一区二区三区换脸 | 国产伦精品一区三区精东 | 亚洲第一大网站 | 91国内产香蕉 | 欧美综合自拍 | 99re久久精品国产 | 男人猛进女人爽的大叫 | 久久久成人精品一区二区三区 | 日本在线免费播放 | 国模无码视频一区二区三区 | 手机在线一区 | 新天堂网 |