厉害了,在Pandas中用SQL来查询数据,效率超高
生活随笔
收集整理的這篇文章主要介紹了
厉害了,在Pandas中用SQL来查询数据,效率超高
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
今天我們繼續來講一下Pandas和SQL之間的聯用,我們其實也可以在Pandas當中使用SQL語句來篩選數據,通過Pandasql模塊來實現該想法,首先我們來安裝一下該模塊
pip?install?pandasql要是你目前正在使用jupyter notebook,也可以這么來下載
!pip?install?pandasql導入數據
我們首先導入數據
import?pandas?as?pd from?pandasql?import?sqldf df?=?pd.read_csv("Dummy_Sales_Data_v1.csv",?sep=",") df.head()output
我們先對導入的數據集做一個初步的探索性分析,
df.info()output
<class?'pandas.core.frame.DataFrame'> RangeIndex:?9999?entries,?0?to?9998 Data?columns?(total?12?columns):#???Column???????????????Non-Null?Count??Dtype?? ---??------???????????????--------------??-----??0???OrderID??????????????9999?non-null???int64??1???Quantity?????????????9999?non-null???int64??2???UnitPrice(USD)???????9999?non-null???int64??3???Status???????????????9999?non-null???object?4???OrderDate????????????9999?non-null???object?5???Product_Category?????9963?non-null???object?6???Sales_Manager????????9999?non-null???object?7???Shipping_Cost(USD)???9999?non-null???int64??8???Delivery_Time(Days)??9948?non-null???float649???Shipping_Address?????9999?non-null???object?10??Product_Code?????????9999?non-null???object?11??OrderCode????????????9999?non-null???int64?? dtypes:?float64(1),?int64(5),?object(6) memory?usage:?937.5+?KB再開始進一步的數據篩選之前,我們再對數據集的列名做一個轉換,代碼如下
df.rename(columns={"Shipping_Cost(USD)":"ShippingCost_USD","UnitPrice(USD)":"UnitPrice_USD","Delivery_Time(Days)":"Delivery_Time_Days"},inplace=True) df.info()output
<class?'pandas.core.frame.DataFrame'> RangeIndex:?9999?entries,?0?to?9998 Data?columns?(total?12?columns):#???Column??????????????Non-Null?Count??Dtype?? ---??------??????????????--------------??-----??0???OrderID?????????????9999?non-null???int64??1???Quantity????????????9999?non-null???int64??2???UnitPrice_USD???????9999?non-null???int64??3???Status??????????????9999?non-null???object?4???OrderDate???????????9999?non-null???object?5???Product_Category????9963?non-null???object?6???Sales_Manager???????9999?non-null???object?7???ShippingCost_USD????9999?non-null???int64??8???Delivery_Time_Days??9948?non-null???float649???Shipping_Address????9999?non-null???object?10??Product_Code????????9999?non-null???object?11??OrderCode???????????9999?non-null???int64?? dtypes:?float64(1),?int64(5),?object(6) memory?usage:?937.5+?KB用SQL篩選出若干列來
我們先嘗試篩選出OrderID、Quantity、Sales_Manager、Status等若干列數據,用SQL語句應該是這么來寫的
SELECT?OrderID,?Quantity,?Sales_Manager,?\ Status,?Shipping_Address,?ShippingCost_USD?\ FROM?df與Pandas模塊聯用的時候就這么來寫
query?=?"SELECT?OrderID,?Quantity,?Sales_Manager,\ Status,?Shipping_Address,?ShippingCost_USD?\ FROM?df"df_orders?=?sqldf(query) df_orders.head()output
SQL中帶WHERE條件篩選
我們在SQL語句當中添加指定的條件進而來篩選數據,代碼如下
query?=?"SELECT?*?\FROM?df_orders?\WHERE?Shipping_Address?=?'Kenya'"df_kenya?=?sqldf(query) df_kenya.head()output
而要是條件不止一個,則用AND來連接各個條件,代碼如下
query?=?"SELECT?*?\FROM?df_orders?\WHERE?Shipping_Address?=?'Kenya'?\AND?Quantity?<?40?\AND?Status?IN?('Shipped',?'Delivered')" df_kenya?=?sqldf(query) df_kenya.head()output
分組
同理我們可以調用SQL當中的GROUP BY來對篩選出來的數據進行分組,代碼如下
query?=?"SELECT?Shipping_Address,?\COUNT(OrderID)?AS?Orders?\FROM?df_orders?\GROUP?BY?Shipping_Address"df_group?=?sqldf(query) df_group.head(10)output
排序
而排序在SQL當中則是用ORDER BY,代碼如下
query?=?"SELECT?Shipping_Address,?\COUNT(OrderID)?AS?Orders?\FROM?df_orders?\GROUP?BY?Shipping_Address?\ORDER?BY?Orders"df_group?=?sqldf(query) df_group.head(10)output
數據合并
我們先創建一個數據集,用于后面兩個數據集之間的合并,代碼如下
query?=?"SELECT?OrderID,\Quantity,?\Product_Code,?\Product_Category,?\UnitPrice_USD?\FROM?df" df_products?=?sqldf(query) df_products.head()output
我們這里采用的兩個數據集之間的交集,因此是INNER JOIN,代碼如下
query?=?"SELECT?T1.OrderID,?\T1.Shipping_Address,?\T2.Product_Category?\FROM?df_orders?T1\INNER?JOIN?df_products?T2\ON?T1.OrderID?=?T2.OrderID"df_combined?=?sqldf(query) df_combined.head()output
與LIMIT之間的聯用
在SQL當中的LIMIT是用于限制查詢結果返回的數量的,我們想看查詢結果的前10個,代碼如下
query?=?"SELECT?OrderID,?Quantity,?Sales_Manager,?\? Status,?Shipping_Address,?\ ShippingCost_USD?FROM?df?LIMIT?10"df_orders_limit?=?sqldf(query) df_orders_limitoutput
END
推薦閱讀牛逼!Python常用數據類型的基本操作(長文系列第①篇) 牛逼!Python的判斷、循環和各種表達式(長文系列第②篇)牛逼!Python函數和文件操作(長文系列第③篇)牛逼!Python錯誤、異常和模塊(長文系列第④篇)總結
以上是生活随笔為你收集整理的厉害了,在Pandas中用SQL来查询数据,效率超高的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 一个悄然成为世界最流行的操作系统诞生!
- 下一篇: linux cmake编译源码,linu