hive窗口函数总结
Hive preceding and following理解
在了解hive開窗函數前我們來看看Hive窗口函數preceding and following是怎么回事呢.
Hive窗口函數中,有一個功能是統計當前行之前或之后指定行作為一個聚合,關鍵字是 preceding 和 following,舉例說明其使用方法.
常規的窗口函數比較簡單,這里介紹一下分組的,重點是分組排序之后的rows between用法。
關鍵是理解rows between中關鍵字含義:
| preceding | 往前 |
| following | 往后 |
| current row | 當前行 |
| unbounded | 開始行 |
| unbounded preceding | 表示從前面的起點 |
| unbounded following | 表示到后面的終點 |
案例
select country,time,charge, max(charge) over (partition by country order by time) as normal, max(charge) over (partition by country order by time rows between unbounded preceding and current row) as unb_pre_cur, max(charge) over (partition by country order by time rows between 2 preceding and 1 following) as pre2_fol1, max(charge) over (partition by country order by time rows between current row and unbounded following) as cur_unb_fol from temp注意:默認是在分組類的當前行之前的行中計算。
rows between unbounded preceding and current row和默認的一樣
rows between 2 preceding and 1 following表示在當前行的前2行和后1行中計算
rows between current row and unbounded following表示在當前行和到最后行中計算
rows between對于avg、min、max、sum這幾個窗口函數的含義基本是一致的,注意查看當前結果
在 hive 環境中創建臨時表
加載測試數據
load data local inpath ‘text.txt’ into table tmp_student;
其中text.txt中內容為:
查看是否加載成功
hive> select * from tmp_student; adf 3 測試公司1 45 xx 3 測試公司2 55 cfe 2 測試公司2 74 3dd 3 測試公司5 NULL fda 1 測試公司7 80 gds 2 測試公司9 92 ffd 1 測試公司10 95 dss 1 測試公司4 95 ddd 3 測試公司3 99 gf 3 測試公司9 99 Time taken: 1.314 seconds, Fetched: 10 row(s)下面來了解preceding and following函數用法,執行下面sql
selectname,score,sum(score) over(order by score range between 2 preceding and 2 following) s1, -- 當前行的score值加減2的范圍內的所有行sum(score) over(order by score rows between 2 preceding and 2 following) s2, -- 當前行+前后2行,一共5行sum(score) over(order by score range between unbounded preceding and unbounded following) s3, -- 全部行,不做限制sum(score) over(order by score rows between unbounded preceding and unbounded following) s4, -- 全部行,不做限制sum(score) over(order by score) s5, -- 第一行到當前行(和當前行相同score值的所有行都會包含進去)sum(score) over(order by score rows between unbounded preceding and current row) s6, -- 第一行到當前行(和當前行相同score值的其他行不會包含進去,這是和上面的區別)sum(score) over(order by score rows between 3 preceding and current row) s7, -- 當前行+往前3行sum(score) over(order by score rows between 3 preceding and 1 following) s8, --當前行+往前3行+往后1行sum(score) over(order by score rows between current row and unbounded following) s9 --當前行+往后所有行 fromtmp.tmp_student order by score;注意:
當ORDER BY后面缺少窗口從句條件,窗口規范默認是 RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.
當ORDER BY和窗口從句都缺失, 窗口規范默認是 ROW BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.
rows是物理窗口,是哪一行就是哪一行,與當前行的值(order by key的key的值)無關,只與排序后的行號相關,就是我們常規理解的那樣。
range是邏輯窗口,與當前行的值有關(order by key的key的值),在key上操作range范圍。
得到相關結果如下
通過上面的練習我們主要是對preceding and following有了一個比較全面的理解,所謂開窗函數其實就相當于flink中的滾動窗口,統計分析都是基于這個滾動窗口內完成的,所有的聚合計算統計都需要先根據range或者row確定窗口內的數據,然后就很容易得到正確的計算結果,在確定行數的過程中需要根據range和row確定是邏輯范圍還是物理范圍,最終都可以看作是第N行到第M行內數據的聚合統計.
Flink窗口說明
窗口函Windowing functions
-
FIRST_VALUE(col, bool DEFAULT)
返回分組窗口內第一行col的值,DEFAULT默認為false,如果指定為true,則跳過NULL后再取值,對于FIRST_VALUE每個分組第一行數據的FIRST_VALUE(col, bool DEFAULT) 就等于col,接下來幾行數據會參考第一行數據是否為NULL根據True/False進行取舍.
- LAST_VALUE(col, bool DEFAULT)
返回分組窗口內第后一行col的值,DEFAULT默認為false,如果指定為true,則跳過NULL后再取值.
- LEAD(col, n, DEFAULT)
返回分組窗口內往下第n行col的值,n默認為1,往下第n沒有時返回DEFAULT(DEFAULT默認為NULL)使用分組后那么分組之間就不交叉計算.
WITH tmp AS (SELECT 1 AS group_id, 'a' AS col UNION ALL SELECT 1 AS group_id, 'b' AS col UNION ALL SELECT 1 AS group_id, 'c' AS col UNION ALL SELECT 2 AS group_id, 'd' AS col UNION ALL SELECT 2 AS group_id, 'e' AS col ) SELECT group_id,col,LEAD(col) over(partition by group_id order by col) as col_new FROM tmp;返回結果
group_id col col_new 1 a b 1 b c 1 c NULL 2 d e 2 e NULL等同于
WITH tmp AS (SELECT 1 AS group_id, 'a' AS col UNION ALL SELECT 1 AS group_id, 'b' AS col UNION ALL SELECT 1 AS group_id, 'c' AS col UNION ALL SELECT 2 AS group_id, 'd' AS col UNION ALL SELECT 2 AS group_id, 'e' AS col ) SELECT group_id,col,LAST_VALUE(col) over(partition by group_id order by col rows between 1 FOLLOWING and 1 FOLLOWING) as col_new FROM tmp;其中rows between 1 FOLLOWING and 1 FOLLOWING為從往后一行開始到往后一行結束=往后一行
返回結果
使用LEAD默認值
WITH tmp AS (SELECT 1 AS group_id, 'a' AS col UNION ALL SELECT 1 AS group_id, 'b' AS col UNION ALL SELECT 1 AS group_id, 'c' AS col UNION ALL SELECT 2 AS group_id, 'd' AS col UNION ALL SELECT 2 AS group_id, 'e' AS col ) SELECT group_id,col,LEAD(col, 2, 'z') over(partition by group_id order by col) as col_new FROM tmp;返回結果
group_id col col_new 1 a c 1 b z 1 c z 2 d z 2 e z- LAG(col, n, DEFAULT)
返回分組窗口內往上第n行col的值,n默認為1,往上第n沒有時返回DEFAULT(DEFAULT默認為NULL)
等同于
WITH tmp AS (SELECT 1 AS group_id, 'a' AS col UNION ALL SELECT 1 AS group_id, 'b' AS col UNION ALL SELECT 1 AS group_id, 'c' AS col UNION ALL SELECT 2 AS group_id, 'd' AS col UNION ALL SELECT 2 AS group_id, 'e' AS col ) SELECT group_id,col,FIRST_VALUE(col) over(partition by group_id order by col rows BETWEEN 1 PRECEDING and 1 PRECEDING) as col_new FROM tmp;返回結果都是
group_id col col_new 1 a NULL 1 b a 1 c b 2 d NULL 2 e d使用默認值
WITH tmp AS (SELECT 1 AS group_id, 'a' AS col UNION ALL SELECT 1 AS group_id, 'b' AS col UNION ALL SELECT 1 AS group_id, 'c' AS col UNION ALL SELECT 2 AS group_id, 'd' AS col UNION ALL SELECT 2 AS group_id, 'e' AS col ) SELECT group_id,col,LAG(col, 2, 'zz') over(partition by group_id order by col) as col_new FROM tmp;返回結果
group_id col col_new 1 a zz 1 b zz 1 c a 2 d zz 2 e zzOVER詳解 The OVER clause
** FUNCTION(expr) OVER([PARTITION BY statement] [ORDER BY statement] [window clause]) ** 中括號為可選參數 FUNCTION:包括標準聚合函數(COUNT/SUM/MIN/MAX/AVG)和一些分析函數(RANK/ROW_NUMBER/DENSE_RANK等) PARTITION BY:可以由一個或者多個列組成 ORDER BY:可以由一個或者多個列組成 window clause:(ROWS | RANGE) BETWEEN (UNBOUNDED PRECEDING | num PRECEDING | CURRENT ROW) AND (UNBOUNDED PRECEDING | num PRECEDING | CURRENT ROW) 當window clause 未指定時默認為RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW,即分組內第一行至當前行作為窗口 當 window clause和ORDER BY都未指定時,默認為ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING **即分組內第一行至最后一行作為窗口.**標準聚合函數
COUNT(expr) OVER() 返回窗口內行數 WITH tmp AS (SELECT 1 AS group_id, 'a' AS col UNION ALL SELECT 1 AS group_id, 'b' AS col UNION ALL SELECT 1 AS group_id, 'c' AS col UNION ALL SELECT 2 AS group_id, 'e' AS col UNION ALL SELECT 2 AS group_id, 'e' AS col ) SELECT group_id,col,count(col) over(partition by group_id) as cnt1,count(col) over(partition by group_id order by col) as cnt2,count(col) over(partition by group_id order by col rows between CURRENT ROW and UNBOUNDED following) as cnt3,count(distinct col) over(partition by group_id order by col rows between CURRENT ROW and UNBOUNDED following) as cnt4 FROM tmp; 返回結果為 group_id col cnt1 cnt2 cnt3 cnt4 1 a 3 1 3 3 1 b 3 2 2 2 1 c 3 3 1 1 2 e 2 2 2 1 2 e 2 2 1 1SUM(expr) OVER() 返回窗口內求和值 WITH tmp AS (SELECT 1 AS group_id, 1 AS col UNION ALL SELECT 1 AS group_id, 2 AS col UNION ALL SELECT 1 AS group_id, 3 AS col UNION ALL SELECT 2 AS group_id, 4 AS col UNION ALL SELECT 2 AS group_id, 4 AS col ) SELECT group_id,col,SUM(col) over(partition by group_id) as sum1,SUM(col) over(partition by group_id order by col) as sum2,SUM(col) over(partition by group_id order by col rows between CURRENT ROW and UNBOUNDED following) as sum3,SUM(distinct col) over(partition by group_id order by col rows between CURRENT ROW and UNBOUNDED following) as sum4 FROM tmp; 返回結果為 group_id col sum1 sum2 sum3 sum4 1 1 6 1 6 6 1 2 6 3 5 5 1 3 6 6 3 3 2 4 8 8 8 4 2 4 8 8 4 4MIN(expr) OVER() 返回窗口內最小值 WITH tmp AS (SELECT 1 AS group_id, 1 AS col UNION ALL SELECT 1 AS group_id, 2 AS col UNION ALL SELECT 1 AS group_id, 3 AS col UNION ALL SELECT 2 AS group_id, 4 AS col UNION ALL SELECT 2 AS group_id, 5 AS col ) SELECT group_id,col,MIN(col) over(partition by group_id) as min1,MIN(col) over(partition by group_id order by col) as min2,MIN(col) over(partition by group_id order by col rows between CURRENT ROW and UNBOUNDED following) as min3 FROM tmp; group_id col min1 min2 min3 1 1 1 1 1 1 2 1 1 2 1 3 1 1 3 2 4 4 4 4 2 5 4 4 5MAX(expr) OVER() 返回窗口內最大值 WITH tmp AS (SELECT 1 AS group_id, 1 AS col UNION ALL SELECT 1 AS group_id, 2 AS col UNION ALL SELECT 1 AS group_id, 3 AS col UNION ALL SELECT 2 AS group_id, 4 AS col UNION ALL SELECT 2 AS group_id, 5 AS col ) SELECT group_id,col,MAX(col) over(partition by group_id) as max1,MAX(col) over(partition by group_id order by col) as max2,MAX(col) over(partition by group_id order by col rows between CURRENT ROW and UNBOUNDED following) as max3 FROM tmp; 返回結果為 group_id col max1 max2 max3 1 1 3 1 3 1 2 3 2 3 1 3 3 3 3 2 4 5 4 5 2 5 5 5 5AVG(expr) OVER() 返回窗口內平均值 WITH tmp AS (SELECT 1 AS group_id, 1 AS col UNION ALL SELECT 1 AS group_id, 2 AS col UNION ALL SELECT 1 AS group_id, 3 AS col UNION ALL SELECT 2 AS group_id, 4 AS col UNION ALL SELECT 2 AS group_id, 4 AS col ) SELECT group_id,col,AVG(col) over(partition by group_id) as avg1,AVG(col) over(partition by group_id order by col) as avg2,AVG(col) over(partition by group_id order by col rows between CURRENT ROW and UNBOUNDED following) as avg3,AVG(distinct col) over(partition by group_id order by col rows between CURRENT ROW and UNBOUNDED following) as avg4 FROM tmp; 返回結果為 |group_id|col|avg1|avg2|avg3|avg4| |1|1|2.0|1.0|2.0|2.0| |1|2|2.0|1.5|2.5|2.5| |1|3|2.0|2.0|3.0|3.0| |2|4|4.0|4.0|4.0|4.0| |2|4|4.0|4.0|4.0|4.0|分析函數 Analytics functions RANK() OVER() 返回分組內排名(不支持自定義窗口) WITH tmp AS (SELECT 1 AS group_id, 1 AS col UNION ALL SELECT 1 AS group_id, 3 AS col UNION ALL SELECT 1 AS group_id, 3 AS col UNION ALL SELECT 2 AS group_id, 4 AS col UNION ALL SELECT 2 AS group_id, 5 AS col ) SELECT group_id,col,RANK() over(partition by group_id order by col desc) as r FROM tmp; 返回結果為 |group_id|col|r| |1|3|1| |1|3|1| |1|1|3| |2|5|1| |2|4|2|ROW_NUMBER() OVER() 返回分組內行號(不支持自定義窗口) WITH tmp AS (SELECT 1 AS group_id, 1 AS col UNION ALL SELECT 1 AS group_id, 3 AS col UNION ALL SELECT 1 AS group_id, 3 AS col UNION ALL SELECT 2 AS group_id, 4 AS col UNION ALL SELECT 2 AS group_id, 5 AS col ) SELECT group_id,col,ROW_NUMBER() over(partition by group_id order by col desc) as r FROM tmp; 返回結果為 |group_id|col|r| |1|3|1| |1|3|2| |1|1|3| |2|5|1| |2|4|2|DENSE_RANK() OVER() 返回分組內排名(排名相等不會留下空位,不支持自定義窗口) WITH tmp AS (SELECT 1 AS group_id, 1 AS col UNION ALL SELECT 1 AS group_id, 3 AS col UNION ALL SELECT 1 AS group_id, 3 AS col UNION ALL SELECT 2 AS group_id, 4 AS col UNION ALL SELECT 2 AS group_id, 5 AS col ) SELECT group_id,col,DENSE_RANK() over(partition by group_id order by col desc) as r FROM tmp; 返回結果為 |group_id|col|r| |1|3|1| |1|3|1| |1|1|2| |2|5|1| |2|4|2|CUME_DIST() OVER() 返回分組內累計分布值,即分組內小于(或者大于)等于當前值行數/分組內總行數 WITH tmp AS (SELECT 1 AS group_id, 1 AS col UNION ALL SELECT 1 AS group_id, 3 AS col UNION ALL SELECT 1 AS group_id, 3 AS col UNION ALL SELECT 2 AS group_id, 4 AS col UNION ALL SELECT 2 AS group_id, 5 AS col ) SELECT group_id,col,CUME_DIST() over(partition by group_id order by col asc) as d1,CUME_DIST() over(partition by group_id order by col desc) as d2 FROM tmp;返回結果為 |group_id|col|d1|d2| |1|3|1.0|0.6666666666666666| |1|3|1.0|0.6666666666666666| |1|1|0.3333333333333333|1.0| |2|5|1.0|0.5| |2|4|0.5|1.0|PERCENT_RANK() OVER() 返回百分比排序值,即分組內當前行的RANK值-1/分組內總行數-1 WITH tmp AS (SELECT 1 AS group_id, 1 AS col UNION ALL SELECT 1 AS group_id, 3 AS col UNION ALL SELECT 1 AS group_id, 3 AS col UNION ALL SELECT 2 AS group_id, 4 AS col UNION ALL SELECT 2 AS group_id, 5 AS col ) SELECT group_id,col,RANK() over(partition by group_id order by col asc) as r1,PERCENT_RANK() over(partition by group_id order by col asc) as p1,RANK() over(partition by group_id order by col desc) as r2,PERCENT_RANK() over(partition by group_id order by col desc) as p2 FROM tmp;返回結果為 |group_id|col|r1|p1|r2|p2| |1|3|2|0.5|1|0.0| |1|3|2|0.5|1|0.0| |1|1|1|0.0|3|1.0| |2|5|2|1.0|1|0.0| |2|4|1|0.0|2|1.0|NTILE(INTEGER x) OVER() 返回分區編號(將有序分區劃分為x個組,稱為bucket,并為分區中的每一行分配一個bucket編號) WITH tmp AS (SELECT 1 AS group_id, 1 AS col UNION ALL SELECT 1 AS group_id, 3 AS col UNION ALL SELECT 1 AS group_id, 3 AS col UNION ALL SELECT 1 AS group_id, 3 AS col UNION ALL SELECT 2 AS group_id, 4 AS col UNION ALL SELECT 2 AS group_id, 5 AS col ) SELECT group_id,col,NTILE(2) over(partition by group_id order by col asc) as bucket_id FROM tmp;返回結果為 |group_id|col|bucket_id| |1|1|1| |1|3|1| |1|3|2| |1|3|2| |2|4|1| |2|5|2| OVER子句也支持聚合函數 Hive 2.1.0及之后版本,OVER子句也支持聚合函數,如: WITH tmp AS (SELECT 1 AS group_id, 1 AS col UNION ALL SELECT 1 AS group_id, 3 AS col UNION ALL SELECT 1 AS group_id, 3 AS col UNION ALL SELECT 2 AS group_id, 4 AS col UNION ALL SELECT 2 AS group_id, 5 AS col ) SELECT group_id,RANK() over(order by sum(col) desc) as r FROM tmp group by group_id; 結果為 |group_id|r| |2|1| |1|2|window clause 的另一種寫法 將window子句寫在from后面,在over后使用別名進行引用,如下: WITH tmp AS (SELECT 1 AS group_id, 1 AS col UNION ALL SELECT 1 AS group_id, 2 AS col UNION ALL SELECT 1 AS group_id, 3 AS col UNION ALL SELECT 2 AS group_id, 4 AS col UNION ALL SELECT 2 AS group_id, 4 AS col ) SELECT group_id,col,AVG(col) over w1 as avg1,AVG(distinct col) over(partition by group_id order by col rows between CURRENT ROW and UNBOUNDED following) as avg2 FROM tmp WINDOW w1 AS (partition by group_id order by col rows between CURRENT ROW and UNBOUNDED following);結果為 |group_id|col|avg1|avg2| |1|1|2.0|2.0| |1|2|2.5|2.5| |1|3|3.0|3.0| |2|4|4.0|4.0| |2|4|4.0|4.0|WITH tmp AS (SELECT 1 AS group_id, 1 AS col UNION ALL SELECT 1 AS group_id, 2 AS col UNION ALL SELECT 1 AS group_id, 3 AS col UNION ALL SELECT 2 AS group_id, 4 AS col UNION ALL SELECT 2 AS group_id, 4 AS col ) SELECT group_id,col,AVG(col) over w1 as avg1,AVG(distinct col) over w2 as avg2 FROM tmp WINDOW w1 AS (partition by group_id order by col rows between CURRENT ROW and UNBOUNDED following), w2 AS (partition by group_id order by col rows between CURRENT ROW and UNBOUNDED following);結果為 |group_id|col|avg1|avg2| |1|1|2.0|2.0| |1|2|2.5|2.5| |1|3|3.0|3.0| |2|4|4.0|4.0| |2|4|4.0|4.0|本文完.
Any suggestions and criticisms will be sincerely welcomed.
資料
https://blog.csdn.net/happyrocking/article/details/105369558
https://docs.aws.amazon.com/redshift/latest/dg/redshift
https://www.jianshu.com/p/3f3cf58472ca
https://www.cnblogs.com/hyunbar/p/13524855.html
https://blog.csdn.net/weixin_42307036/article/details/112381387
總結
以上是生活随笔為你收集整理的hive窗口函数总结的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: Git与GitHub安装与配置
- 下一篇: IOS开发学习笔记(一)——Object