日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問(wèn) 生活随笔!

生活随笔

當(dāng)前位置: 首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

Hive分析窗口函数系列文章

發(fā)布時(shí)間:2024/1/17 编程问答 27 豆豆
生活随笔 收集整理的這篇文章主要介紹了 Hive分析窗口函数系列文章 小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

分析窗口函數(shù)應(yīng)用場(chǎng)景:

(1)用于分區(qū)排序

(2)動(dòng)態(tài)Group By

(3)Top N

(4)累計(jì)計(jì)算

(5)層次查詢

?

Hive分析窗口函數(shù)(一) SUM,AVG,MIN,MAX

Hive中提供了越來(lái)越多的分析函數(shù),用于完成負(fù)責(zé)的統(tǒng)計(jì)分析。抽時(shí)間將所有的分析窗口函數(shù)理一遍,將陸續(xù)發(fā)布。

今天先看幾個(gè)基礎(chǔ)的,SUM、AVG、MIN、MAX。

用于實(shí)現(xiàn)分組內(nèi)所有和連續(xù)累積的統(tǒng)計(jì)。

數(shù)據(jù)準(zhǔn)備:

?
  • CREATE EXTERNAL TABLE lxw1234 (

  • cookieid string,

  • createtime string, --day

  • pv INT

  • ) ROW FORMAT DELIMITED

  • FIELDS TERMINATED BY ','

  • stored as textfile location '/tmp/lxw11/';

  • ?
  • DESC lxw1234;

  • cookieid STRING

  • createtime STRING

  • pv INT

  • ?
  • hive> select * from lxw1234;

  • OK

  • cookie1 2015-04-10 1

  • cookie1 2015-04-11 5

  • cookie1 2015-04-12 7

  • cookie1 2015-04-13 3

  • cookie1 2015-04-14 2

  • cookie1 2015-04-15 4

  • cookie1 2015-04-16 4

  • SUM — 注意,結(jié)果和ORDER BY相關(guān),默認(rèn)為升序

    ?
  • SELECT cookieid,

  • createtime,

  • pv,

  • SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1, -- 默認(rèn)為從起點(diǎn)到當(dāng)前行

  • SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2, --從起點(diǎn)到當(dāng)前行,結(jié)果同pv1

  • SUM(pv) OVER(PARTITION BY cookieid) AS pv3, --分組內(nèi)所有行

  • SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4, --當(dāng)前行+往前3行

  • SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5, --當(dāng)前行+往前3行+往后1行

  • SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6 ---當(dāng)前行+往后所有行

  • FROM lxw1234;

  • ?
  • cookieid createtime pv pv1 pv2 pv3 pv4 pv5 pv6

  • -----------------------------------------------------------------------------

  • cookie1 2015-04-10 1 1 1 26 1 6 26

  • cookie1 2015-04-11 5 6 6 26 6 13 25

  • cookie1 2015-04-12 7 13 13 26 13 16 20

  • cookie1 2015-04-13 3 16 16 26 16 18 13

  • cookie1 2015-04-14 2 18 18 26 17 21 10

  • cookie1 2015-04-15 4 22 22 26 16 20 8

  • cookie1 2015-04-16 4 26 26 26 13 13 4

  • pv1: 分組內(nèi)從起點(diǎn)到當(dāng)前行的pv累積,如,11號(hào)的pv1=10號(hào)的pv+11號(hào)的pv, 12號(hào)=10號(hào)+11號(hào)+12號(hào)
    pv2: 同pv1
    pv3: 分組內(nèi)(cookie1)所有的pv累加
    pv4: 分組內(nèi)當(dāng)前行+往前3行,如,11號(hào)=10號(hào)+11號(hào), 12號(hào)=10號(hào)+11號(hào)+12號(hào), 13號(hào)=10號(hào)+11號(hào)+12號(hào)+13號(hào), 14號(hào)=11號(hào)+12號(hào)+13號(hào)+14號(hào)
    pv5: 分組內(nèi)當(dāng)前行+往前3行+往后1行,如,14號(hào)=11號(hào)+12號(hào)+13號(hào)+14號(hào)+15號(hào)=5+7+3+2+4=21
    pv6: 分組內(nèi)當(dāng)前行+往后所有行,如,13號(hào)=13號(hào)+14號(hào)+15號(hào)+16號(hào)=3+2+4+4=13,14號(hào)=14號(hào)+15號(hào)+16號(hào)=2+4+4=10

    ?

    如果不指定ROWS BETWEEN,默認(rèn)為從起點(diǎn)到當(dāng)前行;
    如果不指定ORDER BY,則將分組內(nèi)所有值累加;
    關(guān)鍵是理解ROWS BETWEEN含義,也叫做WINDOW子句
    PRECEDING:往前
    FOLLOWING:往后
    CURRENT ROW:當(dāng)前行
    UNBOUNDED:起點(diǎn),UNBOUNDED PRECEDING 表示從前面的起點(diǎn), UNBOUNDED FOLLOWING:表示到后面的終點(diǎn)

    –其他AVG,MIN,MAX,和SUM用法一樣。

    ?
  • --AVG

  • SELECT cookieid,

  • createtime,

  • pv,

  • AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1, -- 默認(rèn)為從起點(diǎn)到當(dāng)前行

  • AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2, --從起點(diǎn)到當(dāng)前行,結(jié)果同pv1

  • AVG(pv) OVER(PARTITION BY cookieid) AS pv3, --分組內(nèi)所有行

  • AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4, --當(dāng)前行+往前3行

  • AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5, --當(dāng)前行+往前3行+往后1行

  • AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6 ---當(dāng)前行+往后所有行

  • FROM lxw1234;

  • cookieid createtime pv pv1 pv2 pv3 pv4 pv5 pv6

  • -----------------------------------------------------------------------------

  • cookie1 2015-04-10 1 1.0 1.0 3.7142857142857144 1.0 3.0 3.7142857142857144

  • cookie1 2015-04-11 5 3.0 3.0 3.7142857142857144 3.0 4.333333333333333 4.166666666666667

  • cookie1 2015-04-12 7 4.333333333333333 4.333333333333333 3.7142857142857144 4.333333333333333 4.0 4.0

  • cookie1 2015-04-13 3 4.0 4.0 3.7142857142857144 4.0 3.6 3.25

  • cookie1 2015-04-14 2 3.6 3.6 3.7142857142857144 4.25 4.2 3.3333333333333335

  • cookie1 2015-04-15 4 3.6666666666666665 3.6666666666666665 3.7142857142857144 4.0 4.0 4.0

  • cookie1 2015-04-16 4 3.7142857142857144 3.7142857142857144 3.7142857142857144 3.25 3.25 4.0

  • ?
  • --MIN

  • SELECT cookieid,

  • createtime,

  • pv,

  • MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1, -- 默認(rèn)為從起點(diǎn)到當(dāng)前行

  • MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2, --從起點(diǎn)到當(dāng)前行,結(jié)果同pv1

  • MIN(pv) OVER(PARTITION BY cookieid) AS pv3, --分組內(nèi)所有行

  • MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4, --當(dāng)前行+往前3行

  • MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5, --當(dāng)前行+往前3行+往后1行

  • MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6 ---當(dāng)前行+往后所有行

  • FROM lxw1234;

  • ?
  • cookieid createtime pv pv1 pv2 pv3 pv4 pv5 pv6

  • -----------------------------------------------------------------------------

  • cookie1 2015-04-10 1 1 1 1 1 1 1

  • cookie1 2015-04-11 5 1 1 1 1 1 2

  • cookie1 2015-04-12 7 1 1 1 1 1 2

  • cookie1 2015-04-13 3 1 1 1 1 1 2

  • cookie1 2015-04-14 2 1 1 1 2 2 2

  • cookie1 2015-04-15 4 1 1 1 2 2 4

  • cookie1 2015-04-16 4 1 1 1 2 2 4

  • ?
  • --MAX

  • SELECT cookieid,

  • createtime,

  • pv,

  • MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1, -- 默認(rèn)為從起點(diǎn)到當(dāng)前行

  • MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2, --從起點(diǎn)到當(dāng)前行,結(jié)果同pv1

  • MAX(pv) OVER(PARTITION BY cookieid) AS pv3, --分組內(nèi)所有行

  • MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4, --當(dāng)前行+往前3行

  • MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5, --當(dāng)前行+往前3行+往后1行

  • MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6 ---當(dāng)前行+往后所有行

  • FROM lxw1234;

  • ?
  • cookieid createtime pv pv1 pv2 pv3 pv4 pv5 pv6

  • -----------------------------------------------------------------------------

  • cookie1 2015-04-10 1 1 1 7 1 5 7

  • cookie1 2015-04-11 5 5 5 7 5 7 7

  • cookie1 2015-04-12 7 7 7 7 7 7 7

  • cookie1 2015-04-13 3 7 7 7 7 7 4

  • cookie1 2015-04-14 2 7 7 7 7 7 4

  • cookie1 2015-04-15 4 7 7 7 7 7 4

  • cookie1 2015-04-16 4 7 7 7 4 4 4

  • ?

    Hive分析窗口函數(shù)(二) NTILE,ROW_NUMBER,RANK,DENSE_RANK

    本文中介紹前幾個(gè)序列函數(shù),NTILE,ROW_NUMBER,RANK,DENSE_RANK,下面會(huì)一一解釋各自的用途。

    注意: 序列函數(shù)不支持WINDOW子句。(什么是WINDOW子句,點(diǎn)此查看前面的文章)

    數(shù)據(jù)準(zhǔn)備:

    ?
  • CREATE EXTERNAL TABLE lxw1234 (

  • cookieid string,

  • createtime string, --day

  • pv INT

  • ) ROW FORMAT DELIMITED

  • FIELDS TERMINATED BY ','

  • stored as textfile location '/tmp/lxw11/';

  • ?
  • DESC lxw1234;

  • cookieid STRING

  • createtime STRING

  • pv INT

  • ?
  • hive> select * from lxw1234;

  • OK

  • cookie1 2015-04-10 1

  • cookie1 2015-04-11 5

  • cookie1 2015-04-12 7

  • cookie1 2015-04-13 3

  • cookie1 2015-04-14 2

  • cookie1 2015-04-15 4

  • cookie1 2015-04-16 4

  • cookie2 2015-04-10 2

  • cookie2 2015-04-11 3

  • cookie2 2015-04-12 5

  • cookie2 2015-04-13 6

  • cookie2 2015-04-14 3

  • cookie2 2015-04-15 9

  • cookie2 2015-04-16 7

  • ?

    NTILE

    NTILE(n),用于將分組數(shù)據(jù)按照順序切分成n片,返回當(dāng)前切片值
    NTILE不支持ROWS BETWEEN,比如 NTILE(2) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)
    如果切片不均勻,默認(rèn)增加第一個(gè)切片的分布

    ?
  • SELECT

  • cookieid,

  • createtime,

  • pv,

  • NTILE(2) OVER(PARTITION BY cookieid ORDER BY createtime) AS rn1, --分組內(nèi)將數(shù)據(jù)分成2片

  • NTILE(3) OVER(PARTITION BY cookieid ORDER BY createtime) AS rn2, --分組內(nèi)將數(shù)據(jù)分成3片

  • NTILE(4) OVER(ORDER BY createtime) AS rn3 --將所有數(shù)據(jù)分成4片

  • FROM lxw1234

  • ORDER BY cookieid,createtime;

  • ?
  • cookieid day pv rn1 rn2 rn3

  • -------------------------------------------------

  • cookie1 2015-04-10 1 1 1 1

  • cookie1 2015-04-11 5 1 1 1

  • cookie1 2015-04-12 7 1 1 2

  • cookie1 2015-04-13 3 1 2 2

  • cookie1 2015-04-14 2 2 2 3

  • cookie1 2015-04-15 4 2 3 3

  • cookie1 2015-04-16 4 2 3 4

  • cookie2 2015-04-10 2 1 1 1

  • cookie2 2015-04-11 3 1 1 1

  • cookie2 2015-04-12 5 1 1 2

  • cookie2 2015-04-13 6 1 2 2

  • cookie2 2015-04-14 3 2 2 3

  • cookie2 2015-04-15 9 2 3 4

  • cookie2 2015-04-16 7 2 3 4

  • ?

    –比如,統(tǒng)計(jì)一個(gè)cookie,pv數(shù)最多的前1/3的天

    ?
  • SELECT

  • cookieid,

  • createtime,

  • pv,

  • NTILE(3) OVER(PARTITION BY cookieid ORDER BY pv DESC) AS rn

  • FROM lxw1234;

  • ?
  • --rn = 1 的記錄,就是我們想要的結(jié)果

  • ?
  • cookieid day pv rn

  • ----------------------------------

  • cookie1 2015-04-12 7 1

  • cookie1 2015-04-11 5 1

  • cookie1 2015-04-15 4 1

  • cookie1 2015-04-16 4 2

  • cookie1 2015-04-13 3 2

  • cookie1 2015-04-14 2 3

  • cookie1 2015-04-10 1 3

  • cookie2 2015-04-15 9 1

  • cookie2 2015-04-16 7 1

  • cookie2 2015-04-13 6 1

  • cookie2 2015-04-12 5 2

  • cookie2 2015-04-14 3 2

  • cookie2 2015-04-11 3 3

  • cookie2 2015-04-10 2 3

  • ?

    ROW_NUMBER

    ROW_NUMBER() –從1開(kāi)始,按照順序,生成分組內(nèi)記錄的序列
    –比如,按照pv降序排列,生成分組內(nèi)每天的pv名次
    ROW_NUMBER() 的應(yīng)用場(chǎng)景非常多,再比如,獲取分組內(nèi)排序第一的記錄;獲取一個(gè)session中的第一條refer等。

    ?
  • SELECT

  • cookieid,

  • createtime,

  • pv,

  • ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn

  • FROM lxw1234;

  • ?
  • cookieid day pv rn

  • -------------------------------------------

  • cookie1 2015-04-12 7 1

  • cookie1 2015-04-11 5 2

  • cookie1 2015-04-15 4 3

  • cookie1 2015-04-16 4 4

  • cookie1 2015-04-13 3 5

  • cookie1 2015-04-14 2 6

  • cookie1 2015-04-10 1 7

  • cookie2 2015-04-15 9 1

  • cookie2 2015-04-16 7 2

  • cookie2 2015-04-13 6 3

  • cookie2 2015-04-12 5 4

  • cookie2 2015-04-14 3 5

  • cookie2 2015-04-11 3 6

  • cookie2 2015-04-10 2 7

  • ?

    RANK 和 DENSE_RANK

    —RANK() 生成數(shù)據(jù)項(xiàng)在分組中的排名,排名相等會(huì)在名次中留下空位
    —DENSE_RANK() 生成數(shù)據(jù)項(xiàng)在分組中的排名,排名相等會(huì)在名次中不會(huì)留下空位

    ?
  • SELECT

  • cookieid,

  • createtime,

  • pv,

  • RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn1,

  • DENSE_RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn2,

  • ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY pv DESC) AS rn3

  • FROM lxw1234

  • WHERE cookieid = 'cookie1';

  • ?
  • cookieid day pv rn1 rn2 rn3

  • --------------------------------------------------

  • cookie1 2015-04-12 7 1 1 1

  • cookie1 2015-04-11 5 2 2 2

  • cookie1 2015-04-15 4 3 3 3

  • cookie1 2015-04-16 4 3 3 4

  • cookie1 2015-04-13 3 5 4 5

  • cookie1 2015-04-14 2 6 5 6

  • cookie1 2015-04-10 1 7 6 7

  • ?
  • rn1: 15號(hào)和16號(hào)并列第3, 13號(hào)排第5

  • rn2: 15號(hào)和16號(hào)并列第3, 13號(hào)排第4

  • rn3: 如果相等,則按記錄值排序,生成唯一的次序,如果所有記錄值都相等,或許會(huì)隨機(jī)排吧。

  • ?

    Hive分析窗口函數(shù)(三) CUME_DIST,PERCENT_RANK

    這兩個(gè)序列分析函數(shù)不是很常用,這里也介紹一下。

    注意: 序列函數(shù)不支持WINDOW子句。(什么是WINDOW子句,點(diǎn)此查看前面的文章)

    數(shù)據(jù)準(zhǔn)備:

    ?
  • CREATE EXTERNAL TABLE lxw1234 (

  • dept STRING,

  • userid string,

  • sal INT

  • ) ROW FORMAT DELIMITED

  • FIELDS TERMINATED BY ','

  • stored as textfile location '/tmp/lxw11/';

  • ?
  • ?
  • hive> select * from lxw1234;

  • OK

  • d1 user1 1000

  • d1 user2 2000

  • d1 user3 3000

  • d2 user4 4000

  • d2 user5 5000

  • ?

    CUME_DIST

    –CUME_DIST 小于等于當(dāng)前值的行數(shù)/分組內(nèi)總行數(shù)
    –比如,統(tǒng)計(jì)小于等于當(dāng)前薪水的人數(shù),所占總?cè)藬?shù)的比例

    ?
  • SELECT

  • dept,

  • userid,

  • sal,

  • CUME_DIST() OVER(ORDER BY sal) AS rn1,

  • CUME_DIST() OVER(PARTITION BY dept ORDER BY sal) AS rn2

  • FROM lxw1234;

  • ?
  • dept userid sal rn1 rn2

  • -------------------------------------------

  • d1 user1 1000 0.2 0.3333333333333333

  • d1 user2 2000 0.4 0.6666666666666666

  • d1 user3 3000 0.6 1.0

  • d2 user4 4000 0.8 0.5

  • d2 user5 5000 1.0 1.0

  • ?
  • rn1: 沒(méi)有partition,所有數(shù)據(jù)均為1組,總行數(shù)為5,

  • 第一行:小于等于1000的行數(shù)為1,因此,1/5=0.2

  • 第三行:小于等于3000的行數(shù)為3,因此,3/5=0.6

  • rn2: 按照部門分組,dpet=d1的行數(shù)為3,

  • 第二行:小于等于2000的行數(shù)為2,因此,2/3=0.6666666666666666

  • ?

    PERCENT_RANK

    –PERCENT_RANK 分組內(nèi)當(dāng)前行的RANK值-1/分組內(nèi)總行數(shù)-1
    應(yīng)用場(chǎng)景不了解,可能在一些特殊算法的實(shí)現(xiàn)中可以用到吧。

    ?
  • SELECT

  • dept,

  • userid,

  • sal,

  • PERCENT_RANK() OVER(ORDER BY sal) AS rn1, --分組內(nèi)

  • RANK() OVER(ORDER BY sal) AS rn11, --分組內(nèi)RANK值

  • SUM(1) OVER(PARTITION BY NULL) AS rn12, --分組內(nèi)總行數(shù)

  • PERCENT_RANK() OVER(PARTITION BY dept ORDER BY sal) AS rn2

  • FROM lxw1234;

  • ?
  • dept userid sal rn1 rn11 rn12 rn2

  • ---------------------------------------------------

  • d1 user1 1000 0.0 1 5 0.0

  • d1 user2 2000 0.25 2 5 0.5

  • d1 user3 3000 0.5 3 5 1.0

  • d2 user4 4000 0.75 4 5 0.0

  • d2 user5 5000 1.0 5 5 1.0

  • ?
  • rn1: rn1 = (rn11-1) / (rn12-1)

  • 第一行,(1-1)/(5-1)=0/4=0

  • 第二行,(2-1)/(5-1)=1/4=0.25

  • 第四行,(4-1)/(5-1)=3/4=0.75

  • rn2: 按照dept分組,

  • dept=d1的總行數(shù)為3

  • 第一行,(1-1)/(3-1)=0

  • 第三行,(3-1)/(3-1)=1

  • ?

    Hive分析窗口函數(shù)(四) LAG,LEAD,FIRST_VALUE,LAST_VALUE

    繼續(xù)學(xué)習(xí)這四個(gè)分析函數(shù)。

    注意: 這幾個(gè)函數(shù)不支持WINDOW子句。(什么是WINDOW子句,點(diǎn)此查看前面的文章)

    數(shù)據(jù)準(zhǔn)備:

    ?
  • CREATE EXTERNAL TABLE lxw1234 (

  • cookieid string,

  • createtime string, --頁(yè)面訪問(wèn)時(shí)間

  • url STRING --被訪問(wèn)頁(yè)面

  • ) ROW FORMAT DELIMITED

  • FIELDS TERMINATED BY ','

  • stored as textfile location '/tmp/lxw11/';

  • ?
  • ?
  • hive> select * from lxw1234;

  • OK

  • cookie1 2015-04-10 10:00:02 url2

  • cookie1 2015-04-10 10:00:00 url1

  • cookie1 2015-04-10 10:03:04 1url3

  • cookie1 2015-04-10 10:50:05 url6

  • cookie1 2015-04-10 11:00:00 url7

  • cookie1 2015-04-10 10:10:00 url4

  • cookie1 2015-04-10 10:50:01 url5

  • cookie2 2015-04-10 10:00:02 url22

  • cookie2 2015-04-10 10:00:00 url11

  • cookie2 2015-04-10 10:03:04 1url33

  • cookie2 2015-04-10 10:50:05 url66

  • cookie2 2015-04-10 11:00:00 url77

  • cookie2 2015-04-10 10:10:00 url44

  • cookie2 2015-04-10 10:50:01 url55

  • ?

    LAG

    LAG(col,n,DEFAULT) 用于統(tǒng)計(jì)窗口內(nèi)往上第n行值
    第一個(gè)參數(shù)為列名,第二個(gè)參數(shù)為往上第n行(可選,默認(rèn)為1),第三個(gè)參數(shù)為默認(rèn)值(當(dāng)往上第n行為NULL時(shí)候,取默認(rèn)值,如不指定,則為NULL)

    ?
  • SELECT cookieid,

  • createtime,

  • url,

  • ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,

  • LAG(createtime,1,'1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY createtime) AS last_1_time,

  • LAG(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS last_2_time

  • FROM lxw1234;

  • ?
  • ?
  • cookieid createtime url rn last_1_time last_2_time

  • -------------------------------------------------------------------------------------------

  • cookie1 2015-04-10 10:00:00 url1 1 1970-01-01 00:00:00 NULL

  • cookie1 2015-04-10 10:00:02 url2 2 2015-04-10 10:00:00 NULL

  • cookie1 2015-04-10 10:03:04 1url3 3 2015-04-10 10:00:02 2015-04-10 10:00:00

  • cookie1 2015-04-10 10:10:00 url4 4 2015-04-10 10:03:04 2015-04-10 10:00:02

  • cookie1 2015-04-10 10:50:01 url5 5 2015-04-10 10:10:00 2015-04-10 10:03:04

  • cookie1 2015-04-10 10:50:05 url6 6 2015-04-10 10:50:01 2015-04-10 10:10:00

  • cookie1 2015-04-10 11:00:00 url7 7 2015-04-10 10:50:05 2015-04-10 10:50:01

  • cookie2 2015-04-10 10:00:00 url11 1 1970-01-01 00:00:00 NULL

  • cookie2 2015-04-10 10:00:02 url22 2 2015-04-10 10:00:00 NULL

  • cookie2 2015-04-10 10:03:04 1url33 3 2015-04-10 10:00:02 2015-04-10 10:00:00

  • cookie2 2015-04-10 10:10:00 url44 4 2015-04-10 10:03:04 2015-04-10 10:00:02

  • cookie2 2015-04-10 10:50:01 url55 5 2015-04-10 10:10:00 2015-04-10 10:03:04

  • cookie2 2015-04-10 10:50:05 url66 6 2015-04-10 10:50:01 2015-04-10 10:10:00

  • cookie2 2015-04-10 11:00:00 url77 7 2015-04-10 10:50:05 2015-04-10 10:50:01

  • ?
  • ?
  • last_1_time: 指定了往上第1行的值,default為'1970-01-01 00:00:00'

  • cookie1第一行,往上1行為NULL,因此取默認(rèn)值 1970-01-01 00:00:00

  • cookie1第三行,往上1行值為第二行值,2015-04-10 10:00:02

  • cookie1第六行,往上1行值為第五行值,2015-04-10 10:50:01

  • last_2_time: 指定了往上第2行的值,為指定默認(rèn)值

  • cookie1第一行,往上2行為NULL

  • cookie1第二行,往上2行為NULL

  • cookie1第四行,往上2行為第二行值,2015-04-10 10:00:02

  • cookie1第七行,往上2行為第五行值,2015-04-10 10:50:01

  • ?

    LEAD

    與LAG相反
    LEAD(col,n,DEFAULT) 用于統(tǒng)計(jì)窗口內(nèi)往下第n行值
    第一個(gè)參數(shù)為列名,第二個(gè)參數(shù)為往下第n行(可選,默認(rèn)為1),第三個(gè)參數(shù)為默認(rèn)值(當(dāng)往下第n行為NULL時(shí)候,取默認(rèn)值,如不指定,則為NULL)

    ?
  • SELECT cookieid,

  • createtime,

  • url,

  • ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,

  • LEAD(createtime,1,'1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY createtime) AS next_1_time,

  • LEAD(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS next_2_time

  • FROM lxw1234;

  • ?
  • ?
  • cookieid createtime url rn next_1_time next_2_time

  • -------------------------------------------------------------------------------------------

  • cookie1 2015-04-10 10:00:00 url1 1 2015-04-10 10:00:02 2015-04-10 10:03:04

  • cookie1 2015-04-10 10:00:02 url2 2 2015-04-10 10:03:04 2015-04-10 10:10:00

  • cookie1 2015-04-10 10:03:04 1url3 3 2015-04-10 10:10:00 2015-04-10 10:50:01

  • cookie1 2015-04-10 10:10:00 url4 4 2015-04-10 10:50:01 2015-04-10 10:50:05

  • cookie1 2015-04-10 10:50:01 url5 5 2015-04-10 10:50:05 2015-04-10 11:00:00

  • cookie1 2015-04-10 10:50:05 url6 6 2015-04-10 11:00:00 NULL

  • cookie1 2015-04-10 11:00:00 url7 7 1970-01-01 00:00:00 NULL

  • cookie2 2015-04-10 10:00:00 url11 1 2015-04-10 10:00:02 2015-04-10 10:03:04

  • cookie2 2015-04-10 10:00:02 url22 2 2015-04-10 10:03:04 2015-04-10 10:10:00

  • cookie2 2015-04-10 10:03:04 1url33 3 2015-04-10 10:10:00 2015-04-10 10:50:01

  • cookie2 2015-04-10 10:10:00 url44 4 2015-04-10 10:50:01 2015-04-10 10:50:05

  • cookie2 2015-04-10 10:50:01 url55 5 2015-04-10 10:50:05 2015-04-10 11:00:00

  • cookie2 2015-04-10 10:50:05 url66 6 2015-04-10 11:00:00 NULL

  • cookie2 2015-04-10 11:00:00 url77 7 1970-01-01 00:00:00 NULL

  • ?
  • --邏輯與LAG一樣,只不過(guò)LAG是往上,LEAD是往下。

  • ?

    FIRST_VALUE

    取分組內(nèi)排序后,截止到當(dāng)前行,第一個(gè)值

    ?
  • SELECT cookieid,

  • createtime,

  • url,

  • ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,

  • FIRST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS first1

  • FROM lxw1234;

  • ?
  • cookieid createtime url rn first1

  • ---------------------------------------------------------

  • cookie1 2015-04-10 10:00:00 url1 1 url1

  • cookie1 2015-04-10 10:00:02 url2 2 url1

  • cookie1 2015-04-10 10:03:04 1url3 3 url1

  • cookie1 2015-04-10 10:10:00 url4 4 url1

  • cookie1 2015-04-10 10:50:01 url5 5 url1

  • cookie1 2015-04-10 10:50:05 url6 6 url1

  • cookie1 2015-04-10 11:00:00 url7 7 url1

  • cookie2 2015-04-10 10:00:00 url11 1 url11

  • cookie2 2015-04-10 10:00:02 url22 2 url11

  • cookie2 2015-04-10 10:03:04 1url33 3 url11

  • cookie2 2015-04-10 10:10:00 url44 4 url11

  • cookie2 2015-04-10 10:50:01 url55 5 url11

  • cookie2 2015-04-10 10:50:05 url66 6 url11

  • cookie2 2015-04-10 11:00:00 url77 7 url11

  • ?

    LAST_VALUE

    取分組內(nèi)排序后,截止到當(dāng)前行,最后一個(gè)值

    ?
  • SELECT cookieid,

  • createtime,

  • url,

  • ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,

  • LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS last1

  • FROM lxw1234;

  • ?
  • ?
  • cookieid createtime url rn last1

  • -----------------------------------------------------------------

  • cookie1 2015-04-10 10:00:00 url1 1 url1

  • cookie1 2015-04-10 10:00:02 url2 2 url2

  • cookie1 2015-04-10 10:03:04 1url3 3 1url3

  • cookie1 2015-04-10 10:10:00 url4 4 url4

  • cookie1 2015-04-10 10:50:01 url5 5 url5

  • cookie1 2015-04-10 10:50:05 url6 6 url6

  • cookie1 2015-04-10 11:00:00 url7 7 url7

  • cookie2 2015-04-10 10:00:00 url11 1 url11

  • cookie2 2015-04-10 10:00:02 url22 2 url22

  • cookie2 2015-04-10 10:03:04 1url33 3 1url33

  • cookie2 2015-04-10 10:10:00 url44 4 url44

  • cookie2 2015-04-10 10:50:01 url55 5 url55

  • cookie2 2015-04-10 10:50:05 url66 6 url66

  • cookie2 2015-04-10 11:00:00 url77 7 url77

  • 如果不指定ORDER BY,則默認(rèn)按照記錄在文件中的偏移量進(jìn)行排序,會(huì)出現(xiàn)錯(cuò)誤的結(jié)果

    ?
  • SELECT cookieid,

  • createtime,

  • url,

  • FIRST_VALUE(url) OVER(PARTITION BY cookieid) AS first2

  • FROM lxw1234;

  • ?
  • cookieid createtime url first2

  • ----------------------------------------------

  • cookie1 2015-04-10 10:00:02 url2 url2

  • cookie1 2015-04-10 10:00:00 url1 url2

  • cookie1 2015-04-10 10:03:04 1url3 url2

  • cookie1 2015-04-10 10:50:05 url6 url2

  • cookie1 2015-04-10 11:00:00 url7 url2

  • cookie1 2015-04-10 10:10:00 url4 url2

  • cookie1 2015-04-10 10:50:01 url5 url2

  • cookie2 2015-04-10 10:00:02 url22 url22

  • cookie2 2015-04-10 10:00:00 url11 url22

  • cookie2 2015-04-10 10:03:04 1url33 url22

  • cookie2 2015-04-10 10:50:05 url66 url22

  • cookie2 2015-04-10 11:00:00 url77 url22

  • cookie2 2015-04-10 10:10:00 url44 url22

  • cookie2 2015-04-10 10:50:01 url55 url22

  • ?
  • SELECT cookieid,

  • createtime,

  • url,

  • LAST_VALUE(url) OVER(PARTITION BY cookieid) AS last2

  • FROM lxw1234;

  • ?
  • cookieid createtime url last2

  • ----------------------------------------------

  • cookie1 2015-04-10 10:00:02 url2 url5

  • cookie1 2015-04-10 10:00:00 url1 url5

  • cookie1 2015-04-10 10:03:04 1url3 url5

  • cookie1 2015-04-10 10:50:05 url6 url5

  • cookie1 2015-04-10 11:00:00 url7 url5

  • cookie1 2015-04-10 10:10:00 url4 url5

  • cookie1 2015-04-10 10:50:01 url5 url5

  • cookie2 2015-04-10 10:00:02 url22 url55

  • cookie2 2015-04-10 10:00:00 url11 url55

  • cookie2 2015-04-10 10:03:04 1url33 url55

  • cookie2 2015-04-10 10:50:05 url66 url55

  • cookie2 2015-04-10 11:00:00 url77 url55

  • cookie2 2015-04-10 10:10:00 url44 url55

  • cookie2 2015-04-10 10:50:01 url55 url55

  • 如果想要取分組內(nèi)排序后最后一個(gè)值,則需要變通一下:

    ?
  • SELECT cookieid,

  • createtime,

  • url,

  • ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,

  • LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS last1,

  • FIRST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime DESC) AS last2

  • FROM lxw1234

  • ORDER BY cookieid,createtime;

  • ?
  • cookieid createtime url rn last1 last2

  • -------------------------------------------------------------

  • cookie1 2015-04-10 10:00:00 url1 1 url1 url7

  • cookie1 2015-04-10 10:00:02 url2 2 url2 url7

  • cookie1 2015-04-10 10:03:04 1url3 3 1url3 url7

  • cookie1 2015-04-10 10:10:00 url4 4 url4 url7

  • cookie1 2015-04-10 10:50:01 url5 5 url5 url7

  • cookie1 2015-04-10 10:50:05 url6 6 url6 url7

  • cookie1 2015-04-10 11:00:00 url7 7 url7 url7

  • cookie2 2015-04-10 10:00:00 url11 1 url11 url77

  • cookie2 2015-04-10 10:00:02 url22 2 url22 url77

  • cookie2 2015-04-10 10:03:04 1url33 3 1url33 url77

  • cookie2 2015-04-10 10:10:00 url44 4 url44 url77

  • cookie2 2015-04-10 10:50:01 url55 5 url55 url77

  • cookie2 2015-04-10 10:50:05 url66 6 url66 url77

  • cookie2 2015-04-10 11:00:00 url77 7 url77 url77

  • <span style="font-weight: bold; color: rgb(255, 0, 0); font-family: Arial, Helvetica, sans-serif; background-color: rgb(255, 255, 255);">提示:在使用分析函數(shù)的過(guò)程中,要特別注意ORDER BY子句,用的不恰當(dāng),統(tǒng)計(jì)出的結(jié)果就不是你所期望的。</span>

    Hive分析窗口函數(shù)(五) GROUPING SETS,GROUPING__ID,CUBE,ROLLUP

    GROUPING SETS,GROUPING__ID,CUBE,ROLLUP

    這幾個(gè)分析函數(shù)通常用于OLAP中,不能累加,而且需要根據(jù)不同維度上鉆和下鉆的指標(biāo)統(tǒng)計(jì),比如,分小時(shí)、天、月的UV數(shù)。

    數(shù)據(jù)準(zhǔn)備:

    ?
  • CREATE EXTERNAL TABLE lxw1234 (

  • month STRING,

  • day STRING,

  • cookieid STRING

  • ) ROW FORMAT DELIMITED

  • FIELDS TERMINATED BY ','

  • stored as textfile location '/tmp/lxw11/';

  • ?
  • ?
  • hive> select * from lxw1234;

  • OK

  • 2015-03 2015-03-10 cookie1

  • 2015-03 2015-03-10 cookie5

  • 2015-03 2015-03-12 cookie7

  • 2015-04 2015-04-12 cookie3

  • 2015-04 2015-04-13 cookie2

  • 2015-04 2015-04-13 cookie4

  • 2015-04 2015-04-16 cookie4

  • 2015-03 2015-03-10 cookie2

  • 2015-03 2015-03-10 cookie3

  • 2015-04 2015-04-12 cookie5

  • 2015-04 2015-04-13 cookie6

  • 2015-04 2015-04-15 cookie3

  • 2015-04 2015-04-15 cookie2

  • 2015-04 2015-04-16 cookie1

  • ?

    GROUPING SETS

    在一個(gè)GROUP BY查詢中,根據(jù)不同的維度組合進(jìn)行聚合,等價(jià)于將不同維度的GROUP BY結(jié)果集進(jìn)行UNION ALL

    ?
  • SELECT

  • month,

  • day,

  • COUNT(DISTINCT cookieid) AS uv,

  • GROUPING__ID

  • FROM lxw1234

  • GROUP BY month,day

  • GROUPING SETS (month,day)

  • ORDER BY GROUPING__ID;

  • ?
  • month day uv GROUPING__ID

  • ------------------------------------------------

  • 2015-03 NULL 5 1

  • 2015-04 NULL 6 1

  • NULL 2015-03-10 4 2

  • NULL 2015-03-12 1 2

  • NULL 2015-04-12 2 2

  • NULL 2015-04-13 3 2

  • NULL 2015-04-15 2 2

  • NULL 2015-04-16 2 2

  • ?
  • ?
  • 等價(jià)于

  • SELECT month,NULL,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM lxw1234 GROUP BY month

  • UNION ALL

  • SELECT NULL,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM lxw1234 GROUP BY day

  • ?

    再如:

    ?
  • SELECT

  • month,

  • day,

  • COUNT(DISTINCT cookieid) AS uv,

  • GROUPING__ID

  • FROM lxw1234

  • GROUP BY month,day

  • GROUPING SETS (month,day,(month,day))

  • ORDER BY GROUPING__ID;

  • ?
  • month day uv GROUPING__ID

  • ------------------------------------------------

  • 2015-03 NULL 5 1

  • 2015-04 NULL 6 1

  • NULL 2015-03-10 4 2

  • NULL 2015-03-12 1 2

  • NULL 2015-04-12 2 2

  • NULL 2015-04-13 3 2

  • NULL 2015-04-15 2 2

  • NULL 2015-04-16 2 2

  • 2015-03 2015-03-10 4 3

  • 2015-03 2015-03-12 1 3

  • 2015-04 2015-04-12 2 3

  • 2015-04 2015-04-13 3 3

  • 2015-04 2015-04-15 2 3

  • 2015-04 2015-04-16 2 3

  • ?
  • ?
  • 等價(jià)于

  • SELECT month,NULL,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM lxw1234 GROUP BY month

  • UNION ALL

  • SELECT NULL,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM lxw1234 GROUP BY day

  • UNION ALL

  • SELECT month,day,COUNT(DISTINCT cookieid) AS uv,3 AS GROUPING__ID FROM lxw1234 GROUP BY month,day

  • ?

    其中的?GROUPING__ID,表示結(jié)果屬于哪一個(gè)分組集合。

    CUBE

    根據(jù)GROUP BY的維度的所有組合進(jìn)行聚合。

    ?
  • SELECT

  • month,

  • day,

  • COUNT(DISTINCT cookieid) AS uv,

  • GROUPING__ID

  • FROM lxw1234

  • GROUP BY month,day

  • WITH CUBE

  • ORDER BY GROUPING__ID;

  • ?
  • ?
  • month day uv GROUPING__ID

  • --------------------------------------------

  • NULL NULL 7 0

  • 2015-03 NULL 5 1

  • 2015-04 NULL 6 1

  • NULL 2015-04-12 2 2

  • NULL 2015-04-13 3 2

  • NULL 2015-04-15 2 2

  • NULL 2015-04-16 2 2

  • NULL 2015-03-10 4 2

  • NULL 2015-03-12 1 2

  • 2015-03 2015-03-10 4 3

  • 2015-03 2015-03-12 1 3

  • 2015-04 2015-04-16 2 3

  • 2015-04 2015-04-12 2 3

  • 2015-04 2015-04-13 3 3

  • 2015-04 2015-04-15 2 3

  • ?
  • ?
  • ?
  • 等價(jià)于

  • SELECT NULL,NULL,COUNT(DISTINCT cookieid) AS uv,0 AS GROUPING__ID FROM lxw1234

  • UNION ALL

  • SELECT month,NULL,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM lxw1234 GROUP BY month

  • UNION ALL

  • SELECT NULL,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM lxw1234 GROUP BY day

  • UNION ALL

  • SELECT month,day,COUNT(DISTINCT cookieid) AS uv,3 AS GROUPING__ID FROM lxw1234 GROUP BY month,day

  • ?

    ROLLUP

    是CUBE的子集,以最左側(cè)的維度為主,從該維度進(jìn)行層級(jí)聚合。

    ?
  • 比如,以month維度進(jìn)行層級(jí)聚合:

  • SELECT

  • month,

  • day,

  • COUNT(DISTINCT cookieid) AS uv,

  • GROUPING__ID

  • FROM lxw1234

  • GROUP BY month,day

  • WITH ROLLUP

  • ORDER BY GROUPING__ID;

  • ?
  • month day uv GROUPING__ID

  • ---------------------------------------------------

  • NULL NULL 7 0

  • 2015-03 NULL 5 1

  • 2015-04 NULL 6 1

  • 2015-03 2015-03-10 4 3

  • 2015-03 2015-03-12 1 3

  • 2015-04 2015-04-12 2 3

  • 2015-04 2015-04-13 3 3

  • 2015-04 2015-04-15 2 3

  • 2015-04 2015-04-16 2 3

  • ?
  • 可以實(shí)現(xiàn)這樣的上鉆過(guò)程:

  • 月天的UV->月的UV->總UV

  • ?
  • --把month和day調(diào)換順序,則以day維度進(jìn)行層級(jí)聚合:

  • ?
  • SELECT

  • day,

  • month,

  • COUNT(DISTINCT cookieid) AS uv,

  • GROUPING__ID

  • FROM lxw1234

  • GROUP BY day,month

  • WITH ROLLUP

  • ORDER BY GROUPING__ID;

  • ?
  • ?
  • day month uv GROUPING__ID

  • -------------------------------------------------------

  • NULL NULL 7 0

  • 2015-04-13 NULL 3 1

  • 2015-03-12 NULL 1 1

  • 2015-04-15 NULL 2 1

  • 2015-03-10 NULL 4 1

  • 2015-04-16 NULL 2 1

  • 2015-04-12 NULL 2 1

  • 2015-04-12 2015-04 2 3

  • 2015-03-10 2015-03 4 3

  • 2015-03-12 2015-03 1 3

  • 2015-04-13 2015-04 3 3

  • 2015-04-15 2015-04 2 3

  • 2015-04-16 2015-04 2 3

  • ?
  • 可以實(shí)現(xiàn)這樣的上鉆過(guò)程:

  • 天月的UV->天的UV->總UV

  • (這里,根據(jù)天和月進(jìn)行聚合,和根據(jù)天聚合結(jié)果一樣,因?yàn)橛懈缸雨P(guān)系,如果是其他維度組合的話,就會(huì)不一樣)

  • ?

    這種函數(shù),需要結(jié)合實(shí)際場(chǎng)景和數(shù)據(jù)去使用和研究,只看說(shuō)明的話,很難理解。

    官網(wǎng)的介紹: https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation%2C+Cube%2C+Grouping+and+Rollup

    總結(jié)

    以上是生活随笔為你收集整理的Hive分析窗口函数系列文章的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。

    如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò),歡迎將生活随笔推薦給好友。