Hive SQL面试题(附答案)
點擊上方 "大數據肌肉猿"關注,?星標一起成長
點擊下方鏈接,進入高質量學習交流群
今日更新| 950個轉型案例分享-大數據交流群
本文目錄:
一、行列轉換
二、排名中取他值
三、累計求值
四、窗口大小控制
五、產生連續數值
六、數據擴充與收縮
七、合并與拆分
八、模擬循環操作
九、不使用distinct或group by去重
十、容器--反轉內容
十一、多容器--成對提取數據
十二、多容器--轉多行
十三、抽象分組--斷點排序
十四、業務邏輯的分類與抽象--時效
十五、時間序列--進度及剩余
十六、時間序列--構造日期
十七、時間序列--構造累積日期
十八、時間序列--構造連續日期
十九、時間序列--取多個字段最新的值
二十、時間序列--補全數據
二十一、時間序列--取最新完成狀態的前一個狀態
二十二、非等值連接--范圍匹配
二十三、非等值連接--最近匹配
二十四、N指標--累計去重
一、行列轉換
描述:表中記錄了各年份各部門的平均績效考核成績。
表名:t1
表結構:
表內容:
a???b??c 2014??B??9 2015??A??8 2014??A??10 2015??B??7問題一:多行轉多列
問題描述:將上述表內容轉為如下輸出結果所示:
a??col_A?col_B 2014??10???9 2015??8????7參考答案:
select?a,max(case?when?b="A"?then?c?end)?col_A,max(case?when?b="B"?then?c?end)?col_B from?t1 group?by?a;問題二:如何將結果轉成源表?(多列轉多行)
問題描述:將問題一的結果轉成源表,問題一結果表名為t1_2。
參考答案:
select?a,b,c from?(select?a,"A"?as?b,col_a?as?c?from?t1_2?union?all?select?a,"B"?as?b,col_b?as?c?from?t1_2?? )tmp;問題三:同一部門會有多個績效,求多行轉多列結果
問題描述:2014年公司組織架構調整,導致部門出現多個績效,業務及人員不同,無法合并算績效,源表內容如下:
2014??B??9 2015??A??8 2014??A??10 2015??B??7 2014??B??6輸出結果如下所示:
a????col_A??col_B 2014???10????6,9 2015???8?????7參考答案:
select?a,max(case?when?b="A"?then?c?end)?col_A,max(case?when?b="B"?then?c?end)?col_B from?(select?a,b,concat_ws(",",collect_set(cast(c?as?string)))?as?cfrom?t1group?by?a,b )tmp group?by?a;二、排名中取他值
表名:t2
表字段及內容:
問題一:按a分組取b字段最小時對應的c字段
輸出結果如下所示:
a???min_c 2014??3 2015??4參考答案:
selecta,c?as?min_c from (selecta,b,c,row_number()?over(partition?by?a?order?by?b)?as?rn?from?t2? )a where?rn?=?1;問題二:按a分組取b字段排第二時對應的c字段
輸出結果如下所示:
a??second_c 2014??1 2015??3參考答案:
selecta,c?as?second_c from (selecta,b,c,row_number()?over(partition?by?a?order?by?b)?as?rn?from?t2? )a where?rn?=?2;問題三:按a分組取b字段最小和最大時對應的c字段
輸出結果如下所示:
a????min_c??max_c 2014??3??????2 2015??4??????3參考答案:
selecta,min(if(asc_rn?=?1,?c,?null))?as?min_c,max(if(desc_rn?=?1,?c,?null))?as?max_c from (selecta,b,c,row_number()?over(partition?by?a?order?by?b)?as?asc_rn,row_number()?over(partition?by?a?order?by?b?desc)?as?desc_rn?from?t2? )a where?asc_rn?=?1?or?desc_rn?=?1 group?by?a;問題四:按a分組取b字段第二小和第二大時對應的c字段
輸出結果如下所示:
a????min_c??max_c 2014??1??????1 2015??3??????4參考答案:
selectret.a,max(case?when?ret.rn_min?=?2?then?ret.c?else?null?end)?as?min_c,max(case?when?ret.rn_max?=?2?then?ret.c?else?null?end)?as?max_c from?(select*,row_number()?over(partition?by?t2.a?order?by?t2.b)?as?rn_min,row_number()?over(partition?by?t2.a?order?by?t2.b?desc)?as?rn_maxfrom?t2 )?as?ret where?ret.rn_min?=?2 or?ret.rn_max?=?2 group?by?ret.a;問題五:按a分組取b字段前兩小和前兩大時對應的c字段
注意:需保持b字段最小、最大排首位
輸出結果如下所示:
a????min_c??max_c 2014??3,1?????2,1 2015??4,3?????3,4參考答案:
selecttmp1.a?as?a,min_c,max_c from? (select?a,concat_ws(',',?collect_list(c))?as?min_cfrom(selecta,b,c,row_number()?over(partition?by?a?order?by?b)?as?asc_rnfrom?t2)awhere?asc_rn?<=?2?group?by?a? )tmp1? join? (select?a,concat_ws(',',?collect_list(c))?as?max_cfrom(selecta,b,c,row_number()?over(partition?by?a?order?by?b?desc)?as?desc_rn?from?t2)awhere?desc_rn?<=?2group?by?a? )tmp2? on?tmp1.a?=?tmp2.a;三、累計求值
表名:t3
表字段及內容:
問題一:按a分組按b字段排序,對c累計求和
輸出結果如下所示:
a????b???sum_c 2014??A???3 2014??B???4 2014??C???6 2015??A???4 2015??D???7參考答案:
select?a,?b,?c,?sum(c)?over(partition?by?a?order?by?b)?as?sum_c from?t3;問題二:按a分組按b字段排序,對c取累計平均值
輸出結果如下所示:
a????b???avg_c 2014??A???3 2014??B???2 2014??C???2 2015??A???4 2015??D???3.5參考答案:
select?a,?b,?c,?avg(c)?over(partition?by?a?order?by?b)?as?avg_c from?t3;問題三:按a分組按b字段排序,對b取累計排名比例
輸出結果如下所示:
a????b???ratio_c 2014??A???0.33 2014??B???0.67 2014??C???1.00 2015??A???0.50 2015??D???1.00參考答案:
select?a,?b,?c,?round(row_number()?over(partition?by?a?order?by?b)?/?(count(c)?over(partition?by?a)),2)?as?ratio_c from?t3? order?by?a,b;問題四:按a分組按b字段排序,對b取累計求和比例
輸出結果如下所示:
a????b???ratio_c 2014??A???0.50 2014??B???0.67 2014??C???1.00 2015??A???0.57 2015??D???1.00參考答案:
select?a,?b,?c,?round(sum(c)?over(partition?by?a?order?by?b)?/?(sum(c)?over(partition?by?a)),2)?as?ratio_c from?t3? order?by?a,b;四、窗口大小控制
表名:t4
表字段及內容:
問題一:按a分組按b字段排序,對c取前后各一行的和
輸出結果如下所示:
a????b???sum_c 2014??A???1 2014??B???5 2014??C???1 2015??A???3 2015??D???4參考答案:
select?a,b,lag(c,1,0)?over(partition?by?a?order?by?b)+lead(c,1,0)?over(partition?by?a?order?by?b)?as?sum_c from?t4;問題二:按a分組按b字段排序,對c取平均值
問題描述:前一行與當前行的均值!
輸出結果如下所示:
a????b???avg_c 2014??A???3 2014??B???2 2014??C???1.5 2015??A???4 2015??D???3.5參考答案:
selecta,b,case?when?lag_c?is?null?then?celse?(c+lag_c)/2?end?as?avg_c from(selecta,b,c,lag(c,1)?over(partition?by?a?order?by?b)?as?lag_cfrom?t4)temp;五、產生連續數值
輸出結果如下所示:
1 2 3 4 5 ... 100參考答案:
不借助其他任何外表,實現產生連續數值
此處給出兩種解法,其一:
其二:
selectrow_number()?over()?as?id from??(select?split(space(99),?'?')?as?x)?t lateral?view explode(x)?ex;那如何產生1至1000000連續數值?
參考答案:
selectrow_number()?over()?as?id from??(select?split(space(999999),?'?')?as?x)?t lateral?view explode(x)?ex;六、數據擴充與收縮
表名:t6
表字段及內容:
問題一:數據擴充
輸出結果如下所示:
a?????b 3???3、2、1 2???2、1 4???4、3、2、1參考答案:
select??t.a,concat_ws('、',collect_set(cast(t.rn?as?string)))?as?b from (??select??t6.a,b.rnfrom?t6left?join(?selectrow_number()?over()?as?rnfrom??(select?split(space(5),?'?')?as?x)?t?--?space(5)可根據t6表的最大值靈活調整lateral?viewexplode(x)?pe)?bon?1?=?1where?t6.a?>=?b.rnorder?by?t6.a,?b.rn?desc? )?t group?by??t.a;問題二:數據擴充,排除偶數
輸出結果如下所示:
a?????b 3???3、1 2???1 4???3、1參考答案:
select??t.a,concat_ws('、',collect_set(cast(t.rn?as?string)))?as?b from (??select??t6.a,b.rnfrom?t6left?join(?selectrow_number()?over()?as?rnfrom??(select?split(space(5),?'?')?as?x)?tlateral?viewexplode(x)?pe)?bon?1?=?1where?t6.a?>=?b.rn?and?b.rn?%?2?=?1order?by?t6.a,?b.rn?desc? )?t group?by??t.a;問題三:如何處理字符串累計拼接
問題描述:將小于等于a字段的值聚合拼接起來
輸出結果如下所示:
a?????b 3?????2、3 2?????2 4?????2、3、4參考答案:
select??t.a,concat_ws('、',collect_set(cast(t.a1?as?string)))?as?b from (???select??t6.a,b.a1from?t6left?join(???select??a?as?a1?from?t6)?bon?1?=?1where?t6.a?>=?b.a1order?by?t6.a,?b.a1? )?t group?by??t.a;問題四:如果a字段有重復,如何實現字符串累計拼接
輸出結果如下所示:
a?????b 2?????2 3?????2、3 3?????2、3、3 4?????2、3、3、4參考答案:
select?a,b from? (select??t.a,t.rn,concat_ws('、',collect_list(cast(t.a1?as?string)))?as?bfrom(???select??a.a,a.rn,b.a1from(select??a,row_number()?over(order?by?a?)?as?rn?from?t6)?aleft?join(???select??a?as?a1,row_number()?over(order?by?a?)?as?rn??from?t6)?bon?1?=?1where?a.a?>=?b.a1?and?a.rn?>=?b.rn?order?by?a.a,?b.a1?)?tgroup?by??t.a,t.rnorder?by?t.a,t.rn )?tt;問題五:數據展開
問題描述:如何將字符串"1-5,16,11-13,9"擴展成"1,2,3,4,5,16,11,12,13,9"?注意順序不變。
參考答案:
select??concat_ws(',',collect_list(cast(rn?as?string))) from (select??a.rn,b.num,b.posfrom(selectrow_number()?over()?as?rnfrom?(select?split(space(20),?'?')?as?x)?t?--?space(20)可靈活調整lateral?viewexplode(x)?pe)?a?lateral?view?outer?posexplode(split('1-5,16,11-13,9',?','))?b?as?pos,?numwhere?a.rn?between?cast(split(num,?'-')[0]?as?int)?and?cast(split(num,?'-')[1]?as?int)?or?a.rn?=?numorder?by?pos,?rn? )?t;七、合并與拆分
表名:t7
表字段及內容:
問題一:合并
輸出結果如下所示:
2014??A、B 2015??B、D參考答案:
selecta,concat_ws('、',?collect_set(t.b))?b from?t7 group?by?a;問題二:拆分
問題描述:將分組合并的結果拆分出來
參考答案:
selectt.a,d from (selecta,concat_ws('、',?collect_set(t7.b))?bfrom?t7group?by?a )t lateral?view? explode(split(t.b,?'、'))?table_tmp?as?d;八、模擬循環操作
表名:t8
表字段及內容:
問題一:如何將字符'1'的位置提取出來
輸出結果如下所示:
1,3,4 2,4參考答案:
select?a,concat_ws(",",collect_list(cast(index?as?string)))?as?res from?(select?a,index+1?as?index,chrfrom?(select?a,concat_ws(",",substr(a,1,1),substr(a,2,1),substr(a,3,1),substr(a,-1))?strfrom?t8)?tmp1lateral?view?posexplode(split(str,","))?t?as?index,chrwhere?chr?=?"1" )?tmp2 group?by?a;九、不使用distinct或group by去重
表名:t9
表字段及內容:
問題一:不使用distinct或group by去重
輸出結果如下所示:
2014??A 2016??A 2014??B 2015??B參考答案:
selectt2.year,t2.num from(select*,row_number()?over?(partition?by?t1.year,t1.num)?as?rank_1from?(select?a?as?year,d?as?numfrom?t9union?allselect?b?as?year,d?as?numfrom?t9union?allselect?c?as?year,d?as?numfrom?t9)t1 )t2 where?rank_1=1 order?by?num;十、容器--反轉內容
表名:t10
表字段及內容:
問題一:反轉逗號分隔的數據:改變順序,內容不變
輸出結果如下所示:
BAD,CA,AB EA,BD參考答案:
select?a,concat_ws(",",collect_list(reverse(str))) from? (select?a,strfrom?t10lateral?view?explode(split(reverse(a),","))?t?as?str )?tmp1 group?by?a;問題二:反轉逗號分隔的數據:改變內容,順序不變
輸出結果如下所示:
BA,AC,DAB DB,AE參考答案:
select?a,concat_ws(",",collect_list(reverse(str))) from? (select?a,strfrom?t10lateral?view?explode(split(a,","))?t?as?str )?tmp1 group?by?a;十一、多容器--成對提取數據
表名:t11
表字段及內容:
問題一:成對提取數據,字段一一對應
輸出結果如下所示:
a???????b A???????1 B???????3 B???????4 C???????5 D???????2參考答案:
select?a_inx,b_inx from? (select?a,b,a_id,a_inx,b_id,b_inxfrom?t11lateral?view?posexplode(split(a,'/'))?t?as?a_id,a_inxlateral?view?posexplode(split(b,'/'))?t?as?b_id,b_inx )?tmp where?a_id=b_id;十二、多容器--轉多行
表名:t12
表字段及內容:
問題一:轉多行
輸出結果如下所示:
a????????d???????e 001?????type_b????A 001?????type_b????B 001?????type_c????1 001?????type_c????3 001?????type_c????5 002?????type_b????B 002?????type_b????C 002?????type_b????D 002?????type_c????4 002?????type_c????5參考答案:
select?a,d,e from? (selecta,"type_b"?as?d,str?as?efrom?t12lateral?view?explode(split(b,"/"))?t?as?strunion?all?selecta,"type_c"?as?d,str?as?efrom?t12lateral?view?explode(split(c,"/"))?t?as?str )?tmp order?by?a,d;十三、抽象分組--斷點排序
表名:t13
表字段及內容:
問題一:斷點排序
輸出結果如下所示:
a????b????c? 2014??1????1 2015??1????2 2016??1????3 2017??0????1 2018??0????2 2019??-1???1 2020??-1???2 2021??-1???3 2022??1????1 2023??1????2參考答案:
select??a,b,row_number()?over(?partition?by?b,repair_a?order?by?a?asc)?as?c--按照b列和[b的組首]分組,排序 from? (select??a,b,a-b_rn?as?repair_a--根據b列值出現的次序,修復a列值為b首次出現的a列值,稱為b的[組首]from?(select?a,b,row_number()?over(?partition?by?b?order?by??a??asc?)?as?b_rn--按b列分組,按a列排序,得到b列各值出現的次序from?t13?)tmp1 )tmp2--注意,如果不同的b列值,可能出現同樣的組首值,但組首值需要和a列值?一并參與分組,故并不影響排序。 order?by?a?asc;十四、業務邏輯的分類與抽象--時效
日期表:d_date
表字段及內容:
工作日:周一至周五09:30-18:30
客戶申請表:t14
表字段及內容:
問題一:計算上表中從申請到通過占用的工作時長
輸出結果如下所示:
a?????????d 1????????0.67h 2???????10.67h參考答案:
select?a,round(sum(diff)/3600,2)?as?d from?(select?a,apply_time,pass_time,dates,rn,ct,is_work,case?when?is_work=1?and?rn=1?then?unix_timestamp(concat(dates,'?18:30:00'),'yyyy-MM-dd?HH:mm:ss')-unix_timestamp(apply_time,'yyyy-MM-dd?HH:mm:ss')when?is_work=0?then?0when?is_work=1?and?rn=ct?then?unix_timestamp(pass_time,'yyyy-MM-dd?HH:mm:ss')-unix_timestamp(concat(dates,'?09:30:00'),'yyyy-MM-dd?HH:mm:ss')when?is_work=1?and?rn!=ct?then?9*3600end?difffrom?(select?a,apply_time,pass_time,time_diff,day_diff,rn,ct,date_add(start,rn-1)?datesfrom?(select?a,apply_time,pass_time,time_diff,day_diff,strs,start,row_number()?over(partition?by?a)?as?rn,count(*)?over(partition?by?a)?as?ctfrom?(select?a,apply_time,pass_time,time_diff,day_diff,substr(repeat(concat(substr(apply_time,1,10),','),day_diff+1),1,11*(day_diff+1)-1)?strsfrom?(select?a,apply_time,pass_time,unix_timestamp(pass_time,'yyyy-MM-dd?HH:mm:ss')-unix_timestamp(apply_time,'yyyy-MM-dd?HH:mm:ss')?time_diff,datediff(substr(pass_time,1,10),substr(apply_time,1,10))?day_difffrom?(select?a,max(case?when?b='申請'?then?c?end)?apply_time,max(case?when?b='通過'?then?c?end)?pass_timefrom?t14group?by?a)?tmp1)?tmp2)?tmp3?lateral?view?explode(split(strs,","))?t?as?start)?tmp4)?tmp5join?d_date?on?tmp5.dates?=?d_date.date_id )?tmp6 group?by?a;十五、時間序列--進度及剩余
表名:t15
表字段及內容:
問題一:求每天的累計周工作日,剩余周工作日
輸出結果如下所示:
date_id??????week_to_work??week_left_work 2017-07-31??????1?????????????4 2017-08-01??????2?????????????3 2017-08-02??????3?????????????2 2017-08-03??????4?????????????1 2017-08-04??????5?????????????0 2017-08-05??????5?????????????0 2017-08-06??????5?????????????0參考答案:
此處給出兩種解法,其一:
其二:
select date_id, week_to_work, week_sum_work-week_to_work?as?week_left_work from(selectdate_id,sum(is_work)?over(partition?by?year,week?order?by?date_id)?as?week_to_work,sum(is_work)?over(partition?by?year,week)?as?week_sum_workfrom(selectdate_id,is_work,year(date_id)?as?year,weekofyear(date_id)?as?weekfrom?t15)?ta )?tb?order?by?date_id;十六、時間序列--構造日期
問題一:直接使用SQL實現一張日期維度表,包含以下字段:
date?????????????????string???????????????日期 d_week???????????????string???????????????年內第幾周 weeks????????????????int??????????????????周幾 w_start??????????????string???????????????周開始日 w_end????????????????string???????????????周結束日 d_month?????????????int??????????????????第幾月 m_start?????????????string???????????????月開始日 m_end???????????????string???????????????月結束日 d_quarter????????????int????????????????????第幾季 q_start?????????????string???????????????季開始日 q_end???????????????string???????????????季結束日 d_year???????????????int????????????????????年份 y_start?????????????string???????????????年開始日 y_end???????????????string???????????????年結束日參考答案:
drop?table?if?exists?dim_date; create?table?if?not?exists?dim_date(`date`?string?comment?'日期',d_week?string?comment?'年內第幾周',weeks?string?comment?'周幾',w_start?string?comment?'周開始日',w_end?string?comment?'周結束日',d_month?string?comment?'第幾月',m_start?string?comment?'月開始日',m_end?string?comment?'月結束日',d_quarter?int?comment?'第幾季',q_start?string?comment?'季開始日',q_end?string?comment?'季結束日',d_year?int?comment?'年份',y_start?string?comment?'年開始日',y_end?string?comment?'年結束日' ); --自然月:?指每月的1號到那個月的月底,它是按照陽歷來計算的。就是從每月1號到月底,不管這個月有30天,31天,29天或者28天,都算是一個自然月。insert?overwrite?table?dim_date select?`date`,?d_week?--年內第幾周,?case?weekidwhen?0?then?'周日'when?1?then?'周一'when?2?then?'周二'when?3?then?'周三'when?4?then?'周四'when?5?then?'周五'when?6?then?'周六'end??as?weeks?--?周,?date_add(next_day(`date`,'MO'),-7)?as?w_start?--周一,?date_add(next_day(`date`,'MO'),-1)?as?w_end???--?周日_end--?月份日期,?concat('第',?monthid,?'月')??as?d_month,?m_start,?m_end--?季節,?quarterid?as?d_quart,?concat(d_year,?'-',?substr(concat('0',?(quarterid?-?1)?*?3?+?1),?-2),?'-01')?as?q_start?--季開始日,?date_sub(concat(d_year,?'-',?substr(concat('0',?(quarterid)?*?3?+?1),?-2),?'-01'),?1)?as?q_end???--季結束日--?年,?d_year,?y_start,?y_endfrom?(select?`date`,?pmod(datediff(`date`,?'2012-01-01'),?7)??????????????????as?weekid????--獲取周幾,?cast(substr(`date`,?6,?2)?as?int)????????????????????????as?monthid???--獲取月份,?casewhen?cast(substr(`date`,?6,?2)?as?int)?<=?3?then?1when?cast(substr(`date`,?6,?2)?as?int)?<=?6?then?2when?cast(substr(`date`,?6,?2)?as?int)?<=?9?then?3when?cast(substr(`date`,?6,?2)?as?int)?<=?12?then?4end???????????????????????????????????????????????????????as?quarterid?--獲取季節?可以直接使用?quarter(`date`),?substr(`date`,?1,?4)?????????????????????????????????????as?d_year????--?獲取年份,?trunc(`date`,?'YYYY')????????????????????????????????????as?y_start???--年開始日,?date_sub(trunc(add_months(`date`,?12),?'YYYY'),?1)?as?y_end?????--年結束日,?date_sub(`date`,?dayofmonth(`date`)?-?1)?????????????????as?m_start???--當月第一天,?last_day(date_sub(`date`,?dayofmonth(`date`)?-?1))??????????m_end?????--當月最后一天,?weekofyear(`date`)???????????????????????????????????????as?d_week????--年內第幾周from?(--?'2021-04-01'是開始日期,?'2022-03-31'是截止日期select?date_add('2021-04-01',?t0.pos)?as?`date`from?(select?posexplode(split(repeat('o',?datediff(from_unixtime(unix_timestamp('2022-03-31',?'yyyy-mm-dd'),'yyyy-mm-dd'),'2021-04-01')),?'o')))?t0)?t1)?t2;十七、時間序列--構造累積日期
表名:t17
表字段及內容:
問題一:每一日期,都擴展成月初至當天
輸出結果如下所示:
date_id????date_to_day 2017-08-01??2017-08-01 2017-08-02??2017-08-01 2017-08-02??2017-08-02 2017-08-03??2017-08-01 2017-08-03??2017-08-02 2017-08-03??2017-08-03這種累積相關的表,常做橋接表。
參考答案:
selectdate_id,date_add(date_start_id,pos)?as?date_to_day from (selectdate_id,date_sub(date_id,dayofmonth(date_id)-1)?as?date_start_idfrom?t17 )?m??lateral?view? posexplode(split(space(datediff(from_unixtime(unix_timestamp(date_id,'yyyy-MM-dd')),from_unixtime(unix_timestamp(date_start_id,'yyyy-MM-dd')))),?''))?t?as?pos,?val;十八、時間序列--構造連續日期
表名:t18
表字段及內容:
問題一:構造連續日期
問題描述:將表中數據的b字段擴充至范圍[2018-01-01, 2018-01-07],并累積對c求和。
b字段的值是較稀疏的。
輸出結果如下所示:
a?????????????b??????????c??????d 101????????2018-01-01?????10?????10 101????????2018-01-02??????0?????10 101????????2018-01-03?????20?????30 101????????2018-01-04??????0?????30 101????????2018-01-05??????0?????30 101????????2018-01-06?????40?????70 101????????2018-01-07??????0?????70 102????????2018-01-01??????0??????0 102????????2018-01-02?????20?????20 102????????2018-01-03??????0?????20 102????????2018-01-04?????30?????50 102????????2018-01-05??????0?????50 102????????2018-01-06??????0?????50 102????????2018-01-07?????60????110參考答案:
selecta,b,c,sum(c)?over(partition?by?a?order?by?b)?as?d from (selectt1.a,t1.b,casewhen?t18.b?is?not?null?then?t18.celse?0end?as?cfrom(selecta,date_add(s,pos)?as?bfrom(selecta,?'2018-01-01'?as?s,?'2018-01-07'?as?rfrom?(select?a?from?t18?group?by?a)?ta)?m??lateral?view?posexplode(split(space(datediff(from_unixtime(unix_timestamp(r,'yyyy-MM-dd')),from_unixtime(unix_timestamp(s,'yyyy-MM-dd')))),?''))?t?as?pos,?val)?t1left?join?t18on??t1.a?=?t18.a?and?t1.b?=?t18.b )?ts;十九、時間序列--取多個字段最新的值
表名:t19
表字段及內容:
問題一:如何一并取出最新日期
輸出結果如下所示:
date_a???a????date_b????b????date_c???c 2017????BC????2015?????23????2016????d參考答案:
此處給出三種解法,其一:
其二:
SELECT??a.date_id,a.a,b.date_id,b.b,c.date_id,c.c FROM (SELECT??t.date_id,t.aFROM??(SELECT??t.date_id,t.a,t.b,t.cFROM?t19?t?INNER?JOIN????t19?t1?ON?t.date_id?=?t1.date_id?AND?t.a?IS?NOT?NULL)?tORDER?BY?t.date_id?DESCLIMIT?1 )?a LEFT?JOIN? (SELECT??t.date_id,t.bFROM????(SELECT??t.date_id,t.bFROM?t19?t?INNER?JOIN?t19?t1?ON?t.date_id?=?t1.date_id?AND?t.b?IS?NOT?NULL)?tORDER?BY?t.date_id?DESCLIMIT?1 )?b?ON?1?=?1? LEFT?JOIN (SELECT??t.date_id,t.cFROM????(SELECT??t.date_id,t.cFROM?t19?t?INNER?JOIN?t19?t1?ON?t.date_id?=?t1.date_id?AND?t.c?IS?NOT?NULL)?tORDER?BY?t.date_id?DESCLIMIT???1 )?c ON?1?=?1;其三:
select?*? from?? (select?t1.date_id?as?date_a,t1.a?from?(select?t1.date_id,t1.a??from?t19?t1?where?t1.a?is?not?null)?t1inner?join?(select?max(t1.date_id)?as?date_id???from?t19?t1?where?t1.a?is?not?null)?t2on?t1.date_id=t2.date_id )?t1 cross?join (select?t1.date_b,t1.b?from?(select?t1.date_id?as?date_b,t1.b??from?t19?t1?where?t1.b?is?not?null)?t1inner?join?(select?max(t1.date_id)?as?date_id???from?t19?t1?where?t1.b?is?not?null)t2on?t1.date_b=t2.date_id )?t2 cross?join? (select?t1.date_c,t1.c?from?(select?t1.date_id?as?date_c,t1.c??from?t19?t1?where?t1.c?is?not?null)?t1inner?join?(select?max(t1.date_id)?as?date_id???from?t19?t1?where?t1.c?is?not?null)t2on?t1.date_c=t2.date_id )?t3;二十、時間序列--補全數據
表名:t20
表字段及內容:
問題一:如何使用最新數據補全表格
輸出結果如下所示:
date_id???a???b????c 2014?????AB??12????bc 2015?????AB??23????bc 2016?????AB??23????d 2017?????BC??23????d參考答案:
select?date_id,?first_value(a)?over(partition?by?aa?order?by?date_id)?as?a,first_value(b)?over(partition?by?bb?order?by?date_id)?as?b,first_value(c)?over(partition?by?cc?order?by?date_id)?as?c from (select?date_id,a,b,c,count(a)?over(order?by?date_id)?as?aa,count(b)?over(order?by?date_id)?as?bb,count(c)?over(order?by?date_id)?as?ccfrom?t20 )tmp1;二十一、時間序列--取最新完成狀態的前一個狀態
表名:t21
表字段及內容:
上表中B為完成狀態。
問題一:取最新完成狀態的前一個狀態
輸出結果如下所示:
date_id??a????b 2016?????1????A 2013?????2????A 2015?????3????A參考答案:
此處給出兩種解法,其一:
其二:
selectnext_date_id?as?date_id,a,next_b?as?b from(select*,min(nk)?over(partition?by?a,b)?as?minbfrom(select*,row_number()?over(partition?by?a?order?by?date_id?desc)?nk,lead(date_id)?over(partition?by?a?order?by?date_id?desc)?next_date_id,lead(b)?over(partition?by?a?order?by?date_id?desc)?next_bfrom(select?*?from?t21)?t)?t )?t where?minb?=?nk?and?b?=?'B';問題二:如何將完成狀態的過程合并
輸出結果如下所示:
a???b_merge 1???A、B、A、B 2???A、B 3???A、A、B參考答案:
selecta,collect_list(b)?as?b from(select*,min(if(b?=?'B',nk,null))?over(partition?by?a)?as?minbfrom(select*,row_number()?over(partition?by?a?order?by?date_id?desc)?nkfrom(select?*?from?t21)?t)?t )?t where?nk?>=?minb group?by?a;二十二、非等值連接--范圍匹配
表f是事實表,表d是匹配表,在hive中如何將匹配表中的值關聯到事實表中?
表d相當于拉鏈過的變化維,但日期范圍可能是不全的。
表f:
date_id??p_id2017????C2018????B2019????A2013????C表d:
d_start????d_end????p_id???p_value2016?????2018?????A???????12016?????2018?????B???????22008?????2009?????C???????42010?????2015?????C???????3問題一:范圍匹配
輸出結果如下所示:
date_id??p_id???p_value2017????C??????null2018????B??????22019????A??????null2013????C??????3**參考答案:
此處給出兩種解法,其一:
其二:
select?date_id,p_id,flag?as?p_value from?(select?f.date_id,f.p_id,d.d_start,d.d_end,d.p_value,if(f.date_id?between?d.d_start?and?d.d_end,d.p_value,null)?flag,max(d.d_end)?over(partition?by?date_id)?max_endfrom?fleft?join?don?f.p_id?=?d.p_id )?tmp where?d_end?=?max_end;二十三、非等值連接--最近匹配
表t23_1和表t23_2通過a和b關聯時,有相等的取相等的值匹配,不相等時每一個a的值在b中找差值最小的來匹配。
t23_1和t23_2為兩個班的成績單,t23_1班的每個學生成績在t23_2班中找出成績最接近的成績。
表t23_1:a中無重復值
a 1 2 4 5 8 10表t23_2:b中無重復值
b 2 3 7 11 13問題一:單向最近匹配
輸出結果如下所示:
注意:b的值可能會被丟棄
參考答案:
select?*? from (select?ttt1.a,ttt1.b?from(select?tt1.a,t23_2.b,dense_rank()?over(partition?by?tt1.a?order?by?abs(tt1.a-t23_2.b))?as?dr?from?(select?t23_1.a?from?t23_1?left?join?t23_2?on?t23_1.a=t23_2.b?where?t23_2.b?is?null)?tt1?cross?join?t23_2)?ttt1?where?ttt1.dr=1?union?allselect?t23_1.a,t23_2.b?from?t23_1?inner?join?t23_2?on?t23_1.a=t23_2.b )?result_t? order?by?result_t.a;二十四、N指標--累計去重
假設表A為事件流水表,客戶當天有一條記錄則視為當天活躍。
表A:
time_id??????????user_id 2018-01-01?10:00:00????001 2018-01-01?11:03:00????002 2018-01-01?13:18:00????001 2018-01-02?08:34:00????004 2018-01-02?10:08:00????002 2018-01-02?10:40:00????003 2018-01-02?14:21:00????002 2018-01-02?15:39:00????004 2018-01-03?08:34:00????005 2018-01-03?10:08:00????003 2018-01-03?10:40:00????001 2018-01-03?14:21:00????005假設客戶活躍非常,一天產生的事件記錄平均達千條。
問題一:累計去重
輸出結果如下所示:
日期???????當日活躍人數?????月累計活躍人數_截至當日 date_id???user_cnt_act????user_cnt_act_month 2018-01-01??????2????????????????2 2018-01-02??????3????????????????4 2018-01-03??????3????????????????5參考答案:
SELECT??tt1.date_id,tt2.user_cnt_act,tt1.user_cnt_act_month FROM (???--?④?按照t.date_id分組求出user_cnt_act_month,得到tt1SELECT??t.date_id,COUNT(user_id)?AS?user_cnt_act_monthFROM(???--?③?表a和表b進行笛卡爾積,按照a.date_id,b.user_id分組,保證截止到當日的用戶唯一,得出表t。SELECT??a.date_id,b.user_idFROM(???--?①?按照日期分組,取出date_id字段當主表的維度字段?得出表aSELECT??from_unixtime(unix_timestamp(time_id),'yyyy-MM-dd')?AS?date_idFROM?test.temp_tanhaidi_20211213_1GROUP?BY??from_unixtime(unix_timestamp(time_id),'yyyy-MM-dd'))?aINNER?JOIN(???--?②?按照date_id、user_id分組,保證每天每個用戶只有一條記錄,得出表bSELECT??from_unixtime(unix_timestamp(time_id),'yyyy-MM-dd')?AS?date_id,user_idFROM?test.temp_tanhaidi_20211213_1GROUP?BY??from_unixtime(unix_timestamp(time_id),'yyyy-MM-dd'),user_id)?bON?1?=?1WHERE?a.date_id?>=?b.date_idGROUP?BY??a.date_id,b.user_id)?tGROUP?BY??t.date_id )?tt1 LEFT?JOIN (???--?⑥?按照date_id分組求出user_cnt_act,得到tt2SELECT??date_id,COUNT(user_id)?AS?user_cnt_actFROM(???--?⑤?按照日期分組,取出date_id字段當主表的維度字段?得出表aSELECT??from_unixtime(unix_timestamp(time_id),'yyyy-MM-dd')?AS?date_id,user_idFROM?test.temp_tanhaidi_20211213_1GROUP?BY??from_unixtime(unix_timestamp(time_id),'yyyy-MM-dd'),user_id)?aGROUP?BY?date_id )?tt2 ON?tt2.date_id?=?tt1.date_id--end--
掃描下方二維碼添加好友,備注【交流】 可私聊交流,也可進資源豐富學習群更文不易,點個“在看”支持一下👇總結
以上是生活随笔為你收集整理的Hive SQL面试题(附答案)的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 运维工程师mysql面试题及答案_运维工
- 下一篇: Spark SQL面试题