當前位置：首頁 > 运维知识 > 数据库 >内容正文

数据库

[一起学Hive]之十二-Hive SQL的优化

發布時間：2024/3/12 数据库 22 豆豆

生活随笔收集整理的這篇文章主要介紹了 [一起学Hive]之十二-Hive SQL的优化小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

十一、Hive SQL的優化

本章只是從HQL層面介紹一下，日常開發HQL中需要注意的一些優化點，不涉及Hadoop層面的參數、配置等優化。

其中大部分是我之前發過的博客文章，這里整理了下。

11.1 使用分區剪裁、列剪裁

在SELECT中，只拿需要的列，如果有，盡量使用分區過濾，少用SELECT *。

在分區剪裁中，當使用外關聯時，如果將副表的過濾條件寫在Where后面，那么就會先全表關聯，之后再過濾，比如：

SELECT a.id

FROM lxw1234_a a

left outer join t_lxw1234_partitioned b

ON (a.id = b.url);

WHERE b.day = ‘2015-05-10′

???????? 正確的寫法是寫在ON后面：

SELECT a.id

FROM lxw1234_a a

left outer join t_lxw1234_partitioned b

ON (a.id = b.url AND b.day = ‘2015-05-10′);

或者直接寫成子查詢：

SELECT a.id

FROM lxw1234_a a

left outer join (SELECT url FROM t_lxw1234_partitioned WHERE day = ‘2015-05-10′) b

ON (a.id = b.url)

11.2 少用COUNT DISTINCT

數據量小的時候無所謂，數據量大的情況下，由于COUNT DISTINCT操作需要用一個Reduce Task來完成，這一個Reduce需要處理的數據量太大，就會導致整個Job很難完成，一般COUNT DISTINCT使用先GROUP BY再COUNT的方式替換：

SELECT day,

COUNT(DISTINCT id) AS uv

FROM lxw1234

GROUP BY day

可以轉換成：

SELECT day,

COUNT(id) AS uv

FROM (SELECT day,id FROM lxw1234 GROUP BY day,id) a

GROUP BY day;

雖然會多用一個Job來完成，但在數據量大的情況下，這個絕對是值得的。

11.3 是否存在多對多的關聯

只要遇到表關聯，就必須得調研一下，是否存在多對多的關聯，起碼得保證有一個表或者結果集的關聯鍵不重復。

如果某一個關聯鍵的記錄數非常多，那么分配到該Reduce Task中的數據量將非常大，導致整個Job很難完成，甚至根本跑不出來。

還有就是避免笛卡爾積，同理，如果某一個鍵的數據量非常大，也是很難完成Job的。

11.4 合理使用MapJoin

關于MapJoin的原理和機制，請參考 [一起學Hive]之十。

MapJoin中小表的大小可以用參數來調節。

11.5 合理使用Union All

對同一張表的union all 要比multi insert快的多。

具體請見：

對同一張表的union all 要比多重insert快的多，原因是hive本身對這種union all做過優化，即只掃描一次源表；http://www.apacheserver.net/How-is-Union-All-optimized-in-Hive-at229466.htm而多重insert也只掃描一次，但應為要insert到多個分區，所以做了很多其他的事情，導致消耗的時間非常長；希望大家在開發的時候多測，多試！lxw_test3 12億左右記錄數Union all : 耗時7分鐘左右Java代碼收藏代碼create table lxw_test5 as select type,popt_id,login_date from ( select 'm3_login' as type,popt_id,login_date from lxw_test3 where login_date>='2012-02-01' and login_date<'2012-05-01' union all select 'mn_login' as type,popt_id,login_date from lxw_test3 where login_date>='2012-05-01' and login_date<='2012-05-09' union all select 'm3_g_login' as type,popt_id,login_date from lxw_test3 where login_date>='2012-02-01' and login_date<'2012-05-01' and apptypeid='1' union all select 'm3_l_login' as type,popt_id,login_date from lxw_test3 where login_date>='2012-02-01' and login_date<'2012-05-01' and apptypeid='2' union all select 'm3_s_login' as type,popt_id,login_date from lxw_test3 where login_date>='2012-02-01' and login_date<'2012-05-01' and apptypeid='3' union all select 'm3_o_login' as type,popt_id,login_date from lxw_test3 where login_date>='2012-02-01' and login_date<'2012-05-01' and apptypeid='4' union all select 'mn_g_login' as type,popt_id,login_date from lxw_test3 where login_date>='2012-05-01' and login_date<='2012-05-09' and apptypeid='1' union all select 'mn_l_login' as type,popt_id,login_date from lxw_test3 where login_date>='2012-05-01' and login_date<='2012-05-09' and apptypeid='2' union all select 'mn_s_login' as type,popt_id,login_date from lxw_test3 where login_date>='2012-05-01' and login_date<='2012-05-09' and apptypeid='3' union all select 'mn_o_login' as type,popt_id,login_date from lxw_test3 where login_date>='2012-05-01' and login_date<='2012-05-09' and apptypeid='4' ) x 多重insert耗時25分鐘左右：Java代碼收藏代碼from lxw_test3 insert overwrite table lxw_test6 partition (flag = '1') select 'm3_login' as type,popt_id,login_date where login_date>='2012-02-01' and login_date<'2012-05-01' insert overwrite table lxw_test6 partition (flag = '2') select 'mn_login' as type,popt_id,login_date where login_date>='2012-05-01' and login_date<='2012-05-09' insert overwrite table lxw_test6 partition (flag = '3') select 'm3_g_login' as type,popt_id,login_date where login_date>='2012-02-01' and login_date<'2012-05-01' and apptypeid='1' insert overwrite table lxw_test6 partition (flag = '4') select 'm3_l_login' as type,popt_id,login_date where login_date>='2012-02-01' and login_date<'2012-05-01' and apptypeid='2' insert overwrite table lxw_test6 partition (flag = '5') select 'm3_s_login' as type,popt_id,login_date where login_date>='2012-02-01' and login_date<'2012-05-01' and apptypeid='3' insert overwrite table lxw_test6 partition (flag = '6') select 'm3_o_login' as type,popt_id,login_date where login_date>='2012-02-01' and login_date<'2012-05-01' and apptypeid='4' insert overwrite table lxw_test6 partition (flag = '7') select 'mn_g_login' as type,popt_id,login_date where login_date>='2012-05-01' and login_date<='2012-05-09' and apptypeid='1' insert overwrite table lxw_test6 partition (flag = '8') select 'mn_l_login' as type,popt_id,login_date where login_date>='2012-05-01' and login_date<='2012-05-09' and apptypeid='2' insert overwrite table lxw_test6 partition (flag = '9') select 'mn_s_login' as type,popt_id,login_date where login_date>='2012-05-01' and login_date<='2012-05-09' and apptypeid='3' insert overwrite table lxw_test6 partition (flag = '10') select 'mn_o_login' as type,popt_id,login_date where login_date>='2012-05-01' and login_date<='2012-05-09' and apptypeid='4'

11.6 并行執行Job

用過oracle rac的應該都知道parallel的用途。

并行執行的確可以大的加快任務的執行速率，但不會減少其占用的資源。

在hive中也有并行執行的選項。

具體請見：

http://superlxw1234.iteye.com/blog/1703713

用過oracle rac的應該都知道parallel的用途。并行執行的確可以大的加快任務的執行速率，但不會減少其占用的資源。在hive中也有并行執行的選項。set hive.exec.parallel=true; //打開任務并行執行set hive.exec.parallel.thread.number=16; //同一個sql允許最大并行度，默認為8。對于同一個SQL產生的JOB,如果不存在依賴的情況下，將會并行啟動JOB，比如：Sql代碼收藏代碼from ( select phone,to_phone, substr(to_phone,-1) as key from youni_contact4_lxw where youni_id='1' and length(to_phone) = 11 and substr(to_phone,1,2) IN ('13','14','15','18') group by phone,to_phone, substr(to_phone,-1) ) t insert overwrite table youni_contact41_lxw partition(pt='0') select phone,to_phone where key='0' insert overwrite table youni_contact41_lxw partition(pt='1') select phone,to_phone where key='1' insert overwrite table youni_contact41_lxw partition(pt='2') select phone,to_phone where key='2' insert overwrite table youni_contact41_lxw partition(pt='3') select phone,to_phone where key='3' insert overwrite table youni_contact41_lxw partition(pt='4') select phone,to_phone where key='4' insert overwrite table youni_contact41_lxw partition(pt='5') select phone,to_phone where key='5' insert overwrite table youni_contact41_lxw partition(pt='6') select phone,to_phone where key='6' insert overwrite table youni_contact41_lxw partition(pt='7') select phone,to_phone where key='7' insert overwrite table youni_contact41_lxw partition(pt='8') select phone,to_phone where key='8' insert overwrite table youni_contact41_lxw partition(pt='9') select phone,to_phone where key='9'; 該SQL產生11個job，第一個job為生成臨時表的job，后續job都依賴它，這時不會有并行啟動，第一個job完成后，后續的job都會并行啟動。運行時間比較：不啟用并行：35分鐘啟用8個并行：10分鐘啟用16個并行：6分鐘當然，得是在系統資源比較空閑的時候才有優勢，否則，沒資源，并行也起不來。

11.7 使用本地MR

如果在hive中運行的sql本身數據量很小，那么使用本地mr的效率要比提交到Hadoop集群中運行快很多。

具體請見：

http://superlxw1234.iteye.com/blog/1703546

如果在hive中運行的sql本身數據量很小，那么使用本地mr的效率要比分布式的快很多。。比如： Sql代碼收藏代碼hive> select 1 from dual; Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_201208151631_2040444, Tracking URL = http://jt.dc.sh-wgq.sdo.com:50030/jobdetails.jsp?jobid=job_201208151631_2040444 Kill Command = /home/hdfs/hadoop-current/bin/hadoop job -Dmapred.job.tracker=10.133.10.103:50020 -kill job_201208151631_2040444 2012-10-23 10:55:17,646 Stage-1 map = 0%, reduce = 0% 2012-10-23 10:55:27,807 Stage-1 map = 100%, reduce = 0% Ended Job = job_201208151631_2040444 OK 1 Time taken: 17.853 seconds set hive.exec.mode.local.auto=true; //開啟本地mr//設置local mr的最大輸入數據量,當輸入數據量小于這個值的時候會采用local mr的方式set hive.exec.mode.local.auto.inputbytes.max=50000000;//設置local mr的最大輸入文件個數,當輸入文件個數小于這個值的時候會采用local mr的方式set hive.exec.mode.local.auto.tasks.max=10;當這三個參數同時成立時候，才會采用本地mrSql代碼收藏代碼hive> select 1 from dual; Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Execution log at: /tmp/liuxiaowen/liuxiaowen_20121023105757_31c966be-ee79-4c23-a467-648290b338ac.log Job running in-process (local Hadoop) 2012-10-23 10:58:03,728 null map = 100%, reduce = 0% Ended Job = job_local_0001 OK 1 Time taken: 4.842 seconds

11.8 合理使用動態分區

參見 [一起學Hive]之六-Hive的動態分區

http://lxw1234.com/archives/2015/06/286.htm

11.9 避免數據傾斜

數據傾斜是Hive開發中對性能影響的一大殺手。

癥狀：

任務迚度長時間維持在99%（或100%）;

查看任務監控頁面，發現只有少量（1個或幾個）reduce子任務未完成。

本地讀寫數據量很大。

導致數據傾斜的操作：

GROUP BY, COUNT DISTINCT, join

原因：

key分布不均勻

業務數據本身特點

這里列出一些常用的數據傾斜解決辦法：

使用COUNT DISTINCT和GROUP BY造成的數據傾斜：

存在大量空值或NULL，或者某一個值的記錄特別多，可以先把該值過濾掉，在最后單獨處理:

SELECT CAST(COUNT(DISTINCT imei)+1 AS bigint)

FROM lxw1234 where pt = ‘2012-05-28′

AND imei <> ‘lxw1234′ ;

比如某一天的IMEI值為’lxw1234’的特別多，當我要統計總的IMEI數，可以先統計不為’lxw1234’的，之后再加1.

多重COUNT DISTINCT

通常使用UNION ALL + ROW_NUMBER() + SUM + GROUP BY來變通實現。

使用JOIN引起的數據傾斜

關聯鍵存在大量空值或者某一特殊值，如”NULL”

空值單獨處理，不參與關聯；

空值或特殊值加隨機數作為關聯鍵；

不同數據類型的字段關聯

轉換為同一數據類型之后再做關聯

11.10 控制Map數和Reduce數

參見http://lxw1234.com/archives/2015/04/15.htm

11.11 中間結果壓縮

參見 http://superlxw1234.iteye.com/blog/1741103

中間Lzo,最終GzipJava代碼收藏代碼set mapred.output.compress = true; set mapred.output.compression.codec = org.apache.hadoop.io.compress.GzipCodec; set mapred.output.compression.type = BLOCK; set mapred.compress.map.output = true; set mapred.map.output.compression.codec = org.apache.hadoop.io.compress.LzoCodec; set hive.exec.compress.output = true; set hive.exec.compress.intermediate = true; set hive.intermediate.compression.codec = org.apache.hadoop.io.compress.LzoCodec; 中間Lzo,最終結果不壓縮Java代碼收藏代碼set mapred.output.compress = true; set mapred.output.compression.codec = org.apache.hadoop.io.compress.LzoCodec; set mapred.output.compression.type = BLOCK; set mapred.compress.map.output = true; set mapred.map.output.compression.codec = org.apache.hadoop.io.compress.LzoCodec; set hive.exec.compress.intermediate = true; set hive.intermediate.compression.codec = org.apache.hadoop.io.compress.LzoCodec;

11.12 其他

在MapReduce的WEB界面上，關注Hive Job執行的情況；
了解HQL -> MapReduce的過程；
HQL優化其實也是MapReduce的優化，作為分布式計算模型，其最核心的地方就是要確保每個節點上分布的數據均勻，才能最大程度發揮它的威力，否則，某一個不均勻的節點就會拖后腿。

總結

以上是生活随笔為你收集整理的[一起学Hive]之十二-Hive SQL的优化的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

SQL
Hive

上一篇： GAN 简介
下一篇： linux cmake编译源码,linu