當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Hive 快速上手

發布時間：2025/3/21 编程问答 50 豆豆

生活随笔收集整理的這篇文章主要介紹了 Hive 快速上手小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

Hive 快速上手

本人大數據專業學生，本文檔最早是在學校上這門課時候的筆記。后來系統重裝重裝hive補充完善了這個筆記，今天偶然翻到，看格式應該是我當時打算發布來著，但是后來忘記了。特此補發。內容主要來自于本校老師教學時自己編寫的文檔和網絡資料。(注:發布時間是2018年9月初)

本文旨在快速學習或者回顧hive常用知識，閱讀本文檔需要二十分鐘，完成后你將上手hive。

外部表和內部表

內部表（managed table）

默認創建的是內部表（managed table），存儲位置在hive.metastore.warehouse.dir設置，默認位置是/user/hive/warehouse。

導入數據的時候是將文件剪切（移動）到指定位置，即原有路徑下文件不再存在

刪除表的時候，數據和元數據都將被刪除

默認創建的就是內部表create table xxx (xx xxx)

外部表（external table）

外部表文件可以在外部系統上，只要有訪問權限就可以

外部表導入文件時不移動文件，僅僅是添加一個metadata

刪除外部表時原數據不會被刪除

分辨外部表內部表可以使用DESCRIBE FORMATTED table_name命令查看

創建外部表命令添加一個external即可，即create external table xxx (xxx)

外部表指向的數據發生變化的時候會自動更新，不用特殊處理

# 查看數據庫 show databases; # 創建數據庫，位置在hdfs上 create database if not exists sysoa COMMENT 'OA數據庫' LOCATION '/user/database/hive/warehouse/sysoa.db';# 刪除數據庫，CASCADE:刪除數據庫之前刪除所有的表格 DROP DATABASE IF EXISTS userdb CASCADE; # 使用數據庫 use class;# 創建內部表 create table if not exists students2(name string,age int,sex string,brithday date)ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' ;# 導入數據 load data local inpath '/home/fonttian/database/hive/students2' overwrite into table students2;# 創建外部表 create external table if not exists students3(name string,age int,sex string,brithday date)ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' stored as orc; # 刪除表結構，保留數據 truncate table students2; # 刪除表數據與結構，外部表只刪除元數據 drop table students2;

存儲格式為 Sequencefile時的一個數據導入問題

指定存儲格式為 Sequencefile 時，把txt格式的數據導入表中，hive 會報文件格式錯，解決方案為先將txt格式傳入hive，然后利用傳入表格插入Sequencefile格式表格

load data local inpath '/home/fonttian/database/hive/students2' overwrite into table students3;# 創建外部表 create external table if not exists students3_orc(name string,age int,sex string,brithday date)ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; # 從其他表格中插入數據 insert into table students3 select * from students2; insert into table students3_orc select * from students3;

分區

# 創建外部表，利用date字段進行分區 create external table if not exists students4(name string,age int,sex string,brithday date) partitioned by (day date) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';# 導入數據進入分區外表，分區為 day="2018-3-26" load data local inpath '/home/fonttian/database/hive/students2' into table students4 partition (day="2018-3-26");# 如果查詢無效，可以使用下面的代碼create external table if not exists students5(name string,age int,sex string,brithday date) partitioned by (pt_int int) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';load data local inpath '/home/fonttian/database/hive/students2' into table students5 partition (pt_int=1); load data local inpath '/home/fonttian/database/hive/students2' into table students5 partition (pt_int=2);select * from students5; select * from students5 where pt_int = 1; select * from students5 where pt_int > 0;# 創建外部表 create external table if not exists students3_parquet(name string,age int,sex string,brithday date)ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' stored as parquet;insert into table students3_parquet select * from students3;# 查詢 SELECT * FROM students2 WHERE age>30 && Dept=TP;# 查看是否為分區表 show partitions # 或者使用查勘表結構的命令 describe extended students5; desc formatted students5; # delete partition alter table students5 drop partition(pt_int=2);

數據的導出

# 導出數據-insert方式 insert overwrite local directory "/home/fonttian/database/hive/learnhive" select * from students5;

但是這種導出方式不利于直接訪問導出數據，分隔符的問題，默認使用“^A（\x01）”分隔符

利用格式化導出自定義我們自己的分隔符，或者流式導出將沒有這個問題

insert overwrite local directory "/home/fonttian/database/hive/learnhive" row format delimited fields terminated by '\t' collection items terminated by '\n' select * from students5;# 流式導出，需要在shell中進行 bin/hive -e "use class;select * from students5;" > /home/fonttian/database/hive/learnhive/students5.txt

如果想要導出到HDFS只需要，將“local”關鍵字去掉即可

DML

查詢

分組(group by/having)

每個部?門的平均工工資
每個部?門中每個崗位的最高高工工資
查詢出每個部?門的平均工工資超過2000的部?門

表連接(join)

排序

order by

全局排序
對全局數據的一一個排序,僅僅只有一一個reduce

sort by

對每一一個reduce內部數據進行行行排序,對全局結果集來說不不排序

# 如果有必要需要先進行調優 # set hive.exec.reducers.max=<number> # set mapreduce.job.reduces=<number># 按照年齡排序，查詢student5表 select * from students5 sort by age asc;

distribute by

類似于MapReduce中分區,對數據進行行行分區,結合sort by進行使用，同樣要注意的是這里我們還是需要進行數據的格式化，這樣才可以直接讀取數據

insert overwrite local directory '/home/fonttian/database/hive/learnhive/students5_distribute_by' row format delimited fields terminated by '\t' collection items terminated by '\n' select * from students5 distribute by pt_int sort by age asc;

注意事項:
distribute by必須在sort by之前
cluster by
當distribute by字段和sort by字段相同時,就可以替代使用用。

join

Hive只支持等值連接，外連接和左半連接。

首先需要導入一波數據備用

# 創建外部表 create external table if not exists score(name string,math int,chinese int,english int)ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' stored as textfile;# 導入數據 load data local inpath '/home/fonttian/database/hive/score' overwrite into table score;# 創建外部表 create external table if not exists job(name string,likes string)ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' stored as textfile;# 導入數據 load data local inpath '/home/fonttian/database/hive/job' overwrite into table job; ```sql - 可以連接兩個以上表

select students2.name ,students2.age,score.math,job.likes from students2 join score on(students2.name = score.name) join job on (job.name=score.name);

- 如果連接多個表的join key 是同一個，則被轉化為單個map/reduce任務 - join時大表放在最后。因為每次map/reduce任務的邏輯是這樣的：reduce會緩存join序列中最后一個表之外的所有的表額記錄，再通過最后一個表序列化到文件系統中。 - 如果想要限制join的輸出，就需要在where子句中寫過濾條件，或是在join子句寫。建議后者，以避免部分錯誤發生。```sql select students5.name,score.math from score left outer join students5 on(score.name = students5.name and students5.pt_int = 1);select students5.name,score.math from students5 left outer join score on(score.name = students5.name and students5.pt_int = 1); ```sql - Left SEMI JOIN 是IN/EXISTS子查詢的一種更高效的實現。其限制為：join子句中的右邊表只能在ON自劇中設置過濾條件，where子句。select子句或其他過濾地方都不行```sql select job.name,job.likes from job where job.name in (select score.name from score); select job.name,job.likes from job left semi join score on (score.name = job.score);

正則表達式

regexp 關鍵字

語法: A REGEXP B

操作類型: strings

描述: 功能與RLIKE相同

select count(*) from students5 where name not regexp '\\d{8}'; # 統計，name開頭不是T的數據行數 beelin >select count(*) from students5 where name not regexp 'T.*';

regexp_extract 關鍵字

語法: regexp_extract(string subject, string pattern, int index)

返回值: string

說明：將字符串subject按照pattern正則表達式的規則拆分，返回index指定的字符。

# 將字符串'IloveYou'按照'(I)(.*?)(You)'拆分，返回第一處字符，結果為I select regexp_extract('IloveYou','(I)(.*?)(You)',1) from students5 limit 1; # 將字符串'IloveYou'按照'(I)(.*?)(You)'拆分，返回第一處字符，結果為You select regexp_extract('IloveYou','I(.*?)(You)',2) from students5 limit 1; # 返回全部-結果‘IloveYou’ select regexp_extract('IloveYou','(I)(.*?)(You)',0) from students5 limit 1;

regexp_replace 關鍵字

語法: regexp_replace(string A, string B, string C)

返回值: string

說明：將字符串A中的符合Java正則表達式B的部分替換為C。注意，在有些情況下要使用轉義字符,類似Oracle中的regexp_replace函數。

# 返回結果：‘Ilove’ select regexp_replace("IloveYou","You","") from students5 limit 1; # 返回：‘Ilovelili’ select regexp_replace("IloveYou","You","lili") from test1 limit 1;

beeline and hivesever2

# 后臺啟動 $ nohup bin/hive --service hiveserver2 & # 查看hive是否啟動 $ ps -aux| grep hiveserver2 # 關閉 $ kill -9 20670$ bin/beeline # 使用默認賬戶連接hive beeline> !connect jdbc:hive2://localhost:10000 scott tiger # 使用配置中的賬戶密碼連接hive beeline> !connect jdbc:hive2://localhost:10000 fonttian 123456 # 退出 beeline> !quit

參考內容

Hadoop Hive概念學習系列之hive的正則表達式初步（六）

本人的學校教材，老師自己編寫的文檔。

總結

以上是生活随笔為你收集整理的Hive 快速上手的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： Lightgbm with Hypero
下一篇： Hyperopt 入门指南