當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Hive(三)hive的高级操作

發布時間：2023/12/20 编程问答 28 豆豆

生活随笔收集整理的這篇文章主要介紹了 Hive(三)hive的高级操作小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

一、hive的各種join操作?

語法結構：
join_table:
table_reference JOIN table_factor [join_condition]
| table_reference {LEFT|RIGHT|FULL} [OUTER] JOIN table_reference join_condition
| table_reference LEFT SEMI JOIN table_reference join_condition
Hive 支持等值連接（ equality join）、外連接（ outer join）和（ left/right join）。 Hive 不支持非?等值的連接，因為非等值連接非常難轉化到 map/reduce 任務。
另外， Hive 支持多于 2 個表的連接。

寫查詢時注意以下幾點：

? ? ?1、只支持等值連接

例如：
SELECT a.* FROM a JOIN b ON (a.id = b.id)
SELECT a.* FROM a JOIN b ON (a.id = b.id AND a.department = b.department)?是正確的，
然而:
SELECT a.* FROM a JOIN b ON (a.id>b.id)?是錯誤的。
? ? 2、可以join多于2個表

例如：
SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key2)
如果 join 中多個表的 join key 是同一個，則 join 會被轉化為單個 map/reduce 任務，例如：
SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1)
被轉化為單個 map/reduce 任務，因為 join 中只使用了 b.key1 作為 join key。
SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key2)
而這一 join 被轉化為 2 個 map/reduce 任務。因為 b.key1 用于第一次 join 條件，而
b.key2 用于第二次 join。

? ? ?3、join 時，每次map /reduce的邏輯

reducer 會緩存 join 序列中除了最后一個表的所有表的記錄，再通過最后一個表將結果序?列化到文件系統。這一實現有助于在 reduce 端減少內存的使用量。實踐中，應該把最大的那個表寫在最后（否則會因為緩存浪費大量內存）。
例如：
SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1)
所有表都使用同一個 join key（使用 1 次 map/reduce 任務計算）。 Reduce 端會緩存 a 表
和 b 表的記錄，然后每次取得一個 c 表的記錄就計算一次 join 結果，類似的還有：
SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key2)
這里用了 2 次 map/reduce 任務。第一次緩存 a 表，用 b 表序列化；第二次緩存第一次
map/reduce 任務的結果，然后用 c 表序列化。

? ? ? 4、Outer Join： LEFT， RIGHT 和 FULL OUTER 關鍵字用于處理 join 中空記錄的情況

? ? ? ? （1）創建兩張表

create table tablea (id int, name string) row format delimited fields terminated by ',';
create table tableb (id int, age int) row format delimited fields terminated by ',';

? ? ? ? ?（2）準備數據

? ? ? ? ??

? ? ? ? ? (3)分別導入數據a.txt 到tablea,b.txt到tableb

? ? ? ? ? (4)數據準備完畢

? ? ? ? ?（5）join 演示

? ? ? ? ? ? ? ? a、內連接（inner join）:把符合兩邊連接條件的數據查詢出啦 ??

? ? ? ? ? ? ? ? ? ? ?select * from tablea a inner join tableb b on a.id=b.id;

? ? ? ? ? ? ? ?b、left join（左連接，等同于 left outer join）? ?

? ? ? ? ? ? ? ? ? 1、以左表數據為匹配標準，左大右小
? ? ? ? ? ? ? ? ? 2、匹配不上的就是 null
? ? ? ? ? ? ? ? ? 3、返回的數據條數與左表相同
? ? ? ? ? ? ? ? ? HQL 語句： select * from tablea a left join tableb b on a.id=b.id;

? ? ? ? ? ? ? ??

? ? ? ? ? ? ? ?c、right join（右連接，等同于right outer join）

? ? ? ? ? ? ? ? ? 1、以右表數據為匹配標準，左小右大
? ? ? ? ? ? ? ? ? 2、匹配不上的就是 null
? ? ? ? ? ? ? ? ? 3、返回的數據條數與右表相同
? ? ? ? ? ? ? ? ? HQL 語句： select * from tablea a right join tableb b on a.id=b.id;

? ? ? ? ? ? ? d、 left semi join（左半連接）（因為 hive 不支持 in/exists 操作（ 1.2.1 版本的 hive 開始支持 in 的操作），所以用該操作實現，并且是 in/exists 的高效實現）
? ? ? ? ? ? ? ? ? ?select * from tablea a left semi join tableb b on a.id=b.id;

? ? ? ? ? ? e、full outer join （完全外連接）

二、hive 的數據類型

? ? ?hive 支持兩種數據類型：一類叫原子數據類型，一類叫復雜數據類型

? ? ? ?1、原子數據類型

(1) Hive 不支持日期類型，在 Hive 里日期都是用字符串來表示的，而常用的日期格式轉化操?作則是通過自定義函數進行操作。
(2) Hive 是用 Java 開發的， Hive 里的基本數據類型和 java 的基本數據類型也是一一對應的，?除了 String 類型。
(3) 有符號的整數類型： TINYINT、 SMALLINT、 INT 和 BIGINT 分別等價于 Java 的 Byte、 Short、?Int 和 Long 原子類型，它們分別為 1 字節、 2 字節、 4 字節和 8 字節有符號整數。
(4) Hive 的浮點數據類型 FLOAT 和 DOUBLE,對應于 Java 的基本類型 Float 和 Double 類型。
(5) Hive 的 BOOLEAN 類型相當于 Java 的基本數據類型 Boolean。
(6) Hive 的 String 類型相當于數據庫的 Varchar 類型，該類型是一個可變的字符串，不過它不?能聲明其中最多能存儲多少個字符，理論上它可以存儲 2GB 的字符數。

? ? ?2、復雜數據類型

? ? ??

說明：
ARRAY： ARRAY 類型是由一系列相同數據類型的元素組成，這些元素可以通過下標來訪問。?比如有一個 ARRAY 類型的變量 fruits，它是由['apple','orange','mango']組成，那么我?們可以通過 fruits[1]來訪問元素 orange，因為 ARRAY 類型的下標是從 0 開始的；
MAP： MAP 包含 key->value 鍵值對，可以通過 key 來訪問元素。比如” userlist”是一個 map?類型，其中 username 是 key， password 是 value；那么我們可以通過 userlist['username']?來得到這個用戶對應的 password；
STRUCT： STRUCT 可以包含不同數據類型的元素。這些元素可以通過”點語法”的方式來得?到所需要的元素，比如 user 是一個 STRUCT 類型，那么可以通過 user.address 得到?這個用戶的地址。

? ? ??

? ? 示例：

說明：
（1）字段 name 是基本類型， favors 是數組類型，可以保存很多愛好， scores 是映射類型，可?以保存多個課程的成績， address 是結構類型，可以存儲住址信息。
（2） ROW FORMAT DELIMITED 是指明后面的關鍵詞是列和元素分隔符的。
（3） FIELDS TERMINATED BY 是字段分隔符，
（4） COLLECTION ITEMS TERMINATED BY 是元素分隔符（ Array 中的各元素、 Struct 中的各元素、?Map 中的 key、 value 對之間），
（5）MAP KEYS TERMINATED BY 是 Map 中 key 與 value 的分隔符， LINES TERMINATED BY 是行之?間的分隔符， STORED AS TEXTFILE 指數據文件上傳之后保存的格式。
總結：在關系型數據庫中，我們至少需要三張表來定義，包括學生基本表、愛好表、成績表；?但在 Hive 中通過一張表就可以搞定了。也就是說，復合數據類型把多表關系通過一張表就?可以實現了。

? ? 3、示例演示：

? ?（1）Array

建表語句：
create table person(name string,work_locations string)
row format delimited fields terminated by '\t';

create table person1(name string,work_locations array<string>)
row format delimited fields terminated by '\t'
collection items terminated by ',';
數據：
huangbo beijing,shanghai,tianjin,hangzhou
xuzheng changchu,chengdu,wuhan
wangbaoqiang dalian,shenyang,jilin
導入數據：
load data local inpath '/root/person.txt' into table person;
查詢語句：
Select * from person;
Select name from person;
Select work_locations from person;
Select work_locations[0] from person;

? ? (2)MAP(有三個分隔符需要處理，分隔符不能一樣，否則解析出錯)

建表語句：
create table score(name string, scores map<string,int>)
row format delimited fields terminated by '\t'
collection items terminated by ','
map keys terminated by ':';
數據：
huangbo yuwen:80,shuxue:89,yingyu:95
xuzheng yuwen:70,shuxue:65,yingyu:81
wangbaoqiang yuwen:75,shuxue:100,yingyu:75
導入數據：
load data local inpath '/root/score.txt' into table score;
查詢語句：
Select * from score;
Select name from score;
Select scores from score;
Select s.scores['yuwen'] from score s;

? ?（3）struct

建表語句：
create table structtable(id int,course struct<name:string,score:int>)
row format delimited fields terminated by '\t'
collection items terminated by ',';
數據：
1 english,80
2 math,89
3 chinese,95
導入數據：
load data local inpath '/root/ structtable.txt' into table structtable;
查詢語句：
Select * from structtable;
Select id from structtable;
Select course from structtable;
Select t.course.name from structtable t;
Select t.course.score from structtable t;

?補充：

// 按ID倒序取前三條記錄 select id, name, sex ,age, department from studentss order by id desc limit 3;// 局部排序 select id, name, sex ,age, department from studentss sort by id desc;// distribute by insert overwrite local directory '/home/hadoop/studentIndex' select id, name ,sex, age, department from studentss distribute by id sort by id desc;insert overwrite local directory '/home/hadoop/studentIndex' select id, name ,sex, age, department from studentss distribute by id sort by age desc, id asc;// cluster by insert overwrite local directory '/home/hadoop/studentIndex1' select id, name ,sex, age, department from studentss cluster by id;// 鏈接查詢準備表和數據 create table a(aid int, name string) row format delimited fields terminated by ','; create table b(bid int, age int) row format delimited fields terminated by ','; load data local inpath '/home/hadoop/a.txt' into table a; load data local inpath '/home/hadoop/b.txt' into table b;// 內連接 select a.*, b.* from a inner join b on a.aid = b.bid;// 外鏈接 select a.*, b.* from a left join b on a.aid = b.bid; select a.*, b.* from a right join b on a.aid = b.bid; select a.*, b.* from a full join b on a.aid = b.bid;// semi join select a.* from a where a.aid in (select b.bid from b); select a.* from a left semi join b on a.aid = b.bid;// 復雜數據類型 array create table arrayTable(id int, orderid array<int>, name string) row format delimited fields terminated by '\t' collection items terminated by ',';load data local inpath '/home/hadoop/arrayData.txt' into table arrayTable;select orderid[1] from arrayTable where name = 'huangbo';// map類型數據 huangbo yuwen:80,shuxue:89,yingyu:95 xuzheng yuwen:70,shuxue:65,yingyu:81 wangbaoqiang yuwen:75,shuxue:100,yingyu:75// 建表語句 create table mapTable(name string, score map<string, int>) row format delimited fields terminated by '\t' collection items terminated by ',' map keys terminated by ':';load data local inpath '/home/hadoop/mapdata.txt' into table mapTable;// 從map結構中查詢數據 select name, score['yuwen'] as yuwen, score['shuxue'] as shuxue, score['yingyu'] as yingyu from mapTable;// 從map結構表中查出數據，然后存入新表 create table score_table as select name, score['yuwen'] as yuwen, score['shuxue'] as shuxue, score['yingyu'] as yingyu from mapTable;1 english,80 2 math,89 3 chinese,95id subject number 1 english 80 2 math 89 3 chinese 95// struct類型數據結構 create table structTable(id int, score struct<subject:string, number:int>) row format delimited fields terminated by '\t' collection items terminated by ',';load data local inpath '/home/hadoop/structdata.txt' into table structTable;// 從struct結構表中查詢數據，然后存入新表 create table struct_table as select score.subject as subject, score.number as number from structTable;2017-01-12 14:45:23 2017-01-12/14:45:23// 解析json格式字符串 {"movie":"1287","rate":"5","timeStamp":"978302039","uid":"1"} {"movie":"2804","rate":"5","timeStamp":"978300719","uid":"1"} {"movie":"594","rate":"4","timeStamp":"978302268","uid":"1"} {"movie":"919","rate":"4","timeStamp":"978301368","uid":"1"}// 創建表 create table json(linedata string); load data local inpath '/home/hadoop/rating.json' into table json;// 創建分桶表 create table bucket_student(id int, name string, sex string, age int, department string) clustered by(age) sorted by(id desc) into 4 buckets row format delimited fields terminated by ',';// 導入數據 insert into table bucket_student select * from studentss ; 以上語句執行的時候，bucket_student表當中的分桶數據并不會自己排序。所以要讓bucket_student表中的數據有順序，改寫導入語句： insert into table bucket_student select * from studentss sort id desc; 但是并未解決bucket_student在創建的時候指定的 sorted by(id desc) 的問題。待我弄清緣由之后告訴大家。PS：用load的方式往分通表當中導入數據的時候，并不會對數據進行分桶。 load的方式只是移動數據，切記。PS：關于分桶排序的四個by： distribute by：只分桶，不排序 cluster by：分桶，并且排序 order by：全局排序 sort by：局部排序經驗點：這四個by在使用的時候的技巧：在創建表的語句當中都帶ed，也就是用clustered by 在查詢的時候用不帶ed, 也就是用cluster by

? ? ?

? ? ? ? ? ? ?

? ? ? ? ? ? ??

轉載于:https://www.cnblogs.com/liuwei6/p/6691406.html

總結

以上是生活随笔為你收集整理的Hive(三)hive的高级操作的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： Spark学习笔记——在集群上运行Spa
下一篇： java去除字符串的html标签