當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Hive基础（一）

發布時間：2024/7/5 编程问答 27 豆豆

生活随笔收集整理的這篇文章主要介紹了 Hive基础（一）小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

一、Hive是什么

? ? ? Hive是基于Hadoop的一個數據倉庫工具(離線)，可以將結構化的數據文件映射為一張數據庫表，并提供類SQL查詢功能。，它能接收用戶輸入的sql語句，然后把它翻譯成mapreduce程序對HDFS上的數據進行查詢、運算，并返回結果，或將結果存入HDFS。

要點：HIVE利用HDFS來存儲數據文件；利用MAPREDUCE來做數據分析運算；利用SQL來為用戶提供查詢接口；

二、Hive的安裝及配置

1.1.用內嵌derby作為元數據庫

? ? ? （1）安裝hive的機器上應該有HADOOP環境（安裝目錄，HADOOP_HOME環境變量）；（2）直接解壓一個hive安裝包即可此時，安裝的這個hive實例使用其內嵌的derby數據庫作為記錄元數據的數據庫此模式不便于使用。

1.2.將mysql作為元數據庫

? ? ? 以lunix安裝為例：（1）上傳mysql安裝包；（2）解壓；（3）安裝mysql的server包；（4）安裝mysql的客戶端r包；（5）啟動mysql的服務；（6）修改初始密碼；（7）測試。注意點：要讓mysql可以遠程登錄訪問

2.修改hive的配置文件

? ? ? 修改hive-site.xml配置文件：

<name>javax.jdo.option.ConnectionURL</name>

<value>jdbc:mysql://master:3306/hive?createDatabaseIfNotExist=true</value>

</property>

<name>javax.jdo.option.ConnectionDriverName</name>

<value>com.mysql.jdbc.Driver</value>

</property>

<name>javax.jdo.option.ConnectionUserName</name>

</property>

<name>javax.jdo.option.ConnectionPassword</name>

</property>

</configuration>

3.上傳一個mysql的驅動jar包到hive的安裝目錄的lib中

4.配置HIVE_HOME到系統環境變量中：/etc/profile

5.source /etc/profile

6.hive啟動

三、啟動Hive的三種方式

1.啟動一個hive交互式查詢shell

? ? ? bin/hive

? ? ? hive>

2.啟動hive的網絡服務，然后通過一個客戶端beeline去連接服務進行查詢

? ? ? 啟動服務： bin/hiveserver2 ?

? ? ? 啟動客戶端去連接hive服務： ?bin/beeline -u jdbc:hive2://slave2:10000 -n root

3.啟動hive以一個腳本的方式

? ??當有大量的hive查詢任務時使用腳本化運行機制效率較高，該機制的核心點是：hive可以用一次性命令的方式來執行給定的hql語句。示例如下：

vi t_t.sh

#!/bin/bash

hive -e "insert into table t_max select id,max(age) from t_1 group by gender'"

hive -e?'create table t_sum as select id,sum(amount) from t_1 group by id'

四、Hive建庫建表與數據導入(DDL)

1、建庫：create database hello;? ? 庫目錄：/user/hive/warehouse/hello.db

? ? ??hive中有一個默認的庫：庫名： default? ? ? 庫目錄：/user/hive/warehouse

2、建表：

? ? ? 內部表：表目錄按照hive的規范來部署，位于hive的倉庫目錄/user/hive/warehouse中

? ? ? ? ? ? ? ? ? ? create table t1(id int,name string,age int)

? ? ? ? ? ? ? ? ? ? row format delimited fields terminated by ',';

? ? ? ?外部表：表目錄由建表用戶自己指定

? ? ? ? ? ? ? ? ? ? ?create external table t2(id int,name string,age int)

? ? ? ? ? ? ? ? ? ? ? row format delimited fields terminated by ','

? ? ? ? ? ? ? ? ? ? ? ?location ‘/xx/xx/’;

注意：drop內部表時，表目錄會被刪除，表的元數據也會被刪除

? ? ? ? ? drop外部表時，是表目錄還在，但表的元數據會被刪除

3、刪除表

? ? ? drop table t1;

4、導入數據到表中

? ? ??實際上，只要把數據文件放入表目錄即可? ? ? ? ? ?hadoop fs -put ......

? ? ? hive命令：

? ? ? ? ? ??如果文件在hive的本地磁盤： load data local inpath ‘/home/t_t.dat’?into table t1;

? ? ? ? ? ? 如果文件在hdfs上：load data inpath ‘/t_t.dat’?into table t1;

? ? ? 提醒：hive不會對用戶所導入的數據做任何的檢查和約束

5、修改表的定義

? ? ??修改表名：alter table table_name rename to new_table_name

? ? ? ? ? ? alter table t1 rename to t2;

? ? ? 修改字段名、字段類型：alter table?table_name change?[column] col_old_name col_new_name column_type [commentcol_comment] [first|(after column_name)]

? ? ? ? ? ? alter table t1 change id oid float first;

? ? ? 增加、替換列：alter table table_name add|replace columns (col_name data_type[comment col_comment], ...)

? ? ? ? ? ? alter table t1?add columns (gender string,phone int);

? ? ? ? ? ??alter table t1 replace columns (id int,age int,name string);

6、分區表

? ? ??在表目錄中為數據文件創建分區子目錄，以便于在查詢時，mr程序可以針對指定的分區子目錄中的數據進行處理，減少讀取的范圍，提高效率。

? ? ? 內部表分區建表：

create table t1(id int,uid string,price float,amount int)

partitioned by (day string,city string)

row format delimited fields terminated by ',';

? ? ? 已存在的文件夾作為表的一個分區，映射到表的分區：

alter table t2_ex add partition(day=’2018-11-25’) local ‘/2018-11-25’;

注意：分區字段不能是表定義中的已存在字段

? ? ? 導入數據到指定分區：

load data [local] inpath '/home/t_t.txt' into table t2?partiton(day='2018-11-25',city='shengzheng');

? ? ? 根據分區進行查詢：

select count(*) from t2 where day='20170804' and city='shengzheng';

將分區字段當成表字段來用，就可以使用where子句指定分區了

7、根據已存在的表建表

? ? ? 1、create table t_t2 like t _t1;? ? 新建的 t_t2表結構定義與源表 t_t1一致，但是沒有數據

? ? ? 2、create table t_t2 as select id,name from t_t1;? ??根據select查詢的字段來建表，將查詢的結果插入新表中

8、將表中的數據導出到指定路徑的文件

insert overwrite [local] directory '......'

row format delimited fields terminated by ','

select * from t1;

? ? ?

加local代表導入到本地磁盤文件，沒加則代表導入到hdfs

五、SQL語法

? ? ??sql運算模型一：逐行運算模型（逐行表達式，逐行過濾）? ? ? 例：select id,upper(name),age from t1;

? ? ? sql運算模型二：分組運算模型（分組表達式，分組過濾）? ?

? ? ? ? ? ? 例：select id,avg(money) from t1 where money >=1000 group by gender having avg(age) <= 23;

? ? ? sql的join聯表機制：join的實質是將多個表的數據連成一個表，作為查詢的輸入數據集，hive不支持不等值join

? ? ? ? ? ?笛卡爾積連接? ? ? ?例：select a.*,b.* from a join b;

? ? ? ? ? ?內連接? ? ? ?例：select a.*,b.* from a join b on a.id = b.id;

? ? ? ? ? ?左外連接：左表的數據全返回作為查詢的輸入數據集? ? ? ?例：select a.*,b.* from a left join b on a.id = b.id;

? ? ? ? ? ?右外連接：右表的數據全返回作為查詢的輸入數據集? ? ? ?例：select a.*,b.* from a right join b on a.id = b.id;

? ? ? ? ? ?全外連接：兩表的數據全返回作為查詢的輸入數據集? ? ? ?例：select a.*,b.* from a full?join b on a.id = b.id;

? ? ? ? ? ?左半連接：hive特有，按照內連接的規律鏈接，但只返回左半部分作為查詢的輸入集

? ? ? ? ? ? 對select a.*?from a where id in (select distinct id from b);

? ? ??? ? ? ? ? ?例：select id,name?from a left semi join b on a.id = b.id;

? ? ??子查詢：本質就是將一個select查詢的結果集作為下一個查詢的輸入數據集

select a.city,a,city_sum

from

(select city,sum(price*amount) as city_sum

from t1

group by city) a

where a.city_sum>300;

? ?

? ? ? order by 排序：order by 永遠寫在一個select語句的最后，limit前；?limit n ：限制select返回的結果條數；

select city,sum(price*amount) as city_sum

from t1

group by city

order by?city_sum asc

limit 2;

? ? ??

? ? ? in 過濾條件子句：?select a.*?from a where id in (select distinct id from b);

? ? ??distinct 去重關鍵字：distinct的前面不能再有表達式；distinct后面的表達式會被看成組合去重

六、數據類型

1、數字類型

? ? ? tinyint(1字節整數);smallint(2字節整數);int/integer (4字節整數);bigint(8字節整數);float(4字節浮點數);double?(8字節雙精度浮點數)

2、字符串類型

? ? ? string;varchar(20)?(字符串1-65535長度，超長截斷);char?(字符串，最大長度255)

3、BOOLEAN（布爾類型）

? ? ? trune;false

4、時間類型

? ? ? timestamp(時間戳) (包含年月日時分秒毫秒的一種封裝);date?(日期)（只包含年月日）

5、array數組類型

? ? ?array<data_type>?

6、map類型

? ? ??map<primitive_type, data_type>?

7、struct類型

? ? ???struct<col_name : data_type, ...>

? ? ??用一個字段來描述整個用戶信息，可以采用struct

七、常用內置函數

1、類型轉換函數

select cast("10" as?int) ;

select cast("2018-11-25" as date) ;

select cast(current_timestamp as date);

2、數學運算函數

select round(2.5); ??## 3 ?四舍五入

select round(2.2315,3) ; ?##2.231

select ceil(2.2) ; // select ceiling(2.2) ; ??## 3 ?向上取整

select floor(2.2); ?## 2 ?向下取整

select abs(-2.2) ; ?## 2.2? 絕對值

select greatest(id1,id2,id3) ; ?##? 單行函數,多個輸入參數中的最大值

select least(2,3,7) ; ?##單行函數，求多個輸入參數中的最小值

3、字符串函數

upper(string str) ?##轉大寫

lower(string str) ?##轉小寫

substr(string str, int start) ??## 截取子串

substring(string str, int start)

substr(string, int start, int len)

substring(string, int start, int len)

concat(string A, string B...)??##?拼接字符串

concat_ws(string SEP, string A, string B...)

length(string A)

split(string str, string pat) ?## 切分字符串，返回數組

注意：select split("192.168.33.44",".") ; 錯誤的，因為.號是正則語法中的特定字符

select split("192.168.33.44","\\.") ;

4.時間函數

select current_timestamp; ## 返回值類型：timestamp，獲取當前的時間戳(詳細時間信息)

select current_date; ??## 返回值類型：date，獲取當前的日期

unix時間戳轉字符串格式——from_unixtime

from_unixtime(bigint unixtime[, string format])

示例：select from_unixtime(unix_timestamp());

select from_unixtime(unix_timestamp(),"yyyy/MM/dd HH:mm:ss");

字符串格式轉unix時間戳——unix_timestamp：返回值是一個長整數類型

如果不帶參數，取當前時間的秒數時間戳long--(距離格林威治時間1970-1-1 0:0:0秒的差距)

select unix_timestamp();

unix_timestamp(string date, string pattern)

示例： select unix_timestamp("2018-11-25 10:00:00");

select unix_timestamp("2018-08-10 10:00:00","yyyy-MM-dd HH:mm:ss");

將字符串轉成日期date

select to_date("2018-11-25 10:00::00");

5、條件控制函數

? ? ? if? ? ? ? ??select id,if(age>18,'male','children') from t1;

? ? ??case when

? ? ? ? ? ? case??

? ? ? ? ? ? ? ? ? when?condition1 then result1

? ? ? ? ? ? ? ? ??when condition2 then result2

??????? ? ? ? ? ? ...

? ? ? ? ? ? ? ? ??when conditionn then?resultn

? ? ? ? ? ? end

6、聚合函數

array(5,4,6,1) ??構造一個整數數組

array(‘hello’,’hi’,’nihao’) ??構造一個字符串數組

array_contains(Array<T>, value)? 判斷是否包含，返回boolean值

sort_array(Array<T>) 返回排序后的數組

size(Array<T>) ?返回一個集合的長度，int值

size(Map<K.V>) ?返回一個imap的元素個數，int值

size(array<T>) ??返回一個數組的長度,int值

map_keys(Map<K.V>)??返回一個map字段的所有key，結果類型為：數組

map_values(Map<K.V>) 返回一個map字段的所有value，結果類型為：數組

7、常見分組聚合函數

sum(字段) ?: ?求這個字段在一個組中的所有值的和

avg(字段) ?：求這個字段在一個組中的所有值的平均值

max(字段) ?：求這個字段在一個組中的所有值的最大值

min(字段) ?：求這個字段在一個組中的所有值的最小值

總結

以上是生活随笔為你收集整理的Hive基础（一）的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

基础
Hive

上一篇： Python基础（八）--迭代，生成器，
下一篇： Master HA源码解析