转】R利剑NoSQL系列文章 之 Hive
原博文出自于: http://blog.fens.me/category/%E6%95%B0%E6%8D%AE%E5%BA%93/page/3/ 感謝!
?
?
?
?
?
Posted:
Jul 27, 2013Tags:
HadoophivelinuxRRHivesqlComments:
11 CommentsR利劍NoSQL系列文章 之 Hive
R利劍NoSQL系列文章,主要介紹通過R語言連接使用nosql數(shù)據(jù)庫。涉及的NoSQL產(chǎn)品,包括Redis,MongoDB,?HBase,?Hive,?Cassandra,?Neo4j。希望通過我的介紹讓廣大的R語言愛好者,有更多的開發(fā)選擇,做出更多地激動人心的應(yīng)用。
關(guān)于作者:
- 張丹(Conan), 程序員Java,R,PHP,Javascript
- weibo:@Conan_Z
- blog: http://blog.fens.me
- email: bsspirit@gmail.com
轉(zhuǎn)載請注明:
http://blog.fens.me/nosql-r-hive/
第四篇 R利劍Hive,分為5個章節(jié)。
1. Hive介紹
Hive是建立在Hadoop上的數(shù)據(jù)倉庫基礎(chǔ)構(gòu)架。它提供了一系列的工具,可以用來進行數(shù)據(jù)提取轉(zhuǎn)化加載(ETL),這是一種可以存儲、查詢和分析存儲在 Hadoop 中的大規(guī)模數(shù)據(jù)的機制。Hive 定義了簡單的類 SQL 查詢語言,稱為 HQL,它允許熟悉 SQL 的用戶查詢數(shù)據(jù)。同時,這個語言也允許熟悉 MapReduce 開發(fā)者的開發(fā)自定義的 mapper 和 reducer 來處理內(nèi)建的 mapper 和 reducer 無法完成的復(fù)雜的分析工作。
Hive 沒有專門的數(shù)據(jù)格式。 Hive 可以很好的工作在 Thrift 之上,控制分隔符,也允許用戶指定數(shù)據(jù)格式
上面內(nèi)容摘自 百度百科(http://baike.baidu.com/view/699292.htm)
hive與關(guān)系數(shù)據(jù)庫的區(qū)別:
- 數(shù)據(jù)存儲不同:hive基于hadoop的HDFS,關(guān)系數(shù)據(jù)庫則基于本地文件系統(tǒng)
- 計算模型不同:hive基于hadoop的mapreduce,關(guān)系數(shù)據(jù)庫則基于索引的內(nèi)存計算模型
- 應(yīng)用場景不同:hive是OLAP數(shù)據(jù)倉庫系統(tǒng)提供海量數(shù)據(jù)查詢的,實時性很差;關(guān)系數(shù)據(jù)庫是OLTP事務(wù)系統(tǒng),為實時查詢業(yè)務(wù)服務(wù)
- 擴展性不同:hive基于hadoop很容易通過分布式增加存儲能力和計算能力,關(guān)系數(shù)據(jù)庫水平擴展很難,要不斷增加單機的性能
2. Hive安裝
Hive是基于Hadoop開發(fā)的數(shù)據(jù)倉庫產(chǎn)品,所以首先我們要先有Hadoop的環(huán)境。
Hadoop安裝,請參考:Hadoop環(huán)境搭建,?創(chuàng)建Hadoop母體虛擬機
Hive的安裝,請參考:Hive安裝及使用攻略
Hadoop-1.0.3的下載地址
http://archive.apache.org/dist/hadoop/core/hadoop-1.0.3/
Hive-0.9.0的下載地址
http://archive.apache.org/dist/hive/hive-0.9.0/
Hive安裝好后
啟動hiveserver的服務(wù)
打開hive shell
~ hive shell Logging initialized using configuration in file:/home/conan/hadoop/hive-0.9.0/conf/hive-log4j.proper ties Hive history file=/tmp/conan/hive_job_log_conan_201306261459_153868095.txt#查看hive的表 hive> show tables; hive_algo_t_account o_account r_t_account Time taken: 2.12 seconds#查看o_account表的數(shù)據(jù) hive> select * from o_account; 1 abc@163.com 2013-04-22 12:21:39 2 dedac@163.com 2013-04-22 12:21:39 3 qq8fed@163.com 2013-04-22 12:21:39 4 qw1@163.com 2013-04-22 12:21:39 5 af3d@163.com 2013-04-22 12:21:39 6 ab34@163.com 2013-04-22 12:21:39 7 q8d1@gmail.com 2013-04-23 09:21:24 8 conan@gmail.com 2013-04-23 09:21:24 9 adeg@sohu.com 2013-04-23 09:21:24 10 ade121@sohu.com 2013-04-23 09:21:24 11 addde@sohu.com 2013-04-23 09:21:24 Time taken: 0.469 seconds3. RHive安裝
請?zhí)崆芭渲煤肑AVA的環(huán)境:
~ java -version java version "1.6.0_29" Java(TM) SE Runtime Environment (build 1.6.0_29-b11) Java HotSpot(TM) 64-Bit Server VM (build 20.4-b02, mixed mode)安裝R:Ubuntu 12.04,請更新源再下載R2.15.3版本
~ sudo sh -c "echo deb http://mirror.bjtu.edu.cn/cran/bin/linux/ubuntu precise/ >>/etc/apt/sources.list" ~ sudo apt-get update ~ sudo apt-get install r-base-core=2.15.3-1precise0precise1安裝R依賴庫:rjava
#配置rJava ~ sudo R CMD javareconf#啟動R程序 ~ sudo R install.packages("rJava")安裝RHive
install.packages("RHive") library(RHive) Loading required package: rJava Loading required package: Rserve This is RHive 0.0-7. For overview type ‘?RHive’. HIVE_HOME=/home/conan/hadoop/hive-0.9.0 call rhive.init() because HIVE_HOME is set.由于RHive已經(jīng)從CRAN上移除,需要動手下載安裝,下載地址:https://cran.r-project.org/src/contrib/Archive/RHive/。我們需要動手下載RHive_0.0-7.tar.gz包,然后通過命令進行安裝。
# 安裝RHive ~ R CMD INSTALL RHive_0.0-7.tar.gz4. RHive函數(shù)庫
rhive.aggregate rhive.connect rhive.hdfs.exists rhive.mapapply rhive.assign rhive.desc.table rhive.hdfs.get rhive.mrapply rhive.basic.by rhive.drop.table rhive.hdfs.info rhive.napply rhive.basic.cut rhive.env rhive.hdfs.ls rhive.query rhive.basic.cut2 rhive.exist.table rhive.hdfs.mkdirs rhive.reduceapply rhive.basic.merge rhive.export rhive.hdfs.put rhive.rm rhive.basic.mode rhive.exportAll rhive.hdfs.rename rhive.sapply rhive.basic.range rhive.hdfs.cat rhive.hdfs.rm rhive.save rhive.basic.scale rhive.hdfs.chgrp rhive.hdfs.tail rhive.script.export rhive.basic.t.test rhive.hdfs.chmod rhive.init rhive.script.unexport rhive.basic.xtabs rhive.hdfs.chown rhive.list.tables rhive.size.table rhive.big.query rhive.hdfs.close rhive.load rhive.write.table rhive.block.sample rhive.hdfs.connect rhive.load.table rhive.close rhive.hdfs.du rhive.load.table2Hive和RHive的基本操作對比:
#連接到hive Hive: hive shell RHive: rhive.connect("192.168.1.210")#列出所有hive的表 Hive: show tables; RHive: rhive.list.tables()#查看表結(jié)構(gòu) Hive: desc o_account; RHive: rhive.desc.table('o_account'), rhive.desc.table('o_account',TRUE)#執(zhí)行HQL查詢 Hive: select * from o_account; RHive: rhive.query('select * from o_account')#查看hdfs目錄 Hive: dfs -ls /; RHive: rhive.hdfs.ls()#查看hdfs文件內(nèi)容 Hive: dfs -cat /user/hive/warehouse/o_account/part-m-00000; RHive: rhive.hdfs.cat('/user/hive/warehouse/o_account/part-m-00000')#斷開連接 Hive: quit; RHive: rhive.close()5. RHive基本使用操作
#初始化 rhive.init()#連接hive rhive.connect("192.168.1.210")#查看所有表 rhive.list.tables()tab_name 1 hive_algo_t_account 2 o_account 3 r_t_account#查看表結(jié)構(gòu) rhive.desc.table('o_account');col_name data_type comment 1 id int 2 email string 3 create_date string#執(zhí)行HQL查詢 rhive.query("select * from o_account");id email create_date 1 1 abc@163.com 2013-04-22 12:21:39 2 2 dedac@163.com 2013-04-22 12:21:39 3 3 qq8fed@163.com 2013-04-22 12:21:39 4 4 qw1@163.com 2013-04-22 12:21:39 5 5 af3d@163.com 2013-04-22 12:21:39 6 6 ab34@163.com 2013-04-22 12:21:39 7 7 q8d1@gmail.com 2013-04-23 09:21:24 8 8 conan@gmail.com 2013-04-23 09:21:24 9 9 adeg@sohu.com 2013-04-23 09:21:24 10 10 ade121@sohu.com 2013-04-23 09:21:24 11 11 addde@sohu.com 2013-04-23 09:21:24#關(guān)閉連接 rhive.close() [1] TRUE創(chuàng)建臨時表
rhive.block.sample('o_account', subset="id<5") [1] "rhive_sblk_1372238856"rhive.query("select * from rhive_sblk_1372238856");id email create_date 1 1 abc@163.com 2013-04-22 12:21:39 2 2 dedac@163.com 2013-04-22 12:21:39 3 3 qq8fed@163.com 2013-04-22 12:21:39 4 4 qw1@163.com 2013-04-22 12:21:39#查看hdfs的文件 rhive.hdfs.ls('/user/hive/warehouse/rhive_sblk_1372238856/')permission owner group length modify-time 1 rw-r--r-- conan supergroup 141 2013-06-26 17:28file 1 /user/hive/warehouse/rhive_sblk_1372238856/000000_0rhive.hdfs.cat('/user/hive/warehouse/rhive_sblk_1372238856/000000_0') 1abc@163.com2013-04-22 12:21:39 2dedac@163.com2013-04-22 12:21:39 3qq8fed@163.com2013-04-22 12:21:39 4qw1@163.com2013-04-22 12:21:39按范圍分割字段數(shù)據(jù)
rhive.basic.cut('o_account','id',breaks='0:100:3') [1] "rhive_result_20130626173626" attr(,"result:size") [1] 443rhive.query("select * from rhive_result_20130626173626");email create_date id 1 abc@163.com 2013-04-22 12:21:39 (0,3] 2 dedac@163.com 2013-04-22 12:21:39 (0,3] 3 qq8fed@163.com 2013-04-22 12:21:39 (0,3] 4 qw1@163.com 2013-04-22 12:21:39 (3,6] 5 af3d@163.com 2013-04-22 12:21:39 (3,6] 6 ab34@163.com 2013-04-22 12:21:39 (3,6] 7 q8d1@gmail.com 2013-04-23 09:21:24 (6,9] 8 conan@gmail.com 2013-04-23 09:21:24 (6,9] 9 adeg@sohu.com 2013-04-23 09:21:24 (6,9] 10 ade121@sohu.com 2013-04-23 09:21:24 (9,12] 11 addde@sohu.com 2013-04-23 09:21:24 (9,12]Hive操作HDFS
#查看hdfs文件目錄 rhive.hdfs.ls()permission owner group length modify-time file 1 rwxr-xr-x conan supergroup 0 2013-04-24 01:52 /hbase 2 rwxr-xr-x conan supergroup 0 2013-06-23 10:59 /home 3 rwxr-xr-x conan supergroup 0 2013-06-26 11:18 /rhive 4 rwxr-xr-x conan supergroup 0 2013-06-23 13:27 /tmp 5 rwxr-xr-x conan supergroup 0 2013-04-24 19:28 /user#查看hdfs文件內(nèi)容 rhive.hdfs.cat('/user/hive/warehouse/o_account/part-m-00000') 1abc@163.com2013-04-22 12:21:39 2dedac@163.com2013-04-22 12:21:39 3qq8fed@163.com2013-04-22 12:21:39轉(zhuǎn)載請注明:
http://blog.fens.me/nosql-r-hive/
?
總結(jié)
以上是生活随笔為你收集整理的转】R利剑NoSQL系列文章 之 Hive的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: .NET仓储模式高级用例
- 下一篇: 高性能服务器架构思路(五)——分布式缓存