當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

hive 导入hdfs数据_将数据加载或导入运行在基于HDFS的数据湖之上的Hive表中的另一种方法。

發(fā)布時(shí)間：2023/11/29 编程问答 21 豆豆

生活随笔收集整理的這篇文章主要介紹了 hive 导入hdfs数据_将数据加载或导入运行在基于HDFS的数据湖之上的Hive表中的另一种方法。小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

hive 導(dǎo)入hdfs數(shù)據(jù)

Preceding pen down the article, might want to stretch out appreciation to all the wellbeing teams beginning from cleaning/sterile group to Nurses, Doctors and other who are consistently battling to spare the mankind from continuous Covid-19 pandemic over the globe.

在下一篇文章之前，不妨向從清潔/無菌小組到護(hù)士，醫(yī)生和其他一直在努力使人類免受全球Covid-19大流行的困擾的所有健康團(tuán)隊(duì)表示感謝。

The fundamental target of this article is to feature how we can load or import data into Hive tables without explicitly execute the “l(fā)oad” command. Basically, with this approach Data scientists can query or even visualize directly on various data visualization tool for quick investigation in a scenario when raw data is continuously ingested to HDFS based Data lake from the external sources on a consistent schedule. Otherwise, “l(fā)oad” command would be required to execute furthermore for stacking the processed data into Hive’s table. Here we are considering an existing environment with the following components either set up on the Cloud or on-premise.

本文的基本目標(biāo)是介紹如何在不顯式執(zhí)行“ load”命令的情況下將數(shù)據(jù)加載或?qū)氲紿ive表中。基本上，使用這種方法，當(dāng)原始數(shù)據(jù)以一致的時(shí)間表從外部源連續(xù)攝取到基于HDFS的Data Lake時(shí)，數(shù)據(jù)科學(xué)家可以直接在各種數(shù)據(jù)可視化工具上進(jìn)行查詢甚至可視化，以進(jìn)行快速調(diào)查。否則，將需要“ load”命令來進(jìn)一步執(zhí)行，以將處理后的數(shù)據(jù)堆疊到Hive的表中。在這里，我們正在考慮具有以下組件的現(xiàn)有環(huán)境，這些組件在云端或本地設(shè)置。

Multi-node Cluster where HDFS installed and configured. Hive running on top of HDFS with MySQL database as metastore.
已安裝和配置HDFS的多節(jié)點(diǎn)群集。 Hive在HDFS之上運(yùn)行，并將MySQL數(shù)據(jù)庫作為metastore。
Assuming raw data is getting dumped from multiple sources into HDFS Data lake landing zone by leveraging Kafka, Flume, customized data ingesting tool etc.
假設(shè)利用Kafka，Flume，定制數(shù)據(jù)提取工具等將原始數(shù)據(jù)從多個(gè)來源轉(zhuǎn)儲到HDFS Data Lake登陸區(qū)。
From the landing zone, raw data moves to the refining zone in order to clean junk and subsequently into the processing zone where clean data gets processed. Here we are considering that the processed data stored in text files with CSV format.
原始數(shù)據(jù)從著陸區(qū)移至精煉區(qū)，以清理垃圾，然后移至處理區(qū)，在此處理干凈數(shù)據(jù)。在這里，我們考慮將處理后的數(shù)據(jù)存儲在CSV格式的文本文件中。

Hive input is directory-based which similar to many Hadoop tools. This means, input for an operation is taken as files in a given directory. Using HDFS command, let’s create a directory in the HDFS using “$ hdfs dfs -mkdir <<name of the folder>>. Same can be done using Hadoop administrative UI depending upon user’s HDFS ACL settings. Now move the data files from the processing zone into newly created HDFS folder. As an example, here we are considering simple order data that ingested into the data lake and eventually transformed to consolidated text files with CSV format after cleaning and filtering. Few lines of rows are as follows

Hive輸入是基于目錄的，類似于許多Hadoop工具。這意味著，操作的輸入將作為給定目錄中的文件。使用HDFS命令，讓我們使用“ $ hdfs dfs -mkdir <<文件夾名稱>>在HDFS中創(chuàng)建一個(gè)目錄。根據(jù)用戶的HDFS ACL設(shè)置，可以使用Hadoop管理UI進(jìn)行相同的操作。現(xiàn)在，將數(shù)據(jù)文件從處理區(qū)域移到新創(chuàng)建的HDFS文件夾中。例如，這里我們考慮的是簡單的訂單數(shù)據(jù)，這些數(shù)據(jù)被導(dǎo)入到數(shù)據(jù)湖中，并在清洗和過濾后最終轉(zhuǎn)換為CSV格式的合并文本文件。行的幾行如下

Next step is to create an external table in Hive by using the following command where the location is the path of HDFS directory that created on the previous step. here is the command we could use to create the external table using Hive CLI. The LOCATION statement in the command tells Hive where to find the input files.

下一步是使用以下命令在Hive中創(chuàng)建外部表，其中位置是在上一步中創(chuàng)建的HDFS目錄的路徑。這是我們可以用來使用Hive CLI創(chuàng)建外部表的命令。命令中的LOCATION語句告訴Hive在哪里找到輸入文件。

If the command worked, an OK will be printed and upon executing Hive query, Hive engine fetches the data internally from these input text files by leveraging processing engine Map Reducer or other like Spark, Tez etc. Ideally, Spark or Tez can be configured as a processing engine in hive-site.xml in order to improve the data processing speed for a huge volume of input files.

如果該命令有效，則將打印OK，并且在執(zhí)行Hive查詢時(shí)，Hive引擎可利用處理引擎Map Reducer或其他諸如Spark，Tez等從這些輸入文本文件內(nèi)部獲取數(shù)據(jù)。理想情況下，Spark或Tez可配置為hive-site.xml中的處理引擎，以提高大量輸入文件的數(shù)據(jù)處理速度。

Once the table creation is successful, we can cross-check it on “ metastore” schema in the MySQL database. To perform that, log in to MySQL CLI which might be running on a different node in the cluster and then connect to the “metastore” database as well as pulls records from “TBLS” table. This displays the created Hive table information.

一旦表創(chuàng)建成功，我們就可以在MySQL數(shù)據(jù)庫的“ metastore”模式中對其進(jìn)行交叉檢查。要執(zhí)行此操作，請登錄到可能正在集群中其他節(jié)點(diǎn)上運(yùn)行MySQL CLI，然后連接到“元存儲”數(shù)據(jù)庫并從“ TBLS”表中提取記錄。這將顯示創(chuàng)建的Hive表信息。

The import can be verified through the Hive’s CLI by listing the first few rows in the table.

可以通過Hive的CLI列出表中的前幾行來驗(yàn)證導(dǎo)入。

hive> Select * from OrderData;

蜂巢>從OrderData中選擇*;

Additionally, “ analyze compute statistics “ command could be executed in Hive CLI to view the detail information of jobs that runs on that table.

另外，可以在Hive CLI中執(zhí)行“ 分析計(jì)算統(tǒng)計(jì)信息 ”命令，以查看在該表上運(yùn)行的作業(yè)的詳細(xì)信息。

The primary advantage with this approach is, data can be query, analyze etc within a minimum span of time without additionally perform explicit data loading operation. Also helps the Data scientists to check the quality of data before running their machine learning jobs on the data lake or cluster. You could read here how to install and configure Apache Hive on multi-node Hadoop cluster with MySQL as Metastore.

這種方法的主要優(yōu)點(diǎn)是，可以在最短的時(shí)間范圍內(nèi)查詢，分析數(shù)據(jù)，而無需另外執(zhí)行顯式的數(shù)據(jù)加載操作。還可以幫助數(shù)據(jù)科學(xué)家在數(shù)據(jù)湖或集群上運(yùn)行其機(jī)器學(xué)習(xí)作業(yè)之前檢查數(shù)據(jù)質(zhì)量。您可以在此處閱讀如何在以MySQL作為Metastore的多節(jié)點(diǎn)Hadoop集群上安裝和配置Apache Hive。

Written byGautam Goswami

由 Gautam Goswami 撰寫

Enthusiastic about learning and sharing knowledge on Big Data and related headways. Play at the intersection of innovation, music and workmanship.

熱衷于學(xué)習(xí)和共享有關(guān)大數(shù)據(jù)和相關(guān)進(jìn)展的知識。在創(chuàng)新，音樂和Craft.io的交匯處演奏。

Originally published at https://dataview.in on August 4, 2020.

最初于 2020年8月4日在 https://dataview.in 上發(fā)布。

翻譯自: https://medium.com/@gautambangalore/an-alternative-way-of-loading-or-importing-data-into-hive-tables-running-on-top-of-hdfs-based-data-d3eee419eb46

hive 導(dǎo)入hdfs數(shù)據(jù)

總結(jié)

以上是生活随笔為你收集整理的hive 导入hdfs数据_将数据加载或导入运行在基于HDFS的数据湖之上的Hive表中的另一种方法。的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：梦到别人送甘蔗代表什么
下一篇：大数据业务学习笔记_学习业务成为一名出色