當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

apache hive_Hive：使用Apache Hive查询客户最喜欢的搜索查询和产品视图计数

發布時間：2023/12/3 编程问答 28 豆豆

生活随笔收集整理的這篇文章主要介紹了 apache hive_Hive：使用Apache Hive查询客户最喜欢的搜索查询和产品视图计数小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

apache hive

這篇文章介紹了如何使用Apache Hive查詢Hadoop下存儲的搜索點擊數據。我們將以示例的形式生成有關總產品瀏覽量的客戶最愛搜索查詢和統計信息。

繼續之前的文章

使用大數據分析客戶產品搜索點擊次數，
Flume：使用Apache Flume收集客戶產品搜索點擊數據，

我們已經有使用Flume在Hadoop HDFS中收集的客戶搜索點擊數據。

這里將進一步分析使用Hive在Hadoop下查詢存儲的數據。

蜂巢

Hive允許我們使用類似SQL的語言HiveQL查詢大數據。

Hadoop數據

正如上一篇文章中分享的那樣，我們具有以以下格式“ / searchevents / 2014/05/15/16 /”存儲在hadoop下的搜索點擊數據。數據存儲在每小時創建的單獨目錄中。

文件創建為：

hdfs://localhost.localdomain:54321/searchevents/2014/05/06/16/searchevents.1399386809864

數據存儲為DataSteam：

{"eventid":"e8470a00-c869-4a90-89f2-f550522f8f52-1399386809212-72","hostedmachinename":"192.168.182.1334","pageurl":"http://jaibigdata.com/0","customerid":72,"sessionid":"7871a55c-a950-4394-bf5f-d2179a553575","querystring":null,"sortorder":"desc","pagenumber":0,"totalhits":8,"hitsshown":44,"createdtimestampinmillis":1399386809212,"clickeddocid":"23","favourite":null,"eventidsuffix":"e8470a00-c869-4a90-89f2-f550522f8f52","filters":[{"code":"searchfacettype_brand_level_2","value":"Apple"},{"code":"searchfacettype_color_level_2","value":"Blue"}]} {"eventid":"2a4c1e1b-d2c9-4fe2-b38d-9b7d32feb4e0-1399386809743-61","hostedmachinename":"192.168.182.1330","pageurl":"http://jaibigdata.com/0","customerid":61,"sessionid":"78286f6d-cc1e-489c-85ce-a7de8419d628","querystring":"queryString59","sortorder":"asc","pagenumber":3,"totalhits":32,"hitsshown":9,"createdtimestampinmillis":1399386809743,"clickeddocid":null,"favourite":null,"eventidsuffix":"2a4c1e1b-d2c9-4fe2-b38d-9b7d32feb4e0","filters":[{"code":"searchfacettype_age_level_2","value":"0-12 years"}]}

Spring數據

我們將使用Spring for Apache Hadoop通過Spring運行配置單元作業。要在您的應用程序中設置hive環境，請使用以下配置：

<hdp:configuration id="hadoopConfiguration"resources="core-site.xml">fs.default.name=hdfs://localhost.localdomain:54321mapred.job.tracker=localhost.localdomain:54310 </hdp:configuration> <hdp:hive-server auto-startup="true" port="10234" min-threads="3" id="hiveServer" configuration-ref="hadoopConfiguration"> </hdp:hive-server> <hdp:hive-client-factory id="hiveClientFactory" host="localhost" port="10234"> </hdp:hive-client-factory> <hdp:hive-runner id="hiveRunner" run-at-startup="false" hive-client-factory-ref="hiveClientFactory"> </hdp:hive-runner>

檢查Spring上下文文件applicationContext-elasticsearch.xml以獲得更多詳細信息。我們將使用hiveRunner運行hive腳本。

應用程序中的所有配置單元腳本都位于資源配置單元文件夾下。
可以在HiveSearchClicksServiceImpl.java中找到運行所有hive腳本的服務。

設置數據庫

讓我們設置數據庫以首先查詢數據。

DROP DATABASE IF EXISTS search CASCADE; CREATE DATABASE search;

使用外部表查詢搜索事件

我們將創建一個外部表search_clicks來讀取hadoop下存儲的搜索事件數據。

USE search; CREATE EXTERNAL TABLE IF NOT EXISTS search_clicks (eventid String, customerid BIGINT, hostedmachinename STRING, pageurl STRING, totalhits INT, querystring STRING, sessionid STRING, sortorder STRING, pagenumber INT, hitsshown INT, clickeddocid STRING, filters ARRAY<STRUCT<code:STRING, value:STRING>>, createdtimestampinmillis BIGINT) PARTITIONED BY (year STRING, month STRING, day STRING, hour STRING) ROW FORMAT SERDE 'org.jai.hive.serde.JSONSerDe' LOCATION 'hdfs:///searchevents/';

JSONSerDe

自定義SerDe“ org.jai.hive.serde.JSONSerDe”用于映射json數據。檢查同一JSONSerDe.java的更多詳細信息

如果您是從Eclipse本身運行查詢，則依賴關系將自動解決。如果您是從hive控制臺運行的，請確保在運行hive查詢之前為該類創建一個jar文件，并將該類的相關依賴項添加到hive控制臺。

#create hive json serde jar jar cf jaihivejsonserde-1.0.jar org/jai/hive/serde/JSONSerDe.class # run on hive console to add jar add jar /opt/hive/lib/jaihivejsonserde-1.0.jar; # Or add jar path to hive-site.xml file permanently <property><name>hive.aux.jars.path</name><value>/opt/hive/lib/jaihivejsonserde-1.0.jar</value> </property>

創建配置單元分區

我們將使用配置單元分區策略讀取分層位置下hadoop中存儲的數據。根據上述位置“ / searchevents / 2014/05/06/16 /”，我們將傳遞以下參數值（DBNAME =搜索，TBNAME = search_clicks，YEAR = 2014，MONTH = 05，DAY = 06，HOUR = 16）。

USE ${hiveconf:DBNAME}; ALTER TABLE ${hiveconf:TBNAME} ADD IF NOT EXISTS PARTITION(year='${hiveconf:YEAR}', month='${hiveconf:MONTH}', day='${hiveconf:DAY}', hour='${hiveconf:HOUR}') LOCATION "hdfs:///searchevents/${hiveconf:YEAR}/${hiveconf:MONTH}/${hiveconf:DAY}/${hiveconf:HOUR}/";

要運行腳本，

Collection<HiveScript> scripts = new ArrayList<>();Map<String, String> args = new HashMap<>();args.put("DBNAME", dbName);args.put("TBNAME", tbName);args.put("YEAR", year);args.put("MONTH", month);args.put("DAY", day);args.put("HOUR", hour);HiveScript script = new HiveScript(new ClassPathResource("hive/add_partition_searchevents.q"), args);scripts.add(script);hiveRunner.setScripts(scripts);hiveRunner.call();

在后面的文章中，我們將介紹如何使用Oozie協調器作業為小時數據自動創建配置單元分區。

獲取所有搜索點擊事件

獲取存儲在外部表search_clicks中的搜索事件。傳遞以下參數值（DBNAME =搜索，TBNAME = search_clicks，YEAR = 2014，MONTH = 05，DAY = 06，HOUR = 16）。

USE ${hiveconf:DBNAME}; select eventid, customerid, querystring, filters from ${hiveconf:TBNAME} where year='${hiveconf:YEAR}' and month='${hiveconf:MONTH}' and day='${hiveconf:DAY}' and hour='${hiveconf:HOUR}';

這將返回指定位置下的所有數據，還可以幫助您測試自定義SerDe。

查找最近30天內的商品視圖

最近n天中瀏覽/點擊產品的次數。

Use search; DROP TABLE IF EXISTS search_productviews; CREATE TABLE search_productviews(id STRING, productid BIGINT, viewcount INT); -- product views count in the last 30 days. INSERT INTO TABLE search_productviews select clickeddocid as id, clickeddocid as productid, count(*) as viewcount from search_clicks where clickeddocid is not null and createdTimeStampInMillis > ((unix_timestamp() * 1000) - 2592000000) group by clickeddocid order by productid;

要運行腳本，

Collection<HiveScript> scripts = new ArrayList<>();HiveScript script = new HiveScript(new ClassPathResource("hive/load-search_productviews-table.q"));scripts.add(script);hiveRunner.setScripts(scripts);hiveRunner.call();

樣本數據，從“ search_productviews”表中選擇數據。

# id, productid, viewcount 61, 61, 15 48, 48, 8 16, 16, 40 85, 85, 7

查找過去30天內的Cutomer熱門查詢

Use search; DROP TABLE IF EXISTS search_customerquery; CREATE TABLE search_customerquery(id String, customerid BIGINT, querystring String, querycount INT); -- customer top query string in the last 30 days INSERT INTO TABLE search_customerquery select concat(customerid,"_",queryString), customerid, querystring, count(*) as querycount from search_clicks where querystring is not null and customerid is not null and createdTimeStampInMillis > ((unix_timestamp() * 1000) - 2592000000) group by customerid, querystring order by customerid;

樣本數據，從“ search_customerquery”表中選擇數據。

# id, querystring, count, customerid 61_queryString59, queryString59, 5, 61 298_queryString48, queryString48, 3, 298 440_queryString16, queryString16, 1, 440 47_queryString85, queryString85, 1, 47

分析構面/過濾器以進行導航

您可以進一步擴展Hive查詢，以生成有關最終客戶在使用構面/過濾器搜索相關產品時的行為表現的統計信息。

USE search; -- How many times a particular filter has been clicked. select count(*) from search_clicks where array_contains(filters, struct("searchfacettype_color_level_2", "Blue")); -- how many distinct customer clicked the filter select DISTINCT customerid from search_clicks where array_contains(filters, struct("searchfacettype_color_level_2", "Blue")); -- top query filters by a customer select customerid, filters.code, filters.value, count(*) as filtercount from search_clicks group by customerid, filters.code, filters.value order by filtercount DESC limit 100;

數據提取Hive查詢可以根據需求按夜/小時計劃，并且可以使用作業計劃程序（如Oozie）執行。該數據可以進一步用于BI分析或改善客戶體驗。

在以后的文章中，我們將介紹進一步分析生成的數據，

使用ElasticSearch Hadoop為客戶最重要的查詢和產品視圖數據編制索引
使用Oozie計劃針對配置單元分區進行協調的作業，并將作業捆綁以將數據索引到ElasticSearch。
使用Pig來計算唯一客戶總數等

翻譯自: https://www.javacodegeeks.com/2014/05/hive-query-customer-top-search-query-and-product-views-count-using-apache-hive.html

apache hive

總結

以上是生活随笔為你收集整理的apache hive_Hive：使用Apache Hive查询客户最喜欢的搜索查询和产品视图计数的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：棱锥体积公式棱锥体积公式是什么
下一篇： POI创建的文档具有不同条件的灵活样式