elasticsearch查询之图书智能推荐
一、elasticsearch智能推薦簡介
elasticsearch作為一個流行搜索引擎,通過用戶輸入的關鍵字來尋找匹配的文檔,以便用戶觸達想要的信息;而推薦系統也是類似的處理過程,其首先拿到一個可以表征用戶或者物品的數據記錄,然后找到跟此記錄最接近的記錄推薦給用戶;
the more link this query查詢與給定文檔類似的文檔,其首先選擇一些可以代表輸入文檔的關鍵字,然后使用這些關鍵詞構造查詢語句,最后在索引中查找相似的文檔;
elasticsearch提供的more line this query就是一個基于文檔相似性的簡單的推薦系統實現,其基于elasticsearch底層的倒排索引及文檔相關度算法實現的;
二、數據準備
elasticsearch 6.8
index books
book數據模型
{"bookId":"23303789","title":"罪與罰","author":"陀思妥耶夫斯基","version":231868826,"format":"epub","type":0,"price":21,"originalPrice":0,"soldout":0,"bookStatus":1,"payType":4097,"intro":"","centPrice":2100,"finished":1,"maxFreeChapter":9,"free":0,"mcardDiscount":0,"ispub":1,"cpid":2571052,"publishTime":"2016-10-12 00:00:00","category":"精品小說-世界名著","hasLecture":1,"lastChapterIdx":47,"paperBook":{"skuId":"12075198"},"newRating":917,"newRatingCount":4017,"newRatingDetail":{"good":3685,"fair":280,"poor":52,"recent":362,"title":"神作"},"finishReading":0 }將數據索引如elasticsearch
import requestsheader ={'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36' } maxIndex = 0 while maxIndex < 500:response = requests.get(url=f'https://127.0.0.1/web/bookListInCategory/all?maxIndex={maxIndex}', headers=header)obj = response.json()books = obj['books']for book in books:info = book['bookInfo']book_id = info['bookId']r = requests.post(url=f'http://127.0.0.1:9200/books/_doc/{book_id}', json=info)if len(books) == 20:maxIndex += len(books)else:exit()三、基于more like this的圖書推薦
我可以輸入一段長文本,通過圖書的intro字段來查找類似的圖書;
GET books/_search {"_source": ["bookId","title","author","intro","category","publishTime"], "query": {"more_like_this" : {"fields" : ["intro"],"like" : "入世20年,世界給中國帶來了什么?中國給世界帶去了什么?從一開始的“狼來了”,憂慮中國的工業內環境會受到致命沖擊,到在一個充分競爭的開放市場,中國在全球化中獲益良多。當美元的鐮刀劃過世界的血管,卻沒能造就一個更強大的美利堅。中國正在逐步融入全球市場,重塑外貿版圖,如今已是全球第二大經濟體,并且在更多的方面展現出了領導者的姿態。對于領航者而言,前方只有無人區。過去,跟隨、復制、拿來主義的追趕模式正在崩壞,創新增量的時代正在到來。中國經濟的未來20年,指向何方?"}},"size": 3}我們可以elasticsearch返回了匹配度最高的三本書,其中第一本書就是我們輸入的like文本;
{"took" : 3,"timed_out" : false,"_shards" : {"total" : 5,"successful" : 5,"skipped" : 0,"failed" : 0},"hits" : {"total" : 320,"max_score" : 25.865322,"hits" : [{"_index" : "books","_type" : "_doc","_id" : "42752779","_score" : 25.865322,"_source" : {"publishTime" : "2021-12-01 00:00:00","author" : "《商界》雜志社","intro" : "","title" : "入世20年:中國經濟進入“無人區”(《商界》2021年第12期)","category" : "期刊專欄-財經","bookId" : "42752779"}},{"_index" : "books","_type" : "_doc","_id" : "33810600","_score" : 9.83586,"_source" : {"publishTime" : "2020-05-01 00:00:00","author" : "史蒂芬·柯維","intro" : "","title" : "高效能人士的七個習慣(30周年紀念版)(全新增訂版)","category" : "個人成長-認知思維","bookId" : "33810600"}},{"_index" : "books","_type" : "_doc","_id" : "42824214","_score" : 8.847967,"_source" : {"publishTime" : "2021-02-01 00:00:00","author" : "傅瑩","intro" : "","title" : "看世界2:百年變局下的挑戰和抉擇","category" : "","bookId" : "42824214"}}]} }我們也可以直接在like中指定具體的某本書[入世20年:中國經濟進入“無人區”(《商界》2021年第12期)],來查找跟它類似的圖書
GET books/_search {"_source": ["bookId","title","author","intro","category","publishTime"], "query":{"more_like_this":{"fields":["intro"],"like":[{"_index":"books","_id":"42752779"}]}},"size":3 }我們可以看到elasticsearch已經自動排除了當前文檔42752779;
{"took" : 4,"timed_out" : false,"_shards" : {"total" : 5,"successful" : 5,"skipped" : 0,"failed" : 0},"hits" : {"total" : 319,"max_score" : 9.83586,"hits" : [{"_index" : "books","_type" : "_doc","_id" : "33810600","_score" : 9.83586,"_source" : {"publishTime" : "2020-05-01 00:00:00","author" : "史蒂芬·柯維","intro" : "","title" : "高效能人士的七個習慣(30周年紀念版)(全新增訂版)","category" : "個人成長-認知思維","bookId" : "33810600"}},{"_index" : "books","_type" : "_doc","_id" : "42824214","_score" : 8.847967,"_source" : {"publishTime" : "2021-02-01 00:00:00","author" : "傅瑩","intro" : "","title" : "看世界2:百年變局下的挑戰和抉擇","category" : "","bookId" : "42824214"}},{"_index" : "books","_type" : "_doc","_id" : "42867305","_score" : 8.118775,"_source" : {"publishTime" : "2022-01-01 00:00:00","author" : "林毅夫","intro" : "","title" : "中國經濟的前景","category" : "經濟理財-財經","bookId" : "42867305"}}]} }四、more like this工作機制
智能推薦系統的本意就是通過處理計算,找到最相似的東西推薦給用戶;elasticsearch的more like this這是基于這個樸素的概念,利用倒排索引的底層數據結構和自己的tf-idf的相關性計算模型,來計算兩個文檔的相似程度,相關度越高則越相似;
當進行查詢的時候,more like this查詢首先會使用指定字段的analyzer對傳入字符串或者文檔的相關字段進行分詞,然后根據配置選擇其中最能表征當前文檔的top n關鍵字,之后利用這些關鍵字進行組合查詢,尋找類似的文檔;
我們可以通過以下查詢語句,看下elasticsearch是怎么工作的;
GET books/_search {"profile": "true", "_source": ["bookId","title","author","intro","category","publishTime"], "query":{"more_like_this":{"fields":["intro"],"like":[{"_index":"books","_id":"42752779"}]}},"size":3 }我們可以看到more like this查詢最終使用的分詞,以及在每個分片上查找相似文檔的查詢語句;
{"took" : 4,"timed_out" : false,"_shards" : {"total" : 5,"successful" : 5,"skipped" : 0,"failed" : 0},"profile" : {"shards" : [{"id" : "[OBkTpZcTQJ25kmlNZ6xyLg][books][0]","searches" : [{"query" : [{"type" : "BooleanQuery","description" : "((intro:一個 intro:什么 intro:世界 intro:正在 intro:中國)~1) -ConstantScore(_id:[fe 42 75 27 79])","time_in_nanos" : 524949}]}]},{"id" : "[OBkTpZcTQJ25kmlNZ6xyLg][books][1]","searches" : [{"query" : [{"type" : "BooleanQuery","description" : "((intro:一個 intro:什么 intro:世界 intro:全球 intro:中國)~1) -ConstantScore(_id:[fe 42 75 27 79])","time_in_nanos" : 444670}]}]},{"id" : "[OBkTpZcTQJ25kmlNZ6xyLg][books][2]","searches" : [{"query" : [{"type" : "BooleanQuery","description" : "((intro:一個 intro:什么 intro:全球 intro:世界 intro:帶 intro:中國)~1) -ConstantScore(_id:[fe 42 75 27 79])","time_in_nanos" : 454063 }]}]},{"id" : "[OBkTpZcTQJ25kmlNZ6xyLg][books][3]","searches" : [{"query" : [{"type" : "BooleanQuery","description" : "((intro:一個 intro:什么 intro:全球 intro:帶 intro:世界 intro:正在 intro:中國)~2) -ConstantScore(_id:[fe 42 75 27 79])","time_in_nanos" : 430971}]}]},{"id" : "[OBkTpZcTQJ25kmlNZ6xyLg][books][4]","searches" : [{"query" : [{"type" : "BooleanQuery","description" : "((intro:一個 intro:什么 intro:世界 intro:全球 intro:中國)~1) -ConstantScore(_id:[fe 42 75 27 79])","time_in_nanos" : 745984}]}]}]} }通過以上我們可以看到,elasticsearch基于性能的考慮,默認情況下,選擇出來的關鍵字有點少,通過少量的關鍵字來表征文檔,粒度有點粗丟失了很多的信息,導致推薦效果并不理想;
五、基于more like this參數的推薦優化
通過四中的分析,我們可以看到目前elasticsearch從輸入文檔中提取的關鍵字比較少;elasticsearch提供了以下幾個參數,來篩選從輸入文檔提取出來的分詞參與查詢;
min_term_freq 參與查詢的分詞term frequency最小值,默認是2;
min_doc_freq 參與查詢的分詞document frequency的最小值,默認是5;
max_doc_freq參與查詢的分詞document frequency的最大值,默認不限制;
由于我們的intro字段文本比較小,我們通過min_term_freq=1來讓更多的關鍵字參與查詢,同時max_doc_freq = 30排除無意義的分詞;
GET books/_search {"profile": "true", "_source": ["bookId","title","author","intro","category","publishTime"], "query":{"more_like_this":{"fields":["intro"],"like":[{"_index":"books","_id":"42752779"}],"min_term_freq": 1, "max_doc_freq": 30}},"size":3 }通過分析elasticsearch的返回結果,我們可以看到查詢使用的分析和命中的記過都得到了相當程度的改善;
{"took" : 4,"timed_out" : false,"_shards" : {"total" : 5,"successful" : 5,"skipped" : 0,"failed" : 0},"hits" : {"total" : 31,"max_score" : 19.660702,"hits" : [{"_index" : "books","_type" : "_doc","_id" : "42867305","_score" : 19.660702,"_source" : {"publishTime" : "2022-01-01 00:00:00","author" : "林毅夫","intro" : "","title" : "中國經濟的前景","category" : "經濟理財-財經","bookId" : "42867305"}},{"_index" : "books","_type" : "_doc","_id" : "33396343","_score" : 14.927922,"_source" : {"publishTime" : "2020-01-01 00:00:00","author" : "黃漢城 史哲 林小琬","intro" : "","title" : "中國城市大洗牌","category" : "經濟理財-財經","bookId" : "33396343"}},{"_index" : "books","_type" : "_doc","_id" : "31231802","_score" : 14.390678,"_source" : {"publishTime" : "2018-02-01 00:00:00","author" : "李光耀","intro" : "","title" : "李光耀觀天下(精裝版)","category" : "","bookId" : "31231802"}}]},"profile" : {"shards" : [{"id" : "[OBkTpZcTQJ25kmlNZ6xyLg][books][0]","searches" : [{"query" : [{"type" : "BooleanQuery","description" : "((intro:時代 intro:開始 intro:更 intro:未來 intro:會 intro:展現 intro:出了 intro:什么 intro:世界 intro:正在 intro:中國)~3) -ConstantScore(_id:[fe 42 75 27 79])","time_in_nanos" : 536546}]}]},{"id" : "[OBkTpZcTQJ25kmlNZ6xyLg][books][1]","searches" : [{"query" : [{"type" : "BooleanQuery","description" : "((intro:展現 intro:更 intro:未來 intro:會 intro:出了 intro:開始 intro:只有 intro:時代 intro:來了 intro:方面 intro:什么 intro:世界 intro:全球 intro:中國)~4) -ConstantScore(_id:[fe 42 75 27 79])","time_in_nanos" : 713527}]}]},{"id" : "[OBkTpZcTQJ25kmlNZ6xyLg][books][2]","searches" : [{"query" : [{"type" : "BooleanQuery","description" : "((intro:開始 intro:會 intro:更 intro:只有 intro:對于 intro:來了 intro:未來 intro:時代 intro:過去 intro:出了 intro:模式 intro:什么 intro:全球 intro:世界 intro:帶 intro:中國)~4) -ConstantScore(_id:[fe 42 75 27 79])","time_in_nanos" : 554927}]}]},{"id" : "[OBkTpZcTQJ25kmlNZ6xyLg][books][3]","searches" : [{"query" : [{"type" : "BooleanQuery","description" : "((intro:時代 intro:會 intro:更 intro:開始 intro:對于 intro:出了 intro:過去 intro:來了 intro:展現 intro:什么 intro:全球 intro:帶 intro:世界 intro:正在 intro:中國)~4) -ConstantScore(_id:[fe 42 75 27 79])","time_in_nanos" : 678723}]}]},{"id" : "[OBkTpZcTQJ25kmlNZ6xyLg][books][4]","searches" : [{"query" : [{"type" : "BooleanQuery","description" : "((intro:更 intro:會 intro:時代 intro:出了 intro:開始 intro:只有 intro:展現 intro:未來 intro:強大 intro:方面 intro:過去 intro:來了 intro:對于 intro:什么 intro:世界 intro:全球 intro:中國)~5) -ConstantScore(_id:[fe 42 75 27 79])","time_in_nanos" : 877326}]}]}]} }六、基于業務特點推薦準確性優化
在圖書的各個字段中,分類肯定是一個很重要的維度,我來修改查詢語句提升具有相同category記錄的權重
GET books/_search {"profile":"true","_source":["bookId","title","author","intro","category","publishTime"],"query":{"bool":{"must":[{"more_like_this":{"fields":["intro"],"like":[{"_index":"books","_id":"42752779"}],"min_term_freq":1,"max_doc_freq":30}}],"should": [{"match": {"category": {"query": "財經","boost": 3}}}],"minimum_should_match": 0}},"size":3 }可看到elasticsearch返回的結果,現在命中的幾條記錄都是財經類的圖書;
{"took" : 5,"timed_out" : false,"_shards" : {"total" : 5,"successful" : 5,"skipped" : 0,"failed" : 0},"hits" : {"total" : 31,"max_score" : 30.677063,"hits" : [{"_index" : "books","_type" : "_doc","_id" : "42867305","_score" : 30.677063,"_source" : {"publishTime" : "2022-01-01 00:00:00","author" : "林毅夫","intro" : "","title" : "中國經濟的前景","category" : "經濟理財-財經","bookId" : "42867305"}},{"_index" : "books","_type" : "_doc","_id" : "33396343","_score" : 24.42368,"_source" : {"publishTime" : "2020-01-01 00:00:00","author" : "黃漢城 史哲 林小琬","intro" : "","title" : "中國城市大洗牌","category" : "經濟理財-財經","bookId" : "33396343"}},{"_index" : "books","_type" : "_doc","_id" : "32440996","_score" : 17.54503,"_source" : {"publishTime" : "2019-06-12 00:00:00","author" : "肖星","intro" : "","title" : "一本書讀懂財報(全新修訂版)","category" : "經濟理財-財經","bookId" : "32440996"}}]} }注意:基于當前平臺的審核策略,已經將圖書intro字段的值抹掉了,完全內容請移步
總結
以上是生活随笔為你收集整理的elasticsearch查询之图书智能推荐的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: POPTEST老李推荐:互联网时代100
- 下一篇: 塔望 用食品改变世界