日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問(wèn) 生活随笔!

生活随笔

當(dāng)前位置: 首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

白话Elasticsearch10-深度探秘搜索技术之基于dis_max实现best fields策略进行多字段搜索

發(fā)布時(shí)間:2025/3/21 编程问答 19 豆豆
生活随笔 收集整理的這篇文章主要介紹了 白话Elasticsearch10-深度探秘搜索技术之基于dis_max实现best fields策略进行多字段搜索 小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

文章目錄

  • 概述
  • TF/IDF
  • 鏈接
  • 示例
    • DSL
    • 普通查詢
    • dis_max 查詢
  • best fields策略-dis_max

概述

繼續(xù)跟中華石杉老師學(xué)習(xí)ES,第十篇

課程地址: https://www.roncoo.com/view/55


TF/IDF

Apache Lucene默認(rèn)評(píng)分機(jī)制

  • TF (Term Frequency): 基于詞項(xiàng)(term vector), 用來(lái)表示一個(gè)詞項(xiàng)在某個(gè)文檔中出現(xiàn)了多少次。
    詞頻越高,文檔得分越高

  • IDF (Inveres Dcoument Frequency): 基于詞項(xiàng)(term vector),用來(lái)告訴評(píng)分公式該詞有多美的漢奸。
    逆文檔頻率越高,詞項(xiàng)就越罕見。 評(píng)分公式利用該因子為包含罕見詞項(xiàng)的文檔加權(quán)。

term vector : 詞項(xiàng)向量是一種針對(duì)每個(gè)文檔的微型倒排索引。詞項(xiàng)向量的每個(gè)維由詞項(xiàng)和出現(xiàn)頻率結(jié)對(duì)組成,還可以包含詞項(xiàng)的位置信息。 Lucene 和 ES都默認(rèn)禁用詞項(xiàng)向量索引,如果實(shí)現(xiàn)某些功能比如高亮顯示等需要開啟該選項(xiàng) 。


鏈接

官方指導(dǎo): https://www.elastic.co/guide/en/elasticsearch/guide/current/_tuning_best_fields_queries.html

https://www.elastic.co/guide/en/elasticsearch/reference/7.2/query-dsl-dis-max-query.html


數(shù)據(jù)量少的時(shí)候,dis_max不生效的問(wèn)題: https://stackoverflow.com/questions/38065692/dis-max-query-isnt-looking-for-the-best-matching-clause


其他博主寫的相關(guān)文章:
https://blog.csdn.net/dm_vincent/article/details/41820537


示例

ES版本 6.4.1

為了演示效果,我們把之前的forum索引刪除了重建一下,

DSL如下

DSL

DELETE /forumPUT /forum { "settings" : { "number_of_shards" : 1 }}POST /forum/article/_bulk { "index": { "_id": 1 }} { "articleID" : "XHDK-A-1293-#fJ3", "userID" : 1, "hidden": false, "postDate": "2017-01-01" } { "index": { "_id": 2 }} { "articleID" : "KDKE-B-9947-#kL5", "userID" : 1, "hidden": false, "postDate": "2017-01-02" } { "index": { "_id": 3 }} { "articleID" : "JODL-X-1937-#pV7", "userID" : 2, "hidden": false, "postDate": "2017-01-01" } { "index": { "_id": 4 }} { "articleID" : "QQPX-R-3956-#aD8", "userID" : 2, "hidden": true, "postDate": "2017-01-02" }POST /forum/article/_bulk {"update":{"_id":"1"}} {"doc":{"tag":["java","hadoop"]}} {"update":{"_id":"2"}} {"doc":{"tag":["java"]}} {"update":{"_id":"3"}} {"doc":{"tag":["hadoop"]}} {"update":{"_id":"4"}} {"doc":{"tag":["java","elasticsearch"]}}POST /forum/article/_bulk {"update":{"_id":"1"}} {"doc":{"tag_cnt":2}} {"update":{"_id":"2"}} {"doc":{"tag_cnt":1}} {"update":{"_id":"3"}} {"doc":{"tag_cnt":1}} {"update":{"_id":"4"}} {"doc":{"tag_cnt":2}}POST /forum/article/_bulk {"update":{"_id":"1"}} {"doc":{"view_cnt":30}} {"update":{"_id":"2"}} {"doc":{"view_cnt":50}} {"update":{"_id":"3"}} {"doc":{"view_cnt":100}} {"update":{"_id":"4"}} {"doc":{"view_cnt":80}}POST /forum/article/_bulk {"index":{"_id":5}} {"articleID":"DHJK-B-1395-#Ky5","userID":3,"hidden":false,"postDate":"2019-06-01","tag":["elasticsearch"],"tag_cnt":1,"view_cnt":10}POST /forum/article/_bulk {"update":{"_id":"5"}} {"doc":{"postDate":"2019-05-01"}}POST /forum/article/_bulk {"update":{"_id":"1"}} {"doc":{"title":"this is java and elasticsearch blog"}} {"update":{"_id":"2"}} {"doc":{"title":"this is java blog"}} {"update":{"_id":"3"}} {"doc":{"title":"this is elasticsearch blog"}} {"update":{"_id":"4"}} {"doc":{"title":"this is java, elasticsearch, hadoop blog"}} {"update":{"_id":"5"}} {"doc":{"title":"this is spark blog"}}POST /forum/article/_bulk {"update":{"_id":"1"}} {"doc":{"content":"i like to write best elasticsearch article"}} {"update":{"_id":"2"}} {"doc":{"content":"i think java is the best programming language"}} {"update":{"_id":"3"}} {"doc":{"content":"i am only an elasticsearch beginner"}} {"update":{"_id":"4"}} {"doc":{"content":"elasticsearch and hadoop are all very good solution, i am a beginner"}} {"update":{"_id":"5"}} {"doc":{"content":"spark is best big data solution based on scala ,an programming language similar to java"}}

至此,數(shù)據(jù)構(gòu)造完成 ,下面來(lái)看下dis_max是如何作用的吧

GET /forum/article/_search 數(shù)據(jù)如下: {"took": 0,"timed_out": false,"_shards": {"total": 1,"successful": 1,"skipped": 0,"failed": 0},"hits": {"total": 5,"max_score": 1,"hits": [{"_index": "forum","_type": "article","_id": "1","_score": 1,"_source": {"articleID": "XHDK-A-1293-#fJ3","userID": 1,"hidden": false,"postDate": "2017-01-01","tag": ["java","hadoop"],"tag_cnt": 2,"view_cnt": 30,"title": "this is java and elasticsearch blog","content": "i like to write best elasticsearch article"}},{"_index": "forum","_type": "article","_id": "2","_score": 1,"_source": {"articleID": "KDKE-B-9947-#kL5","userID": 1,"hidden": false,"postDate": "2017-01-02","tag": ["java"],"tag_cnt": 1,"view_cnt": 50,"title": "this is java blog","content": "i think java is the best programming language"}},{"_index": "forum","_type": "article","_id": "3","_score": 1,"_source": {"articleID": "JODL-X-1937-#pV7","userID": 2,"hidden": false,"postDate": "2017-01-01","tag": ["hadoop"],"tag_cnt": 1,"view_cnt": 100,"title": "this is elasticsearch blog","content": "i am only an elasticsearch beginner"}},{"_index": "forum","_type": "article","_id": "4","_score": 1,"_source": {"articleID": "QQPX-R-3956-#aD8","userID": 2,"hidden": true,"postDate": "2017-01-02","tag": ["java","elasticsearch"],"tag_cnt": 2,"view_cnt": 80,"title": "this is java, elasticsearch, hadoop blog","content": "elasticsearch and hadoop are all very good solution, i am a beginner"}},{"_index": "forum","_type": "article","_id": "5","_score": 1,"_source": {"articleID": "DHJK-B-1395-#Ky5","userID": 3,"hidden": false,"postDate": "2019-05-01","tag": ["elasticsearch"],"tag_cnt": 1,"view_cnt": 10,"title": "this is spark blog","content": "spark is best big data solution based on scala ,an programming language similar to java"}}]} }

普通查詢

先看下普通的DSL

GET /forum/article/_search {"query": {"bool": {"should": [{"match": {"title": "java solution"}},{"match": {"content": "java solution"}}],"minimum_should_match": 1}} }

返回:

{"took": 1,"timed_out": false,"_shards": {"total": 1,"successful": 1,"skipped": 0,"failed": 0},"hits": {"total": 4,"max_score": 1.5179626,"hits": [{"_index": "forum","_type": "article","_id": "2","_score": 1.5179626,"_source": {"articleID": "KDKE-B-9947-#kL5","userID": 1,"hidden": false,"postDate": "2017-01-02","tag": ["java"],"tag_cnt": 1,"view_cnt": 50,"title": "this is java blog","content": "i think java is the best programming language"}},{"_index": "forum","_type": "article","_id": "5","_score": 1.4233948,"_source": {"articleID": "DHJK-B-1395-#Ky5","userID": 3,"hidden": false,"postDate": "2019-05-01","tag": ["elasticsearch"],"tag_cnt": 1,"view_cnt": 10,"title": "this is spark blog","content": "spark is best big data solution based on scala ,an programming language similar to java"}},{"_index": "forum","_type": "article","_id": "4","_score": 1.2832261,"_source": {"articleID": "QQPX-R-3956-#aD8","userID": 2,"hidden": true,"postDate": "2017-01-02","tag": ["java","elasticsearch"],"tag_cnt": 2,"view_cnt": 80,"title": "this is java, elasticsearch, hadoop blog","content": "elasticsearch and hadoop are all very good solution, i am a beginner"}},{"_index": "forum","_type": "article","_id": "1","_score": 0.4889865,"_source": {"articleID": "XHDK-A-1293-#fJ3","userID": 1,"hidden": false,"postDate": "2017-01-01","tag": ["java","hadoop"],"tag_cnt": 2,"view_cnt": 30,"title": "this is java and elasticsearch blog","content": "i like to write best elasticsearch article"}}]} }

來(lái)分析一下結(jié)果

計(jì)算每個(gè)document的relevance score:每個(gè)query的分?jǐn)?shù),乘以matched query數(shù)量,除以總query數(shù)量

算一下doc2的分?jǐn)?shù)

{ "match": { "title": "java solution" }},針對(duì)doc2,是有一個(gè)分?jǐn)?shù)的
{ "match": { "content": "java solution" }},針對(duì)doc2,也是有一個(gè)分?jǐn)?shù)的

假設(shè)分?jǐn)?shù)如下 , 所以是兩個(gè)分?jǐn)?shù)加起來(lái),比如說(shuō),1.1 + 1.2 = 2.3
matched query數(shù)量 = 2
總query數(shù)量 = 2

2.3 * 2 / 2 = 2.3


算一下doc5的分?jǐn)?shù)

{ "match": { "title": "java solution" }},針對(duì)doc5,是沒有分?jǐn)?shù)的
{ "match": { "content": "java solution" }},針對(duì)doc5,是有一個(gè)分?jǐn)?shù)的

所以說(shuō),只有一個(gè)query是有分?jǐn)?shù)的,比如2.3
matched query數(shù)量 = 1
總query數(shù)量 = 2

2.3 * 1 / 2 = 1.15

doc5的分?jǐn)?shù) = 1.15 < doc2的分?jǐn)?shù) = 2.3


id=2的數(shù)據(jù)排在了前面,其實(shí)我們希望id=5的排在前面,畢竟id=5的數(shù)據(jù) content字段既有java又有solution. 那看下dis_max吧


dis_max 查詢

GET /forum/article/_search {"query": {"dis_max": {"queries": [{"match": {"title": "java solution"}},{"match": {"content": "java solution"}}]}} }

返回

{"took": 0,"timed_out": false,"_shards": {"total": 1,"successful": 1,"skipped": 0,"failed": 0},"hits": {"total": 4,"max_score": 1.4233948,"hits": [{"_index": "forum","_type": "article","_id": "5","_score": 1.4233948,"_source": {"articleID": "DHJK-B-1395-#Ky5","userID": 3,"hidden": false,"postDate": "2019-05-01","tag": ["elasticsearch"],"tag_cnt": 1,"view_cnt": 10,"title": "this is spark blog","content": "spark is best big data solution based on scala ,an programming language similar to java"}},{"_index": "forum","_type": "article","_id": "2","_score": 0.93952733,"_source": {"articleID": "KDKE-B-9947-#kL5","userID": 1,"hidden": false,"postDate": "2017-01-02","tag": ["java"],"tag_cnt": 1,"view_cnt": 50,"title": "this is java blog","content": "i think java is the best programming language"}},{"_index": "forum","_type": "article","_id": "4","_score": 0.79423964,"_source": {"articleID": "QQPX-R-3956-#aD8","userID": 2,"hidden": true,"postDate": "2017-01-02","tag": ["java","elasticsearch"],"tag_cnt": 2,"view_cnt": 80,"title": "this is java, elasticsearch, hadoop blog","content": "elasticsearch and hadoop are all very good solution, i am a beginner"}},{"_index": "forum","_type": "article","_id": "1","_score": 0.4889865,"_source": {"articleID": "XHDK-A-1293-#fJ3","userID": 1,"hidden": false,"postDate": "2017-01-01","tag": ["java","hadoop"],"tag_cnt": 2,"view_cnt": 30,"title": "this is java and elasticsearch blog","content": "i like to write best elasticsearch article"}}]} }

best fields策略-dis_max

best fields策略 : 搜索到的結(jié)果,應(yīng)該是某一個(gè)field中匹配到了盡可能多的關(guān)鍵詞,被排在前面;而不是盡可能多的field匹配到了少數(shù)的關(guān)鍵詞,排在了前面.

dis_max語(yǔ)法,直接取多個(gè)query中,分?jǐn)?shù)最高的那一個(gè)query的分?jǐn)?shù)即可

舉個(gè)例子

{ "match": { "title": "java solution" }},針對(duì)doc2,是有一個(gè)分?jǐn)?shù)的,1.1
{ "match": { "content": "java solution" }},針對(duì)doc2,也是有一個(gè)分?jǐn)?shù)的,1.2

取最大分?jǐn)?shù),1.2


{ "match": { "title": "java solution" }},針對(duì)doc5,是沒有分?jǐn)?shù)的
{ "match": { "content": "java solution" }},針對(duì)doc5,是有一個(gè)分?jǐn)?shù)的,2.3

取最大分?jǐn)?shù),2.3

然后doc2的分?jǐn)?shù) = 1.2 < doc5的分?jǐn)?shù) = 2.3,所以doc5就可以排在更前面的地方.

總結(jié)

以上是生活随笔為你收集整理的白话Elasticsearch10-深度探秘搜索技术之基于dis_max实现best fields策略进行多字段搜索的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò),歡迎將生活随笔推薦給好友。