日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Elasticsearch实现类百度搜索引擎搜索功能ES5.5.0v

發布時間:2024/1/8 编程问答 36 豆豆
生活随笔 收集整理的這篇文章主要介紹了 Elasticsearch实现类百度搜索引擎搜索功能ES5.5.0v 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

2019獨角獸企業重金招聘Python工程師標準>>>

源碼地址:?GitHub

業務需求(使用背景):

  • 實現搜索引擎前綴搜索功能(中文,拼音前綴查詢及簡拼前綴查詢功能)
  • 實現摘要全文檢索功能,及標題加權處理功能(按照標題權值高內容權值相對低的權值分配規則,按照索引的相關性進行排序,列出前20條相關性最高的文章)
  • 一、搜索引擎前綴搜索功能:

    中文搜索:
    1、搜索“劉”,匹配到“劉德華”、“劉斌”、“劉德志”
    2、搜索“劉德”,匹配到“劉德華”、“劉德志”
    小結:搜索的文字需要匹配到集合中所有名字的子集。
    全拼搜索:
    1、搜索“li”,匹配到“劉德華”、“劉斌”、“劉德志”
    2、搜索“liud”,匹配到“劉德華”、“劉德”
    3、搜索“liudeh”,匹配到“劉德華”
    小結:搜索的文字轉換成拼音后,需要匹配到集合中所有名字轉成拼音后的子集

    簡拼搜索:
    1、搜索“w”,匹配到“我是中國人”,“我愛我的祖國”
    2、搜索“wszg”,匹配到“我是中國人”
    小結:搜索的文字取拼音首字母進行組合,需要匹配到組合字符串中前綴匹配的子集

    解決方案:

    方案一:將“like”搜索的字段的中、英簡拼、英全拼 分別用索引的三個字段來進行存儲并且不進行分詞,最簡單直接(倒排索引存儲它們本身數據),檢索索引數據的時候進行 通配符查詢(like查詢),從這三個字段中分別進行搜索,查詢匹配的記錄然后返回。(優勢:存儲格式簡單,倒排索引存儲的數據量最少。缺點:like索引數據的時候開銷比較大 prefix 查詢比 term 查詢開銷大得多)

    方案二:將中、中簡拼、中全拼 用一個字段衍生出三個字段(multi-field)來存儲三種數據,并且分詞器filter采用edge_ngram類型對分詞的數據進行,然后處理存儲到倒排索引中,當檢索索引數據時,檢索所有字段的數據。(優勢:格式緊湊,檢索索引數據的時候采用term 全匹配規則,也無需對入參進行分詞,查詢效率高。缺點:采用以空間換時間的策略,但是對索引來說可以接受。采用衍生字段來存儲,增加了存儲及檢索的復雜度,對于三個字段搜索會將相關度相加,容易混淆查詢相關度結果)

    方案三:將索引數據存儲在一個不需分詞的字段中(keyword), 生成倒排索引時進行三種類型倒排索引的生成,倒排索引生成的時候采用edge_ngram 對倒排進一步拆分,以滿足業務場景需求,檢索時不對入參進行分詞。(優勢:索引數據存儲簡單,,檢索索引數據的時只需對一個字段 采用term 全匹配查詢規則,查詢效率極高。缺點:采用以空間換時間的策略——比方案二要少,對索引數據來說可以接受。)

    ES 針對這一業務場景解決方案還有很多種,先列出比較典型的這三種方案,選擇方案三來進行處理。

    準備工作:

    • pinyin分詞插件安裝及參數解讀
    • ElasticSearch edge_ngram 使用
    • ElasticSearch multi-field 使用
    • ElasticSearch 多種查詢特性熟悉

    代碼:

    baidu_settings.json:

    {"refresh_interval":"2s","number_of_replicas":1,"number_of_shards":2,"analysis":{"filter":{"autocomplete_filter":{"type":"edge_ngram","min_gram":1,"max_gram":15},"pinyin_first_letter_and_full_pinyin_filter" : {"type" : "pinyin","keep_first_letter" : true,"keep_full_pinyin" : false,"keep_joined_full_pinyin": true,"keep_none_chinese" : false,"keep_original" : false,"limit_first_letter_length" : 16,"lowercase" : true,"trim_whitespace" : true,"keep_none_chinese_in_first_letter" : true},"full_pinyin_filter" : {"type" : "pinyin","keep_first_letter" : true,"keep_full_pinyin" : false,"keep_joined_full_pinyin": true,"keep_none_chinese" : false,"keep_original" : true,"limit_first_letter_length" : 16,"lowercase" : true,"trim_whitespace" : true,"keep_none_chinese_in_first_letter" : true}},"analyzer":{"full_prefix_analyzer":{"type":"custom","char_filter": ["html_strip"],"tokenizer":"keyword","filter":["lowercase","full_pinyin_filter","autocomplete_filter"]},"chinese_analyzer":{"type":"custom","char_filter": ["html_strip"],"tokenizer":"keyword","filter":["lowercase","autocomplete_filter"]},"pinyin_analyzer":{"type":"custom","char_filter": ["html_strip"],"tokenizer":"keyword","filter":["pinyin_first_letter_and_full_pinyin_filter","autocomplete_filter"]}}} }

    baidu_mapping.json

    {"baidu_type": {"properties": {"full_name": {"type": "text","analyzer": "full_prefix_analyzer"},"age": {"type": "integer"}}} }

    public class PrefixTest {@Testpublic void testCreateIndex() throws Exception{TransportClient client = ESConnect.getInstance().getTransportClient();//定義索引BaseIndex.createWithSetting(client,"baidu_index","esjson/baidu_settings.json");//定義類型及字段詳細設計BaseIndex.createMapping(client,"baidu_index","baidu_type","esjson/baidu_mapping.json");}@Testpublic void testBulkInsert() throws Exception{TransportClient client = ESConnect.getInstance().getTransportClient();List<Object> list = new ArrayList<>();list.add(new BulkInsert(12l,"我們都有一個家名字叫中國",12));list.add(new BulkInsert(13l,"兄弟姐妹都很多景色也不錯 ",13));list.add(new BulkInsert(14l,"家里盤著兩條龍是長江與黃河",14));list.add(new BulkInsert(15l,"還有珠穆朗瑪峰兒是最高山坡",15));list.add(new BulkInsert(16l,"我們都有一個家名字叫中國",16));list.add(new BulkInsert(17l,"兄弟姐妹都很多景色也不錯",17));list.add(new BulkInsert(18l,"看那一條長城萬里在云中穿梭",18));boolean flag = BulkOperation.batchInsert(client,"baidu_index","baidu_type",list);System.out.println(flag);} }

    不要意思,代碼封裝了,java生成索引網上查方式即可:重點不在java代碼怎么實現。而是上面的思想。

    接下來查看下定義的分詞器效果:

    http://192.168.20.114:9200/baidu_index/_analyze?text=劉德華AT2016&analyzer=full_prefix_analyzer {"tokens": [{"token": "劉","start_offset": 0,"end_offset": 9,"type": "word","position": 0},{"token": "劉德","start_offset": 0,"end_offset": 9,"type": "word","position": 0},{"token": "劉德華","start_offset": 0,"end_offset": 9,"type": "word","position": 0},{"token": "劉德華a","start_offset": 0,"end_offset": 9,"type": "word","position": 0},{"token": "劉德華at","start_offset": 0,"end_offset": 9,"type": "word","position": 0},{"token": "劉德華at2","start_offset": 0,"end_offset": 9,"type": "word","position": 0},{"token": "劉德華at20","start_offset": 0,"end_offset": 9,"type": "word","position": 0},{"token": "劉德華at201","start_offset": 0,"end_offset": 9,"type": "word","position": 0},{"token": "劉德華at2016","start_offset": 0,"end_offset": 9,"type": "word","position": 0},{"token": "l","start_offset": 0,"end_offset": 9,"type": "word","position": 0},{"token": "li","start_offset": 0,"end_offset": 9,"type": "word","position": 0},{"token": "liu","start_offset": 0,"end_offset": 9,"type": "word","position": 0},{"token": "liud","start_offset": 0,"end_offset": 9,"type": "word","position": 0},{"token": "liude","start_offset": 0,"end_offset": 9,"type": "word","position": 0},{"token": "liudeh","start_offset": 0,"end_offset": 9,"type": "word","position": 0},{"token": "liudehu","start_offset": 0,"end_offset": 9,"type": "word","position": 0},{"token": "liudehua","start_offset": 0,"end_offset": 9,"type": "word","position": 0},{"token": "l","start_offset": 0,"end_offset": 9,"type": "word","position": 0},{"token": "ld","start_offset": 0,"end_offset": 9,"type": "word","position": 0},{"token": "ldh","start_offset": 0,"end_offset": 9,"type": "word","position": 0},{"token": "ldha","start_offset": 0,"end_offset": 9,"type": "word","position": 0},{"token": "ldhat","start_offset": 0,"end_offset": 9,"type": "word","position": 0},{"token": "ldhat2","start_offset": 0,"end_offset": 9,"type": "word","position": 0},{"token": "ldhat20","start_offset": 0,"end_offset": 9,"type": "word","position": 0},{"token": "ldhat201","start_offset": 0,"end_offset": 9,"type": "word","position": 0},{"token": "ldhat2016","start_offset": 0,"end_offset": 9,"type": "word","position": 0}] }

    大功告成。

    參考:

    http://blog.csdn.net/napoay/article/details/53907921
    https://elasticsearch.cn/question/407
    http://blog.csdn.net/xifeijian/article/details/51095762
    http://www.cnblogs.com/xing901022/p/5910139.html
    http://www.cnblogs.com/clonen/p/6674492.html

    https://github.com/medcl/elasticsearch-analysis-pinyin

    https://github.com/medcl/elasticsearch-analysis-ik

    全文檢索后續有時間再進行整理。

    ?

    ?

    轉載于:https://my.oschina.net/LucasZhu/blog/1543956

    總結

    以上是生活随笔為你收集整理的Elasticsearch实现类百度搜索引擎搜索功能ES5.5.0v的全部內容,希望文章能夠幫你解決所遇到的問題。

    如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。