Elasticsearch对外提供分词服务实践
1、問題拋出?
實戰(zhàn)開發(fā)應用場景中,有獲取一段話、一篇文章詞頻的業(yè)務場景,?
詞頻的前提就是分詞。?
常用的中文分詞包括:?
1、IK分詞——https://github.com/medcl/elasticsearch-analysis-ik?
2、結(jié)巴分詞——https://github.com/huaban/elasticsearch-analysis-jieba?
3、ANSJ分詞——https://github.com/NLPchina/elasticsearch-analysis-ansj?
實際開發(fā)中,我們可以借助以上分詞工具封裝成接口或服務進行分詞。?
但,有沒有想過,借助Elasticsearch的分詞插件直接實現(xiàn)分詞呢并對外提供服務呢?
2、可行性
1、Elasticsearch對中文的處理,倒排索引的前置條件就是中文分詞。?
而分詞,我們常用的就是IK分詞插件。?
2、正常ES的部署、開發(fā)設計時候就提前選好分詞器。?
綜上,借助Elasticsearch實現(xiàn)分詞完全沒有問題。
2、Elasticsearch中的DSL實現(xiàn)
GET test_index/_analyze {"analyzer":"ik_smart","text":"9年后,我還是沒有跑出去 | 震后余生" }返回結(jié)果:
{"tokens": [{"token": "9","start_offset": 0,"end_offset": 1,"type": "ARABIC","position": 0},{"token": "年后","start_offset": 1,"end_offset": 3,"type": "CN_WORD","position": 1},{"token": "我","start_offset": 4,"end_offset": 5,"type": "CN_WORD","position": 2},{"token": "還是","start_offset": 5,"end_offset": 7,"type": "CN_WORD","position": 3},{"token": "沒有","start_offset": 7,"end_offset": 9,"type": "CN_WORD","position": 4},{"token": "跑出去","start_offset": 9,"end_offset": 12,"type": "CN_WORD","position": 5},{"token": "震后","start_offset": 15,"end_offset": 17,"type": "CN_WORD","position": 6},{"token": "余生","start_offset": 17,"end_offset": 19,"type": "CN_WORD","position": 7}] }3、Elasticsearch Java接口實現(xiàn)
以下是基于Jest5.3.3的接口實現(xiàn)。
/* *@brief:獲取分詞結(jié)果接口 *@param:待分詞的文章/字符串 *@return:不重復分詞結(jié)果集(可根據(jù)實際業(yè)務場景二次開發(fā)) *@date:20180704 */ public static String IK_TYPE = "ik_smart"; public static Set<String> getIkAnalyzeSearchTerms(String searchContent) { // 調(diào)用 IK 分詞分詞JestClient client = JestHelper.getClient();Analyze ikAnalyze = new Analyze.Builder().index(TEST_INDEX).analyzer(IK_TYPE).text(searchContent).build();JestResult result = null;Set<String> keySet = new HashSet<String>();try {result = client.execute(ikAnalyze);JsonArray jsonArray = result.getJsonObject().getAsJsonArray("tokens");int arraySize = jsonArray.size();for (int i = 0; i < arraySize; ++i) {JsonElement curKeyword = jsonArray.get(i).getAsJsonObject().get("token");//Logger.info("rst = " + curKeyword.getAsString());keySet.add(curKeyword.getAsString());}} catch (IOException e) {// TODO Auto-generated catch blocke.printStackTrace();}return keySet;}有了java接口,對外提供Restful API就變得相對簡單了。
4、小結(jié)
充分挖據(jù)Elasticsearch自身特性,優(yōu)化、簡化業(yè)務場景才是王道!
總結(jié)
以上是生活随笔為你收集整理的Elasticsearch对外提供分词服务实践的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: Elasticsearch filter
- 下一篇: 如何将不同类型数据导入Elaticsea