當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

【Lucene4.8教程之四】分析

發(fā)布時間：2025/3/13 编程问答 25 豆豆

生活随笔收集整理的這篇文章主要介紹了【Lucene4.8教程之四】分析小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

1、基礎(chǔ)內(nèi)容

（1）相關(guān)概念

分析(Analysis)，在Lucene中指的是將域(Field)文本轉(zhuǎn)換成最基本的索引表示單元--項(Term)的過程。在搜索過程中，這些項用于決定什么樣的文檔能夠匹配查詞條件。

分析器對分析操作進(jìn)行了封裝，它通過執(zhí)行若干操作，將文本轉(zhuǎn)化成語匯單元，這個處理過程也稱為語匯單元化過程(tokenization)，而從文本洲中提取的文本塊稱為語匯單元(token)。詞匯單元與它的域名結(jié)合后，就形成了項。

（2）何時使用分析器

建立索引期間

Directory returnIndexDir = FSDirectory.open(indexDir);IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_48,new StandardAnalyzer(Version.LUCENE_48));IndexWriter writer = new IndexWriter(returnIndexDir, iwc);

使用QueryParser對象進(jìn)行搜索時

QueryParser parser = new QueryParser(Version.LUCENE_48, "contents",new SimpleAnalyzer(Version.LUCENE_48));

在搜索中高亮顯示結(jié)果時

（3）常用的4個分析器：

WhitespaceAnalyzer, as the name implies, simply splits text into tokens on whitespace?characters and makes no other effort to normalize the tokens.
SimpleAnalyzer first splits tokens at non-letter characters, then lowercases each token. Be?careful! This analyzer quietly discards numeric characters.
StopAnalyzer is the same as SimpleAnalyzer, except it removes common words (called stop?words, described more in section XXX). By default it removes common words in the English?language (the, a, etc.), though you can pass in your own set.
StandardAnalyzer is Lucene’s most sophisticated core analyzer. It has quite a bit of logic to?identify certain kinds of tokens, such as company names,

四、其它內(nèi)容

在創(chuàng)建IndexWriter時，需要指定分析器，如： IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_48, new StandardAnalyzer(Version.LUCENE_48)); writer = new IndexWriter(returnIndexDir, iwc);便在每次向writer中添加文檔時，可以針對該文檔指定一個分析器，如 writer.addDocument(doc, new SimpleAnalyzer(Version.LUCENE_48));

轉(zhuǎn)載于:https://www.cnblogs.com/eaglegeek/p/4557911.html

總結(jié)

以上是生活随笔為你收集整理的【Lucene4.8教程之四】分析的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。