日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Lucene从入门到进阶(6.6.0版本)

發布時間:2025/3/20 编程问答 18 豆豆
生活随笔 收集整理的這篇文章主要介紹了 Lucene从入门到进阶(6.6.0版本) 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

Lucene學習筆記


前言

基于最新的Lucene-6.6.0進行學習,很多方法都過時并不適用了,本文盡可能以最簡單的方法入門學習。

第二章的例子都是官方的例子,寫得很好很詳細,但是竟然一句注釋都沒有,里面的注釋都是我自己添加的,可能有不正確的理解,望體諒,可以將錯誤的注解反饋給我。

第三章開始是自己寫的例子,很簡單,很好理解,建議是直接從第三章開始看。

1???資源準備

1.1入門文檔

軟件文檔:http://lucene.apache.org/core/6_6_0/index.html

可以根據該文檔看官方例子。

1.2 開發文檔

?????? Luence核心coreAPI文檔:http://lucene.apache.org/core/6_6_0/core/index.html

1.3 導入Maven依賴

導入使用lucene所必須的jar包

<dependency>
? <groupId>
org.apache.lucene</groupId>
? <artifactId>
lucene-core</artifactId>
? <version>
6.6.0</version>
</dependency>
<dependency>
? <groupId>
org.apache.lucene</groupId>
? <artifactId>
lucene-analyzers-common</artifactId>
? <version>
6.6.0</version>
</dependency>
<dependency>
? <groupId>
org.apache.lucene</groupId>
? <artifactId>
lucene-queryparser</artifactId>
? <version>
6.6.0</version>
</dependency>
<!-- 官方測試例子 -->
<dependency>
? <groupId>
org.apache.lucene</groupId>
? <artifactId>
lucene-demo</artifactId>
? <version>
6.6.0</version>
</dependency>

1.1.4 Luke

Luke是專門用于Lucene的索引查看工具

GitHub地址:https://github.com/DmitryKey/luke

安裝步驟:

  • Clone the repository.
  • Run?mvn install?from the project directory. (Make sure you have Java and Maven installed before doing this)
  • Use?luke.sh?or?luke.bat?for launching luke from the command line based on the OS you are in.
  • (Alternatively, for older versions of lukeyou can directly download the jar file from the?releases?page and run it with the command?java -jarluke-with-deps.jar)

    2 入門

    2.1 IndexFiels

    官方例子 IndexFiels.java創建一個Lucene索引。

    該類啟動要往main方法寫入參數,可以有三種參數寫入方式,這里就寫一種,使用IDEA在配置中寫入如下參數:

    2.1.1 Test.txt的內容如下:

    numberA

    numberB

    number 范德薩 jklj

    test

    你好

    不錯啊

    ?

    2.1.2 代碼

    package com.bingo.backstage;import java.io.BufferedReader; import java.io.IOException; import java.io.InputStream; import java.io.InputStreamReader; import java.nio.charset.StandardCharsets; import java.nio.file.FileVisitResult; import java.nio.file.Files; import java.nio.file.LinkOption; import java.nio.file.OpenOption; import java.nio.file.Path; import java.nio.file.Paths; import java.nio.file.SimpleFileVisitor; import java.nio.file.attribute.BasicFileAttributes; import java.util.Date; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.*; import org.apache.lucene.document.Field.Store; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.IndexWriterConfig; import org.apache.lucene.index.Term; import org.apache.lucene.index.IndexWriterConfig.OpenMode; import org.apache.lucene.store.FSDirectory;/*** Created by MoSon on 2017/6/30.*/public class IndexFiles {private IndexFiles() {}public static void main(String[] args) {//在運行是要添加參數如:-docs (你文件的路徑)String usage = "java com.bingo.backstage.IndexFiles [-index INDEX_PATH] [-docs DOCS_PATH] [-update]\n\n" +"This indexes the documents in DOCS_PATH, creating a Lucene indexin INDEX_PATH that can be searched with SearchFiles";String indexPath = "index";String docsPath = null;boolean create = true;for(int docDir = 0; docDir < args.length; ++docDir) {if("-index".equals(args[docDir])) {indexPath = args[docDir + 1];++docDir;} else if("-docs".equals(args[docDir])) {docsPath = args[docDir + 1];++docDir;} else if("-update".equals(args[docDir])) {create = false;}}if(docsPath == null) {System.err.println("Usage: " + usage);System.exit(1);}Path var13 = Paths.get(docsPath, new String[0]);if(!Files.isReadable(var13)) {System.out.println("Document directory \'" + var13.toAbsolutePath() + "\' does not exist or is not readable, please check the path");System.exit(1);}Date start = new Date();try {System.out.println("Indexing to directory \'" + indexPath + "\'...");//打開文件路徑FSDirectory e = FSDirectory.open(Paths.get(indexPath, new String[0]));StandardAnalyzer analyzer = new StandardAnalyzer();IndexWriterConfig iwc = new IndexWriterConfig(analyzer);if(create) {iwc.setOpenMode(IndexWriterConfig.OpenMode.CREATE);} else {iwc.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);}IndexWriter writer = new IndexWriter(e, iwc);indexDocs(writer, var13);writer.close();Date end = new Date();System.out.println(end.getTime() - start.getTime() + " total milliseconds");} catch (IOException var12) {System.out.println(" caught a " + var12.getClass() + "\n with message: " + var12.getMessage());}}static void indexDocs(final IndexWriter writer, Path path) throws IOException {if(Files.isDirectory(path, new LinkOption[0])) {Files.walkFileTree(path, new SimpleFileVisitor() {public FileVisitResult visitFile(Path file, BasicFileAttributes attrs) throws IOException {try {IndexFiles.indexDoc(writer, file, attrs.lastModifiedTime().toMillis());} catch (IOException var4) {;}return FileVisitResult.CONTINUE;}});} else {indexDoc(writer, path, Files.getLastModifiedTime(path, new LinkOption[0]).toMillis());}}static void indexDoc(IndexWriter writer, Path file, long lastModified) throws IOException {InputStream stream = Files.newInputStream(file, new OpenOption[0]);Throwable var5 = null;try {Document doc = new Document();StringField pathField = new StringField("path", file.toString(), Field.Store.YES);doc.add(pathField);doc.add(new LongPoint("modified", new long[]{lastModified}));doc.add(new TextField("contents", new BufferedReader(new InputStreamReader(stream, StandardCharsets.UTF_8))));if(writer.getConfig().getOpenMode() == IndexWriterConfig.OpenMode.CREATE) {System.out.println("adding " + file);writer.addDocument(doc);} else {System.out.println("updating " + file);writer.updateDocument(new Term("path", file.toString()), doc);}} catch (Throwable var15) {var5 = var15;try {throw var15;} catch (Throwable throwable) {throwable.printStackTrace();}} finally {if(stream != null) {if(var5 != null) {try {stream.close();} catch (Throwable var14) {var5.addSuppressed(var14);}} else {stream.close();}}}} }

    2.1.3啟動效果

    將會在跟目錄下自動生成一個文件用來保存索引


    使用Luke查看效果:


    發現沒有添加中文進去


    2.1.4 分析

    IndexFiles類創建一個Lucene索引。

    在主()方法分析命令行參數,則在制備用于實例化 IndexWriter,打開 Directory,和實例化StandardAnalyzer 和IndexWriterConfig。

    所述的值-index命令行參數是其中應該存儲所有索引信息文件系統目錄的名稱。如果IndexFiles與在給定的相對路徑調用-index命令行參數,或者如果-index沒有給出命令行參數,使默認的相對索引路徑“ 指數 ”被使用,索引路徑將被創建作為當前工作目錄的子目錄(如果它不存在)。在某些平臺上,可以在不同的目錄(例如用戶的主目錄)中創建索引路徑。

    所述-docs命令行參數值是包含文件的目錄的位置被索引。

    該-update命令行參數告訴 IndexFiles不刪除索引,如果它已經存在。當沒有給出-update時,IndexFiles將在索引任何文檔之前首先擦拭平板。

    IndexWriterDirectory使用Lucene 來存儲索引中的信息。除了 我們使用的實現之外,還有其他幾個可以寫入RAM,數據庫等的Directory子類。FSDirectory

    Lucene Analyzer正在處理管道,將文本分解為索引令牌,也稱為條款,并可選擇對這些令牌進行其他操作,例如縮小,同義詞插入,過濾掉不需要的令牌等。我們使用的Analyzer是StandardAnalyzer,它使用Unicode標準附件#29中指定的Unicode文本分段算法中的Word Break規則; 將令牌轉換為小寫字母; 然后過濾掉停用詞。停用詞是諸如文章(a,an,等等)和其他可能具有較少搜索價值的標記的常用語言單詞。應該注意的是,每個語言都有不同的規則,你應該為每個語言使用適當的分析器。

    該IndexWriterConfig實例適用于所有配置的IndexWriter。例如,我們將OpenMode設置為基于-update命令行參數的值使用。

    在文件中進一步看,在IndexWriter被實例化之后,應該看到indexDocs()代碼。此遞歸函數可以抓取目錄并創建Document對象。該文獻僅僅是一個數據對象來表示從文件以及其創建時間和位置的文本內容。這些實例被添加到IndexWriter。如果給出了 -update命令行參數,則 IndexWriterConfig OpenMode將被設置為OpenMode.CREATE_OR_APPEND,而不是向索引添加文檔,IndexWriter將通過嘗試找到具有相同標識符的已經索引的文檔來更新它們在索引中(在我們的例子中,文件路徑作為標識符); 如果存在,則從索引中刪除它; 然后將新文檔添加到索引中。

    2.2 SearchFiles

    搜索文件

    2.2.1代碼

    package com.bingo.backstage;import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.index.DirectoryReader; import org.apache.lucene.queryparser.classic.QueryParser; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.Query; import org.apache.lucene.search.ScoreDoc; import org.apache.lucene.search.TopDocs; import org.apache.lucene.store.FSDirectory;import java.io.BufferedReader; import java.io.IOException; import java.io.InputStreamReader; import java.nio.charset.StandardCharsets; import java.nio.file.Files; import java.nio.file.Paths; import java.util.Date;/*** Created by MoSon on 2017/6/30.*/public class SearchFiles {private SearchFiles() {}public static void main(String[] args) throws Exception {String usage = "Usage:\tjava com.bingo.backstage.SearchFiles [-index dir] [-field f] [-repeat n] [-queries file] [-query string] [-raw] [-paging hitsPerPage]\n\nSee http://lucene.apache.org/core/4_1_0/demo/ for details.";if(args.length > 0 && ("-h".equals(args[0]) || "-help".equals(args[0]))) {System.out.println(usage);System.exit(0);}String index = "index";String field = "contents";String queries = null;int repeat = 0;boolean raw = false;String queryString = null;int hitsPerPage = 10;for(int reader = 0; reader < args.length; ++reader) {if("-index".equals(args[reader])) {index = args[reader + 1];++reader;} else if("-field".equals(args[reader])) {field = args[reader + 1];++reader;} else if("-queries".equals(args[reader])) {queries = args[reader + 1];++reader;} else if("-query".equals(args[reader])) {queryString = args[reader + 1];++reader;} else if("-repeat".equals(args[reader])) {repeat = Integer.parseInt(args[reader + 1]);++reader;} else if("-raw".equals(args[reader])) {raw = true;} else if("-paging".equals(args[reader])) {hitsPerPage = Integer.parseInt(args[reader + 1]);if(hitsPerPage <= 0) {System.err.println("There must be at least 1 hit per page.");System.exit(1);}++reader;}}//打開文件DirectoryReader var18 = DirectoryReader.open(FSDirectory.open(Paths.get(index, new String[0])));IndexSearcher searcher = new IndexSearcher(var18);StandardAnalyzer analyzer = new StandardAnalyzer();BufferedReader in = null;if(queries != null) {in = Files.newBufferedReader(Paths.get(queries, new String[0]), StandardCharsets.UTF_8);} else {in = new BufferedReader(new InputStreamReader(System.in, StandardCharsets.UTF_8));}QueryParser parser = new QueryParser(field, analyzer);do {if(queries == null && queryString == null) {System.out.println("Enter query: ");}String line = queryString != null?queryString:in.readLine();if(line == null || line.length() == -1) {break;}line = line.trim();if(line.length() == 0) {break;}Query query = parser.parse(line);System.out.println("Searching for: " + query.toString(field));if(repeat > 0) {Date start = new Date();for(int end = 0; end < repeat; ++end) {searcher.search(query, 100);}Date var19 = new Date();System.out.println("Time: " + (var19.getTime() - start.getTime()) + "ms");}doPagingSearch(in, searcher, query, hitsPerPage, raw, queries == null && queryString == null);} while(queryString == null);var18.close();}public static void doPagingSearch(BufferedReader in, IndexSearcher searcher, Query query, int hitsPerPage, boolean raw, boolean interactive) throws IOException {TopDocs results = searcher.search(query, 5 * hitsPerPage);ScoreDoc[] hits = results.scoreDocs;int numTotalHits = results.totalHits;System.out.println(numTotalHits + " total matching documents");int start = 0;int end = Math.min(numTotalHits, hitsPerPage);while(true) {if(end > hits.length) {System.out.println("Only results 1 - " + hits.length + " of " + numTotalHits + " total matching documents collected.");System.out.println("Collect more (y/n) ?");String quit = in.readLine();if(quit.length() == 0 || quit.charAt(0) == 110) {break;}hits = searcher.search(query, numTotalHits).scoreDocs;}end = Math.min(hits.length, start + hitsPerPage);for(int var15 = start; var15 < end; ++var15) {if(raw) {System.out.println("doc=" + hits[var15].doc + " score=" + hits[var15].score);} else {Document line = searcher.doc(hits[var15].doc);String page = line.get("path");if(page != null) {System.out.println(var15 + 1 + ". " + page);String title = line.get("title");if(title != null) {System.out.println("?? Title: " + line.get("title"));}} else {System.out.println(var15 + 1 + ". No path for this document");}}}if(!interactive || end == 0) {break;}if(numTotalHits >= end) {boolean var16 = false;while(true) {System.out.print("Press ");if(start - hitsPerPage >= 0) {System.out.print("(p)revious page, ");}if(start + hitsPerPage < numTotalHits) {System.out.print("(n)ext page, ");}System.out.println("(q)uit or enter number to jump to a page.");String var17 = in.readLine();if(var17.length() == 0 || var17.charAt(0) == 113) {var16 = true;break;}if(var17.charAt(0) == 112) {start = Math.max(0, start - hitsPerPage);break;}if(var17.charAt(0) == 110) {if(start + hitsPerPage < numTotalHits) {start += hitsPerPage;}break;}int var18 = Integer.parseInt(var17);if((var18 - 1) * hitsPerPage < numTotalHits) {start = (var18 - 1) * hitsPerPage;break;}System.out.println("No such page");}if(var16) {break;}end = Math.min(numTotalHits, start + hitsPerPage);}}} }

    2.2.2 運行效果

    可以看出是跟上面Luke工具查看的結果一樣,只有是對了才能查到


    2.2.3 分析

    該類主要與一個IndexSearcher,, StandardAnalyzer(在IndexFiles類中使用)和一個QueryParser。查詢解析器是用一個分析器構造的,用于以與解釋文檔相同的方式解釋查詢文本:查找單詞邊界,縮小和刪除無用單詞,如“a”,“an”和“the”。該 Query對象包含 QueryParser傳遞給搜索者的結果。請注意,也可以以編程方式構建豐富Query 對象,而不使用查詢解析器。查詢語法分析器只能將 Lucene查詢語法解碼為相應的 Query對象。

    SearchFiles使用最大 n個匹配IndexSearcher.search(query,n)返回的方法 。結果以頁面打印,按分數(即相關性)排序。

    2.3 SimpleSortedSetFacetsExample

    一個簡單的例子,比前面的兩個Demo理解起來容易一些。

    該例子使用SortedSetDocValuesFacetField和SortedSetDocValuesFacetCounts顯示了簡單的使用分面索引和搜索。

    以下代碼里面有注釋,結合起來看會比較容易理解。

    2.3.1 代碼

    package com.bingo.backstage.facet;import java.io.IOException; import java.util.ArrayList; import java.util.List;import org.apache.lucene.analysis.core.WhitespaceAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.facet.DrillDownQuery; import org.apache.lucene.facet.FacetResult; import org.apache.lucene.facet.FacetsCollector; import org.apache.lucene.facet.FacetsConfig; import org.apache.lucene.facet.sortedset.DefaultSortedSetDocValuesReaderState; import org.apache.lucene.facet.sortedset.SortedSetDocValuesFacetCounts; import org.apache.lucene.facet.sortedset.SortedSetDocValuesFacetField; import org.apache.lucene.index.DirectoryReader; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.IndexWriterConfig; import org.apache.lucene.index.IndexWriterConfig.OpenMode; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.MatchAllDocsQuery; import org.apache.lucene.store.Directory; import org.apache.lucene.store.RAMDirectory;/*** Created by MoSon on 2017/6/30.*/public class SimpleSortedSetFacetsExample {//RAMDirectory:內存駐留目錄實現。 默認情況下,鎖定實現是SingleInstanceLockFactory。private final Directory indexDir = new RAMDirectory();private final FacetsConfig config = new FacetsConfig();public SimpleSortedSetFacetsExample() {}private void index() throws IOException {初始化索引創建器//WhitespaceAnalyzer僅僅是去除空格,對字符沒有lowcase,不支持中文;并且不對生成的詞匯單元進行其他的規范化處理。//openMode:創建索引模式:CREATE,覆蓋模式; APPEND,追加模式//IndexWriter:創建并維護索引IndexWriter indexWriter = new IndexWriter(this.indexDir, (new IndexWriterConfig(new WhitespaceAnalyzer())).setOpenMode(OpenMode.CREATE));//建立文檔Document doc = new Document();// 創建Field對象,并放入doc對象中doc.add(new SortedSetDocValuesFacetField("Author", "Bob"));doc.add(new SortedSetDocValuesFacetField("Publish Year", "2010"));// 寫入IndexWriterindexWriter.addDocument(this.config.build(doc));doc = new Document();doc.add(new SortedSetDocValuesFacetField("Author", "Lisa"));doc.add(new SortedSetDocValuesFacetField("Publish Year", "2010"));indexWriter.addDocument(this.config.build(doc));doc = new Document();doc.add(new SortedSetDocValuesFacetField("Author", "Lisa"));doc.add(new SortedSetDocValuesFacetField("Publish Year", "2012"));indexWriter.addDocument(this.config.build(doc));doc = new Document();doc.add(new SortedSetDocValuesFacetField("Author", "Susan"));doc.add(new SortedSetDocValuesFacetField("Publish Year", "2012"));indexWriter.addDocument(this.config.build(doc));doc = new Document();doc.add(new SortedSetDocValuesFacetField("Author", "Frank"));doc.add(new SortedSetDocValuesFacetField("Publish Year", "1999"));indexWriter.addDocument(this.config.build(doc));indexWriter.close();}//查詢并統計文檔的信息private List<FacetResult> search() throws IOException {//基本都是一層包著一層封裝//DirectoryReader是可以讀取目錄中的索引的CompositeReader的實現。DirectoryReader indexReader = DirectoryReader.open(this.indexDir);//通過一個IndexReader實現搜索。IndexSearcher searcher = new IndexSearcher(indexReader);DefaultSortedSetDocValuesReaderState state = new DefaultSortedSetDocValuesReaderState(indexReader);//收集命中后續刻面。 一旦你運行了一個搜索并收集命中,就可以實例化一個Facets子類來進行細分計數。 使用搜索實用程序方法執行普通搜索,但也會收集到Collector中。FacetsCollector fc = new FacetsCollector();//實用方法,搜索并收集所有的命中到提供的Collector。FacetsCollector.search(searcher, new MatchAllDocsQuery(), 10, fc);//計算所提供的匹配中的所有命中。SortedSetDocValuesFacetCounts facets = new SortedSetDocValuesFacetCounts(state, fc);ArrayList results = new ArrayList();//getTopChildren:返回指定路徑下的頂級子標簽。results.add(facets.getTopChildren(10, "Author", new String[0]));results.add(facets.getTopChildren(10, "Publish Year", new String[0]));indexReader.close();return results;}private FacetResult drillDown() throws IOException {DirectoryReader indexReader = DirectoryReader.open(this.indexDir);IndexSearcher searcher = new IndexSearcher(indexReader);DefaultSortedSetDocValuesReaderState state = new DefaultSortedSetDocValuesReaderState(indexReader);DrillDownQuery q = new DrillDownQuery(this.config);//添加查詢條件q.add("Publish Year", new String[]{"2012"});FacetsCollector fc = new FacetsCollector();FacetsCollector.search(searcher, q, 10, fc);SortedSetDocValuesFacetCounts facets = new SortedSetDocValuesFacetCounts(state, fc);//獲取符合的作者FacetResult result = facets.getTopChildren(10, "Author", new String[0]);indexReader.close();return result;}public List<FacetResult> runSearch() throws IOException {this.index();return this.search();}public FacetResult runDrillDown() throws IOException {this.index();return this.drillDown();}public static void main(String[] args) throws Exception {System.out.println("Facet counting example:");System.out.println("-----------------------");SimpleSortedSetFacetsExample example = new SimpleSortedSetFacetsExample();List results = example.runSearch();System.out.println("Author: " + results.get(0));System.out.println("Publish Year: " + results.get(0));System.out.println("\n");System.out.println("Facet drill-down example (Publish Year/2010):");System.out.println("---------------------------------------------");System.out.println("Author: " + example.runDrillDown());} }

    2.3.2????????運行效果


    3 簡單上手

    3.1 創建索引

    這是自己寫的例子,很好理解。

    簡單地添加內容到索引庫。

    3.1.1 代碼

    ?

    package com.bingo.backstage;import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.document.LegacyLongField; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.IndexWriterConfig; import org.apache.lucene.store.Directory; import org.apache.lucene.store.FSDirectory;import java.io.IOException; import java.nio.file.FileSystems; import java.nio.file.Path;import static org.apache.lucene.document.TextField.TYPE_STORED;/*** Created by MoSon on 2017/6/30.*/public class CreateIndex {public static void main(String[] args) throws IOException {//定義IndexWriter//index是一個相對路徑,當前工程Path path = FileSystems.getDefault().getPath("", "index");Directory directory = FSDirectory.open(path);//定義分詞器Analyzer analyzer = new StandardAnalyzer();IndexWriterConfig indexWriterConfig = new IndexWriterConfig(analyzer).setOpenMode(IndexWriterConfig.OpenMode.CREATE);IndexWriter indexWriter = new IndexWriter(directory, indexWriterConfig);//定義文檔Document document = new Document();//定義文檔字段document.add(new LegacyLongField("id", 5499, Field.Store.YES));document.add(new Field("title", "小米6", TYPE_STORED));document.add(new Field("sellPoint", "驍龍8356G內存,雙攝!", TYPE_STORED));//寫入數據indexWriter.addDocument(document);//添加新的數據document = new Document();document.add(new LegacyLongField("id", 8324, Field.Store.YES));document.add(new Field("title", "OnePlus5", TYPE_STORED));document.add(new Field("sellPoint", "8核,8G運行內存", TYPE_STORED));indexWriter.addDocument(document);//提交indexWriter.commit();//關閉indexWriter.close();}}

    ?

    3.1.2結果

    一下是使用Luke查看的結果


    3.2 分詞搜索

    根據條件查詢符合的內容

    3.2.1 代碼

    package com.bingo.backstage;import org.apache.lucene.document.Document; import org.apache.lucene.index.DirectoryReader; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.Term; import org.apache.lucene.search.*; import org.apache.lucene.store.Directory; import org.apache.lucene.store.FSDirectory;import java.io.IOException; import java.nio.file.FileSystems; import java.nio.file.Path;/*** Created by MoSon on 2017/7/1.*/public class Search {public static void main(String[] args) throws IOException {//定義索引目錄Path path = FileSystems.getDefault().getPath("index");Directory directory = FSDirectory.open(path);//定義索引查看器IndexReader indexReader = DirectoryReader.open(directory);//定義搜索器IndexSearcher indexSearcher = new IndexSearcher(indexReader);//搜索內容//定義查詢字段Term term = new Term("sellPoint","");Query query = new TermQuery(term);//命中前10條文檔TopDocs topDocs = indexSearcher.search(query,10);//打印命中數System.out.println("命中數:"+topDocs.totalHits);//取出文檔ScoreDoc[] scoreDocs = topDocs.scoreDocs;//遍歷取出數據for (ScoreDoc scoreDoc : scoreDocs){//通過indexSearcherdoc方法取出文檔Document doc = indexSearcher.doc(scoreDoc.doc);System.out.println("id:"+doc.get("id"));System.out.println("sellPoint:"+doc.get("sellPoint"));}//關閉索引查看器indexReader.close();} }

    3.2.2 運行結果

    將符合條件的結果查詢并顯示。

    4???Lucene創建索引核心API

    Directory? 索引操作目錄

    Analyzer?? 分詞器

    Document 索引中文檔對象

    IndexableField 文檔內部數據信息

    IndexWriterConfig 索引生成配置信息

    IndexWriter? 索引生成對象

    5???IK分詞器

    5.1下載

    下載適合Lucene的IKAnalyzer

    鏈接:http://download.csdn.net/detail/fanpei_moukoy/9796612

    5.2 基本使用

    使用IK分詞器對中文進行詞意劃分。

    使用方式:將系統的Analyzer替換為IKAnalyzer


    效果:

    能對常用的詞語識別并劃分,但還不足夠,例如“雙攝像頭”,“驍龍”識別出來。


    5.3 自定義分詞器

    創建配置文件


    創建自定義的擴展字典


    分詞效果:


    5.4 使用分頁查詢

    代碼:

    packagecom.bingo.backstage;

    import
    org.apache.lucene.document.Document;
    import
    org.apache.lucene.index.DirectoryReader;
    import
    org.apache.lucene.index.IndexReader;
    import
    org.apache.lucene.index.Term;
    import
    org.apache.lucene.queryparser.classic.ParseException;
    import
    org.apache.lucene.queryparser.classic.QueryParser;
    import
    org.apache.lucene.search.*;
    import
    org.apache.lucene.store.Directory;
    import
    org.apache.lucene.store.FSDirectory;
    import
    org.wltea.analyzer.lucene.IKAnalyzer;

    import
    java.io.IOException;
    import
    java.nio.file.FileSystems;
    import
    java.nio.file.Path;

    /**
    ?* Created by MoSon on 2017/7/1.
    ?*/
    public class SearchPage {

    ???
    public static void main(String[] args)throwsIOException,ParseException {
    ???????
    //定義索引目錄
    ???????
    Path path = FileSystems.getDefault().getPath("index");
    ???????
    Directory directory = FSDirectory.open(path);
    ???????
    //定義索引查看器
    ???????
    IndexReader indexReader = DirectoryReader.open(directory);
    ???????
    //定義搜索器
    ???????
    IndexSearcher indexSearcher = newIndexSearcher(indexReader);
    ???????
    //搜索內容


    ???????
    //搜索關鍵字
    ???????
    String? keyWords = "內存";

    ????? ??
    //分頁信息
    ???????
    Integer page = 1;
    ???????
    Integer pageSize = 20;
    ???????
    Integer start = (page-1) * pageSize;
    ???????
    Integer end = start + pageSize;

    ???????
    Query query = newQueryParser("sellPoint",newIKAnalyzer()).parse(keyWords);//模糊搜索

    ???????
    //命中前10條文檔
    ???????
    TopDocs topDocs = indexSearcher.search(query,end);//根據end查詢

    ???????
    Integer totalPage = ((topDocs.totalHits/ pageSize) ==0)
    ??????????????? ? topDocs.
    totalHits/pageSize
    ??????????????? : ((topDocs.
    totalHits / pageSize) +1);

    ???????
    System.out.println("“"+ keyWords +"”搜索到"+ topDocs.totalHits
    ???????????????
    + "條數據,頁數:"+ page +"/"+ totalPage);
    ???????
    //打印命中數
    ???????
    System.out.println("命中數:"+topDocs.totalHits);
    ???????
    //取出文檔
    ???????
    ScoreDoc[] scoreDocs = topDocs.scoreDocs;
    ??????? int
    length = scoreDocs.length> end ? end : scoreDocs.length;
    ???????
    //遍歷取出數據
    ???????
    for (inti = start;i < length;i++){
    ??????????? ScoreDoc doc = scoreDocs[i]
    ;
    ???????????
    System.out.println("得分:"+ doc.score);
    ???????????
    Document document = indexSearcher.doc(doc.doc);
    ???????????
    System.out.println("ID:"+ document.get("id"));
    ???????????
    System.out.println("sellPoint:"+document.get("sellPoint"));
    ???????????
    System.out.println("-----------------------");
    ???????
    }

    ???????
    //關閉索引查看器
    ???????
    indexReader.close();
    ???
    }
    }

    效果:

    ?

    6文件索引建立與搜索

    導入一百萬的數據創建索引

    6.1 創建索引

    packagecom.bingo.backstage;

    import
    org.apache.lucene.analysis.Analyzer;
    import
    org.apache.lucene.analysis.standard.StandardAnalyzer;
    import
    org.apache.lucene.document.Document;
    import
    org.apache.lucene.document.Field;
    import
    org.apache.lucene.document.StringField;
    import
    org.apache.lucene.index.IndexWriter;
    import
    org.apache.lucene.index.IndexWriterConfig;
    import
    org.apache.lucene.store.Directory;
    import
    org.apache.lucene.store.FSDirectory;
    import
    org.wltea.analyzer.lucene.IKAnalyzer;

    import
    javax.print.Doc;
    import
    java.io.*;
    import
    java.nio.file.FileSystems;
    import
    java.nio.file.Path;

    import static
    org.apache.lucene.document.TextField.TYPE_STORED;

    /**
    ?* Created by MoSon on 2017/7/4.
    ?*/
    public class ReadTxt {
    ???
    public static void main(String[] args)throwsIOException {
    ??????? Path path = FileSystems.getDefault().getPath(
    "","index");
    ???????
    String extPath = "H:\\IDEAWorkspace\\lucene\\src\\main\\resources\\ext.dic";
    ???????
    Directory directory = FSDirectory.open(path);
    ???????
    //定義分詞器
    //??????? Analyzer analyzer = new StandardAnalyzer();
    ???????
    Analyzer analyzer = newIKAnalyzer();
    ???????
    IndexWriterConfig indexWriterConfig = new IndexWriterConfig(analyzer).setOpenMode(IndexWriterConfig.OpenMode.CREATE);
    ???????
    IndexWriter indexWriter = newIndexWriter(directory,indexWriterConfig);


    ???????
    String filePath = "H:\\myfile\\品高\\茂名全量地址20170401boss+.csv";
    ???????
    FileInputStream fis = newFileInputStream(filePath);
    ???????
    InputStreamReader isr = newInputStreamReader(fis,"GBK");
    ???????
    BufferedReader br = newBufferedReader(isr);
    ???????
    String content;
    ???????
    String levelOne = "";
    ???????
    String levelTwo = "";
    ???????
    String levelThree = "";
    ???????
    String levelFour = "";
    ???????
    String levelFive = "";
    ??????? int
    i = 0;
    ??????
    /* while ((content = br.readLine()) != null){
    ??????????? if (i == 1000) {
    ??????????????? break;
    ??????????? }
    ??????????? String[] split = content.split(",");
    ??????????? String tempOne = "";
    ??????????? String tempTwo = "";
    ??????????? String tempThree = "";
    ??????????? String tempFour = "";
    ??????????? String tempFive = "";
    ??????????? if (i == 1) {
    ??????????????? levelOne = split[2];
    ??????????????? levelTwo = split[3];
    ??????????????? levelThree = split[4];
    ??????????????? levelFour = split[5];
    ??????????????? levelFive = split[6];
    ??????????? }

    ??????????? tempOne = split[2];
    ??????????? tempTwo = split[3];
    ??????????? tempThree = split[4];
    ??????????? tempFour = split[5];
    ??????????? tempFive = split[6];

    ??????????? StringBuilder sb = new StringBuilder();
    ??????????? //
    使用equals如存在""避免放在前面
    ??????????? if (levelOne != null && levelOne != "" && tempOne!= "" && tempOne != null) {
    ??????????????? if(!tempOne.equals(levelOne)) {
    ??????????????????? sb.append("\n" + levelOne);
    ??????????????????? levelOne = tempOne;
    ??????????????????? System.out.println("11" + levelOne+tempOne);
    ??????????????? }
    ??????????? }
    ??????????? if (levelTwo != null && levelTwo != "" && tempTwo!= ""&& tempTwo != null) {
    ??????????????? if(!tempTwo.equals(levelTwo)) {
    ??????????????????? sb.append("\n" + levelTwo);
    ??????????????????? levelTwo = tempTwo;
    ??????????????? }
    ??????????? }
    ??????????? if (levelThree != null && levelThree != ""&& tempThree != ""&& tempThree != null) {
    ??????????????? if(!tempThree.equals(levelThree)) {
    ??????????????????? sb.append("\n" + levelThree);
    ??????????????????? levelThree = tempThree;
    ??????????????? }
    ??????????? }
    ??????????? if (levelFour != null && levelFour != ""&& tempFour != "" && tempFour != null) {
    ??????????????? if(!tempFour.equals(levelFour)) {
    ??????????????????? sb.append("\n" + levelFour);
    ??????????????????? levelFour = tempFour;
    ??????????????? }
    ??????????? }
    ??????????? if (levelFive != null && levelFive != "" && tempFive != "" && tempFive != null) {
    ??????????????? if(!tempFive.equals(levelFive)) {
    ??????????????????? sb.append("\n" + levelFive);
    ??????????????????? levelFive = tempFive;
    ??????????????? }
    ??????????? }
    ??????????? if(i == 422){
    ??????????????? System.out.println("address" + sb.toString()+tempFive+levelFive);
    ??????????? }

    //??????????? System.out.println("address" + sb.toString()+tempFive+levelFive);
    ??????????? if (sb != null){
    ??????????????? //
    以追加的形式寫入
    ???????? ???????FileOutputStream fos = new FileOutputStream(extPath,true);
    ??????????????? OutputStreamWriter osr = new OutputStreamWriter(fos);
    ??????????????? BufferedWriter bw = new BufferedWriter(osr);
    ??????????????? bw.write(sb.toString(),0,sb.length());
    ??? ????????????bw.close();
    ??????????? }
    ??????????? i++;
    ??????? }*/

    ???????
    long start = System.currentTimeMillis();
    ???????
    System.out.println("start:"+ start);
    ??????? while
    ((content = br.readLine()) != null) {
    ???????????
    //第一行不記錄
    ???????????
    /*if(i == 0){
    ??????????????? continue;
    ??????????? }*/
    ?????????? /* if (i == 1000) {
    ??????????????? break;
    ??????????? }*/

    ??????????? //
    定義文檔
    ???????????
    Document document = newDocument();
    ???????????
    //讀取每一行
    //??????????? System.out.println(content);
    ????? ??????
    String[] split = content.split(",");
    ???????????
    String id = split[0];
    ???????????
    String address = split[1];


    //??????????? System.out.println(id + ":" + address);
    ???????????
    document.add(newField("id",id,TYPE_STORED));
    ???????????
    document.add(newField("address",address,TYPE_STORED));
    ???????????
    indexWriter.addDocument(document);
    ???????????
    i++;
    ???????
    }
    ???????
    long end = System.currentTimeMillis();
    ???????
    System.out.println("end:"+ end);
    ??????? float
    time = end - start;
    ???????
    System.out.println("用時:"+ time);
    ???????
    //提交
    ???????
    indexWriter.commit();
    ???????
    //關閉
    ???????
    indexWriter.close();
    ???????
    br.close();
    ???????
    isr.close();
    ???????
    fis.close();
    ???
    }
    }

    6.2 效果

    一開始100秒將一百萬的索引建完。后來逐漸加快,應該跟只開了2個應用程序有關,不到一分鐘就建完了。


    6.3 模糊搜索

    搜索“茂名”,全部命中,一百多萬條。用時1秒多。

    效果:


    7 獲取分詞器分詞結果

    7.1 使用IK分詞器

    想百度那樣,把我們要搜索的一句話先給分詞了再按關鍵字搜索

    代碼:

    packagecom.bingo.backstage;

    import
    org.apache.lucene.analysis.Analyzer;
    import
    org.apache.lucene.analysis.TokenStream;
    import
    org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
    import
    org.wltea.analyzer.lucene.IKAnalyzer;

    import
    java.io.IOException;
    import
    java.io.StringReader;
    import
    java.util.ArrayList;
    import
    java.util.List;

    /**
    ?* Created by MoSon on 2017/7/5.
    ?*/
    public class AnalyzerResult {

    ???
    /**
    ???? *
    獲取指定分詞器的分詞結果
    ????
    * @param analyzeStr
    ????
    *??????????? 要分詞的字符串
    ????
    * @param analyzer
    ????
    *??????????? 分詞器
    ????
    * @return 分詞結果
    ????
    */
    ???
    public List<String>getAnalyseResult(String analyzeStr,Analyzer analyzer) {
    ??????? List<String> response =
    new ArrayList<String>();
    ???????
    TokenStream tokenStream = null;
    ??????? try
    {
    ???????????
    //返回適用于fieldNameTokenStream,標記讀者的內容。
    ???????????
    tokenStream = analyzer.tokenStream("address", newStringReader(analyzeStr));
    ???????????
    // 語匯單元對應的文本
    ???????????
    CharTermAttribute attr = tokenStream.addAttribute(CharTermAttribute.class);
    ???????????
    //消費者在使用incrementToken()開始消費之前調用此方法。
    ???????????
    //將此流重置為干凈狀態。 有狀態的實現必須實現這種方法,以便它們可以被重用,就像它們被創建的一樣。
    ??? ????????
    tokenStream.reset();
    ???????????
    //Consumers(即IndexWriter)使用此方法將流推送到下一個令牌。
    ???????????
    while (tokenStream.incrementToken()) {
    ??????????????? response.add(attr.toString())
    ;
    ???????????
    }
    ??????? }
    catch (IOException e) {
    ??????????? e.printStackTrace()
    ;
    ???????
    } finally{
    ???????????
    if (tokenStream !=null) {
    ???????????????
    try {
    ??????????????????? tokenStream.close()
    ;
    ???????????????
    } catch(IOException e) {
    ??????????????????? e.printStackTrace()
    ;
    ???????????????
    }
    ??????????? }
    ??????? }
    ?? ?????
    return response;
    ???
    }

    ???
    public static void main(String[] args) {
    ??????? List<String> analyseResult =
    new AnalyzerResult().getAnalyseResult("茂名市信宜市丁堡鎮丁堡鎮片區丁堡街道181301", new IKAnalyzer());
    ??????? for
    (String result : analyseResult){
    ??????????? System.
    out.println(result);
    ???????
    }
    ??? }
    }

    分詞效果


    7.2 使用內置CJK分詞器

    把類中的IKAnalyzer替換為CJKAnalyzer就可以了


    分詞效果:


    基本以兩個字兩個字來劃分,沒有IK分詞器的效果好。

    8???進階

    根據前面的知識結合起來,先分詞,根據關鍵詞搜索。相似度高的靠前輸出。

    使用的是布爾搜索。

    代碼:

    package com.bingo.backstage;import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; import org.apache.lucene.document.Document; import org.apache.lucene.index.DirectoryReader; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.Term; import org.apache.lucene.queryparser.classic.ParseException; import org.apache.lucene.search.*; import org.apache.lucene.store.Directory; import org.apache.lucene.store.FSDirectory; import org.wltea.analyzer.lucene.IKAnalyzer;import java.io.IOException; import java.io.StringReader; import java.nio.file.FileSystems; import java.nio.file.Path; import java.util.ArrayList; import java.util.List;/*** Created by MoSon on 2017/7/5.*/public class BooleanSearchQuery {public static void main(String[] args) throws IOException, ParseException {long start = System.currentTimeMillis();System.out.println("開始時間:" + start);//定義索引目錄Path path = FileSystems.getDefault().getPath("index");Directory directory = FSDirectory.open(path);//定義索引查看器IndexReader indexReader = DirectoryReader.open(directory);//定義搜索器IndexSearcher indexSearcher = new IndexSearcher(indexReader);//搜索內容//定義查詢字段//布爾搜索/*?? TermQuery termQuery1 = new TermQuery(term1);TermQuery termQuery2 = new TermQuery(term2);BooleanClause booleanClause1 = new BooleanClause(termQuery1, BooleanClause.Occur.MUST);BooleanClause booleanClause2 = new BooleanClause(termQuery2, BooleanClause.Occur.SHOULD);BooleanQuery.Builder builder = new BooleanQuery.Builder();builder.add(booleanClause1);builder.add(booleanClause2);BooleanQuery query = builder.build();*//*** 進階*多關鍵字的布爾搜索* *///定義Term集合List<Term> termList = new ArrayList<Term>();//獲取分詞結果List<String> analyseResult = new AnalyzerResult().getAnalyseResult("信宜市1234ewrq13asd丁堡鎮丁堡鎮片區丁堡街道181301", new IKAnalyzer());for (String result : analyseResult){termList.add(new Term("address",result));//??????????? System.out.println(result);}//定義TermQuery集合List<TermQuery> termQueries = new ArrayList<TermQuery>();//取出集合結果for(Term term : termList){termQueries.add(new TermQuery(term));}List<BooleanClause> booleanClauses = new ArrayList<BooleanClause>();//遍歷for (TermQuery termQuery : termQueries){booleanClauses.add(new BooleanClause(termQuery, BooleanClause.Occur.SHOULD));}BooleanQuery.Builder builder = new BooleanQuery.Builder();for (BooleanClause booleanClause : booleanClauses){builder.add(booleanClause);}//檢索BooleanQuery query = builder.build();//命中前10條文檔TopDocs topDocs = indexSearcher.search(query,20);//打印命中數System.out.println("命中數:"+topDocs.totalHits);//取出文檔ScoreDoc[] scoreDocs = topDocs.scoreDocs;//遍歷取出數據for (ScoreDoc scoreDoc : scoreDocs){float score = scoreDoc.score; //相似度System.out.println("相似度:"+ score);//通過indexSearcherdoc方法取出文檔Document doc = indexSearcher.doc(scoreDoc.doc);System.out.println("id:"+doc.get("id"));System.out.println("address:"+doc.get("address"));}//關閉索引查看器indexReader.close();long end = System.currentTimeMillis();System.out.println("開始時間:" + end);long time =? end-start;System.out.println("用時:" + time + "毫秒" );}/*** 獲取指定分詞器的分詞結果* @param analyzeStr*??????????? 要分詞的字符串* @param analyzer*??????????? 分詞器* @return 分詞結果*/public List<String> getAnalyseResult(String analyzeStr, Analyzer analyzer) {List<String> response = new ArrayList<String>();TokenStream tokenStream = null;try {//返回適用于fieldNameTokenStream,標記讀者的內容。tokenStream = analyzer.tokenStream("address", new StringReader(analyzeStr));// 語匯單元對應的文本CharTermAttribute attr = tokenStream.addAttribute(CharTermAttribute.class);//消費者在使用incrementToken()開始消費之前調用此方法。//將此流重置為干凈狀態。 有狀態的實現必須實現這種方法,以便它們可以被重用,就像它們被創建的一樣。tokenStream.reset();//Consumers(即IndexWriter)使用此方法將流推送到下一個令牌。while (tokenStream.incrementToken()) {response.add(attr.toString());}} catch (IOException e) {e.printStackTrace();} finally {if (tokenStream != null) {try {tokenStream.close();} catch (IOException e) {e.printStackTrace();}}}return response;} }

    效果:

    輸入的句子是


    檢索結果:

    ?

    ?

    ?在此入門到此結束,如有興趣可以查看進階版,可以看底部的“我的更多文章”。

    ?

    ?

    ?

    ?

    總結

    以上是生活随笔為你收集整理的Lucene从入门到进阶(6.6.0版本)的全部內容,希望文章能夠幫你解決所遇到的問題。

    如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。