【Lucene4.8教程之二】索引
一、基礎(chǔ)內(nèi)容
0、官方文檔說(shuō)明
(1)org.apache.lucene.index provides two primary classes: IndexWriter, which creates and adds documents to indices; and IndexReader, which accesses the data in the index.
(2)涉及的兩個(gè)主要包有:
org.apache.lucene.index:Code to maintain and access indices.
org.apache.lucene.document:Thelogical representation of a Document for indexing and searching.
1、創(chuàng)建一個(gè)索引時(shí),涉及的重要類有下面幾個(gè):
(1)IndexWriter:索引過(guò)程中的核心組件,用于創(chuàng)建新索引或者打開(kāi)已有索引。以及向索引中加入、刪除、更新被索引文檔的信息。
(2)Document:代表一些域(field)的集合。
(3)Field及其子類:一個(gè)域,如文檔創(chuàng)建時(shí)間,作者。內(nèi)容等。
(4)Analyzer:分析器。
(5)Directory:可用于描寫敘述Lucene索引的存放位置。
2、索引文檔的基本過(guò)程例如以下:
(1)創(chuàng)建索引庫(kù)IndexWriter
(2)依據(jù)文件創(chuàng)建文檔Document
(3)向索引庫(kù)中寫入文檔內(nèi)容
基本程序例如以下:
package org.jediael.search.index;import java.io.File; import java.io.FileReader; import java.io.IOException;import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.document.LongField; import org.apache.lucene.document.StringField; import org.apache.lucene.document.TextField; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.IndexWriterConfig; import org.apache.lucene.store.Directory; import org.apache.lucene.store.FSDirectory; import org.apache.lucene.util.Version; import org.jediael.util.LoadProperties;// 1、創(chuàng)建索引庫(kù)IndexWriter // 2、依據(jù)文件創(chuàng)建文檔Document // 3、向索引庫(kù)中寫入文檔內(nèi)容public class IndexFiles {private IndexWriter writer = null;public void indexAllFileinDirectory(String indexPath, String docsPath)throws IOException {// 獲取放置待索引文件的位置。若傳入?yún)?shù)為空,則讀取search.properties中設(shè)置的默認(rèn)值。if (docsPath == null) { docsPath = LoadProperties.getProperties("docsDir"); } final File docDir = new File(docsPath); if (!docDir.exists() || !docDir.canRead()) { System.out .println("Document directory '" + docDir.getAbsolutePath() + "' does not exist or is not readable, please check the path"); System.exit(1); } // 獲取放置索引文件的位置,若傳入?yún)?shù)為空。則讀取search.properties中設(shè)置的默認(rèn)值。 if (indexPath == null) { indexPath = LoadProperties.getProperties("indexDir"); } final File indexDir = new File(indexPath); if (!indexDir.exists() || !indexDir.canRead()) { System.out .println("Document directory '" + indexDir.getAbsolutePath() + "' does not exist or is not readable, please check the path"); System.exit(1); } try { // 1、創(chuàng)建索引庫(kù)IndexWriter if(writer == null){ initialIndexWriter(indexDir); } index(writer, docDir); } catch (IOException e) { e.printStackTrace(); } finally{ writer.close(); } } //使用了最簡(jiǎn)單的單例模式,用于返回一個(gè)唯一的IndexWirter。注意此處非線程安全,須要進(jìn)一步優(yōu)化。 private void initialIndexWriter(File indexDir) throws IOException { Directory returnIndexDir = FSDirectory.open(indexDir); IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_48,new StandardAnalyzer(Version.LUCENE_48)); writer = new IndexWriter(returnIndexDir, iwc); } private void index(IndexWriter writer, File filetoIndex) throws IOException { if (filetoIndex.isDirectory()) { String[] files = filetoIndex.list(); if (files != null) { for (int i = 0; i < files.length; i++) { index(writer, new File(filetoIndex, files[i])); } } } else { // 2、依據(jù)文件創(chuàng)建文檔Document,考慮一下是否能不用每次創(chuàng)建Document對(duì)象 Document doc = new Document(); Field pathField = new StringField("path", filetoIndex.getPath(), Field.Store.YES); doc.add(pathField); doc.add(new LongField("modified", filetoIndex.lastModified(), Field.Store.YES)); doc.add(new StringField("title",filetoIndex.getName(),Field.Store.YES)); doc.add(new TextField("contents", new FileReader(filetoIndex))); //System.out.println("Indexing " + filetoIndex.getName()); // 3、向索引庫(kù)中寫入文檔內(nèi)容 writer.addDocument(doc); } } }
一些說(shuō)明:
(1)使用了最簡(jiǎn)單的單例模式。用于返回一個(gè)唯一的IndexWirter,注意此處非線程安全,須要進(jìn)一步優(yōu)化。
(2)注意IndexWriter,IndexReader等均須要耗費(fèi)較大的資源用于創(chuàng)建實(shí)例。因此如非必要,使用單例模式創(chuàng)建一個(gè)實(shí)例后。
3、索引、Document、Filed之間的關(guān)系
簡(jiǎn)而言之,多個(gè)Filed組成一個(gè)Document,多個(gè)Document組成一個(gè)索引。
它們之間通過(guò)下面方法相互調(diào)用:
Document doc = new Document(); Field pathField = new StringField("path", filetoIndex.getPath(),Field.Store.YES); doc.add(pathField);writer.addDocument(doc);二、關(guān)于Field
(一)創(chuàng)建一個(gè)域(field)的基本方法
1、在Lucene4.x前,使用下面方式創(chuàng)建一個(gè)Field: Field field = new Field("filename", f.getName(), Field.Store.YES, Field.Index.NOT_ANALYZED); Field field = new Field("contents", new FileReader(f)); Field field = new Field("fullpath", f.getCanonicalPath(), Field.Store.YES, Field.Index.NOT_ANALYZED)Filed的四個(gè)參數(shù)分別代表: 域的名稱 域的值 是否保存 是否分析。對(duì)于文件名,url。文件路徑等內(nèi)容。不須要對(duì)其進(jìn)行分析。2、在Lucene4后。定義了大量的Field的實(shí)現(xiàn)類型。依據(jù)須要,直接使用當(dāng)中一個(gè),不再使用籠統(tǒng)的Field來(lái)直接創(chuàng)建域。 Direct Known Subclasses: BinaryDocValuesField, DoubleField, FloatField,IntField, LongField, NumericDocValuesField, SortedDocValuesField, SortedSetDocValuesField, StoredField, StringField,TextField 比如,對(duì)于上述三個(gè)Filed,可對(duì)應(yīng)的改為: <pre name="code" class="java">Field field = new StringField("path", filetoIndex.getPath(),Field.Store.YES); Field field = new LongField("modified", filetoIndex.lastModified(),Field.Store.NO); Field field = new TextField("contents", new FileReader(filetoIndex)); 在4.x以后,StringField即為NOT_ANALYZED的(即不正確域的內(nèi)容進(jìn)行切割分析),而textField是ANALYZED的,因此,創(chuàng)建Field對(duì)象時(shí)。無(wú)需再指定此屬性。見(jiàn)http://stackoverflow.com/questions/19042587/how-to-prevent-a-field-from-not-analyzing-in-lucene 即每個(gè)Field的子類均具有默認(rèn)的是否INDEXED與ANALYZED屬性,不再須要顯式指定。 官方文檔: StringField:?A field that is indexed but not tokenized: the entire String value is indexed as a single token. For example this might be used for a 'country' field or an 'id' field, or any field that you intend to use for sorting or access through the field cache TextField:?A field that is indexed and tokenized,without term vectors. For example this would be used on a 'body' field, that contains the bulk of a document's text. (二)有關(guān)于Field的一些選項(xiàng) 1、Field.Store.Yes/No 在創(chuàng)建一個(gè)Field的時(shí)候,須要傳入一個(gè)參數(shù),用于指定內(nèi)容是否須要存儲(chǔ)到索引中。
這些被存儲(chǔ)的內(nèi)容能夠在搜索結(jié)果中返回,呈現(xiàn)給用戶。
二者最直觀的差異在于:使用document.get("fileName")時(shí),能否夠返回內(nèi)容。 比方,一個(gè)文件的標(biāo)題通常都是Field.Store.Yes,由于其內(nèi)容一般須要呈現(xiàn)給用戶。文件的作者、摘要等信息也一樣。 但一個(gè)文件的內(nèi)容可能就不是必需保存了。一方面是文件內(nèi)容太大。還有一方面是不是必需在索引中保存其信息,由于能夠引導(dǎo)用戶進(jìn)入原有文件就可以。 2、加權(quán) 能夠?qū)iled及Document進(jìn)行加權(quán)。注意加權(quán)是影響返回結(jié)果順序的一個(gè)因素,但也不過(guò)一個(gè)因素,它和其他因素一起構(gòu)成了Lucene的排序算法。 (三)對(duì)富文本(非純文本)的索引 上述的對(duì)正文的索引語(yǔ)句: Field field = new TextField("contents", new FileReader(filetoIndex));僅僅對(duì)純文本有效。對(duì)于word,excel,pdf等富文本。FileReader讀取到的內(nèi)容僅僅是一些亂碼。并不能形成有效的索引。
若須要對(duì)此類文本進(jìn)行索引,須要使用Tika等工具先將其正文內(nèi)容提取出來(lái),然后再進(jìn)行索引。http://stackoverflow.com/questions/16640292/lucene-4-2-0-index-pdf
Lucene doesn't handle files at all, really. That demo handles plain text files, but core Lucene doesn't. FileStreamReader is a Java standard stream reader, and for your purposes, it will only handle plain text. This works on the Unix philosophy. Lucene indexes content. Tika extracts content from rich documents. I've added links to a couple of examples using Tika, one with Lucene directly, the other using Solr (which you might want to consider as well).?
一個(gè)簡(jiǎn)單示比例如以下: 首先使用Tika提取word中的正文,再使用TextField索引文字。
doc.add(new TextField("contents", TikaBasicUtil.extractContent(filetoIndex),Field.Store.NO)); 注意此處不能使用StringField。由于StringField限制了字符串的大小不能超過(guò)32766,否則會(huì)報(bào)異常IllegalArgumentException:Document contains at least one immense term in field="contents"?(whose UTF8 encoding is longer than the max length 32766)*/
使用Tika索引富文本的簡(jiǎn)單示比例如以下: 注意,此演示樣例不僅能夠索引word。還能夠索引pdf,excel等。
package org.jediael.util;import java.io.File; import java.io.FileInputStream; import java.io.FileNotFoundException; import java.io.IOException; import java.io.InputStream;import org.apache.tika.exception.TikaException; import org.apache.tika.metadata.Metadata; import org.apache.tika.parser.AutoDetectParser; import org.apache.tika.parser.ParseContext; import org.apache.tika.parser.Parser; import org.apache.tika.sax.BodyContentHandler; import org.xml.sax.ContentHandler; import org.xml.sax.SAXException;public class TikaBasicUtil {public static String extractContent(File f) {//1、創(chuàng)建一個(gè)parserParser parser = new AutoDetectParser();InputStream is = null;try {Metadata metadata = new Metadata();metadata.set(Metadata.RESOURCE_NAME_KEY, f.getName());is = new FileInputStream(f);ContentHandler handler = new BodyContentHandler();ParseContext context = new ParseContext();context.set(Parser.class,parser);//2、運(yùn)行parser的parse()方法。parser.parse(is,handler, metadata,context);String returnString = handler.toString();System.out.println(returnString.length());return returnString;} catch (FileNotFoundException e) {e.printStackTrace();} catch (IOException e) {e.printStackTrace();} catch (SAXException e) {e.printStackTrace();} catch (TikaException e) {e.printStackTrace();}finally {try {if(is!=null) is.close();} catch (IOException e) {e.printStackTrace();}}return "No Contents";} }
三、關(guān)于Document FSDocument RAMDocument 四、關(guān)于IndexWriter 1、創(chuàng)建一個(gè)IndexWriter Directory returnIndexDir = FSDirectory.open(indexDir);IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_48,new StandardAnalyzer(Version.LUCENE_48));iwc.setOpenMode(IndexWriterConfig.OpenMode.CREATE);writer = new IndexWriter(returnIndexDir, iwc);System.out.println(writer.getConfig().getOpenMode()+"");System.out.println(iwc.getOpenMode());創(chuàng)建一個(gè)IndexWriter時(shí),須要2個(gè)參數(shù),一個(gè)是Directory對(duì)象,用于指定所創(chuàng)建的索引寫到哪個(gè)地方。還有一個(gè)是IndexWriterConfig對(duì)象,用于指定writer的配置。
2、IndexWriterConfig (1)繼承關(guān)系
- java.lang.Object
-
- org.apache.lucene.index.LiveIndexWriterConfig
-
- org.apache.lucene.index.IndexWriterConfig
- All Implemented Interfaces:Cloneable(2)Holds all the configuration that is used to create an?IndexWriter. Once?IndexWriter?has been created with this object, changes to this object will not affect the?IndexWriterinstance.(3)IndexWriterConfig.OpenMode:指明了打開(kāi)索引文件夾的方式,有下面三種:APPEND:Opens an existing index. 若原來(lái)存在索引,則將本次索引的內(nèi)容追加進(jìn)來(lái)。無(wú)論文檔是否與原來(lái)是否反復(fù)。因此若2次索引的文檔同樣,則返回結(jié)果數(shù)則為原來(lái)的2倍。CREATE:Creates a new index or overwrites an existing one. 若原來(lái)存在索引,則先將其刪除,再創(chuàng)建新的索引CREATE_OR_APPEND【默認(rèn)值】:Creates a new index if one does not exist, otherwise it opens the index and documents will be appended.
五、關(guān)于Analyzer 此處主要關(guān)于和索引期間相關(guān)的analyzer,關(guān)于analyzer更具體的內(nèi)容請(qǐng)參見(jiàn)?http://blog.csdn.net/jediael_lu/article/details/33303499 ?【Lucene4.8教程之四】分析 在創(chuàng)建IndexWriter時(shí)。須要指定分析器。如: IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_48,new StandardAnalyzer(Version.LUCENE_48)); writer = new IndexWriter(IndexDir, iwc);
便在每次向writer中加入文檔時(shí),能夠針對(duì)該文檔指定一個(gè)分析器,如 writer.addDocument(doc, new SimpleAnalyzer(Version.LUCENE_48));
六、關(guān)于Directory
總結(jié)
以上是生活随笔為你收集整理的【Lucene4.8教程之二】索引的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 合作伙伴:VMware收购Wavefro
- 下一篇: 偏见为什么是数据科学领域的一个大问题