日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Lucene的评分(score)机制研究

發布時間:2025/4/5 编程问答 15 豆豆
生活随笔 收集整理的這篇文章主要介紹了 Lucene的评分(score)机制研究 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

首先,需要學習Lucene的評分計算公式——

分值計算方式為查詢語句q中每個項t與文檔d的匹配分值之和,當然還有權重的因素。其中每一項的意思如下表所示:

表3.5

評分公式中的因子

評分因子

描?述

tf(t in d)

項頻率因子——文檔(d)中出現項(t)的頻率

idf(t)

項在倒排文檔中出現的頻率:它被用來衡量項的“唯一”性.出現頻率較高的term具有較低的idf,出現較少的term具有較高的idf

boost(t.field in d)

域和文檔的加權,在索引期間設置.你可以用該方法 對某個域或文檔進行靜態單獨加權

lengthNorm(t.field in d)

域的歸一化(Normalization)值,表示域中包含的項數量.該值在索引期間計算,并保存在索引norm中.對于該因子,更短的域(或更少的語匯單元)能獲得更大的加權

coord(q,d)

協調因子(Coordination factor),基于文檔中包含查詢的項個數.該因子會對包含更多搜索項的文檔進行類似AND的加權

queryNorm(q)

每個査詢的歸一化值,指毎個查詢項權重的平方和

?

?

通過Searcher.explain(Query query, int doc)方法可以查看某個文檔的得分的具體構成。 示例:

public class ScoreSortTest {public final static String INDEX_STORE_PATH = "index";public static void main(String[] args) throws Exception {IndexWriter writer = new IndexWriter(INDEX_STORE_PATH, new StandardAnalyzer(), true);writer.setUseCompoundFile(false);Document doc1 = new Document();Document doc2 = new Document();Document doc3 = new Document();Field f1 = new Field("bookname","bc bc", Field.Store.YES, Field.Index.TOKENIZED);Field f2 = new Field("bookname","ab bc", Field.Store.YES, Field.Index.TOKENIZED);Field f3 = new Field("bookname","ab bc cd", Field.Store.YES, Field.Index.TOKENIZED);doc1.add(f1);doc2.add(f2);doc3.add(f3);writer.addDocument(doc1);writer.addDocument(doc2);writer.addDocument(doc3);writer.close();IndexSearcher searcher = new IndexSearcher(INDEX_STORE_PATH);TermQuery q = new TermQuery(new Term("bookname", "bc"));q.setBoost(2f);Hits hits = searcher.search(q);for(int i=0; i<hits.length();i++){Document doc = hits.doc(i);System.out.print(doc.get("bookname") + "\t\t");System.out.println(hits.score(i));System.out.println(searcher.explain(q, hits.id(i)));// }} }

運行結果:?

bc bc 0.629606 0.629606 = (MATCH) fieldWeight(bookname:bc in 0), product of: 1.4142135 = tf(termFreq(bookname:bc)=2) 0.71231794 = idf(docFreq=3, numDocs=3) 0.625 = fieldNorm(field=bookname, doc=0) ab bc 0.4451987 0.4451987 = (MATCH) fieldWeight(bookname:bc in 1), product of: 1.0 = tf(termFreq(bookname:bc)=1) 0.71231794 = idf(docFreq=3, numDocs=3) 0.625 = fieldNorm(field=bookname, doc=1) ab bc cd 0.35615897 0.35615897 = (MATCH) fieldWeight(bookname:bc in 2), product of: 1.0 = tf(termFreq(bookname:bc)=1) 0.71231794 = idf(docFreq=3, numDocs=3) 0.5 = fieldNorm(field=bookname, doc=2)

涉及到的源碼:

idf的計算

idf是項在倒排文檔中出現的頻率,計算方式為

  • /**?Implemented?as?<code>log(numDocs/(docFreq+1))?+?1</code>.?*/
  • ??@Override
  • ??public?float?idf(long?docFreq,?long?numDocs)?{
  • ????return?(float)(Math.log(numDocs/(double)(docFreq+1))?+?1.0);
  • ??}
  • docFreq是根據指定關鍵字進行檢索,檢索到的Document的數量,我們測試的docFreq=14;numDocs是指索引文件中總共的Document的數量,我們測試的numDocs=1453。用計算器驗證一下,沒有錯誤,這里就不啰嗦了。

    queryNorm的計算

    queryNorm的計算在DefaultSimilarity類中實現,如下所示:

  • /**?Implemented?as?<code>1/sqrt(sumOfSquaredWeights)</code>.?*/
  • public?float?queryNorm(float?sumOfSquaredWeights)?{
  • ????return?(float)(1.0?/?Math.sqrt(sumOfSquaredWeights));
  • }
  • 這里,sumOfSquaredWeights的計算是在org.apache.lucene.search.TermQuery.TermWeight類中的sumOfSquaredWeights方法實現:

    ? ??

  • public?float?sumOfSquaredWeights()?{
  • ??????queryWeight?=?idf?*?getBoost();?????????????//?compute?query?weight
  • ??????return?queryWeight?*?queryWeight;??????????//?square?it
  • ????}
  • 其實默認情況下,sumOfSquaredWeights = idf * idf,因為Lucune中默認的boost = 1.0。

    fieldWeight的計算

    在org/apache/lucene/search/similarities/TFIDFSimilarity.java的explainScore方法中有:

  • //?explain?field?weight
  • ????Explanation?fieldExpl?=?new?Explanation();
  • ????fieldExpl.setDescription("fieldWeight?in?"+doc+
  • ?????????????????????????????",?product?of:");
  • ?
  • ????Explanation?tfExplanation?=?new?Explanation();
  • ????tfExplanation.setValue(tf(freq.getValue()));
  • ????tfExplanation.setDescription("tf(freq="+freq.getValue()+"),?with?freq?of:");
  • ????tfExplanation.addDetail(freq);
  • ????fieldExpl.addDetail(tfExplanation);
  • ????fieldExpl.addDetail(stats.idf);
  • ?
  • ????Explanation?fieldNormExpl?=?new?Explanation();
  • ????float?fieldNorm?=?norms?!=?null???decodeNormValue(norms.get(doc))?:?1.0f;
  • ????fieldNormExpl.setValue(fieldNorm);
  • ????fieldNormExpl.setDescription("fieldNorm(doc="+doc+")");
  • ????fieldExpl.addDetail(fieldNormExpl);
  • ????
  • ????fieldExpl.setValue(tfExplanation.getValue()?*
  • ???????????????????????stats.idf.getValue()?*
  • ???????????????????????fieldNormExpl.getValue());
  • ?
  • ????result.addDetail(fieldExpl);
  • 重點是這一句:

  • fieldExpl.setValue(tfExplanation.getValue()?*
  • ???????????????????????stats.idf.getValue()?*
  • ???????????????????????fieldNormExpl.getValue());
  • 使用計算式表示就是

    fieldWeight = tf * idf * fieldNorm

    tf和idf的計算參考前面的,fieldNorm的計算在索引的時候確定了,此時直接從索引文件中讀取,這個方法并沒有給出直接的計算。如果使用DefaultSimilarity的話,它實際上就是lengthNorm,域越長的話Norm越小,在org/apache/lucene/search/similarities/DefaultSimilarity.java里面有關于它的計算:

  • ??public?float?lengthNorm(FieldInvertState?state)?{
  • ????final?int?numTerms;
  • ????if?(discountOverlaps)
  • ??????numTerms?=?state.getLength()?-?state.getNumOverlap();
  • ????else
  • ??????numTerms?=?state.getLength();
  • ???return?state.getBoost()?*?((float)?(1.0?/?Math.sqrt(numTerms)));
  • ??}
  • ?

    參考文獻:

    【1】http://www.hankcs.com/program/java/lucene-scoring-algorithm-explained.html

    【2】http://grantbb.iteye.com/blog/181802

    轉載于:https://www.cnblogs.com/davidwang456/p/6150388.html

    總結

    以上是生活随笔為你收集整理的Lucene的评分(score)机制研究的全部內容,希望文章能夠幫你解決所遇到的問題。

    如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。