當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

文本挖掘之文本相似度判定

發布時間：2025/3/19 编程问答 20 豆豆

生活随笔收集整理的這篇文章主要介紹了文本挖掘之文本相似度判定小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

劉勇?? Email:lyssym@sina.com

簡介

??????? 針對文本相似判定，本文提供余弦相似度和SimHash兩種算法，并根據實際項目遇到的一些問題，給出相應的解決方法。經過實際測試表明：余弦相似度算法適合于短文本，而SimHash算法適合于長文本，并且能應用于大數據環境中。

余弦相似度

原理

??????? 余弦定理：

?????????????????

圖-1 余弦定理圖示

???????? 性質：

???????? 余弦值的范圍在[-1,1]之間，值越趨近于1，代表兩個向量的方向越趨近于0°，他們的方向更加一致，相應的相似度也越高。需要指出的是，在文本相似度判定中，因為文本特征向量定義的特殊性，其余弦值范圍為[0,1]，即向量夾角越趨向于90°，則兩向量越不相似。

向量空間模型

??????? VSM（Vector Space Model）把對文本內容的處理簡化為向量空間中的向量運算。

??????? 概念：

??????? 1）文檔（D）：泛指文檔或文檔片段，一般表征一篇文檔。

??????? 2）詞匯（T）：文本內容特征的基本語言單位，包含字、詞、詞組或短語。

??????? 3）權重（W）：表征詞匯T的權重，在文檔D中的重要程度。

??????? 權重：

??????? 目前表征一個字詞在一個文本集或者語料庫中某篇文本中的重要程度的統計方法為TF-IDF(term frequency–inverse document frequency)，詞匯的重要性隨著它在文件中出現的次數成正比增加，但同時會隨著它在語料庫中出現的頻率成反比下降，詳細內容在此不贅述。但是本文在實際項目中面臨的問題是，文本集是變動的，而且變化速率比較快，因此并不適用于采用TF-IDF方法。本文采用非常簡單直觀的方法，即以詞頻來表征該詞匯在文本中的重要程度（即權重）。

??????? 向量對齊：

??????? 由于在實際應用中，表征文本特征的兩個向量的長度是不同的，因此必然需要對上述向量進行處理。目前存在兩種方法：1）剔除掉向量中不重要的詞匯，從而使得兩個向量長度保持一致，目前主要依靠經驗設定一些關鍵詞來處理，但是其準確率不可保證；2）歸并向量，并根據原向量是否在新向量（歸并后的向量）存在，若存在則以該詞匯的詞頻來表征，若不存在則該節點置為0，示例如下：

??????? Text1: 我/是/中國人/

??????? Text2: 我們/是/中國人/

??????? Vector: 我/是/中國人/我們/

??????? Vector1 = (1, 1, 1, 0)

??????? Vector2 = (0, 1, 1, 1)

??????? 上述“/”為采用IK分詞，智能切分后的間隔符，則歸并后的向量如Vector所示，對齊后的向量分別為Vector1 和Vector2。之后則根據兩向量的余弦值確定相似度。

文本特例

??????? 由于在實際項目中，本文發現了2個特例，并相應給出了解決方案。

??????? 1）長句包含短句（無需完全包含）：

??????? Text1：“貫徹強軍目標出實招用實勁努力開創部隊建設新局面”

??????? Text2：“在接見駐浙部隊領導干部時強調貫徹強軍目標出實招用實勁努力開創部隊建設新局面”

??????? 上述兩個文本為網絡上實際的網頁標題，若簡單以余弦相似度來判定，其誤判率是比較高的。本文解決方案為：若長句長度（中文切分后以詞匯為單位表征，并非以字符為單位）為短句的1.5倍，則針對長句選定短句長度的文本內容逐個與短句進行相似度判定，直至長句結束，若中間達到預設的閾值，則跳出該循環，否則判定文本不相似。

??????? 2）文本中存在同義表述

??????? Text1：“臺灣居民明日起持臺胞證可通關無需辦理簽注”

??????? Text2：“明起臺胞來京無需辦理簽注電子臺胞證年內實施”

??????? 上述兩個文本中“臺胞”和“臺灣居民”，“明日起”和“明起”為同義表述，可以理解為近義詞，但不完全為近義詞范疇。本文解決方案為引入同義詞詞典，鑒于中文詞匯的豐富性，其能在一定程度上緩解，仍然不是根本解決之法。

應用場景及優缺點

??????? 本文目前將該算法應用于網頁標題合并和標題聚類中，目前仍在嘗試應用于其它場景中。

??????? 優點：計算結果準確，適合對短文本進行處理。

??????? 缺點：需要逐個進行向量化，并進行余弦計算，比較消耗CPU處理時間，因此不適合長文本，如網頁正文、文檔等。

　　余弦相似度算法源程序：

1 public class ElementDict { 2 private String term; 3 private int freq; 4 5 public ElementDict(String term, int freq) { 6 this.term = term; 7 this.freq = freq; 8 } 9 10 11 public void setFreq (int freq) { 12 this.freq = freq; 13 } 14 15 16 public String getTerm() { 17 return term; 18 } 19 20 21 public int getFreq() { 22 return freq; 23 } 24 25 } Class Element 1 import java.io.BufferedReader; 2 import java.io.File; 3 import java.io.FileInputStream; 4 import java.io.FileReader; 5 import java.io.IOException; 6 import java.io.InputStreamReader; 7 import java.util.HashMap; 8 import java.util.List; 9 import java.util.ArrayList; 10 import java.util.Map; 11 12 import org.apache.lucene.analysis.TokenStream; 13 import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; 14 import org.wltea.analyzer.lucene.IKAnalyzer; 15 16 17 public class TextCosine { 18 private Map<String, String> map= null; 19 20 public TextCosine() { 21 map = new HashMap<String, String>(); 22 try { 23 InputStreamReader isReader = new InputStreamReader(new FileInputStream(TextCosine.class.getResource("synonyms.dict").getPath()), "UTF-8"); 24 BufferedReader br = new BufferedReader(isReader); 25 String s = null; 26 while ((s = br.readLine()) !=null) { 27 String []synonymsEnum = s.split("→"); 28 map.put(synonymsEnum[0], synonymsEnum[1]); 29 } 30 br.close(); 31 } catch (IOException e) { 32 e.printStackTrace(); 33 } 34 } 35 36 37 public List<ElementDict> tokenizer(String str) { 38 List<ElementDict> list = new ArrayList<ElementDict>(); 39 IKAnalyzer analyzer = new IKAnalyzer(true); 40 try { 41 TokenStream stream = analyzer.tokenStream("", str); 42 CharTermAttribute cta = stream.addAttribute(CharTermAttribute.class); 43 stream.reset(); 44 int index = -1; 45 while (stream.incrementToken()) { 46 if ((index = isContain(cta.toString(), list)) >= 0) { 47 list.get(index).setFreq(list.get(index).getFreq() + 1); 48 } 49 else { 50 list.add(new ElementDict(cta.toString(), 1)); 51 } 52 } 53 analyzer.close(); 54 } catch (IOException e) { 55 e.printStackTrace(); 56 } 57 return list; 58 } 59 60 61 public int isContain(String str, List<ElementDict> list) { 62 for (ElementDict ed : list) { 63 if (ed.getTerm().equals(str)) { 64 return list.indexOf(ed); 65 } else if (map.get(ed.getTerm())!= null && map.get(ed.getTerm()).equals(str)) { 66 return list.indexOf(ed); 67 } 68 } 69 return -1; 70 } 71 72 73 public List<String> mergeTerms(List<ElementDict> list1, List<ElementDict> list2) { 74 List<String> list = new ArrayList<String>(); 75 for (ElementDict ed : list1) { 76 if (!list.contains(ed.getTerm())) { 77 list.add(ed.getTerm()); 78 } else if (!list.contains(map.get(ed.getTerm()))) { 79 list.add(ed.getTerm()); 80 } 81 } 82 83 for (ElementDict ed : list2) { 84 if (!list.contains(ed.getTerm())) { 85 list.add(ed.getTerm()); 86 } else if (!list.contains(map.get(ed.getTerm()))) { 87 list.add(ed.getTerm()); 88 } 89 } 90 return list; 91 } 92 93 94 public int anslysisTerms(List<ElementDict> list1, List<ElementDict> list2) { 95 int len1 = list1.size(); 96 int len2 = list2.size(); 97 if (len2 >= len1 * 1.5) { 98 List<ElementDict> newList = new ArrayList<ElementDict>(); 99 for (int i = 0; i + len1 <= len2; i++) { 100 for (int j = 0; j < len1; j++) 101 newList.add(list2.get(i+j)); 102 103 newList = adjustList(newList, list2, len2, len1, i); 104 if (getResult(analysis(list1, newList))) 105 return 1; 106 else 107 newList.clear(); 108 } 109 } else if (len1 >= len2 * 1.5) { 110 List<ElementDict> newList = new ArrayList<ElementDict>(); 111 for (int i = 0; i + len2 <= len1; i++) { 112 for (int j = 0; j < len2; j++) 113 newList.add(list1.get(i+j)); 114 115 newList = adjustList(newList, list1, len1, len2, i); 116 if (getResult(analysis(newList, list2))) 117 return 1; 118 else 119 newList.clear(); 120 } 121 } else { 122 if (getEasyResult(analysis(list1, list2))) 123 return 1; 124 } 125 return 0; 126 } 127 128 129 public List<ElementDict> adjustList(List<ElementDict> newList, List<ElementDict> list, int lenBig, int lenSmall, int index) { 130 int gap = lenBig -lenSmall; 131 int size = (gap/2 > 2) ? 2: gap/2; 132 if (index < gap/2) { 133 for (int i = 0; i < size; i++) { 134 newList.add(list.get(lenSmall+index+i)); 135 } 136 } else { 137 for (int i = 0; i > size; i++) { 138 newList.add(list.get(lenBig-index-i)); 139 } 140 } 141 return newList; 142 } 143 144 145 public double analysis(List<ElementDict> list1, List<ElementDict> list2) { 146 List<String> list = mergeTerms(list1, list2); 147 List<Integer> weightList1 = assignWeight(list, list1); 148 List<Integer> weightList2 = assignWeight(list, list2); 149 return countCosSimilariry(weightList1, weightList2); 150 } 151 152 153 public List<Integer> assignWeight(List<String> list, List<ElementDict> list1) { 154 List<Integer> vecList = new ArrayList<Integer>(list.size()); 155 boolean isEqual = false; 156 for (String str : list) { 157 for (ElementDict ed : list1) { 158 if (ed.getTerm().equals(str)) { 159 isEqual = true; 160 vecList.add(new Integer(ed.getFreq())); 161 } else if (map.get(ed.getTerm())!= null && map.get(ed.getTerm()).equals(str)) { 162 isEqual = true; 163 vecList.add(new Integer(ed.getFreq())); 164 } 165 } 166 167 if (!isEqual) { 168 vecList.add(new Integer(0)); 169 } 170 isEqual = false; 171 } 172 return vecList; 173 } 174 175 176 public double countCosSimilariry(List<Integer> list1, List<Integer> list2) { 177 double countScores = 0; 178 int element = 0; 179 int denominator1 = 0; 180 int denominator2 = 0; 181 int index = -1; 182 for (Integer it : list1) { 183 index ++; 184 int left = it.intValue(); 185 int right = list2.get(index).intValue(); 186 element += left * right; 187 denominator1 += left * left; 188 denominator2 += right * right; 189 } 190 try { 191 countScores = (double)element / Math.sqrt(denominator1 * denominator2); 192 } catch (ArithmeticException e) { 193 e.printStackTrace(); 194 } 195 return countScores; 196 } 197 198 199 public boolean getResult(double scores) { 200 System.out.println(scores); 201 if (scores >= 0.85) 202 return true; 203 else 204 return false; 205 } 206 207 208 public boolean getEasyResult(double scores) { 209 System.out.println(scores); 210 if (scores >= 0.75) 211 return true; 212 else 213 return false; 214 } 215 216 } Class TextCosine

　　備注：同義詞詞典“synonyms.dict”文件較大，完全可以自己構建，在此就不贅述了。

SimHash

??????? SimHash為Google處理海量網頁的采用的文本相似判定方法。該方法的主要目的是降維，即將高維的特征向量映射成f-bit的指紋，通過比較兩篇文檔指紋的漢明距離來表征文檔重復或相似性。

過程

??????? 該算法設計十分精巧，主要過程如下：

??????? 1.? 文檔特征量化為向量；

??????? 2.? 計算特征詞匯哈希值，并輔以權重進行量化；

??????? 3.? 針對f-bit指紋，按位進行疊加運算；

??????? 4.? 針對疊加后的指紋，若對應位為正，則標記為1，否則標記為0。

　　備注：此處f-bit指紋，可以根據應用需求，定制為16位、32位、64位或者其它位數等。

? ? ? ?如圖-2所示，為SimHash作者Charikar在論文中的圖示，本文結合實際項目解釋如下：Doc表征一篇文本，feature為該文本經過中文分詞后的詞匯組合，按列向量組織，weight為對應詞匯在文本中的詞頻，之后經過某種哈希計算得出哈希值，見圖中1和0的組合，剩余部分不再贅述。需要指出，Charikar在論文中并未指定需要采用哪種哈希函數，本文作者認為，只要哈希計算值能夠均衡化、分散化，哈希函數可以根據實際應用場景進行設計，本文在實際的項目中自行設計哈希函數，雖未經過完全驗證，但是測試結果表明，該函數當前能夠滿足需求。

圖-2 SimHash處理過程

漢明距離

??????? 漢明距離應用于數據傳輸差錯控制編碼，它表示兩個（相同長度）字對應位不同的數量。鑒于SimHash最后計算出的指紋采用0和1進行組織，故而用其來衡量文檔相似性或者重復性，該部分詳細內容在此不再贅述。

應用場景與優缺點

??????? 本文目前將該算法應用于話題發現和內容聚合等場景中，同時也在嘗試其它應用場景。

??????? 優點：文本處理速率快，計算后的指紋能夠存儲于數據庫，因此對海量文本相似判定非常適合。

??????? 缺點：由于短文本的用于哈希計算的數據源較少，因此短文本相似度識別率低。

　　SimHash算法源程序：?

1 public class TermDict { 2 private String term; 3 private int freq; 4 5 public TermDict(String term, int freq) 6 { 7 this.term = term; 8 this.freq = freq; 9 } 10 11 public String getTerm() { 12 return term; 13 } 14 15 public void setTerm(String term) { 16 this.term = term; 17 } 18 19 public int getFreq() { 20 return freq; 21 } 22 23 public void setFreq(int freq) { 24 this.freq = freq; 25 } 26 27 } Class TermDict 1 import java.io.IOException; 2 import java.math.BigInteger; 3 import java.util.List; 4 import java.util.ArrayList; 5 6 import org.wltea.analyzer.lucene.IKAnalyzer; 7 import org.apache.lucene.analysis.TokenStream; 8 import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; 9 10 public class SimHash { 11 private String tokens; 12 private int hashBits = 64; 13 private int distance = 5; 14 15 public SimHash(String tokens) 16 { 17 this.tokens = tokens; 18 } 19 20 21 public SimHash(String tokens, int hashBits, int distance) 22 { 23 this.tokens = tokens; 24 this.hashBits = hashBits; 25 this.distance = distance; 26 } 27 28 29 public List<TermDict> tokenizer() 30 { 31 List<TermDict> terms = new ArrayList<TermDict>(); 32 IKAnalyzer analyzer = new IKAnalyzer(true); 33 try { 34 TokenStream stream = analyzer.tokenStream("", this.tokens); 35 CharTermAttribute cta = stream.addAttribute(CharTermAttribute.class); 36 stream.reset(); 37 int index = -1; 38 while (stream.incrementToken()) 39 { 40 if ((index = isContain(cta.toString(), terms)) >= 0) 41 { 42 terms.get(index).setFreq(terms.get(index).getFreq()+1); 43 } 44 else 45 { 46 terms.add(new TermDict(cta.toString(), 1)); 47 } 48 } 49 analyzer.close(); 50 } catch (IOException e) { 51 e.printStackTrace(); 52 } 53 return terms; 54 } 55 56 57 public int isContain(String str, List<TermDict> terms) 58 { 59 for (TermDict td : terms) 60 { 61 if (str.equals(td.getTerm())) 62 { 63 return terms.indexOf(td); 64 } 65 } 66 return -1; 67 } 68 69 70 public BigInteger simHash(List<TermDict> terms) 71 { 72 int []v = new int[hashBits]; 73 for (TermDict td : terms) 74 { 75 String str = td.getTerm(); 76 int weight = td.getFreq(); 77 BigInteger bt = shiftHash(str); 78 for (int i = 0; i < hashBits; i++) 79 { 80 BigInteger bitmask = new BigInteger("1").shiftLeft(i); 81 if ( bt.and(bitmask).signum() != 0) 82 { 83 v[i] += weight; 84 } 85 else 86 { 87 v[i] -= weight; 88 } 89 } 90 } 91 92 BigInteger fingerPrint = new BigInteger("0"); 93 for (int i = 0; i < hashBits; i++) 94 { 95 if (v[i] >= 0) 96 { 97 fingerPrint = fingerPrint.add(new BigInteger("1").shiftLeft(i)); // update the correct fingerPrint 98 } 99 } 100 return fingerPrint; 101 } 102 103 104 public BigInteger shiftHash(String str) 105 { 106 if (str == null || str.length() == 0) 107 { 108 return new BigInteger("0"); 109 } 110 else 111 { 112 char[] sourceArray = str.toCharArray(); 113 BigInteger x = BigInteger.valueOf((long) sourceArray[0] << 7); 114 BigInteger m = new BigInteger("131313"); 115 for (char item : sourceArray) 116 { 117 x = x.multiply(m).add(BigInteger.valueOf((long)item)); 118 } 119 BigInteger mask = new BigInteger("2").pow(hashBits).subtract(new BigInteger("1")); 120 boolean flag = true; 121 for (char item : sourceArray) 122 { 123 if (flag) 124 { 125 BigInteger tmp = BigInteger.valueOf((long)item << 3); 126 x = x.multiply(m).xor(tmp).and(mask); 127 } 128 else 129 { 130 BigInteger tmp = BigInteger.valueOf((long)item >> 3); 131 x = x.multiply(m).xor(tmp).and(mask); 132 } 133 flag = !flag; 134 } 135 136 if (x.equals(new BigInteger("-1"))) 137 { 138 x = new BigInteger("-2"); 139 } 140 return x; 141 } 142 } 143 144 145 public BigInteger getSimHash() 146 { 147 return simHash(tokenizer()); 148 } 149 150 151 public int getHammingDistance(SimHash hashData) 152 { 153 BigInteger m = new BigInteger("1").shiftLeft(hashBits).subtract(new BigInteger("1")); 154 System.out.println(getFingerPrint(getSimHash().toString(2))); 155 System.out.println(getFingerPrint(hashData.getSimHash().toString(2))); 156 BigInteger x = getSimHash().xor(hashData.getSimHash()).and(m); 157 int tot = 0; 158 while (x.signum() != 0) 159 { 160 tot += 1; 161 x = x.and(x.subtract(new BigInteger("1"))); 162 } 163 System.out.println(tot); 164 return tot; 165 } 166 167 168 public String getFingerPrint(String str) 169 { 170 int len = str.length(); 171 for (int i = 0; i < hashBits; i++) 172 { 173 if (i >= len) 174 { 175 str = "0" + str; 176 } 177 } 178 return str; 179 } 180 181 182 public void getResult(SimHash hashData) 183 { 184 if (getHammingDistance(hashData) <= distance) 185 { 186 System.out.println("match"); 187 } 188 else 189 { 190 System.out.println("false"); 191 } 192 } 193 194 } Class SimHash

　　備注：源程序中“131313”只是作者挑選的一個較大的素數而已，不代表特別含義，該數字可以根據需求進行設定。

　　作者：志青云集
　　出處：http://www.cnblogs.com/lyssym/p/4880896.html
　　如果，您認為閱讀這篇博客讓您有些收獲，不妨點擊一下右下角的【推薦】。
　　如果，您希望更容易地發現我的新博客，不妨點擊一下左下角的【關注我】。
　　如果，您對我的博客所講述的內容有興趣，請繼續關注我的后續博客，我是【志青云集】。
　　本文版權歸作者和博客園共有，歡迎轉載，但未經作者同意必須保留此段聲明，且在文章頁面明顯位置給出原文連接。

轉載于:https://www.cnblogs.com/lyssym/p/4880896.html

總結

以上是生活随笔為你收集整理的文本挖掘之文本相似度判定的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。