生活随笔
收集整理的這篇文章主要介紹了
simhash笔记
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
simhash也有其局限性,在處理小于500字的短文本時,simhash的表現并不是很好舉例
from simhash
import Simhash
def simhash_similarity(text_a
, text_b
):"""計算兩個文本的simhash相似度"""a_simhash
= Simhash
(text_a
)b_simhash
= Simhash
(text_b
)max_hashbit
= max(len(bin(a_simhash
.value
)), len(bin(b_simhash
.value
)))distince
= a_simhash
.distance
(b_simhash
)similar
= 1 - distince
/ max_hashbit
return similar
源代碼
def build_simhash_dict(hash_list
):"""基于simhash構建倒排索引,將hash_list中的每個元素放入這個倒排索引中"""hash_dict
= {}SPLITS_NUM
= 8for i
,h
in enumerate(hash_list
):hash_string
= str(bin(h
.value
).replace
('0b', '')).zfill
(64)for j
in range(0, SPLITS_NUM
):key
= hash_string
[j
*SPLITS_NUM
: (j
+1)*SPLITS_NUM
]if key
not in hash_dict
.keys
():hash_dict
[key
] = []hash_dict
[key
].append
(hash_list
[i
])else:hash_dict
[key
].append
(hash_list
[i
])return hash_dict
def compute_shortest_distance(h
, hash_dict
):"""如果hash_dict存在distance < 8,則從中找出舉例海明距離最短的值distance,并返回該值;否則,返回默認最小值"""hash_string
= str(bin(h
.value
).replace
('0b', '')).zfill
(64)reverse_list
= []for j
in range(0, SPLITS_NUM
):key
= hash_string
[j
*SPLITS_NUM
: (j
+1)*SPLITS_NUM
]if key
in hash_dict
:reverse_list
.extend
(hash_dict
[key
])min_value
= 1000for i
in reverse_list
:distance
= h
.distance
(i
)if distance
< min_value
:min_value
= distance
return min_value
參考文獻
- https://geek.digiasset.org/pages/affiliate/text-simhash-good-re-process-deep_21Apr03114628313403/
- https://zhuanlan.zhihu.com/p/71488127
- https://cloud.tencent.com/developer/article/1379302 (這篇文章講得比較好,在實際落地時怎么操作比較清楚)
- https://blog.csdn.net/ineedstudytosurvive/article/details/113986137 (源代碼不錯)
總結
以上是生活随笔為你收集整理的simhash笔记的全部內容,希望文章能夠幫你解決所遇到的問題。
如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。