布隆过滤器之Python+Redis
簡(jiǎn)單的python實(shí)現(xiàn)
pip install mmh3對(duì)于安裝報(bào)錯(cuò),c++編譯錯(cuò)誤問(wèn)題:可以安裝? ??Microsoft Visual C++ Build Tools()
?例子轉(zhuǎn)載(https://www.cnblogs.com/naive/p/5815433.html)
from bitarray import bitarray# 3rd party import mmh3class BloomFilter(set):def __init__(self, size, hash_count):super(BloomFilter, self).__init__()self.bit_array = bitarray(size)self.bit_array.setall(0)self.size = sizeself.hash_count = hash_countdef __len__(self):return self.sizedef __iter__(self):return iter(self.bit_array)def add(self, item):for ii in range(self.hash_count):index = mmh3.hash(item, ii) % self.sizeself.bit_array[index] = 1return selfdef __contains__(self, item):out = Truefor ii in range(self.hash_count):index = mmh3.hash(item, ii) % self.sizeif self.bit_array[index] == 0:out = Falsereturn outdef main():bloom = BloomFilter(10000, 10)animals = ['dog', 'cat', 'giraffe', 'fly', 'mosquito', 'horse', 'eagle','bird', 'bison', 'boar', 'butterfly', 'ant', 'anaconda', 'bear','chicken', 'dolphin', 'donkey', 'crow', 'crocodile']# First insertion of animals into the bloom filterfor animal in animals:bloom.add(animal)# Membership existence for already inserted animals# There should not be any false negativesfor animal in animals:if animal in bloom:print('{} is in bloom filter as expected'.format(animal))else:print('Something is terribly went wrong for {}'.format(animal))print('FALSE NEGATIVE!')# Membership existence for not inserted animals# There could be false positivesother_animals = ['badger', 'cow', 'pig', 'sheep', 'bee', 'wolf', 'fox','whale', 'shark', 'fish', 'turkey', 'duck', 'dove','deer', 'elephant', 'frog', 'falcon', 'goat', 'gorilla','hawk' ]for other_animal in other_animals:if other_animal in bloom:print('{} is not in the bloom, but a false positive'.format(other_animal))else:print('{} is not in the bloom filter as expected'.format(other_animal))if __name__ == '__main__':main()?
運(yùn)行結(jié)果
dog is in bloom filter as expected cat is in bloom filter as expected giraffe is in bloom filter as expected fly is in bloom filter as expected mosquito is in bloom filter as expected horse is in bloom filter as expected eagle is in bloom filter as expected bird is in bloom filter as expected bison is in bloom filter as expected boar is in bloom filter as expected butterfly is in bloom filter as expected ant is in bloom filter as expected anaconda is in bloom filter as expected bear is in bloom filter as expected chicken is in bloom filter as expected dolphin is in bloom filter as expected donkey is in bloom filter as expected crow is in bloom filter as expected crocodile is in bloom filter as expectedbadger is not in the bloom filter as expected cow is not in the bloom filter as expected pig is not in the bloom filter as expected sheep is not in the bloom, but a false positive bee is not in the bloom filter as expected wolf is not in the bloom filter as expected fox is not in the bloom filter as expected whale is not in the bloom filter as expected shark is not in the bloom, but a false positive fish is not in the bloom, but a false positive turkey is not in the bloom filter as expected duck is not in the bloom filter as expected dove is not in the bloom誤報(bào) filter as expected deer is not in the bloom filter as expected elephant is not in the bloom, but a false positive frog is not in the bloom filter as expected falcon is not in the bloom filter as expected goat is not in the bloom filter as expected gorilla is not in the bloom filter as expected hawk is not in the bloom filter as expected?
?
?從輸出結(jié)果可以發(fā)現(xiàn),存在不少誤報(bào)樣本,但是并不存在假陰性。
不同于這段布隆過(guò)濾器的實(shí)現(xiàn)代碼,其它語(yǔ)言的多個(gè)實(shí)現(xiàn)版本并不提供哈希函數(shù)的參數(shù)。這是因?yàn)樵趯?shí)際應(yīng)用中誤報(bào)比例這個(gè)指標(biāo)比哈希函數(shù)更重要,用戶可以根據(jù)誤報(bào)比例的需求來(lái)調(diào)整哈希函數(shù)的個(gè)數(shù)。通常來(lái)說(shuō),size和error_rate是布隆過(guò)濾器的真正誤報(bào)比例。如果你在初始化階段減小了error_rate,它們會(huì)調(diào)整哈希函數(shù)的數(shù)量。
誤報(bào)
布隆過(guò)濾器能夠拍著胸脯說(shuō)某個(gè)元素“肯定不存在”,但是對(duì)于一些元素它們會(huì)說(shuō)“可能存在”。針對(duì)不同的應(yīng)用場(chǎng)景,這有可能會(huì)是一個(gè)巨大的缺陷,亦或是無(wú)關(guān)緊要的問(wèn)題。如果在檢索元素是否存在時(shí)不介意引入誤報(bào)情況,那么你就應(yīng)當(dāng)考慮用布隆過(guò)濾器。
另外,如果隨意地減小了誤報(bào)比率,哈希函數(shù)的數(shù)量相應(yīng)地就要增加,在插入和查詢時(shí)的延時(shí)也會(huì)相應(yīng)地增加。本節(jié)的另一個(gè)要點(diǎn)是,如果哈希函數(shù)是相互獨(dú)立的,并且輸入元素在空間中均勻的分布,那么理論上真實(shí)誤報(bào)率就不會(huì)超過(guò)理論值。否則,由于哈希函數(shù)的相關(guān)性和更頻繁的哈希沖突,布隆過(guò)濾器的真實(shí)誤報(bào)比例會(huì)高于理論值。
在使用布隆過(guò)濾器時(shí),需要考慮誤報(bào)的潛在影響。
確定性
當(dāng)你使用相同大小和數(shù)量的哈希函數(shù)時(shí),某個(gè)元素通過(guò)布隆過(guò)濾器得到的是正反饋還是負(fù)反饋的結(jié)果是確定的。對(duì)于某個(gè)元素x,如果它現(xiàn)在可能存在,那五分鐘之后、一小時(shí)之后、一天之后、甚至一周之后的狀態(tài)都是可能存在。當(dāng)我得知這一特性時(shí)有一點(diǎn)點(diǎn)驚訝。因?yàn)椴悸∵^(guò)濾器是概率性的,那其結(jié)果顯然應(yīng)該存在某種隨機(jī)因素,難道不是嗎?確實(shí)不是。它的概率性體現(xiàn)在我們無(wú)法判斷究竟哪些元素的狀態(tài)是可能存在。
換句話說(shuō),過(guò)濾器一旦做出可能存在的結(jié)論后,結(jié)論不會(huì)發(fā)生變化。
?
?
python 基于redis實(shí)現(xiàn)的bloomfilter(布隆過(guò)濾器),BloomFilter_imooc
BloomFilter_imooc下載
下載地址:https://github.com/liyaopinner/BloomFilter_imooc
?
?py_bloomfilter.py(布隆過(guò)濾器)源碼:
import mmh3 import redis import math import timeclass PyBloomFilter():#內(nèi)置100個(gè)隨機(jī)種子SEEDS = [543, 460, 171, 876, 796, 607, 650, 81, 837, 545, 591, 946, 846, 521, 913, 636, 878, 735, 414, 372,344, 324, 223, 180, 327, 891, 798, 933, 493, 293, 836, 10, 6, 544, 924, 849, 438, 41, 862, 648, 338,465, 562, 693, 979, 52, 763, 103, 387, 374, 349, 94, 384, 680, 574, 480, 307, 580, 71, 535, 300, 53,481, 519, 644, 219, 686, 236, 424, 326, 244, 212, 909, 202, 951, 56, 812, 901, 926, 250, 507, 739, 371,63, 584, 154, 7, 284, 617, 332, 472, 140, 605, 262, 355, 526, 647, 923, 199, 518]#capacity是預(yù)先估計(jì)要去重的數(shù)量#error_rate表示錯(cuò)誤率#conn表示redis的連接客戶端#key表示在redis中的鍵的名字前綴def __init__(self, capacity=1000000000, error_rate=0.00000001, conn=None, key='BloomFilter'):self.m = math.ceil(capacity*math.log2(math.e)*math.log2(1/error_rate)) #需要的總bit位數(shù)self.k = math.ceil(math.log1p(2)*self.m/capacity) #需要最少的hash次數(shù)self.mem = math.ceil(self.m/8/1024/1024) #需要的多少M(fèi)內(nèi)存self.blocknum = math.ceil(self.mem/512) #需要多少個(gè)512M的內(nèi)存塊,value的第一個(gè)字符必須是ascii碼,所有最多有256個(gè)內(nèi)存塊self.seeds = self.SEEDS[0:self.k]self.key = keyself.N = 2**31-1self.redis = conn# print(self.mem)# print(self.k)def add(self, value):name = self.key + "_" + str(ord(value[0])%self.blocknum)hashs = self.get_hashs(value)for hash in hashs:self.redis.setbit(name, hash, 1)def is_exist(self, value):name = self.key + "_" + str(ord(value[0])%self.blocknum)hashs = self.get_hashs(value)exist = Truefor hash in hashs:exist = exist & self.redis.getbit(name, hash)return existdef get_hashs(self, value):hashs = list()for seed in self.seeds:hash = mmh3.hash(value, seed)if hash >= 0:hashs.append(hash)else:hashs.append(self.N - hash)return hashspool = redis.ConnectionPool(host='127.0.0.1', port=6379, db=0) conn = redis.StrictRedis(connection_pool=pool)# 使用方法 # if __name__ == "__main__": # bf = PyBloomFilter(conn=conn) # 利用連接池連接Redis # bf.add('www.jobbole.com') # 向Redis默認(rèn)的通道添加一個(gè)域名 # bf.add('www.luyin.org') # 向Redis默認(rèn)的通道添加一個(gè)域名 # print(bf.is_exist('www.zhihu.com')) # 打印此域名在通道里是否存在,存在返回1,不存在返回0 # print(bf.is_exist('www.luyin.org')) # 打印此域名在通道里是否存在,存在返回1,不存在返回0?
轉(zhuǎn)載于:https://www.cnblogs.com/yhll/p/9842514.html
總結(jié)
以上是生活随笔為你收集整理的布隆过滤器之Python+Redis的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: python面向编程:类继承、继承案例、
- 下一篇: websocket python爬虫_p