當前位置：首頁 > 运维知识 > 数据库 >内容正文

数据库

用Redis快速实现BloomFilter！

發布時間：2025/5/22 数据库 32 豆豆

生活随笔收集整理的這篇文章主要介紹了用Redis快速实现BloomFilter！小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

背景

最近工作上有個類似需求是：現有約3億條數據詞典存在于一個csv文件A中，作為數據源。對于用戶輸入的任意單詞M，需要快速的在A中匹配M單詞是否存在。（A文件約3G大小左右，總行數三億）

拿到這個需求，你的第一想法怎么做呢？

正常思路可能是：

將csv文件A導入某關系型數據庫。

sql查詢按M匹配。

上面的方式有個明顯的缺點是：慢！

3億多行的數據，即便是建好索引進行檢索，匹配到也得話不少時間（筆者沒親自試過，感興趣的朋友可以自行測試測試，理論上快不起來的）。

目前能在時間復雜度和空間復雜度上達到最佳的方案，恐怕就是Bloom Filter了，維基地址：Bloom Filter

此處給不太了解Bloom Filter的讀者看，熟悉的朋友直接看下一節。

本文場景Bloom Filter 使用思路解釋：
假設申請了一段bit位大數組（即數組中的元素只能是一個bit位，1或0，默認元素值都為0）
將csv文件A中的每個單詞，經過多個hash函數進行hash運算之后得到在大數組中對應的多個下標位置
將步驟2中得到的多個下標位置的bit位都置為1.
對于用戶輸入的任意單詞M，按照2的步驟得到多個下標位置，其對應大數組中的值全部為1則存在，否則不存在。

方案選型
實現Bloom Filter的方法很多，有各種語言版本的，這里為了真切感受一下算法的魅力，筆者這里決定用java 代碼徒手擼了！

另一方面，考慮到分布式應用的需要，顯然在單機內存上構建Bloom Filter存儲是不太合適的。這里選擇redis。

redis有以下為操作，可以用于實現bloomfilter：

redis> SETBIT bit 10086 1 (integer) 0 redis> GETBIT bit 10086 (integer) 1 redis> GETBIT bit 100 # bit 默認被初始化為 0 (integer) 0

實現細節

實現bloom filter的關鍵是hash函數，一般為了降低誤報率、減少hash碰撞的影響，會選擇多個hash函數。

那么，怎么寫一個hash函數呢？

不要方，我們要的hash是 input: String --> output: int , jdk里面的String類不是恰好也有一個hashCode 方法嗎？翻出來看一看！

/** * Returns a hash code for this string. The hash code for a * {@code String} object is computed as * <blockquote><pre> * s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1] * </pre></blockquote> * using {@code int} arithmetic, where {@code s[i]} is the * <i>i</i>th character of the string, {@code n} is the length of * the string, and {@code ^} indicates exponentiation. * (The hash value of the empty string is zero.) * * @return a hash code value for this object. */ public int hashCode() {int h = hash;if (h == 0 && value.length > 0) {char val[] = value;for (int i = 0; i < value.length; i++) {h = 31 * h + val[i];}hash = h;}return h; }

看到這一行h = 31 * h + val[i];，貌似原理其實也很簡單，每個字符對應的ascii碼，經過一個公式計算依次加起來。這里有個系數31, 稍微變一下，不就可以有多個hash函數了嗎。

以下是稍加修改后的hash函數：

//總的bitmap大小 64M private static final int cap = 1 << 29; /* * 不同哈希函數的種子，一般取質數 * seeds數組共有8個值，則代表采用8種不同的哈希函數 */ private int[] seeds = new int[]{3, 5, 7, 11, 13, 31, 37, 61}; private int hash(String value, int seed) {int result = 0;int length = value.length();for (int i = 0; i < length; i++) {result = seed * result + value.charAt(i);}return (cap - 1) & result; }

剩下的事情便很簡單了，對每個詞典A中的單詞，依次調seeds 中對應的hash函數（這里一共是8個），用redis的setbit操作，將下標值置為1.

redis代碼 (這里用pipeline 包裝了下。)

@Service public class RedisService {@Autowiredprivate StringRedisTemplate template;public void multiSetBit(String name, Boolean value, long... offsets) {template.executePipelined((RedisCallback<Object>) connection -> {for (long offset : offsets) {connection.setBit(name.getBytes(), offset, value);}return null;});}public List<Boolean> multiGetBit(String name, long... offsets) {List<Object> results = template.executePipelined((RedisCallback<Object>) connection -> {for (long offset : offsets) {connection.getBit(name.getBytes(), offset);}return null;});List<Boolean> list = new ArrayList<>();results.forEach(obj -> {list.add((Boolean) obj);});return list;} }

最后，代碼串起來大概長這個樣子：

FileInputStream inputStream = new FileInputStream("/XXXX.csv"); BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(inputStream)); HashSet<long> totalSet=new HashSet<>(); String word=null; while((word = bufferedReader.readLine()) != null){for (int seed : seeds) {int hash = hash(word, seed);totalSet.add((long) hash);}long[] offsets = new long[totalSet.size()];int i=0;for (long l:totalSet){offsets[i++]=l;}redisService.multiSetBit("BLOOM_FILTER_WORDS_DICTIONARY", true, offsets); }

查的時候也類似：

String word = "XXXX"; //實際輸入 long[] offsets = new long[seeds.length]; for (int i = 0; i < seeds.length; i++) {int hash = hash(mobile, seeds[i]);offsets[i] = hash; } List<Boolean> results = redisService.multiGetBit("BLOOM_FILTER_WORDS_DICTIONARY", offsets); //判斷是否都為true （則存在) Boolean isExisted=true; for (Boolean result:results){if(!result){isExisted=false;break;} }

注意事項

setbit的offset是用大小限制的，在0到 232（最大使用512M內存）之間，即0~4294967296之前，超過這個數會自動將offset轉化為0，因此使用的時候一定要注意。

寫在最后

轉載于:https://www.cnblogs.com/Java-no-1/p/11178761.html

《新程序員》：云原生和全面數字化實踐50位技術專家共同創作，文字、視頻、音頻交互閱讀

總結

以上是生活随笔為你收集整理的用Redis快速实现BloomFilter！的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：常考的js笔试语句
下一篇： linux cmake编译源码,linu