当前位置:网站首页>Are you sure about this filter?
Are you sure about this filter?
2022-06-21 06:21:00 【Java geek Technology】

01、 What is? BloomFilter( The bloon filter )
The bloon filter ( English :Bloom Filter) yes 1970 Proposed by bron in . It's actually a very long binary vector and a series of random mapping functions . It is mainly used to judge whether an element is in a set . Usually, we encounter a lot of business scenarios to determine whether an element is in a collection , At this time, we often adopt Hashmap,Set Or other collections to save data , Then compare and judge , But if there are many elements , If we use this method, it will be a waste of space . At this time we need BloomFilter To help us .
1.1、BloomFilter principle
BloomFilter Is a fixed size binary vector or bitmap (bitmap) And a series of ( Usually several ) The mapping function consists of . The principle of Bloom filter is , When a variable is added to a set , adopt K Mapping functions map this variable to... In the bitmap K A little bit , Set them as 1. When we query a variable, we just need to see if these points are all 1 We can find out if there is one in the set , If any of these points 0, Then the queried variable must not be in ; If it's all 1, The query variable is very Probably stay . Be careful , Here is the possibility that , It doesn't have to exist ! This is the basic idea of bloon filter .
As shown in the figure below , character string "ziyou" After four mapping function operations, four points on the bitmap are set to 1. When we need to judge “ziyou” Whether or not a string exists only needs to be mapped to the string once , Get four 1 Just explain “ziyou” It's possible .

Why is it possible , Not necessarily ? That's because the mapping function itself is a hash function , Hash functions can collide , This means that there will be a string that may be “ziyou01” The four points obtained by the same four mapping functions are followed by “ziyou” It's the same , In this case, we say that there is a miscalculation . In addition, it is also possible that 1 It is the result of four different variables , This does not prove that the string “ziyou” There must be , As shown in the following picture 1 It could be a string “ Zhang San ” To calculate the , The same goes for other positions 1 It can also be calculated from other strings .

1.2 characteristic
So we can make it clear through the above example
- If an element is judged to exist, the element does not necessarily exist , But when the result of judgment does not exist, it must not exist .
- The bloon filter can add elements , But you can't delete elements . Because deleting elements will increase the rate of misjudgment .
02、 Use scenarios
2.1、 Webpage URL duplicate removal
When we use web crawlers ( Reptiles need to be careful ), It is often necessary to record what URL It has been crawled , Which ones haven't climbed yet , At this time, we can adopt BloomFilter To those who have crawled URL For storage , In this way, this can be determined during the next crawl URL Have you ever climbed .
2.2、 Black and white list storage
There is often a feature that has different processing methods for different devices or users , At this time, there may be a white list or a blacklist , So according to BloomFilter Characteristics of the filter , We can also use it to store these data , Although there is a certain miscalculation rate , But to a certain extent, it can solve this problem well .
2.3、 Summary
In addition to the two scenarios mentioned above , In fact, there are many scenes , For example, hot data access , Spam filtering and so on , In fact, the unified feature of these scenarios is to judge whether an element is in a set , The principle is the same .
03、 Code practice
3.1、 Realize it by yourself
package com.test.pkg;
import java.util.BitSet;
/**
* <br>
* <b>Function:</b><br>
* <b>Author:</b>@author ziyou<br>
* <b>Date:</b>2019-10-23 23:21<br>
* <b>Desc:</b> nothing <br>
*/
public class BloomFilterTest {
/**
* Initialize the bloom filter bitmap size
*/
private static final int DEFAULT_SIZE = 2 << 24;
/**
* In order to reduce the error rate , Here we choose some numbers as the reference numbers
*/
private static final int[] seeds = {3, 5, 7, 11, 13, 31, 37, 61};
/**
* Set up bitmap
*/
private static BitSet bitset = new BitSet(DEFAULT_SIZE);
/**
* Set up hash Number of functions
*/
private static HashFunction[] functions = new HashFunction[seeds.length];
/**
* Add data
*
* @param value The value added by the requirement
*/
public static void put(String value) {
if (value != null) {
for (HashFunction f : functions) {
// Calculation hash Value and modify bitmap The corresponding position in is true
bitset.set(f.hash(value), true);
}
}
}
/**
* Determine whether the corresponding element exists
*
* @param value The elements that need to be judged
* @return result
*/
public static boolean check(String value) {
if (value == null) {
return false;
}
boolean ret = true;
for (HashFunction f : functions) {
ret = bitset.get(f.hash(value));
// One hash The function returns false And then jump out of the loop
if (!ret) {
break;
}
}
return ret;
}
public static void main(String[] args) {
String value = "test";
for (int i = 0; i < seeds.length; i++) {
functions[i] = new HashFunction(DEFAULT_SIZE, seeds[i]);
}
put(value);
System.out.println(check("value"));
}
}
class HashFunction {
private int size;
private int seed;
public HashFunction(int size, int seed) {
this.size = size;
this.seed = seed;
}
public int hash(String value) {
int result = 0;
int len = value.length();
for (int i = 0; i < len; i++) {
result = seed * result + value.charAt(i);
}
int r = (size - 1) & result;
return (size - 1) & result;
}
}
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
- 16.
- 17.
- 18.
- 19.
- 20.
- 21.
- 22.
- 23.
- 24.
- 25.
- 26.
- 27.
- 28.
- 29.
- 30.
- 31.
- 32.
- 33.
- 34.
- 35.
- 36.
- 37.
- 38.
- 39.
- 40.
- 41.
- 42.
- 43.
- 44.
- 45.
- 46.
- 47.
- 48.
- 49.
- 50.
- 51.
- 52.
- 53.
- 54.
- 55.
- 56.
- 57.
- 58.
- 59.
- 60.
- 61.
- 62.
- 63.
- 64.
- 65.
- 66.
- 67.
- 68.
- 69.
- 70.
- 71.
- 72.
- 73.
- 74.
- 75.
- 76.
- 77.
- 78.
- 79.
- 80.
- 81.
- 82.
- 83.
- 84.
- 85.
- 86.
- 87.
- 88.
- 89.
- 90.
- 91.
- 92.
- 93.
- 94.
- 95.
- 96.
- 97.
- 98.
- 99.
- 100.
- 101.
- 102.
- 103.
- 104.
- 105.
- 106.
- 107.
- 108.
- 109.
We wrote a simple BloomFilter , adopt put Method input data , adopt check Method to determine whether the element exists , Basic function , The comments in the code are also very clear , But their own realization must be inefficient , So let's take a look at what the industry leaders have done for us BloomFilter.
2.4、Guava Medium BloomFilter
package com.test.pkg;
import com.google.common.hash.BloomFilter;
import com.google.common.hash.Funnels;
/**
* <br>
* <b>Function:</b><br>
* <b>Author:</b>@author ziyou<br>
* <b>Date:</b>2019-10-24 00:17<br>
* <b>Desc:</b> nothing <br>
*/
public class BloomFilterTest02 {
public static void main(String[] args) {
BloomFilter<Integer> bloomFilter = BloomFilter.create(Funnels.integerFunnel(), 100000, 0.01);
for (int i = 0; i < 100000; i++) {
bloomFilter.put(i);
}
System.out.println(bloomFilter.mightContain(1));
System.out.println(bloomFilter.mightContain(2));
System.out.println(bloomFilter.mightContain(3));
System.out.println(bloomFilter.mightContain(100001));
}
}
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
- 16.
- 17.
- 18.
- 19.
- 20.
- 21.
- 22.
- 23.
- 24.
- 25.
- 26.
- 27.
- 28.
Guava Has helped us achieve BloomFilter Code for , We just need to call it where we use it .
Here we briefly explain the last two parameters in the constructor , One is the amount of data expected to be included , One is the allowable error value . The code will be based on the two values we fill in , Automatically help us calculate the size of the array , And the number of hash functions required , Here's the picture . More details , Readers can check the source code by themselves , We won't introduce it here .


04、 summary
This article article I'd like to introduce it to you BloomFilter, An efficient way to determine whether an element exists in a set , It can be used in our daily work , Combined with daily work scenarios , You can choose .
边栏推荐
- Leetcode 75 - 颜色分类 [Medium] 三种实现方法
- tf. contrib. slim. conv2d()
- tf. compat. v1.pad
- IP - 射频数据转换器 -04- API使用指南 - 系统设置相关函数
- 牛客-TOP101-BM26
- FPGA - 7系列 FPGA SelectIO -03- 逻辑资源之ILOGIC
- Memorizing Normality to Detect Anomaly: Memory-augmented Deep Autoencoder for Unsupervised Anomaly D
- 小程序【第一期】
- That's great. MySQL's summary is too comprehensive
- 内卷大厂系列《LRU 缓存淘汰算法》
猜你喜欢

FPGA - 7系列 FPGA SelectIO -06- 逻辑资源之ODELAY

【数据挖掘】期末复习 第二章

MSF intranet penetration

Deeply understand the gradient disappearance of RNN and why LSTM can solve the gradient disappearance
![[data mining] final review Chapter 3](/img/9a/68c3ce12b0f55e9b26b6f054e2ffe7.png)
[data mining] final review Chapter 3

Digital signal processing-07-dds IP application example

【数据挖掘】期末复习 第四章

用递归和循环两种方法解决华为4月20日机试第一题(100分)

笔记 How Powerful are Spectral Graph Neural Networks

数字信号处理-07-DDS IP应用实例
随机推荐
如何限制内网网速
构建和保护小型网络考试
Unity隐藏目录和隐藏文件
当今的数学是否过于繁琐?
[MySQL] SQL statement execution process of MySQL
用递归和循环两种方法解决华为4月20日机试第一题(100分)
User defined thread pool
520泡泡的源码
pyshark使用教程
Idea usage record
Niuke-top101-bm25
TypeError: iter() returned non-iterator of type ‘xxx‘
Pyshark tutorial
leetcode 410. Maximum value of split array - (Day30)
IP - RF data converter -04- API User Guide - ADC status indication function
DDD 实践手册(4. Aggregate — 聚合)
Chapter 2: Data Model (final review of database)
Pycharm设置项目的默认解释器
leetcode 410. 分割数组的最大值——(每日一难day30)
lambda-stream