当前位置:网站首页>Are you sure about this filter?

Are you sure about this filter?

2022-06-21 06:21:00 Java geek Technology


 Are you sure this filter will ?_ The bloon filter

01、 What is? BloomFilter( The bloon filter )

The bloon filter ​( English :Bloom Filter) yes 1970 Proposed by bron in . It's actually a very long binary vector and a series of random mapping functions . It is mainly used to judge whether an element is in a set . Usually, we encounter a lot of business scenarios to determine whether an element is in a collection , At this time, we often adopt Hashmap,Set Or other collections to save data , Then compare and judge , But if there are many elements , If we use this method, it will be a waste of space . At this time we need BloomFilter To help us .

1.1、BloomFilter principle

BloomFilter Is a fixed size binary vector or bitmap (bitmap) And a series of ( Usually several ) The mapping function consists of . The principle of Bloom filter is , When a variable is added to a set , adopt K Mapping functions map this variable to... In the bitmap K A little bit , Set them as 1. When we query a variable, we just need to see if these points are all 1 We can find out if there is one in the set , If any of these points 0, Then the queried variable must not be in ; If it's all 1, The query variable is very ​ Probably ​ stay . Be careful , Here is the possibility that , It doesn't have to exist ! This is the basic idea of bloon filter .

As shown in the figure below , character string "ziyou" After four mapping function operations, four points on the bitmap are set to 1. When we need to judge “ziyou” Whether or not a string exists only needs to be mapped to the string once , Get four 1 Just explain “ziyou” It's possible .

 Are you sure this filter will ?_ The bloon filter _02


Why is it possible , Not necessarily ? That's because the mapping function itself is a hash function , Hash functions can collide , This means that there will be a string that may be “ziyou01” The four points obtained by the same four mapping functions are followed by “ziyou” It's the same , In this case, we say that there is a miscalculation . In addition, it is also possible that 1 It is the result of four different variables , This does not prove that the string “ziyou” There must be , As shown in the following picture 1 It could be a string “ Zhang San ” To calculate the , The same goes for other positions 1 It can also be calculated from other strings .


 Are you sure this filter will ?_ Mapping function _03


1.2 characteristic

So we can make it clear through the above example


  • If an element is judged to exist, the element does not necessarily exist , But when the result of judgment does not exist, it must not exist .
  • The bloon filter can add elements , But you can't delete elements ​. Because deleting elements will increase the rate of misjudgment .

02、 Use scenarios

2.1、 Webpage URL duplicate removal

When we use web crawlers ( Reptiles need to be careful ), It is often necessary to record what URL It has been crawled , Which ones haven't climbed yet , At this time, we can adopt BloomFilter To those who have crawled URL For storage , In this way, this can be determined during the next crawl URL Have you ever climbed .

2.2、 Black and white list storage

There is often a feature that has different processing methods for different devices or users , At this time, there may be a white list or a blacklist , So according to BloomFilter Characteristics of the filter , We can also use it to store these data , Although there is a certain miscalculation rate , But to a certain extent, it can solve this problem well .

2.3、 Summary

In addition to the two scenarios mentioned above , In fact, there are many scenes , For example, hot data access , Spam filtering and so on , In fact, the unified feature of these scenarios is to judge whether an element is in a set , The principle is the same .

03、 Code practice

3.1、 Realize it by yourself

       
package com.test.pkg;


import java.util.BitSet;


/**
* <br>
* <b>Function:</b><br>
* <b>Author:</b>@author ziyou<br>
* <b>Date:</b>2019-10-23 23:21<br>
* <b>Desc:</b> nothing <br>
*/
public class BloomFilterTest {


/**
* Initialize the bloom filter bitmap size
*/
private static final int DEFAULT_SIZE = 2 << 24;
/**
* In order to reduce the error rate , Here we choose some numbers as the reference numbers
*/
private static final int[] seeds = {3, 5, 7, 11, 13, 31, 37, 61};
/**
* Set up bitmap
*/
private static BitSet bitset = new BitSet(DEFAULT_SIZE);
/**
* Set up hash Number of functions
*/
private static HashFunction[] functions = new HashFunction[seeds.length];




/**
* Add data
*
* @param value The value added by the requirement
*/
public static void put(String value) {
if (value != null) {
for (HashFunction f : functions) {
// Calculation hash Value and modify bitmap The corresponding position in is true
bitset.set(f.hash(value), true);
}
}
}


/**
* Determine whether the corresponding element exists
*
* @param value The elements that need to be judged
* @return result
*/
public static boolean check(String value) {
if (value == null) {
return false;
}
boolean ret = true;
for (HashFunction f : functions) {
ret = bitset.get(f.hash(value));
// One hash The function returns false And then jump out of the loop
if (!ret) {
break;
}
}
return ret;
}


public static void main(String[] args) {
String value = "test";
for (int i = 0; i < seeds.length; i++) {
functions[i] = new HashFunction(DEFAULT_SIZE, seeds[i]);
}
put(value);
System.out.println(check("value"));
}
}


class HashFunction {


private int size;
private int seed;


public HashFunction(int size, int seed) {
this.size = size;
this.seed = seed;
}


public int hash(String value) {
int result = 0;
int len = value.length();
for (int i = 0; i < len; i++) {
result = seed * result + value.charAt(i);
}
int r = (size - 1) & result;
return (size - 1) & result;
}


}
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.
  • 49.
  • 50.
  • 51.
  • 52.
  • 53.
  • 54.
  • 55.
  • 56.
  • 57.
  • 58.
  • 59.
  • 60.
  • 61.
  • 62.
  • 63.
  • 64.
  • 65.
  • 66.
  • 67.
  • 68.
  • 69.
  • 70.
  • 71.
  • 72.
  • 73.
  • 74.
  • 75.
  • 76.
  • 77.
  • 78.
  • 79.
  • 80.
  • 81.
  • 82.
  • 83.
  • 84.
  • 85.
  • 86.
  • 87.
  • 88.
  • 89.
  • 90.
  • 91.
  • 92.
  • 93.
  • 94.
  • 95.
  • 96.
  • 97.
  • 98.
  • 99.
  • 100.
  • 101.
  • 102.
  • 103.
  • 104.
  • 105.
  • 106.
  • 107.
  • 108.
  • 109.

We wrote a simple BloomFilter , adopt put Method input data , adopt check Method to determine whether the element exists , Basic function , The comments in the code are also very clear , But their own realization must be inefficient , So let's take a look at what the industry leaders have done for us BloomFilter.

2.4、Guava Medium BloomFilter

       
package com.test.pkg;


import com.google.common.hash.BloomFilter;
import com.google.common.hash.Funnels;


/**
* <br>
* <b>Function:</b><br>
* <b>Author:</b>@author ziyou<br>
* <b>Date:</b>2019-10-24 00:17<br>
* <b>Desc:</b> nothing <br>
*/
public class BloomFilterTest02 {


public static void main(String[] args) {
BloomFilter<Integer> bloomFilter = BloomFilter.create(Funnels.integerFunnel(), 100000, 0.01);
for (int i = 0; i < 100000; i++) {
bloomFilter.put(i);
}
System.out.println(bloomFilter.mightContain(1));
System.out.println(bloomFilter.mightContain(2));
System.out.println(bloomFilter.mightContain(3));
System.out.println(bloomFilter.mightContain(100001));
}
}
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.

Guava Has helped us achieve BloomFilter Code for , We just need to call it where we use it .

Here we briefly explain the last two parameters in the constructor , One is the amount of data expected to be included , One is the allowable error value . The code will be based on the two values we fill in , Automatically help us calculate the size of the array , And the number of hash functions required , Here's the picture . More details , Readers can check the source code by themselves , We won't introduce it here .

 Are you sure this filter will ?_ character string _04


 Are you sure this filter will ?_ Mapping function _05


04、 summary

This article article I'd like to introduce it to you BloomFilter, An efficient way to determine whether an element exists in a set , It can be used in our daily work , Combined with daily work scenarios , You can choose .

原网站

版权声明
本文为[Java geek Technology]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/02/202202221548226576.html