当前位置：网站首页>Are you sure about this filter?

Are you sure about this filter?

2022-06-21 06:21:00 【Java geek Technology】

Are you sure this filter will ？_ The bloon filter

01、 What is? BloomFilter（ The bloon filter ）

The bloon filter （ English ：Bloom Filter） yes 1970 Proposed by bron in . It's actually a very long binary vector and a series of random mapping functions . It is mainly used to judge whether an element is in a set . Usually, we encounter a lot of business scenarios to determine whether an element is in a collection , At this time, we often adopt Hashmap,Set Or other collections to save data , Then compare and judge , But if there are many elements , If we use this method, it will be a waste of space . At this time we need BloomFilter To help us .

1.1、BloomFilter principle

BloomFilter Is a fixed size binary vector or bitmap （bitmap） And a series of （ Usually several ） The mapping function consists of . The principle of Bloom filter is , When a variable is added to a set , adopt K Mapping functions map this variable to... In the bitmap K A little bit , Set them as 1. When we query a variable, we just need to see if these points are all 1 We can find out if there is one in the set , If any of these points 0, Then the queried variable must not be in ; If it's all 1, The query variable is very Probably stay . Be careful , Here is the possibility that , It doesn't have to exist ！ This is the basic idea of bloon filter .

As shown in the figure below , character string "ziyou" After four mapping function operations, four points on the bitmap are set to 1. When we need to judge “ziyou” Whether or not a string exists only needs to be mapped to the string once , Get four 1 Just explain “ziyou” It's possible .

Are you sure this filter will ？_ The bloon filter _02

Why is it possible , Not necessarily ？ That's because the mapping function itself is a hash function , Hash functions can collide , This means that there will be a string that may be “ziyou01” The four points obtained by the same four mapping functions are followed by “ziyou” It's the same , In this case, we say that there is a miscalculation . In addition, it is also possible that 1 It is the result of four different variables , This does not prove that the string “ziyou” There must be , As shown in the following picture 1 It could be a string “ Zhang San ” To calculate the , The same goes for other positions 1 It can also be calculated from other strings .

Are you sure this filter will ？_ Mapping function _03

1.2 characteristic

So we can make it clear through the above example

If an element is judged to exist, the element does not necessarily exist , But when the result of judgment does not exist, it must not exist .
The bloon filter can add elements , But you can't delete elements . Because deleting elements will increase the rate of misjudgment .

02、 Use scenarios

2.1、 Webpage URL duplicate removal

When we use web crawlers （ Reptiles need to be careful ）, It is often necessary to record what URL It has been crawled , Which ones haven't climbed yet , At this time, we can adopt BloomFilter To those who have crawled URL For storage , In this way, this can be determined during the next crawl URL Have you ever climbed .

2.2、 Black and white list storage

There is often a feature that has different processing methods for different devices or users , At this time, there may be a white list or a blacklist , So according to BloomFilter Characteristics of the filter , We can also use it to store these data , Although there is a certain miscalculation rate , But to a certain extent, it can solve this problem well .

2.3、 Summary

In addition to the two scenarios mentioned above , In fact, there are many scenes , For example, hot data access , Spam filtering and so on , In fact, the unified feature of these scenarios is to judge whether an element is in a set , The principle is the same .

03、 Code practice

3.1、 Realize it by yourself

       
        package com.test.pkg;
        

        

        
import java.util.BitSet;
        

        

        
/**
        
 * <br>
        
 * <b>Function：</b><br>
        
 * <b>Author：</b>@author ziyou<br>
        
 * <b>Date：</b>2019-10-23 23:21<br>
        
 * <b>Desc：</b> nothing <br>
        
 */
        
public class BloomFilterTest {
        

        

        
    /**
        
     *  Initialize the bloom filter  bitmap  size 
        
     */
        
    private static final int DEFAULT_SIZE = 2 << 24;
        
    /**
        
     *  In order to reduce the error rate , Here we choose some numbers as the reference numbers 
        
     */
        
    private static final int[] seeds = {3, 5, 7, 11, 13, 31, 37, 61};
        
    /**
        
     *  Set up  bitmap
        
     */
        
    private static BitSet bitset = new BitSet(DEFAULT_SIZE);
        
    /**
        
     *  Set up  hash  Number of functions 
        
     */
        
    private static HashFunction[] functions = new HashFunction[seeds.length];
        

        

        

        

        
    /**
        
     *  Add data 
        
     *
        
     * @param value  The value added by the requirement 
        
     */
        
    public static void put(String value) {
        
        if (value != null) {
        
            for (HashFunction f : functions) {
        
                // Calculation  hash  Value and modify  bitmap  The corresponding position in is  true
        
                bitset.set(f.hash(value), true);
        
            }
        
        }
        
    }
        

        

        
    /**
        
     *  Determine whether the corresponding element exists 
        
     *
        
     * @param value  The elements that need to be judged 
        
     * @return  result 
        
     */
        
    public static boolean check(String value) {
        
        if (value == null) {
        
            return false;
        
        }
        
        boolean ret = true;
        
        for (HashFunction f : functions) {
        
            ret = bitset.get(f.hash(value));
        
            // One  hash  The function returns  false  And then jump out of the loop 
        
            if (!ret) {
        
                break;
        
            }
        
        }
        
        return ret;
        
    }
        

        

        
    public static void main(String[] args) {
        
        String value = "test";
        
        for (int i = 0; i < seeds.length; i++) {
        
            functions[i] = new HashFunction(DEFAULT_SIZE, seeds[i]);
        
        }
        
        put(value);
        
        System.out.println(check("value"));
        
    }
        
}
        

        

        
class HashFunction {
        

        

        
    private int size;
        
    private int seed;
        

        

        
    public HashFunction(int size, int seed) {
        
        this.size = size;
        
        this.seed = seed;
        
    }
        

        

        
    public int hash(String value) {
        
        int result = 0;
        
        int len = value.length();
        
        for (int i = 0; i < len; i++) {
        
            result = seed * result + value.charAt(i);
        
        }
        
        int r = (size - 1) & result;
        
        return (size - 1) & result;
        
    }
        

        

        
}
       
       
        1.
        2.
        3.
        4.
        5.
        6.
        7.
        8.
        9.
        10.
        11.
        12.
        13.
        14.
        15.
        16.
        17.
        18.
        19.
        20.
        21.
        22.
        23.
        24.
        25.
        26.
        27.
        28.
        29.
        30.
        31.
        32.
        33.
        34.
        35.
        36.
        37.
        38.
        39.
        40.
        41.
        42.
        43.
        44.
        45.
        46.
        47.
        48.
        49.
        50.
        51.
        52.
        53.
        54.
        55.
        56.
        57.
        58.
        59.
        60.
        61.
        62.
        63.
        64.
        65.
        66.
        67.
        68.
        69.
        70.
        71.
        72.
        73.
        74.
        75.
        76.
        77.
        78.
        79.
        80.
        81.
        82.
        83.
        84.
        85.
        86.
        87.
        88.
        89.
        90.
        91.
        92.
        93.
        94.
        95.
        96.
        97.
        98.
        99.
        100.
        101.
        102.
        103.
        104.
        105.
        106.
        107.
        108.
        109.

We wrote a simple BloomFilter , adopt put Method input data , adopt check Method to determine whether the element exists , Basic function , The comments in the code are also very clear , But their own realization must be inefficient , So let's take a look at what the industry leaders have done for us BloomFilter.

2.4、Guava Medium BloomFilter

       
        package com.test.pkg;
        

        

        
import com.google.common.hash.BloomFilter;
        
import com.google.common.hash.Funnels;
        

        

        
/**
        
 * <br>
        
 * <b>Function：</b><br>
        
 * <b>Author：</b>@author ziyou<br>
        
 * <b>Date：</b>2019-10-24 00:17<br>
        
 * <b>Desc：</b> nothing <br>
        
 */
        
public class BloomFilterTest02 {
        

        

        
    public static void main(String[] args) {
        
        BloomFilter<Integer> bloomFilter = BloomFilter.create(Funnels.integerFunnel(), 100000, 0.01);
        
        for (int i = 0; i < 100000; i++) {
        
            bloomFilter.put(i);
        
        }
        
        System.out.println(bloomFilter.mightContain(1));
        
        System.out.println(bloomFilter.mightContain(2));
        
        System.out.println(bloomFilter.mightContain(3));
        
        System.out.println(bloomFilter.mightContain(100001));
        
    }
        
}
       
       
        1.
        2.
        3.
        4.
        5.
        6.
        7.
        8.
        9.
        10.
        11.
        12.
        13.
        14.
        15.
        16.
        17.
        18.
        19.
        20.
        21.
        22.
        23.
        24.
        25.
        26.
        27.
        28.

Guava Has helped us achieve BloomFilter Code for , We just need to call it where we use it .

Here we briefly explain the last two parameters in the constructor , One is the amount of data expected to be included , One is the allowable error value . The code will be based on the two values we fill in , Automatically help us calculate the size of the array , And the number of hash functions required , Here's the picture . More details , Readers can check the source code by themselves , We won't introduce it here .

Are you sure this filter will ？_ character string _04

Are you sure this filter will ？_ Mapping function _05

04、 summary

This article article I'd like to introduce it to you BloomFilter, An efficient way to determine whether an element exists in a set , It can be used in our daily work , Combined with daily work scenarios , You can choose .

原网站

版权声明
本文为[Java geek Technology]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/02/202202221548226576.html