当前位置:网站首页>Explain in simple terms the bloom filter
Explain in simple terms the bloom filter
2022-06-22 19:36:00 【JackieZhengChina】
I've seen such a passage on the Internet before
Data structures are nothing different. They are like the bookshelves of your application where you can organize your data. Different data structures will give you different facility and benefits. To properly use the power and accessibility of the data structures you need to know the trade-offs of using one.
The main idea of this passage is : The data structure is no different , In applications, they are like bookshelves that can organize data , Different data structures will provide you with different convenience and benefits , You need to weigh your needs carefully and use them properly . The bloon filter Is the representative of practicing this sentence , So today, let's talk about bloom filter in simple terms (Bloom Filter).
The bloon filter (Bloom Filter)
As usual , Before learning a new knowledge , We need to understand his basic concepts
The bloon filter (Bloom Filter) yes 1970 Proposed by bron in . It's actually a very long binary vector and a series of random mapping functions . The bloom filter can be used to retrieve whether an element is in a collection . Its advantage is that the spatial efficiency and query time are much better than the general algorithm , The disadvantage is that it has certain error recognition rate and deletion difficulty .
We can see this sentence in the basic concept of Bloom filter , The bloom filter can be used to retrieve whether an element is in a collection . Some partners may ask : We just use HashMap Don't you just search for elements , Why use a bloom filter ?HashMap It can really help us realize this function , and HashMap The position of an element can be determined by one operation , It can also help us check whether an element is in a collection . Then let's think about another question : If there is... In this element set One billion A random number , How do we judge whether a certain number exists ?
If the element set contains a billion random numbers , Then first of all, it will bring us problems in space , Occupy by a random number 4 If the sub section is calculated , This billion random numbers will take up nearly 4G Storage space , Space consumption is very large , At this time, you need to use bloom filter (Bloom Filter) To help us solve this problem .
The basic idea of Bloom filter
As we mentioned above, the element set contains a billion random numbers , If you store these billions of elements directly, you need to take up a lot of space , So we definitely can't store elements directly . So how do we store it ? The storage method is also very ingenious , You can use digit groups , Do not store elements directly , It's the state of whether the storage element exists , This can save a lot of storage space ~( I don't know how the great gods came up with such a clever way )

Let's take a look at how the bloom filter determines whether an element is in a set
① Now there is a set and an initial state of 0 The number group of

② The elements in the collection pass through N A hash function calculates the position of the element in the array , And the corresponding position in the array 0 Change to 1

③ If you need to judge the element at this time X Whether there is , So the element X Will also pass through this N The operation of a hash function to obtain several positions in the array , If the values in several positions are 1, Then it proves that the element X It is likely to exist in the set , On the contrary, it proves that the element X It must not exist in the set . As shown in the figure below , This element X after N The values stored at the calculated positions of the hash elements are not all 1, Then prove that the element X It doesn't exist in the set .

The characteristics of the bloom filter
Above, we talked about the idea of Bloom filter , In the ③ There is such a sentence in the point : If the values in several positions are 1, Then it proves that the element X Probably Exist in and set . Why is it all 1 The situation is likely to exist , Not necessarily ? This is related to the characteristics of hash function ...
We all know that hash function is a function that can convert input data of any size into output data of a specific size ( The converted data is called Hashi value ), Hash functions also have the following two features :
- If the hash value obtained by the same hash function is different , Then the original input values of the two hash values must be different .
- If two hash values obtained from the same hash function are equal , The original input values of two hash values may be equal , It may not be equal .
This is similar to Java Of two objects in HashCode equal , But the object itself is not necessarily equal . To put it bluntly , The values of the projection points of the bit group calculated by the hash function are all 1, It doesn't have to be set when the variable to be queried was saved in before , It may also be the point mapped by other elements . This leads to a characteristic of Bloom filter : There is a certain miscalculation .
In the League of Heroes , You can fully trust bloom , But you have to be careful when writing code
So can we delete the elements in the digit group ? Obviously not , Because the same point on the bit array may have multiple input value mappings , If deleted, it will affect the judgment results of other elements in the bloom filter . This is another feature of Bloom filter : You cannot delete elements in the bloom filter .
So we can sum up the advantages and disadvantages of Bloom filter
advantage : In space and time , All have great advantages . Because it doesn't store complete data , Is a binary vector , It can save a lot of memory space , In terms of time complexity , Because the query is calculated according to the hash function , So suppose there is N A hash function , So time complexity is O(N); At the same time, when storing an element, it is not the element itself , It's a binary vector , Therefore, it has certain advantages in some scenes with strict confidentiality requirements .
shortcoming : There is a certain miscalculation ( The more elements stored in the bloom filter , The higher the miscalculation rate ); You cannot delete elements in the bloom filter , As the time goes by , Because... Cannot be deleted , More and more elements are stored in it , As a result, more and more memory is occupied , The misjudgment rate is getting higher and higher , Finally, I had to reset .
Application of bloon filter
We've all heard of “ Cache penetration ”, How should we solve cache penetration ? you 're right , It is through the bloom filter to solve this problem .
The problem of cache penetration is mainly due to the incoming Key Values in Redis There is no such thing as , Then it will be directly typed on the database , So as to increase the pressure on the database . In this case , Can be in Redis Add a bloom filter before , Add the data of... To the database in advance , In the query Redis First judge through the bloom filter Key Whether the value exists , If it doesn't exist, go straight back , If Key If the value exists , Then continue to execute according to the original process .
The solution to cache penetration is If the judgment result of Bloom filter is that it does not exist, it must not exist This feature , However, due to some misjudgment of Bloom filter , Therefore, cache penetration cannot be completely solved , But it can greatly alleviate the problem of cache penetration .
Analog implementation of Bloom filter
Finally, paste a piece of code , To simulate the implementation of Bloom filter
import java.util.BitSet;
/**
* @description: MyBloomFilter
* @author: Zhuang Ba .liziye
* @create: 2022-04-01 16:50
**/
public class MyBloomFilter {
/**
* A length of 10 Billion bits
*/
private static final int DEFAULT_SIZE = 256 << 22;
/**
* In order to reduce the error rate , Use addition hash Algorithm , So define a 8 A prime array of elements
*/
private static final int[] seeds = {3, 5, 7, 11, 13, 31, 37, 61};
/**
* It's equivalent to building 8 Different hash Algorithm
*/
private static HashFunction[] functions = new HashFunction[seeds.length];
/**
* Initialize the bloom filter bitmap, Positional array
*/
private static BitSet bitset = new BitSet(DEFAULT_SIZE);
/**
* Add data
*
* @param value Value to add
*/
public static void add(String value) {
if (value != null) {
for (HashFunction f : functions) {
// Calculation hash Value and modify bitmap The corresponding position in is true
bitset.set(f.hash(value), true);
}
}
}
/**
* Determine whether the corresponding element exists
* @param value The elements that need to be judged
* @return result
*/
public static boolean contains(String value) {
if (value == null) {
return false;
}
boolean ret = true;
for (HashFunction f : functions) {
ret = bitset.get(f.hash(value));
// One hash The function returns false And then jump out of the loop
if (!ret) {
break;
}
}
return ret;
}
/**
* test
*/
public static void main(String[] args) {
for (int i = 0; i < seeds.length; i++) {
functions[i] = new HashFunction(DEFAULT_SIZE, seeds[i]);
}
// add to 1 100 million elements
for (int i = 0; i < 100000000; i++) {
add(String.valueOf(i));
}
String id = "123456789";
add(id);
System.out.println(" Elements 123456789 Whether there is :" + contains(id));
System.out.println(" Elements 234567890 Whether there is :" + contains("234567890"));
}
}
class HashFunction {
private int size;
private int seed;
public HashFunction(int size, int seed) {
this.size = size;
this.seed = seed;
}
public int hash(String value) {
int result = 0;
int len = value.length();
for (int i = 0; i < len; i++) {
result = seed * result + value.charAt(i);
}
int r = (size - 1) & result;
return (size - 1) & result;
}
}
Copy code

author : He refused to cross the river
link :https://juejin.cn/post/7083294020323000356
source : Rare earth digs gold
The copyright belongs to the author . Commercial reprint please contact the author for authorization , Non-commercial reprint please indicate the source .
---------------------
author : Chestnut Shao
source :CSDN
original text :https://blog.csdn.net/weixin_38556197/article/details/124111236
Copyright notice : This is the author's original article , Please attach a link to the blog post !
Content analysis By:CSDN,CNBLOG One click reprint plugin for blog posts
边栏推荐
- 消息中间件(一)MQ详解及四大MQ比较
- Flush difficult to open an account? Is it safe to open an account online?
- ABAQUS 使用RSG绘制插件初体验
- jniLibs. Srcdirs = ['LIBS'] what's the use?
- Decorator mode of structural mode
- Ts as const
- k8s部署mysql
- C #, introductory tutorial -- a little knowledge about function parameter ref and source program
- Message Oriented Middleware (I) MQ explanation and comparison of four MQS
- Method of activity jump to fragment (intent)
猜你喜欢
随机推荐
wpa_ CLI parameter description
函数的导数与微分的关系
贪心之分配问题(1)
K个一组翻转链表[链表拆解/翻转/拼装]
AUTOCAD——五种标注快捷键
NLP-D57-nlp比赛D26&刷题D13&读论文&&找了一个多小时bug
mysql数据库设计
【干货|接口测试必备技能-常见接口协议解析】
antd tree 树状选择器 子类必填
Shell script explanation (IV) -- while loop and until loop of loop statements (additional examples and analysis)
Digital business cloud: build a digital supply chain system to enable enterprises to optimize and upgrade their logistics supply chain
[nfs无法挂载问题] mount.nfs: access denied by server while mounting localhost:/data/dev/mysql
Screw数据库文档生成器
Activereports report practical application tutorial (19) -- multi data source binding
集群、分布式、微服务概念和区别
常用技术注解
Wavelet transform DB4 for four layer decomposition and signal reconstruction matlab analysis and C language implementation
Is flush easy to open an account? Is it safe to open a mobile account?
Experiment 7 trigger
Thread pool: reading the source code of threadpoolexcutor







