当前位置:网站首页>How much do you know about the bloom filter and cuckoo filter in redis?
How much do you know about the bloom filter and cuckoo filter in redis?
2022-06-22 19:00:00 【Java confidant_】
Click on the official account , Practical technical articles Know in time 
Everybody knows , In the computer ,IO It has always been a bottleneck , Many frameworks, technologies and even hardware are designed to reduce IO Born from operation , Let's talk about filters today , Let's start with a scene :
Our business backend involves databases , When a request message queries for some information , You may first check whether there is relevant information in the cache , If so, return to , If not, you may have to query in the database , There is a problem at this time , If many requests are for data that does not exist in the database , Then the database must respond frequently to this unnecessary IO Inquire about , If more , Most databases IO Are responding to such meaningless requests , So how to keep these requests out ? The filter was born :
The bloon filter
The bloon filter (Bloom Filter) The general idea is , When the information you requested comes , First check the data you queried. Do I have , If so, push the request to the database , If not, go back to , how ?

Pictured , One bitmap Used to record ,bitmap The original values are all 0, When a data is saved , With three Hash The function is calculated three times Hash value , And will bitmap The corresponding position is set to 1, Above picture ,bitmap Of 1,3,6 The location is marked as 1, At this time, if a data request comes , Still use the previous three Hash Function calculation Hash value , If it's the same data , It must still be mapped to 1,3,6 position , Then it can be judged that the data has been stored before , If the three locations of the new data map , One doesn't match , If mapped to 1,3,7 position , because 7 Is it 0, That is, this data has not been added to the database before , So go straight back .
The problem with the bloom filter
This way up here , You should have found out , There are some problems with the bloom filter :
The first on the one hand, , The bloom filter may misjudge :
If there is such a scenario , Put the packet 1 when , take bitmap Of 1,3,6 Bit set to 1, Put the packet 2 When will bitmap Of 3,6,7 Bit set to 1, At this point, a packet request that has not been saved 3, After hashing three times , Corresponding bitmap The loci are 1,6,7, This data has not been saved before , But because of the packet 1 and 2 When saving, set the corresponding point to 1, So request 3 It will also overwhelm the database , This situation , It will increase with the increase of stored data .

In the second , The bloom filter cannot delete data , There are two difficulties in deleting data :
One is , Due to the possibility of misjudgment , Not sure if the data exists in the database , For example, packets 3.
Two is , When you delete the flag on the bitmap corresponding to a packet , May affect other packets , For example, in the example above , If you delete a packet 1, It means that bitmap1,3,6 Bit is set to 0, At this point, the packet 2 To request , Does not exist , because 3,6 The two bits have been set to 0.
Bloom filter plus
In order to solve the problem of Bloom filter above , An enhanced bloom filter appears (Counting Bloom Filter), The idea of this filter is to combine the bitmap Replace with array , When a position in the array is mapped once +1, When deleted -1, This avoids the need to recalculate the remaining packets after the data is deleted by the ordinary bloom filter Hash The problem of , But there is still no way to avoid misjudgment .

Cuckoo filter
In order to solve the problem that the bloom filter cannot delete elements , The paper 《Cuckoo Filter:Better Than Bloom》 The author proposes a cuckoo filter . Compared to the cuckoo filter , The bloom filter has the following shortcomings : Poor query performance 、 The efficiency of space utilization is low 、 Reverse operation is not supported ( Delete ) as well as Counting is not supported .
Poor query performance Because the bloom filter needs to use multiple hash Function to detect multiple different sites in a bitmap , These sites span a wide range of memory , It can lead to CPU Cache row hit rate is low .
Space efficiency is low Because at the same misjudgment rate , The space utilization rate of cuckoo filter is significantly higher than that of bloom , Space can probably be saved 40% many . However, the bloom filter does not require that the length of the bitmap must be 2 The index of , The cuckoo filter must have this requirement . From this point of view , It seems that bloom filter has more spatial scalability .
Reverse delete operation is not supported This problem really hit the soft rib of Bloom filter . In a dynamic system, elements always come and go . The bloom filter is like an imprint , Come here, there will be traces , Even if you leave, you can't clean it up . For example, your system only left 1kw Elements , But on the whole, there are hundreds of millions of water elements , Bloom filter is helpless , It will store the imprints of these lost elements there forever . As time goes by , This filter will get more and more crowded , Until one day you find that its misjudgment rate is too high , Have to rebuild .
Cuckoo filter claims to have solved this problem , It can effectively support reverse deletion . And make it an important selling point , Tempt you to give up the bloom filter and switch to the cuckoo filter .
Why is it called cuckoo ?
There is an idiom ,「 Dog in the manger 」, So are cuckoos , Cuckoos never build their own nests . It lays its eggs in other people's nests , Let others help hatch . After the little cuckoo broke its shell , Because cuckoos are relatively large , It will be the adoptive mother's other children ( Or eggs ) Squeeze out of the nest —— Fell from a high altitude and died .
Cuckoo
The simplest cuckoo hash structure is a one-dimensional array structure , There will be two. hash The algorithm maps the new element to two positions in the array . If one of the two positions is empty , Then you can put the elements directly in . But if both positions are full , It has to 「 Dog in the manger 」, Kick one at random , Then he occupied this position .
p1 = hash1(x) % l
p2 = hash2(x) % lUnlike cuckoos , The cuckoo hash algorithm will help these victims ( Squeezed eggs ) Look for other nests . Because each element can be placed in two places , As long as any free position , You can put it in . So the sad egg will see if his other position is free , If it is empty , Move over and everyone will be happy . But what if this position is occupied by others ? good , Then it will do it again 「 Dog in the manger 」, Transfer the victim's role to others . Then the new victim will repeat the process until all the eggs find their nests .
Cuckoo hash problem
But there's a problem , That is, if the array is too crowded , Kicking around hundreds of times without stopping , This will seriously affect the insertion efficiency . At this time, cuckoo hash will set a threshold , When the continuous nesting behavior exceeds a certain threshold , I think this array is almost full . At this time, it needs to be expanded , Repositioning all elements .
There will be another problem , That is, there may be a run cycle . For example, two different elements ,hash The next two positions are exactly the same , At this time, they have no problem in one position . But here comes the third element , it hash Then the position is the same as them , Obviously , There will be a run cycle . But let three different elements go through twice hash The back position is the same , This probability is not very high , Unless it's yours hash The algorithm is too frustrating .
Cuckoo hash algorithm's attitude towards this run cycle is that the array is too crowded , Need to expand ( It's not like that ).
Optimize
The average space utilization of the cuckoo hash algorithm above is not high , Perhaps only 50%. At this percentage , The number of consecutive runs will soon exceed the threshold . The value of such a hash algorithm is not obvious , So it needs to be improved .
One of the improvements is to increase hash function , Let each element have more than two nests , But three nests 、 Four nests . This can greatly reduce the probability of collision , Increase space utilization to 95% about .
Another improvement is to hang multiple seats in each position of the array , So even if the two elements are hash In the same position , You don't have to 「 Dog in the manger 」, Because there are multiple seats here , You can sit anywhere you like . Unless there are more seats , A run is needed . Obviously, this will also significantly reduce the number of runs . The space utilization of this scheme is only 85% about , But the query efficiency will be very high , Multiple seats in the same memory space , Can be used effectively CPU Cache .
Therefore, a more efficient solution is to integrate the above two improved solutions , For example, use 4 individual hash function , On each position 2 A seat . In this way, we can get time efficiency , You can also get space efficiency . Such a combination can even increase the space utilization 99%, This is a great space efficiency .
Cuckoo filter
The cuckoo filter is the same as the cuckoo hash structure , It's also a one-dimensional array , But unlike cuckoo hash , Cuckoo hashes store the entire element , The cuckoo filter only stores the fingerprint information of the element ( How many? bit, Similar to bloom filter ). Here, the filter sacrifices the accuracy of the data for spatial efficiency . It is precisely because the fingerprint information of the element is stored , So there will be a misjudgment rate , This is the same as the bloom filter .
First of all, cuckoo filter will only use two hash function , But each position can hold multiple seats . these two items. hash The function selection is special , Because only fingerprint information can be stored in the filter . When the fingerprints in this position are run , It needs to calculate another dual position . The calculation of this dual position requires the element itself , Let's recall the previous hash position calculation formula .
fp = fingerprint(x)
p1 = hash1(x) % l
p2 = hash2(x) % lWe know p1 and x The fingerprints of , There is no way to directly calculate p2 Of .
special hash function
The clever thing about cuckoo filter is that it designs a unique hash function , So that it can be based on p1 and Elemental fingerprints Calculate directly p2, Without the need for a complete x Elements .
fp = fingerprint(x)
p1 = hash(x)
p2 = p1 ^ hash(fp) // Exclusive or As can be seen from the above formula , When we know fp and p1, You can directly calculate p2. Similarly, if we know p2 and fp, It can also be calculated directly p1 —— Duality .
p1 = p2 ^ hash(fp) So we don't need to know the current position is p1 still p2, Just put the current location and hash(fp) The dual position can be obtained by XOR calculation . And just make sure hash(fp) != 0 To make sure p1 != p2, In this way, there will be no problem that kicking yourself leads to a dead cycle .
Maybe you'll ask why it's here hash The function does not need to modulo the length of the array ? It's actually needed , However, the cuckoo filter forces that the length of the array must be 2 The index of , So modulo the length of an array is equivalent to modulo hash At the end of the value n position . During XOR operation , Ignore the low n position Just other bits . The calculated position p Keep it low n Bit is the final dual position .
source :cnblogs.com/Courage129/p/14337466.html
recommend
Technical involution group , Learn together !!

PS: Because the official account platform changed the push rules. , If you don't want to miss the content , Remember to click after reading “ Looking at ”, Add one “ Star standard ”, In this way, each new article push will appear in your subscription list for the first time . spot “ Looking at ” Support us !
边栏推荐
- Set of redis data structure
- Linked list 4- 21 merge two ordered linked lists
- Does CDC 2.2.1 monitoring sqlserver not support monitoring multiple databases?
- Pytoch -- error reporting solution: "torch/optim/adamw.py" beta1, unboundlocalerror: local variable 'beta1‘
- RSPS2022 Finalist | Dr. Yang Bai 简介
- 零基础学编程/学逆向/过检测(frida实战)
- Array implementation of circular linked list
- What happened to this page when sqlserver was saving
- wpa_supplicant的状态机迁移
- chrome突然无法复制粘贴了
猜你喜欢

Redis中的布隆过滤器与布谷鸟过滤器,你了解多少?

Killed by the script, and "resurrected" by camping

牛客网:判断是否为回文字符串

第八届 GopherChina 大会蓄势待发!

第四届青年生命科学论坛 | 第一轮通知
Golang implements reliable delay queue based on redis

Exness sorted out three problems to be solved in Musk's acquisition of Twitter

postman学习

<JVM上篇:内存与垃圾回收篇>08-对象实例化及直接内存

RSPS2022 Finalist | Dr. Yang Bai 简介
随机推荐
Filebeat收集日志数据传输到Redis,通过Logstash来根据日志字段创建不同的ES索引
2022年5月中国游戏厂商及应用出海 EMEA 地区收入30强
The Fourth Youth Life Science Forum | first round notice
Correct method of converting Inkscape into DXF file SVG exporting DXF file
静态链表(一)
2022年T电梯修理复训题库及答案
牛客网:最小覆盖子串
阻碍华为5G手机的关键芯片取得突破,国产芯片已取得一成份额
Conda——基础配置命令总结
2022年R2移动式压力容器充装试题模拟考试平台操作
Activity跳转到Fragment的方法(Intent)
d的dip1000,1
I became a big enemy when I bought wanghong ice cream
Jenkins configuration project integration pin notification
SaaS application development guide
Linked list 4- 21 merge two ordered linked lists
Array implementation of circular linked list
Network intelligent operation and maintenance helps improve operation and maintenance efficiency
Concepts and solutions of redis' cache penetration, cache avalanche and cache breakdown problems
centerOS 安装mangodb