Preface
This article will be C# Language to implement a simple bloom filter , To simplify the description , The design is simple , For learning purposes only .
thank @ Shi Zong Guidance in your busy schedule .
Brief introduction of bloon filter
The bloon filter (Bloom filter) It's a special kind Hash Table, It can quickly determine whether the data exists with a small storage space . It is often used in data filtering that allows a certain false positive rate and in scenarios such as preventing cache breakdown .
Compare with .NET Medium HashSet Such a traditional Hash Table, There are the following advantages and disadvantages .
advantage :
- It takes up less storage space . No need to be like HashSet Same storage Key Raw data .
Inferiority :
- False positive rate , Data that the filter thinks does not exist must not exist , But the data that is considered to exist does not necessarily exist . This is related to the implementation of Bloom filter .
- Data deletion is not supported , The following will explain why deletion is not supported .
Data storage
The data of the bloom filter is stored in Bitmap (Bitmap) On .Bitmap In short, it's binary (bit) Array of .Hash Table Save the location of each element , We call it bucket (bucket), Bitmap Everyone on this page is a Bronte filter bucket.
Each of the bloan filters bucket Can only store 0 or 1. When inserting data , The bloom filter will go through Hash Function to calculate the inserted key Corresponding bucket, And put the bucket Set to 1.
When inquiring , Again on the basis of Hash Function calculated key Corresponding bucket, If bucket The value of is 1, Think key There is .
Hash The solution to the conflict
Bron filter uses Hash function , Naturally, there is no escape Hash The question of conflict . For the bloom filter , happen Hash Conflict means miscarriage of justice .
Tradition Hash Algorithmic solution Hash The ways of conflict are Open addressing 、 Chain list method, etc . And the bloom filter solves Hash The way of conflict is special , It uses multiple Hash Function to resolve conflicts .
In the following figure, insert the... Of the bloom filter Bar and Baz after Hash1 The calculated position is the same , but Hash2 The calculated position is different ,Bar and Baz Can be distinguished .
Even if bloom filter uses this method to solve Hash Conflict , The possibility of conflict still exists , As shown in the figure below :
Because the bloan filter does not retain the inserted Key Original value ,Hash Conflict is inevitable . We can only increase Hash Function to reduce the probability of collision , That is to reduce the false positive rate .
Suppose the bloom filter has m individual bucket, contain k Hash functions , Has inserted n individual key. The misjudgment rate can be obtained through mathematical derivation ε The formula is as follows :
For the specific inference process, please refer to https://en.wikipedia.org/wiki/Bloom_filter.
The misjudgment probability of Bloom filter is approximately the same as that of Already inserted key The number of n In direct proportion to , and hash Number of functions k、bucket Count m In inverse proportion . In order to reduce the error rate , We can add m or increase k, increase m It means that the storage space occupied by the filter will increase , increase k It means that the efficiency of inserting and querying will be reduced .
Why the bloom filter does not support deleting
The bloom filter passes through a number of Hash Function to resolve conflicts , It also means that multiple inserted elements may share the same bucket, While deleting an element , Will also be part of other elements bucket Delete it . Therefore, based on Bitmap The implemented bloom filter does not support deletion .
use C# Realization Bitmap
Before implementing the bloom filter , First of all, we need to realize a Bitmap.
stay C# in , We can't just use bit As the smallest data storage unit , But with the help of bit operations , We can express based on other data types , such as byte. hereafter byte Describe as an example Bitmap The implementation of the , But not limited to byte,int、long It's OK to wait .
An operation
Here is C# A brief introduction to median operation :
Symbol | describe | Operational rules |
---|---|---|
& | And | Both of them are 1 when , The result is 1 |
| | or | Both of them are 0 when , The result is 0 |
^ | Exclusive or | The two bits are the same 0, Dissimilarity is 1 |
~ | Take the opposite | 0 change 1,1 change 0 |
<< | Move left | Each binary moves several bits to the left , Low complement 0 |
>> | Move right | Each binary moves several bits to the right , High compensation 0 |
Generally speaking , The data to be calculated by bit operation is usually composed of multiple binary bits . Use... For two numbers &
、|
、^
These three operators are , You need to align the right side of two numbers , Calculate bit by bit .
// 0b The representative value represents a number in binary
short a = 0b0111111111111001;
byte b = 0b011111111;
short c = (short)(a & b); // 0b0111111111111001
short d = (short)(a | b); // 0b0111111111111111
short e = (short)(a ^ b); // 0b0000000000000110
byte f = (byte)~b; 0b011111111;
short g = (short)(b << 1); // 0b0000000111111111;
short h = (short)(b >> 1); // 0b0000000001111111;
Use bit operations to create Bitmap
With the help of byte Realization Bitmap, That is to be able to modify and view byte Every one of them bit Value , meanwhile , Modification should be able to achieve idempotence .
- The positioning is set to 1
According to the rule of bit operation mentioned above , It cannot be modified separately bit Of a bit in a sequence . Bit operations require a pair of calculations from right to left .
Use|
This function can be realized . Suppose we want to change the subscript from right to 3( initial position 0) Of bit Value , You need to prepare a location that is 1, The other positions are 0 Of bit Sequence , And what to change bit Sequence|
operation .
// In order to a From the right of 3 Change the bit to 1, You need to prepare a b
byte a = 0b010100010;
byte b = 1 << 3; // 0b000001000
a |= b; // 0b010101010
- The positioning is set to 0
And set to 1 Just the opposite , You need to prepare a designated location for 0, The other positions are 1 Of bit Sequence , And what to change bit Sequence&
operation .
byte a = 0b010101010;
byte b = 1 << 3; // 0b000001000
b = ~b; // 0b111110111
a &= b; // 0b010100010
- View the value of the pointing position
utilize & Operator , As long as the calculation result is not 0, It means that the value of the specified position is 1.
byte a = 0b010101010;
byte b = 1 << 3; // 0b000001000;
a &= b; // 0b000001000;
After knowing the basic operation , We store data in byte Array .
class Bitmap
{
private readonly byte[] _bytes;
private readonly long _capacity;
public Bitmap(long capacity)
{
_capacity = capacity;
_bytes = new byte[_capacity / 8 + 1];
}
public long Capacity => _capacity;
public void Set(long index)
{
if (index >= _capacity)
{
throw new IndexOutOfRangeException();
}
// Calculate the number of data byte On
long byteIndex = index / 8;
// Calculate the number of data bit On
int bitIndex = (int)(index % 8);
_bytes[byteIndex] |= (byte)(1 << bitIndex);
}
public void Remove(long index)
{
if (index >= _capacity)
{
throw new IndexOutOfRangeException();
}
long byteIndex = index / 8;
int bitIndex = (int)(index % 8);
_bytes[byteIndex] &= (byte)~(1 << bitIndex);
}
public bool Get(long index)
{
if (index >= _capacity)
{
throw new IndexOutOfRangeException();
}
long byteIndex = index / 8;
int bitIndex = (int)(index % 8);
return (_bytes[byteIndex] & (byte)(1 << bitIndex)) != 0;
}
}
use C# Realization The bloon filter
With Bitmap, Let's put Hash The implementation of the function is ready , A simple bloom filter can be completed . here , We refer to guava This java The implementation of the library .
https://github.com/google/guava/blob/master/guava/src/com/google/common/hash/BloomFilter.java
MurmurHash3 Use
We use and guava Same MurmurHash3 As Hash Implementation of function .
The following is the author in github An available implementation found on .
https://github.com/darrenkopp/murmurhash-net
Use this library , We can put any length of byte Array to 128 The binary bit of bit , That is to say 16 byte.
byte[] data = Guid.NewGuid().ToByteArray();
// returns a 128-bit algorithm using "unsafe" code with default seed
HashAlgorithm murmur128 = MurmurHash.Create128(managed: false);
byte[] hash = murmur128.ComputeHash(data);
Put any type of key Convert to byte Array
Funnel And Sink The definition of
We need to put various types key convert to MurmurHash Capable of being handled directly byte Array . For this reason, we refer to guava Introduce the following two concepts :
Funnel: Convert all kinds of data into byte Array , Include int、bool、string etc. built-in type And custom complex types .
Sink:Funnel Core components , As a buffer for data .Funnel When converting a custom complex type instance to byte Array time , You need to disassemble the data and write it in batches sink.
Funnel It can be defined as the following delegation , Accept the original value , And write it in sink in .
delegate void Funnel<in T>(T from, ISink sink);
Sink Convert different types of data into byte Array and put them together .
interface ISink
{
ISink PutByte(byte b);
ISink PutBytes(byte[] bytes);
ISink PutBool(bool b);
ISink PutShort(short s);
ISink PutInt(int i);
ISink PutString(string s, Encoding encoding);
ISink PutObject<T>(T obj, Funnel<T> funnel);
/// ... other built-in type , Readers can add
}
ordinary Funnel The implementation is as follows :
public class Funnels
{
public static Funnel<string> StringFunnel = (from, sink) =>
sink.PutString(from, Encoding.UTF8);
public static Funnel<int> IntFunnel = (from, sink) =>
sink.PutInt(from);
}
Custom complex types Funnel The implementation can disassemble data and write it in batches sink. Instance members of complex types may still be complex types , So we have to be in Sink Implement a PutObject To provide dolls for disassembly .
Funnel<Foo> funnelFoo = (foo, sink) =>
{
sink.PutString(foo.A, Encoding.UTF8);
sink.PutInt(foo.B);
Funnel<Bar> funnelBar = (bar, barSink) => barSink.PutBool(bar.C);
sink.PutObject(foo.Bar, funnelBar);
};
class Foo
{
public string A { get; set; }
public int B { get; set; }
public Bar Bar { get; set; }
}
class Bar
{
public bool C { get; set; }
}
Sink The implementation of the
Sink The core is byte Implementation of array buffer , utilize ArrayPool We can easily implement a ByteBuffer.
class ByteBuffer : IDisposable
{
private readonly int _capacity;
private readonly byte[] _buffer;
private int _offset;
private bool _disposed;
public ByteBuffer(int capacity)
{
_capacity = capacity;
_buffer = ArrayPool<byte>.Shared.Rent(capacity);
}
public void Put(byte b)
{
CheckInsertable();
_buffer[_offset] = b;
_offset++;
}
public void Put(byte[] bytes)
{
CheckInsertable();
bytes.CopyTo(_buffer.AsSpan(_offset, bytes.Length));
_offset += bytes.Length;
}
public void PutInt(int i)
{
CheckInsertable();
BinaryPrimitives.WriteInt32BigEndian(GetRemainingAsSpan(), i);
_offset += sizeof(int);
}
public void PutShort(short s)
{
CheckInsertable();
BinaryPrimitives.WriteInt32BigEndian(GetRemainingAsSpan(), s);
_offset += sizeof(short);
}
// ... Other primitive type The implementation of the
public Span<byte> GetBuffer() =>
_buffer.AsSpan(.._offset);
public bool HasRemaining() => _offset < _capacity;
public void Dispose()
{
_disposed = true;
ArrayPool<byte>.Shared.Return(_buffer);
}
private void CheckInsertable()
{
if (_disposed)
{
throw new ObjectDisposedException(typeof(ByteBuffer).FullName);
}
if (_offset >= _capacity)
{
throw new OverflowException("Byte buffer overflow");
}
}
private Span<byte> GetRemainingAsSpan() => _buffer.AsSpan(_offset..);
}
Sink That's right. ByteBuffer Further encapsulation , To adapt to the current usage scenario .
class Sink : ISink, IDisposable
{
private readonly ByteBuffer _byteBuffer;
/// <summary>
/// Create a new <see cref="Sink"/> example
/// </summary>
/// <param name="expectedInputSize"> The maximum size of the expected input single element </param>
public Sink(int expectedInputSize)
{
_byteBuffer = new ByteBuffer(expectedInputSize);
}
public ISink PutByte(byte b)
{
_byteBuffer.Put(b);
return this;
}
public ISink PutBytes(byte[] bytes)
{
_byteBuffer.Put(bytes);
return this;
}
public ISink PutBool(bool b)
{
_byteBuffer.Put((byte)(b ? 1 : 0));
return this;
}
public ISink PutShort(short s)
{
_byteBuffer.PutShort(s);
return this;
}
public ISink PutInt(int i)
{
_byteBuffer.PutInt(i);
return this;
}
public ISink PutString(string s, Encoding encoding)
{
_byteBuffer.Put(encoding.GetBytes(s));
return this;
}
public ISink PutObject<T>(T obj, Funnel<T> funnel)
{
funnel(obj, this);
return this;
}
public byte[] GetBytes() => _byteBuffer.GetBuffer().ToArray();
public void Dispose()
{
_byteBuffer.Dispose();
}
}
k individual Hash Function and The bloon filter Realization
As mentioned above The bloon filter adopt k individual hash Function to solve hash The question of conflict . In practice , We can put once murmur hash Calculated results of (16 byte) Split into two parts and convert to long type ( One long yes 8 byte).
The results of these two parts are saved to hash1 and hash2, The first k individual hash The function is right hash1 and hash2 Regroup of .
hash(k) = hash1 + (k-1) * hash2
public class BloomFilter<T>
{
private readonly int _hashFunctions;
private readonly Funnel<T> _funnel;
private readonly int _expectedInputSize;
private readonly Bitmap _bitmap;
private readonly HashAlgorithm _murmur128;
/// <summary>
/// Create a new <see cref="BloomFilter"/> example
/// </summary>
/// <param name="funnel"> Related to the insert element type <see cref="Funnel"/> The implementation of the </param>
/// <param name="buckets">BloomFilter Inside Bitmap Of bucket Number , The bigger it is , The lower the miscalculation rate </param>
/// <param name="hashFunctions">hash The number of functions , The more , The lower the miscalculation rate </param>
/// <param name="expectedInputSize"> Estimated maximum size of a single element inserted </param>
public BloomFilter(Funnel<T> funnel, int buckets, int hashFunctions = 2, int expectedInputSize = 128)
{
_hashFunctions = hashFunctions;
_funnel = funnel;
_expectedInputSize = expectedInputSize;
_bitmap = new Bitmap(buckets);
_murmur128 = MurmurHash.Create128(managed: false);
}
public void Add(T item)
{
long bitSize = _bitmap.Capacity;
var (hash1, hash2) = Hash(item);
long combinedHash = hash1;
for (int i = 0; i < _hashFunctions; i++)
{
_bitmap.Set((combinedHash & long.MaxValue) % bitSize);
combinedHash += hash2;
}
}
public bool MightContains(T item)
{
long bitSize = _bitmap.Capacity;
var (hash1, hash2) = Hash(item);
long combinedHash = hash1;
for (int i = 0; i < _hashFunctions; i++)
{
if (!_bitmap.Get((combinedHash & long.MaxValue) % bitSize))
{
return false;
}
combinedHash += hash2;
}
return true;
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private (long Hash1, long Hash2) Hash(T item)
{
byte[] inputBytes;
using (var sink = new Sink(_expectedInputSize))
{
sink.PutObject(item, _funnel);
inputBytes = sink.GetBytes();
}
var hashSpan = _murmur128.ComputeHash(inputBytes).AsSpan();
long lowerEight = BinaryPrimitives.ReadInt64LittleEndian(hashSpan.Slice(0,8));
long upperEight = BinaryPrimitives.ReadInt64LittleEndian(hashSpan.Slice(8,8));
return (lowerEight, upperEight);
}
}
Expand
Bloom filter with counter
The above is based on Bitmap The implemented bloom filter does not support deleting , But if you take Bitmap This bit Array to n individual bit As a bucket Array of , That single bucket You have the ability to count . When you delete an element like this , Is to subtract one from this counter , Thus, a bloom filter with deletion function can be realized in a limited range , The price is , The storage space will become the original n times .
Implementation scheme of distributed bloom filter
If you have the actual use needs of the bloom filter , And in a distributed environment , I recommend the following library , It's for redis Provided by the plug-in of , For details, click the link below .
https://github.com/RedisBloom/RedisBloom
Code address
For the convenience of learning , All the code in this article has been arranged in github:https://github.com/eventhorizon-cli/EventHorizon.BloomFilter
From bitmap to bloom filter ,C# More articles on Implementation
- 【Redis Those things · Sequel 】Redis Bitmap 、HyperLogLog Data structure demonstration and bloom filter
One .Redis Bitmap 1. The minimum unit of a bitmap is bit, Every bit The value of can only be 0 and 1, Bitmap application scenarios are generally used for some check-in records , For example, punch in . Examples of scenes : For example, some APP To store the user's clock in record , If you follow the normal way of thinking , Probably ...
- BloomFilter( The bloon filter )
Link to the original text :http://blog.csdn.net/qq_38646470/article/details/79431659 1. Concept : If you want to determine whether an element is in a set , The general idea is to preserve all the elements ...
- C++ The bloon filter
The bloon filter Does this noun sound very Very tall , You bet , It is also a very important structure , Let's take a look : One : Talk about history : (Bloom Filter) It was by bloon (Burton Howard Bloom) stay 1970 year ...
- be based on Java Implement a simplified version of the bloom filter
One . The bloon filter : The bloon filter (Bloom Filter) yes 1970 Proposed by bron in . It's actually a very long binary vector and a series of random mapping functions . The bloom filter can be used to retrieve whether an element is in a collection . Its advantage is space efficiency ...
- The bloon filter - How to be in 100 $ URL Quickly judge a URL Whether there is ?
Title Description A website has 100 Billion url There's a blacklist , Every one of them url Average 64 byte . How to save this blacklist ? If you input any url, How do you quickly judge the url Is it on the blacklist ? title this ...
- On the bloom filter Bloom Filter
Start with an interview question : to A,B Two documents , Each store 50 Billion bars URL, Every one of them URL Occupy 64 byte , Memory limit is 4G, Let you find out A,B Document common URL. The essence of this problem is to judge whether an element is in a set . Hash table to O(1) ...
- Detailed analysis Redis Bloom filter in and its application
Welcome to WeChat official account. : Wanmao society , Share on Monday Java Technical dry cargo . What is a bloon filter The bloon filter (Bloom Filter) By Howard Bloom stay 1970 A more ingenious probabilistic data structure proposed in , It can sue ...
- Redis Detailed explanation ( 13、 ... and )------ Redis The bloon filter
In this blog, we mainly introduce how to use Redis Implement the bloon filter , But before introducing the bloon filter , Let's first introduce , Why use a bloom filter . 1. Use scenario of bloon filter For example, there are several requirements : ①. Originally 10 Million numbers , Now here comes... Again ...
- Redis Bloom filter and cuckoo filter
Everybody knows , In the computer ,IO It has always been a bottleneck , Many frameworks, technologies and even hardware are designed to reduce IO Born from operation , Let's talk about filters today , Let's start with a scene : Our business backend involves databases , When a request message queries for some information , You may first check whether there is... In the cache ...
- Redis()- The bloon filter
One . The bloon filter The bloon filter : A data structure . By binary arrays ( Very long binary vectors ) Composed of . The bloom filter can be used to retrieve whether an element is in a collection . Its advantage is that the spatial efficiency and query time are much better than the general algorithm , The disadvantage is that there is some misunderstanding ...
Random recommendation
- linux Connect to Alibaba cloud server
When Windows Have xshell When the software can connect to your remote server ,Linux In fact, I have ssh You can connect : The specific order is : ssh [email protected] Then enter your server password :××××× ...
- REDIS Source code
http://blog.csdn.net/chosen0ne https://github.com/chosen0ne/task-schedule-simulate
- Modify file permissions chmod
$ chmod u+x file to file The owner of increases the execution authority $ chmod 751 file to file The owner of the assignment read . Write . Of board ...
- MVC3.0+knockout.js+Ajax Realize simple addition, deletion, modification and search
MVC3.0+knockout.js+Ajax Realize simple addition, deletion, modification and search I haven't been in touch since I came to Beijing MVC, Many of them have been forgotten , I've been reading it lately knockout.js and webAPI, It was intended to adopt MVC+k ...
- SpringBoot2 application.properties Load the configuration file in this way
application.properties jdbc.driverClassName=com.mysql.jdbc.Driver jdbc.url=jdbc:mysql://127.0.0.1:33 ...
- JS Learning notes :( Two ) Reflow and redraw
Before figuring out the concept of reflow and redraw , We want to clear the browser's rendering process . Parse generation DOM Tree( All nodes are included , Include display:none); according to CSS Object Module(CCSSOM) Computing node ...
- Mac Lower installation mongdb
Use homebrew install MongoDB :brew install mongodb At this time MongoDB Will be installed in /usr/local/Cellar/mongodb/4.0.3_1 ( my ...
- c Linked list and dynamic memory allocation
Go around and use c.c Some of the basic but almost forgotten ( Smile to cry )!! Dynamic memory allocation When malloc After casting the returned pointer type to the desired type , The data structure of the pointer is stored in the pointer , The allocated memory is just available for the data structure . Linked list ...
- VB.NET and C# differences
VB.NET Program Structure C# Imports System Namespace Hello Class HelloWorld Overloads Share ...
- 20155331 Exp3 Principle and practice of killing free
20155331 Exp3 Principle and practice of killing free Answer the basic question How does kill soft detect malicious code ? 1. Signature based detection ,2. Heuristic malware detection ,3. Behavior based malware detection . What's free from killing ? Let the virus not be killed by anti-virus software ...