Bloom Filter and Big Data - Synopse Open Source

Bloom Filters

A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not, thus a Bloom filter has a 100% recall rate. In other words, a query returns either "possibly in set" or "definitely not in set". Elements can be added to the set, but not removed (though this can be addressed with a "counting" filter). The more elements that are added to the set, the larger the probability of false positives.
(from Wikipedia definition)

In short, to insert a value, you apply several hash functions to the value, and set the bit according to each resulting hash in a zeroed bit-array:

Then, to check if a value may be in the storage, you apply the same hash functions to the value:

If all bits corresponding to all hashes are set, then the value may have been inserted (x,y,z in the above diagram) - with some potential false positive.
If a bit corresponding to any hash in the array is not set (as w value above), then you can be sure that the w value was never inserted.

The benefits of this Bloom Filter algorithm, despite its simplicity, are huge:

When you initialize your Bloom Filter storage, you can define the maximum false positive ratio you expect, for a given number of items.
The resulting ratio of false positives would match the expectations, thanks to mathematical proof.
Memory needed to store the information is reduced: e.g. 9.6 bits per item for 1% of false positives, regardless of its value length.
Value lookup is almost instant: only a few hashes are to be performed on the value.
You could easily compact a storage with only a few items, since it will be mostly filled with zeros.
You can apply lossless union and intersection of bit storage, using OR and AND binary operations.
It is very easy and efficient to transmit deltas of bit storage, with very few memory use, between several states of the data: you are not forced to transmit the whole buffer each time some new items are added.

Typical use cases may be:

Avoid uneeded disk lookup for Big Data engines, by maintaining a BF structure in memory.
Routing over several nodes, with minimal resource consumption (RAM, CPU, disk) of the BF on the router: this may be used at IP level, or at application level.
P2P synchronization, avoiding most data transfert.
Web cache optimization: only cache a page if it was used more than once.
Fast identification of spam content or malicious URI.
Infrastructure measurements.

The Bloom filter principle:
Wherever a list or set is used, and space is at a premium, consider using a Bloom filter if the effect of false positives can be mitigated.

TSynBloomFilter

We just added a TSynBloomFilter class to our Open Source framework core.

It features:

Automatic computation of all internal storage details (bits storage size, number of hash functions) from basic high-level information, i.e. estimation of maximum number of items, and false positive probability.
Insertion and lookup is very optimized for speed, using hardware-accelerated crc32c hashing, reduced to only two calls, following mathematically approved best practice.
Persistence of the data using an efficient RLE compression of continuous zeros.
Full regression tests provided (as usual).
Thread-safe process.
Works on Delphi 5 up to 10.1 Berlin, and FPC.

It is very easy to use (extracted from the regression tests):

  b := TSynBloomFilter.Create(200000);
  try
    // b.FalsePositivePercent=1
    // b.Size=200000
    // b.Bits=1964937=1.8MB
    // b.HashFunctions=7
    for i := 1 to 1000 do
      Check(not b.MayExist(@i,sizeof(i)));
    Check(b.Inserted=0);
    for i := 1 to 1000 do
      b.Insert(@i,sizeof(i));
    Check(b.Inserted=1000);
    for i := 1 to 1000 do
      Check(b.MayExist(@i,sizeof(i)));
    savedbuffer := b.SaveTo;
  finally
    b.Free;
  end;
  b := TSynBloomFilter.Create(savedbuffer);
  try
    for i := 1 to 1000 do
      Check(b.MayExist(@i,sizeof(i)));
  finally
    b.Free;
  end;

The value lookup is called MayExist(), to clearly state that it returns a probability of a value presence.

You can check the documentation of this class online.

TSynBloomFilterDiff

In addition to the stand-alone TSynBloomFilter class, a dedicated class allows remote synchronization of Bloom Filter values, with minimal data transmission.

This class can compute incremental serialization of the bits storage, following a revision number.
The first transmission of the BF data may consist in up to 10 bits per stored value, but later synchronization would only consume about 10 bytes per newly inserted value.
If you consider that 10,000,000 items presence with 1% of false positive ratio would need 11.5 MB of data, a further synchronization after 100 items insertions would only return 1 KB of data.

When compared to a regular database storing 10,000,000 string values, you can easily guess the gain of using this class.
Storing the presence 1,000,000,000 string values would need 1.15 GB of RAM, which still fit in a 32-bit process, and is less resource consuming than a modern Internet Browser with several opened tabs. ;)

You can check the documentation of this class online.

Feel free to discuss on our forum, as usual!