## Bloom Filters

A Bloom filter is a space-efficient probabilistic data structure, conceived
by Burton Howard Bloom in 1970, that is used to test whether an element is a
member of a set. False positive matches are possible, but false negatives are
not, thus a Bloom filter has a 100% recall rate. In other words, a query
returns either "possibly in set" or "definitely not in set". Elements can be
added to the set, but not removed (though this can be addressed with a
"counting" filter). The more elements that are added to the set, the larger the
probability of false positives.

(from Wikipedia
definition)

In short, *to insert a value*, you apply several hash functions to
the value, and set the bit according to each resulting hash in a zeroed
bit-array:

Then, *to check if a value may be in the storage*, you apply the same
hash functions to the value:

- If all bits corresponding to all hashes are set, then the value may have been inserted (x,y,z in the above diagram) - with some potential false positive.
- If a bit corresponding to any hash in the array is not set (as w value above), then you can be sure that the w value was never inserted.

- When you initialize your Bloom Filter storage, you can define the maximum
false positive ratio you expect, for a given number of items.

The resulting ratio of false positives would match the expectations, thanks to mathematical proof. - Memory needed to store the information is reduced: e.g. 9.6 bits per item for 1% of false positives, regardless of its value length.
- Value lookup is almost instant: only a few hashes are to be performed on the value.
- You could easily compact a storage with only a few items, since it will be mostly filled with zeros.
- You can apply lossless union and intersection of bit storage, using OR and AND binary operations.
- It is very easy and efficient to transmit deltas of bit storage, with very few memory use, between several states of the data: you are not forced to transmit the whole buffer each time some new items are added.

- Avoid uneeded disk lookup for Big Data engines, by maintaining a BF structure in memory.
- Routing over several nodes, with minimal resource consumption (RAM, CPU, disk) of the BF on the router: this may be used at IP level, or at application level.
- P2P synchronization, avoiding most data transfert.
- Web cache optimization: only cache a page if it was used more than once.
- Fast identification of spam content or malicious URI.
- Infrastructure measurements.

The Bloom filter principle:

*Wherever a list or set is used, and space is at a premium, consider
using a Bloom filter if the effect of false positives can be
mitigated.*

## TSynBloomFilter

We just added a `TSynBloomFilter`

class to our Open Source
framework core.

It features:

- Automatic computation of all internal storage details (bits storage size, number of hash functions) from basic high-level information, i.e. estimation of maximum number of items, and false positive probability.
- Insertion and lookup is very optimized for speed, using hardware-accelerated crc32c hashing, reduced to only two calls, following mathematically approved best practice.
- Persistence of the data using an efficient RLE compression of continuous zeros.
- Full regression tests provided (as usual).
- Thread-safe process.
- Works on Delphi 5 up to 10.1 Berlin, and FPC.

It is very easy to use (extracted from the regression tests):

` b := TSynBloomFilter.Create(200000);`

try

// b.FalsePositivePercent=1

// b.Size=200000

// b.Bits=1964937=1.8MB

// b.HashFunctions=7

for i := 1 to 1000 do

Check(not b.MayExist(@i,sizeof(i)));

Check(b.Inserted=0);

for i := 1 to 1000 do

b.Insert(@i,sizeof(i));

Check(b.Inserted=1000);

for i := 1 to 1000 do

Check(b.MayExist(@i,sizeof(i)));

savedbuffer := b.SaveTo;

finally

b.Free;

end;

b := TSynBloomFilter.Create(savedbuffer);

try

for i := 1 to 1000 do

Check(b.MayExist(@i,sizeof(i)));

finally

b.Free;

end;

The value lookup is called `MayExist()`

, to clearly state that it
returns a *probability* of a value presence.

You can check the documentation of this class online.

## TSynBloomFilterDiff

In addition to the stand-alone `TSynBloomFilter`

class, a
dedicated class allows remote synchronization of Bloom Filter values, with
minimal data transmission.

This class can compute *incremental serialization* of the bits
storage, following a revision number.

The first transmission of the BF data may consist in up to 10 bits per stored
value, but later synchronization would only consume about 10 bytes per newly
inserted value.

If you consider that 10,000,000 items presence with 1% of false positive ratio
would need 11.5 MB of data, a further synchronization after 100 items
insertions would only return 1 KB of data.

When compared to a regular database storing 10,000,000 string values, you
can easily guess the gain of using this class.

Storing the presence 1,000,000,000 string values would need 1.15 GB of RAM,
which still fit in a 32-bit process, and is less resource consuming than a
modern Internet Browser with several opened tabs. ;)

You can check the documentation of this class online.

Feel free to discuss on our forum, as usual!