Bloom filters

Bloom filter

Image released to the public domain by David Eppstein.

https://shane-tomlinson.github.io/bloomfilter-presentation/

https://goo.gl/3FuYLy

Me

Shane Tomlinson

https://shanetomlinson.com

@shane_tomlinson

https://github.com/shane-tomlinson

Start with a problem

top 25 passwords

Top 25 Most Used Internet Passwords - An infographic by the team at Tom Fanelli - Infographic Marketing.

Goal

Ban users from using one of the top 50k most common passwords.

Math(s) speak: determine if an item is a member of a set.

Ideas?

Constraints

  1. No data can be sent to the server.
  2. Download must be "reasonably" sized.

What data structures could we use?

What if the occasional false positive is OK?

Hash table

Create a bit array, hash entries to a position in array. Hash table
Image released to the public domain by Jorge Stolfi.

Hash table properties

  • Downloadble
  • Worst case O(1) insertion and lookup
  • Fixed size
  • No need to re-balance

Collisions

Reducing collisions

  • Expand the size of the hash table.
  • Use more than one hash function.

Bloom filter

A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set.

https://en.wikipedia.org/wiki/Bloom_filter

It looks like an onion

Properties

  1. Has two primary functions, add and test.
  2. If an item is in the set, test will always return true.
  3. If an item is not in the set, test might return true.
  4. Rate of false positives is configurable.

Secondary properties

  1. False positive rate of <1% uses ~9.6 bits per member.
  2. Supports union and intersection of two or more filters.

Nomenclature

k
number of hash functions.
m
size of hash table, in bits.
n
size of set.

Bloom filter algorithm

Create

  1. Create a bit array of size m.
  2. Fill bit array with all 0s.

Add

  1. Run item through k independent hash functions.
  2. Set the corresponding entries in the bit array to 1.

Test

  1. Run item through k independent hash functions.
  2. Check the corresponding entries in the bit array.
    1. If all entries are 1, return true, otherwise return false.

K independent ... WTF?

K independent hash functions

Can be simulated with only two hash functions.

To find an arbitrary i:

hash(i) = (a + b * i ) % m

Where a is the result of Hash 1, b is the result of Hash 2, and m is the max value of the bit array.

Code

https://github.com/LondonAlgorithms/bloomfilter.js

Resources

Wikipedia: https://en.wikipedia.org/wiki/Bloom_filter

MDN bit operations: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Operators/Bitwise_Operators

MDN Uint32Array: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Uint32Array

Simulate k hashes with only 2: https://willwhim.wpengine.com/2011/09/03/producing-n-hash-functions-by-hashing-only-once/