Hacker News

Two Bits Are Better Than One: making bloom filters 2x more accurate

54 points by matheusalmeida 5 days ago | 6 comments

vlmutolo 2 hours ago

This article is a little confusing. I think this is a roundabout way to invent the blocked bloom filter with k=2 bits inserted per element.

It seems like the authors wanted to use a single hash for performance (?). Maybe they correctly determined that naive Bloom filters have poor cache locality and reinvented block bloom filters from there.

Overall, I think block bloom filters should be the default most people reach for. They completely solve the cache locality issues (single cache miss per element lookup), and they sacrifice only like 10–15% space increase to do it. I had a simple implementation running at something like 20ns per query with maybe k=9. It would be about 9x that for native Bloom filters.

There’s some discussion in the article about using a single hash to come up with various indexing locations, but it’s simpler to just think of block bloom filters as:

1. Hash-0 gets you the block index

2. Hash-1 through hash-k get you the bits inside the block

If your implementation slices up a single hash to divide it into multiple smaller hashes, that’s fine.

sakras 2 hours ago

Yeah I kind of think authors didn't conduct a thorough-enough literature review here. There are well-known relations between number of hash functions you use and the FPR, cache-blocking and register-blocking are classic techniques (Cache-, Hash-, and Space-Efficient Bloom Filters by Putze et. al), and there are even ways of generating patterns from only a single hash function that works well (shamelessly shilling my own blogpost on the topic: https://save-buffer.github.io/bloom_filter.html)

I also find the use of atomics to build the filter confusing here. If you're doing a join, you're presumably doing a batch of hashes, so it'd be much more efficient to partition your Bloom filter, lock the partitions, and do a bulk insertion.

pkoird 3 hours ago

Clever. My first impression was that surely this saturates the filter too fast as we're setting more bits at once but looks like the maths checks out. It's one of those non-intuitive things that I am glad I learned today.

lemagedurage 3 hours ago

True, I had the same feeling. The article does go off 256K elements in a bloom filter of 2M. After 1M elements, using 2 bits actually increases false positive rate, but at that point the false positive rate is higher than 50% already.

h33t-l4x0r 2 hours ago

Hmm, Bloom filters seem important. I'm wondering why my CS education never even touched on them and it's tbh triggering my imposter syndrome.

benmanns 48 minutes ago

They were only touched on (and just barely) in my CS education, so don’t feel too left out. Spend an evening or two on the Wiki for Probabilistic data structures[0]. With a CS education you should have the baseline knowledge to find them really fascinating. Enjoy!

Oh, and I don’t find myself actually implementing any of these very often or knowing that they are in use. I occasionally use things like APPROX_COUNT_DISTINCT in Snowflake[1], which is a HyperLogLog (linked in the Wiki).

[0]: https://en.wikipedia.org/wiki/Category:Probabilistic_data_st...

[1]: https://docs.snowflake.com/en/sql-reference/functions/approx...