MITB Banner

What Is Bloom Filter And How Useful It Is For Data Science

Share

Bloom filter, an ingenious data structure invented 50 years ago by Howard Bloom to test whether a certain elements belongs to a certain set.

There are many applications of bloom filters such as detecting malicious URLs on browsers, wallet synchronisation for cryptocurrencies, reducing datacenter workloads, recommendation system.

 

Knowledge of hash function prior to knowing Bloom Filter would be of great use here. But, in short, hash functions can be thought of as code given to each data item. The data item can be anything. It can be a string saying Hello World and there will be a corresponding code. This code is used as an index to share the information in a secure way and to retrieve information from memory core. This code tells the computer what to look for and where to.

Bloom filters have a number of hash functions, which are used to set bits in the bit array. The bit positions are determined by the hash functions.

As stated earlier, bloom filter was invented to check whether a set contains certain elements.

When a question is asked to check whether certain data item exists, then the hashed index or code(unique ID) corresponding to that item is checked. If the filter contains all the parts of the code with that of item then it can be concluded that the data item exists.

There can be false positives from time to time, and if you don’t need to remove items, it’s a great choice. A Bloom filter has constant time complexity for both adding items and asking whether they are present, and it requires very little space relative to the size of the items needed to store and check. It gets these properties in large part because it is based on hash functions.

Bloom Filter For Recommendation System

The use of machine learning for recommending movies based on user’s history and techniques like collaborative filtering has been widely documented and companies like Netflix and Amazon employ them on a large scale. In the case of Medium, one of the widely popular publishing platforms which has more than 30 million users every month.

For Medium, a simple method for suggesting a story might be “because you follow the author.” A more complex one might be collaborative filtering, which can be described as “people who recommended lots of the stories that you recommended also recommended this story”.

Medium has thousands of blog posts and each of these has an ID and is stored in a database table, but the table is so big, and accessed so frequently, that it can’t fit on one machine. The database takes the ID, hashes it to a unique value, and uses that to jump straight to the record location like we discussed earlier.

So, when a certain story is recommended to a user, the algorithm needs to be sure whether it has been suggested before or had already been read. read story is stored as a data item along with many other data and it is inefficient to call all these to verify quick actions. This is where Bloom filter comes into play. It is  a mechanism that is particularly good at answering the question “has this been seen before?”.

Key Takeaways

  • There can be false positives but no false negatives.
  • Bloom filters do not store data items like other data structures
  • Eliminates unnecessary disk accesses for information retrieval(queries)

Read in detail about Bloom filter here

PS: The story was written using a keyboard.
Picture of Ram Sagar

Ram Sagar

I have a master's degree in Robotics and I write about machine learning advancements.
Related Posts

Download our Mobile App

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
Recent Stories

Featured

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

AIM Conference Calendar

Immerse yourself in AI and business conferences tailored to your role, designed to elevate your performance and empower you to accomplish your organization’s vital objectives. Revel in intimate events that encapsulate the heart and soul of the AI Industry.

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed