Oleksii Trekhleb | Javascript algorithms (Bloom filter)

This is a series of books diving deep into the core mechanisms of the JavaScript language.

 · 4 phút đọc.

This is a series of books diving deep into the core mechanisms of the JavaScript language.

A bloom filter is a space-efficient probabilistic data structure designed to test whether an element is present in a set. It is designed to be blazingly fast and use minimal memory at the cost of potential false positives. False positive matches are possible, but false negatives are not – in other words, a query returns either possibly in set or definitely not in set.

Bloom proposed the technique for applications where the amount of source data would require an impractically large amount of memory if conventional error-free hashing techniques were applied.

Algorithm description

An empty Bloom filter is a bit array of m bits, all set to 0. There must also be k different hash functions defined, each of which maps or hashes some set element to one of the m array positions, generating a uniform random distribution. Typically, k is a constant, much smaller than m, which is proportional to the number of elements to be added; the precise choice of k and the constant of proportionality of m are determined by the intended false positive rate of the filter.

Here is an example of a Bloom filter, representing the set {x, y, z}. The colored arrows show the positions in the bit array that each set element is mapped to. The element w is not in the set {x, y, z}, because it hashes to one bit-array position containing 0. For this figure, m = 18 and k = 3.

Bloom Filter

Operations

There are two main operations a bloom filter can perform: insertion and search. Search may result in false positives. Deletion is not possible.

In other words, the filter can take in items. When we go to check if an item has previously been inserted, it can tell us either no or maybe.

Both insertion and search are O(1) operations.

Making the filter

A bloom filter is created by allotting a certain size. In our example, we use 100 as a default length. All locations are initialized to false.

Insertion

During insertion, a number of hash functions, in our case 3 hash functions, are used to create hashes of the input. These hash functions output indexes. At every index received, we simply change the value in our bloom filter to true.

During a search, the same hash functions are called and used to hash the input. We then check if the indexes received all have a value of true inside our bloom filter. If they all have a value of true, we know that the bloom filter may have had the value previously inserted.

However, it’s not certain, because it’s possible that other values previously inserted flipped the values to true. The values aren’t necessarily true due to the item currently being searched for. Absolute certainty is impossible unless only a single item has previously been inserted.

While checking the bloom filter for the indexes returned by our hash functions, if even one of them has a value of false, we definitively know that the item was not previously inserted.

False Positives

The probability of false positives is determined by three factors: the size of the bloom filter, the number of hash functions we use, and the number of items that have been inserted into the filter.

The formula to calculate probablity of a false positive is:

( 1 - e -kn/m ) k

k = number of hash functions

m = filter size

n = number of items inserted

These variables, k, m, and n, should be picked based on how acceptable false positives are. If the values are picked and the resulting probability is too high, the values should be tweaked and the probability re-calculated.

Applications

A bloom filter can be used on a blogging website. If the goal is to show readers only articles that they have never seen before, a bloom filter is perfect. It can store hashed values based on the articles. After a user reads a few articles, they can be inserted into the filter. The next time the user visits the site, those articles can be filtered out of the results.

Some articles will inevitably be filtered out by mistake, but the cost is acceptable. It’s ok if a user never sees a few articles as long as they have other, brand new ones to see every time they visit the site.

nhavantuonglai

Share:
Quay lại.

Có thể bạn chưa đọc

Xem tất cả »
Tối ưu hình ảnh

Tối ưu hình ảnh

Giúp Google và người dùng tìm thấy nội dung website hướng dẫn nâng cao những kỹ thuật giúp tối ưu SEO hiệu quả đem lại thứ hạng tốt trên công…

Đăng ký nhận bảng tin hàng tuần

Liên lạc trao đổi

Liên lạc thông qua Instagram

Thông qua Instagram, bạn có thể trao đổi trực tiếp và tức thời, cũng như cập nhật những thông tin mới nhất từ nhavantuonglai.

Tức thời

Bạn có thể gửi và nhận tin nhắn nhanh chóng, trực tiếp, giúp những vấn đề cá nhân của bạn được giải quyết tức thời và hiệu quả hơn.

Thân thiện

Vì tính chất là kênh liên lạc nhanh, nên bạn có thể bỏ qua những nghi thức giao tiếp thông thường, chỉ cần lịch sự và tôn trọng thì sẽ nhận được sự phản hồi đầy thân thiện, thoải mái từ tác giả.

Trao đổi trên email

Thông qua email cá nhân, bạn có thể trao đổi thỏa thuận hợp tác, kết nối chuyên sâu và mang tính chuyên nghiệp.

Tin cậy

Trong một số trường hợp, email được dùng như một tài liệu pháp lý, chính vì vậy mà bạn có thể an tâm và tin cậy khi trao đổi với tác giả thông qua email.

Chuyên nghiệp

Cấu trúc của email đặt tính chuyên nghiệp lên hàng đầu, nên những thông tin, nội dung được viết trong email từ tác giả sẽ luôn đảm bảo điều này ở mức cao nhất.

nhavantuonglai · Ghiblis Music Piano Playlist