Indexing is one of the fundamental aspects of computer architecture and computer science. For many decades now, indexes have been designed by hand by computer scientists and we have worked with indexing techniques like B-trees and HashMaps. But researchers at Google and MIT have perceived a future where indexes could be learnt instead of being hand-designed. This research has been following recent patterns set by the industry of building domain-specific and task-specific hardware and fundamental computing tools that are targeted at specialised tasks. One of the ground-breaking examples of the same is Google’s TPU which is a special processor for neural networks. This work of learned indexes is an attempt to get the computer ready for the world of neural networks.
A team led by Tim Kraska of MIT and Jeff Dean of Google contest that indexes are models. These researchers say that a B-Tree Index, which is one of the most popular indexing techniques, can be looked at as a model to map a key to the position of data within a data structure such as an array. And in the same way a Hash Index, another popular technique, can be seen as a model to map a key to a position to a record in an unsorted array.
The researchers proposed a solution to replace traditional index structures with learned indexes using deep learning. The research initiative is an endeavour to understand in which conditions learned indexes are better than traditional ones and what are the main challenges in designing and implementing learned indexes.
Data Access And Management — Essential Part Of Computer Architecture
Computers and computer architectures work best when data management and data access strategies are efficient and robust. Currently there are numerous data access techniques such B-Trees, HashMaps and others, which suit a particular use case. For example a popular technique to check whether a record exists or not, is called as the Bloom filter. The researchers at Google believe that these are very general-purpose techniques and assume that they are unable to take advantage of patterns in the data that they encounter.
From Google’s side there has been a very strong push for task-based custom hardware design — case in point being the tensor processing unit (TPU). TPUs are designed to work with numerically-heavy AI and ML applications. Google likes to call it the “domain-specific architecture”. The idea behind the TPU was to replace the general purpose CPU with a matrix processor built only for neural network workloads. In contrast to today’s CPUs, TPUs can’t run multiple applications. TPUs can do one thing well, run billions of big multiplications and additions to assist neural networks operations with emphasis on power saving.
The researchers observed that today’s CPUs are great at SIMD, i.e Single Instruction Multi Data but GPUs and TPUs are going to take over. They also sight some reports suggesting GPUs will get one thousand times better by the year 2025 making building neural networks easy and affordable. The researchers want to continue this trend in hardware improvements by bringing the successes of neural networks to databases. At the same time, they are careful to underline that the learned indexes will not replace the traditional structures. The main aim is to complement the basic structured indexes.
B-Trees Vs. Learned Indexes
Some rough calculations by the researchers suggest that a B-Tree model will be fast if and only if it has a better precision gain than 1/100 per 50∗8=400 arithmetic operations. The researchers assume that the B-Tree sits in the cache. At the same time the researchers believe that the current ML practices offer higher speed.
To begin the simple experiment, the researchers imagine Range Index Models as CDF models. The research says that a model predicts the position given a key inside a sorted data structure like array acts like a cumulative distribution function (CDF). The researchers point to something interesting and exciting, that a B-Tree tries to “learn” the data distribution by building a regression tree. The researchers model it as:
p = f(Key) * N
The researchers use a 200M log record database to build a secondary index over the timestamp using neural networks. The researchers trained a 2 layer fully connected neural network with 32 neurons in each layer using the ReLU activation function. Here we can take the timestamps as the inputs and position of the timestamps as labels. Using such an architecture they constructed a learned index and measured the lookup performance.
The researchers claim that they achieved ≈1250 predictions per second and hence the whole model takes around 80,000 nanoseconds to execute the model. The search time was not better than full-blown brute force search. But compare it to a B-Tree where it takes only ≈300ns to traverse the data, which is two magnitudes faster than the neural network methods and 2−3× faster as searching over the key-space. The researchers claim that the main reason for failure are that regression trees used in B-Trees is much better at over-fitting. One of the other reasons also B-Trees are very cache-efficient.
The researchers also carried out many other experiments and found that the high cost for learning models is a big problem for learned indexes but still they feel learned indexes can be useful in many scenarios.
Researchers conducted many experiments to compare learned indexes with HashMaps which can be found here. They found that the model hash function overall has similar performance while utilising the memory better than a HashMap. They also state that the main improvements in indexing due to introduction of learning and neural networks will be in form of getting higher dimensional indexes and speeding up indexing. In conclusion the researchers say, “In summary, we have demonstrated that machine-learned models have the potential to provide significant benefits over state-of-the-art database indexes, and we believe this is a fruitful direction for future research.”