We know this for a fact that Moore’s Law does does not stand true anymore. Our computation requirement far exceeds what is available right now. There has been an explosion in the complexity and requirements of modern computing, especially with the advent of Machine Learning. This has forced many to look beyond traditional processors towards GPU’s etc. But then there is a group of researchers that believe that neural network itself can be the solution to make current computing more efficient.
Today’s computers assume that the memory usage to perform a certain task happens exactly the way it did in past. So, it creates a rules table that chronicle the usage of memory and mirror it for all future tasks. Think of it as a rules based mechanism for allocating memory, which is highly efficient when the availability for memory is high. But modern computational workloads are order of magnitude larger than traditional workloads. As the working set becomes larger than the predictive table, the prediction accuracy drops sharply. This trend poses a challenge and scaling predictive table is very costly and cumbersome.
Machine learning as of late has been used to advance the fields of software engineering optimisations, augmenting and replacing traditional data structures and rule based systems. In a recent study, led by Google researchers in partnership with researchers from University of Texas at Austin and Stanford University have found ways in which deep learning can be used to optimise use of computer hardware architecture.
Prefetching and the von Neumann bottleneck
Prefetching is a technique for speeding up computation by transferring information from computer slow memory to a faster memory (cache) before that information is used. This greatly enhances the computation speed because the information is available whose result is expected to be needed soon. Most predictors (prefetchers) in hardware are simply table-based.
Prefetching addresses a critical issue in von Neumann architecture computers (The computers we use today). The truth that computation is much faster than accessing memory and transfer of data. Some research has also shown that processors can spend up to 50% of all compute cycles waiting for data to arrive from memory. Prefetchers predict when to fetch what data into cache to reduce memory latency, and the key towards effective prefetching is to attack the difficult problem of predicting memory access patterns.
The main focus of the study was to identify and learn memory access patterns and learn efficient prefetchers. The study demonstrates the potential of deep learning on solving the von Neumann bottleneck. The von Neumann bottleneck is a limitation on throughput caused by the standard personal computer architecture. This study has been seen as pioneering work in the application of machine learning in understanding memory access. It has also opened a wide range of directions for machine learning to be applied in computer architecture.
Prefetching as a Prediction Problem
In the study, the researchers explore the utility of sequence-based neural networks in micro architectural systems. By taking inspiration from powerful techniques of neural networks to address sequence prediction problems, such as those found in natural language processing (NLP) and text understanding they apply sequence learning to the problem of prefetching. Prefetching can be seen as a regression problem. The output space, is both vast and extremely sparse, making it a poor fit for standard regression models. The researchers have taken inspiration from recent works in image and audio generation that discretize the space, namely PixelRNN and Wavenet.
Discretization makes prefetching more adaptable to neural language models, and the researchers leverage it as a starting point for building neural prefetchers. Prefetching means predicting the memory access operation that will miss the on-chip cache. The sequence of cache miss addresses so far can be observed as well as the sequence of instruction addresses. Therefore, thinking in terms of a prediction problem, an initial model could use two input features. At a given time step N, It could use the address and PC that generated a cache miss at that time step to predict the address of the miss at timestep N + 1.
There is one issue that is apparent. The address space of an application is very sparse. There are only order of 10 M unique cache block miss addresses appear on average out of the entire 264 physical address space. Hence the nature of this space makes it very difficult to model it as a time-series regression models.
Prefetching as a Classification Problem
Prefetching can be modelled as a classification problem rather than a regression. We can treat the address space as a large, discrete vocabulary, and perform classification. We could compare this problem with next-word or character prediction in natural language processing. Some addresses are much more commonly accesses than others, much like some words are used more commonly than others. In this way, effective vocabulary size can be manageable for RNN models.
This is a great way to model the problem. However, now there are 264 possible softmax targets, hence some quantization scheme is necessary. We can decrease the resolution of addresses towards the extremes of the address space, but in prefetching we need high resolution in every area where addresses are used. But there is a good news in that programs tend to behave in predictable ways. So a very small (compared to the whole space) and consistent set of addresses are ever seen. The primary quantization scheme the researchers used was to create a vocabulary of common addresses during training, and to use this as the set of targets during testing.
Neural Network models of prefetching
Two LSTM-based prefetching models models were experimented with. The first version is analogous to a standard language model, while the second exploits the structure of the memory access space in order to reduce the vocabulary size and reduce the model memory footprint. The researchers decided to take a layout for programs to behave consistently. Therefore, the strategy is to predict deltas, ∆N = Addr(N+1) −AddrN , instead of addresses directly.
Suppose we restricted the output vocabulary size in order to only model the most frequently occurring deltas. Our first model therefore restricts the output vocabulary size to a large, but feasible 50,000 of the most frequent, unique deltas since this vocabulary size can be managed by today’s language models.
The model uses a categorical (one-hot) representation for both the input and output deltas. The LSTM then performs classification over the delta vocabulary, and the K highest-probability deltas are chosen for prefetching. The prefetchers can return several predictions. The researchers opt to take top-10 predictions of the LSTM. There are several limitations to this approach. First, a large vocabulary increases the model’s computational and storage footprint. Second, cutting down the vocabulary necessarily puts a ceiling on the accuracy of the model. Dealing with rarely occurring deltas is non-trivial, as they will be seen relatively few times during training. This is known as the rare word problem in in NLP.
Clustering + LSTM
The researchers hypothesize that there is some interaction between addresses occurs locally in address space. This idea can be further exploited and the prefetcher can be designed in a way that the prefetcher has both local and global context. The researchers took the set of addresses and clustered them into 6 different regions using k-means.The data is then partitioned into these clusters, and deltas are computed within each cluster
A multi-task LSTM is used to model all the clusters and provided the cluster ID as an additional feature, which effectively gives each LSTM each set of biases. Partitioning of address space into narrower regions also means we get clusters of same order of magnitude. Hence the resulting deltas can be normalised and used as real-valued inputs to the LSTM. This allows us to further reduce the size of the model, as we do not need to keep around a large matrix of embeddings. Importantly, we still treat next-delta prediction as a classification problem.
Hence this model of LSTM addresses some of the issued embedding LSTM. The additional pre-processing step of clustering has its own trade-offs. The main trade-off being not being able to model the dynamics that cause the program to access different regions of the address space.