Recommendation systems are the cornerstone of a majority of the modern day billion dollar industries. From Amazon to Netflix to Pinterest. Amazon recommends commodities for its e-commerce website and Netflix has a limited number of movies to recommend but in the case of Pinterest, the numbers are not that small. It has more than 100 billion ideas saved and has to deliver to more than 200 million of its users in real time.
Pinterest opted for a graph-based system for this high performance task. They do this with their custom built technique named Pixie.
Pixie’s ease of generalisation to different types of tasks makes this possible, whether it’s creating many types of graphs, recommending different objects, or tuning the graphs to capture the absolute right content. Now with Pixie and recent optimisations, Pinterest is streamlined to serve relevant ideas in real time.
The above diagram illustrates the graphs connecting the boards and pins (images saved by the user). Here each edge shows a saved pin to the board. As the same pins can be saved to different boards for example, a bike can be saved into both ‘bikes’ board or ‘royal enfield’ or both. This eventually increases the number of edges to more than 100 billion. These graphs can be loaded on to AWS machines and deployed at scale.
The Pixie algorithm examines the portion of the graph nearest to the relevant nodes by using biased random walk algorithm.
A biased random walk is an approach for statistical analysis of graphs. They are used to extract symmetries when the graph contains complex networks.
At each step, it selects a random neighbor and visits the node, incrementing node visit counts as it visits more random neighbors. The probability Alpha, set at 0.5, to restart at node Q so our walks do not stray too far. And then neighboring nodes and boards undergo randomly sampling for 100,000 steps.
So, in the case of the bike example, if a user opens and checks bike images 3 times then this information will be given to Pixie and it comes up with thousands of similar images.
The nodes that have been visited 14 and 16 times are the ones that are most closely related to the query node.
Pixie continuously repeats this process in real-time as the data grows, so the users are always able to keep narrowing down their searches and find the exact ideas they’re looking for to pursue their goal.
The top 1,000 most visited nodes are retrieved , to avoid random walks through the graphs for complete 100,000 steps every time. To accomplish this, walking is done until the rank 1,000 candidate gets at least 20 visits. Along with this the engineering team at Pinterest also created Graph Pruning to cut down edges.
Though the recommendation systems have evolved over time, real-time(< 100 milliseconds) large scale application is still a challenging task for the online platforms. Pinterest’s Pixie is one of the best solutions available out today. Pixie uses 3 billion nodes and 17 billion edges and the results show that the user interaction has increased up to 50% when compared to Hadoop-based production system.
Key Takeaways Of Pixie
- 3 billion nodes are pruned and fit to 120 GB main memory. This reduction in memory improves performance.
- Graphs enable parallel execution hence lesser time(less than 60 milliseconds).
- Pixie can be scaled up by simply adding more machines(AWS) to the cluster.
Read more here