Alibaba has always been at the forefront of adopting new age technologies to boost their business. The online e-commerce giant dived into the much popular augmented reality and virtual reality technology back in 2016 with an aim to bridge the gap between online and offline worlds for shoppers. Since then, Alibaba has come a long way — whether using big data analytics or launching VR research laboratory GnomeMagic Lab.
As a part of their effort to bringing newer improvements at Alibaba, the Chinese e-tailer is now virtualising the online retail environment for training reinforcement learning.
Adoption Of Reinforcement Learning At Alibaba
In a paper published by Jing-Cheng Shi, Yang Yu, Qing Da, Shi-Yong Chen and An-Xiang Zeng, the researchers describe how applying reinforcement learning can offer better commodity search. And Alibaba has adopted reinforcement learning to accomplish exactly that.
The paper suggests that while the incorporation of deep neural networks and reinforcement learning (RL) has seen a significant progress recently in areas like games and robotics, the application of RL in real-world tasks still has to witness significant application. This can be attributed to the fact that while gaming and robotics are relatively receptive to unsupervised machine learning approaches, large online systems may be limited to supervised approaches, as they may be incapable of learning the sequential decision-making needed to maximise long-term rewards.
Researchers explain a scenario where an e-commerce search engine observes a buyer’s request, and displays a page of ranked commodities to the buyer. It then updates the decision model after obtaining the user feedback to pursue revenue maximisation. During a session, it keeps displaying new pages according to latest information of the buyer if he or she continues to browse. In such scenarios, previous solutions are mostly based on supervised learning which are incapable of learning sequential decisions. Thus RL solutions are highly appealing, but couldn’t be implemented until now owing to its own set of challenges.
The Virtual Taobao
Though RL method offers huge potential for complex user environments, it is difficult to apply in many real-world settings as they require training in a live system.
“One major barrier to directly applying RL in these scenarios is that current RL algorithms
commonly require a large number of interactions with the environment, which take high physical costs, such as real money, time, bad user experiences, and even lives in medical tasks,” noted the researchers in their paper.
To avoid these challenges, researchers built a process similar to Google’s cooling facilities; that is, they built a simulator in the form of “Virtual Taobao”, which can be trained offline by any RL algorithm maximising long-term reward. This replica of the platform was created from real historical data.
Understanding Virtual Taobao
The researchers first built Virtual Taobao, a simulator learned from historical customer behaviour data through the proposed GAN-SD (GAN for Simulating Distributions) and MAIL (multi-agent adversarial imitation learning). They then trained policies in Virtual Taobao with no physical costs. It was trained from hundreds of millions of customer records from the real environment.
The GAN-for-SimulatingDistribution (GAN-SD) approach was then used to simulate customers including their requests. Since the original GAN methods often undesirably mismatch with the target distribution, GAN-SD adopts an extra distribution constraint to generate diverse customers. To generate interactions, which is the key component of Virtual Taobao, they adopted the Multi-Agent Adversarial Imitation Learning (MAIL) approach.
MAIL learns the customers’ policies and the platform policy simultaneously. It trains a discriminator to distinguish the simulated interactions from the real interactions. The discrimination signal then feeds back as the reward to train the customer and platform policies for generating more real-alike interactions. After generating customers and interactions, Virtual Taobao is built.
“In experiments, we build Virtual Taobao from hundreds of millions of customers’ records, and compared it with the real environment. We found that Virtual Taobao successfully reconstructs properties very close to the real environment. We then employ Virtual Taobao to train platform policy for maximising the revenue”, they said.
Commodity Search Using virtual Taobao
Once the virtual Taobao is built, commodity search can also be carried out there. The search engine in Taobao deals with millisecond-level responses to billions of commodities, with a rich diversity of customer’s preference.
From the engine’s point of view, Taobao platform works as the following:
- A customer comes and sends a search request to the search engine
- It makes an appropriate response to the request by sorting the related commodities and displaying the page view to the customer
- The customer gives the feedback signal — purchasing, turning to the next page, leaving, according to the PVs as well as the buyer’s own intention
- The search engine receives the signal and makes a new decision for the next PV request
The business goal of Taobao is to increase sales by optimising the strategy of displaying PVs. As the feedback signal from a customer depends on a sequence of PVs, it’s reasonable to consider it as a multi-step decision problem rather than a one-step supervised learning problem. And this research does exactly that.
The introduction of GAN-SD and MAIL simulation tools allowed Alibaba to imitate the spontaneity of real-time Taobao activity. It also triggered the training engines to deliver better performance. The researchers reported an overall improvement in strategy and performance by 3 percent. It suggests that simulation may be a useful means of applying reinforcement learning in other situations where complex physical environments have traditionally prohibited direct application, and researchers hope to see the application of RL in complex physical task.