One of the key problems faced by almost all e-commerce giants across the world is item categorisation. These large online retail giants feature a very large and long-tail inventory with millions of Items (product offers) entering into the marketplace every day.
The quality of item categorisation plays a significant role in subsequent customer-facing applications such as search, product recommendation, trust and safety, product catalogue building, and seller utilities.
A correct item categorisation system is also essential for user experience as it helps determine the relevant presentation logic in surfacing the items to users through search and browsing.
How E-Commerce Giants Tackle Categorisation Problem
The success of e-commerce can be associated with its offering of commodities below the usual market price along with the variety of deals; an effortless window shopping.
In order to give better deals to the customer while managing the profitability, the procurement costs have to be brought down.
A good old technique is to procure the goods in high volume.
In the case of Japanese e-commerce giant Rakuten Ichiba too, the engineers used a large scale multi-class hierarchical product categorisation method.
In one of their works, to improve item selection at a large scale, they used data containing 172 million product titles and description.
For training the deep networks in a reasonable time, word-vectors need to be sparse even if the input layer is very large for training deep networks. This can be done using the selective reconstruction method.
Models like Deep belief nets (DBN) and deep autoencoders (DAE) were used in the case of Rakuten.
DBN and DAE with selective reconstruction are implemented on GPUs in order to process a large product base within a reasonable amount of time because conventional methods like multinomial Naive Bayes are not practical to be used in that scale.
Initially, the text is normalised by converting all Japanese characters to full-width and all non-Japanese characters to lower cases. And, then all HTML tags are cleaned from descriptions.
For example, the words “iPhone 4s” is normalised to “iphone 4s”.
A frequency threshold is chosen so as to not select those words which appear a few times as features.
This helps in eliminating most of the noisy information as well. Eliminating less-frequent words also helps to make the classifier practical and more robust for the classification of new products.
Spell correction and abbreviations are two main challenges during item categorisation. This forms a crucial aspect of data cleaning tasks for e-commerce sites.
Doing It The Walmart Way
Walmart too has deployed state-of-the-art machine learning models. In their case, they figured out that the generalised linear model (GLM) had an advantage over random forest models.
Walmart has a huge global presence with regards to the retail market and with Flipkart, it wants to capture a significant chunk of the Indian e-commerce market.
Initially, Walmart operated independently. The Walmart buyers of India have no knowledge of their American counterparts.
Since, bringing the procurement costs is the main objective, it now makes sense to integrate the suppliers on a global scale when buying in high volume.
Now this integration will come with an inevitable challenge- data collection and mapping.
A starting point could be building a sample data with few classes to create a pipeline. An incoming text will be converted into a term frequency-inverse document frequency (tf-idf) matrix for feature extraction and then plug this data into machine learning models.
Walmart follows these steps for feature engineering:
- Taking item description from raw data
- Data cleaning: removing special characters, stop words and single letter words
- Stemming and translation
- Concatenating all the clean columns and creating document term matrix
For converting the different classes of the items into manageable clusters, the hierarchical clustering method is used. The sub-classes within the hierarchy follow the Bayes’ rule.
Usually, while forming the final matrix, Jaccard distance is used as a metric to measure the frequency of the words occurring in different classes. But in the case of Walmart, a modified Jaccard distance is used instead to avoid the problem of improper grouping due to inequality number of words.
Modified Jaccard distance considers the ratio of the similarities in classes and minimum words present in the class.
Know more about the work here
Data matching globally is an intensive task. For example, fries in the US can mean chips in the UK. Using synonyms and other localised terms make text pattern recognising difficult.
Moreover, merchants may not be accurate while assigning products, the category assignment of the same product listed by different merchants may not be consistent. Automatic category recommendation for given product information helps in solving these problems.
These challenges are persistent across all e-commerce platforms and with intensified customer-centric retail strategies, there will be more developments in NLP and image classification techniques.