MITB Banner

Pinterest Is Getting Computer Vision Tasks Right With A Scalable Data Management System 

Share

 

Deep learning models rely on collecting high-quality data at scale to be successful with training. Some of the popular use cases are that of Amazon or Myntra, that use image processing techniques to classify the products based on the information present in the image.

Any machine learning pipeline contains many auxiliary tasks, which often require more engineering work than the model design itself. Tasks like data collection, cleaning, crowdsourcing, versioning, dataset storage and taxonomy management, are some of the examples.

Even though the recommendation systems have evolved over time, real-time (< 100 milliseconds) large scale application is still a challenging task for the online platforms.  And, Pinterest’s Pixie is one of the best solutions available out today.  

A Look At Pinterest Voyager – Hub For Storing CV Data 

Recommendation systems are the cornerstone of a majority of the modern day billion dollar industries. From Amazon to Netflix to Pinterest. Amazon recommends commodities for its e-commerce website and Netflix has a limited number of movies to recommend but in the case of Pinterest, the numbers are not that small. It has more than 100 billion ideas saved and has to deliver to more than 200 million of its users in real time.

Pinterest’s Pixie uses 3 billion nodes and 17 billion edges and the results show that the user interaction has increased up to 50% when compared to a Hadoop-based production system.

Computer Vision plays a key role in Pinterest’s everyday business. It is used in enabling search and discovery from every Pin and every image within a Pin.

To carry out the tasks like label cleaning, dataset storage as discussed above, Pinterest engineered a solution — Voyager.

Voyager is a centralised, scalable, and flexible data management tool. It serves as a hub for storing computer vision training data. It helps in the initial stages of data exploration where only a small dataset is required.

For e-commerce platform defining taxonomy is crucial. And, Voyager enables a quick way to define just with the help of text search. 

An illustration of how text searching ‘heels’ can be used to add items to a category.

Voyager is not just a web tool. It is also a system that handles data collection, cleaning, visualisation, and deep vision model training. This system is supported by an underlying unified data labelling schema.

This success of this schema owes to the following factors:

  1. Each image can have image-level labels and region-level labels A region can be defined as a box, a mask, a polygon, etc.
  2. Any region can have a label set from several independent taxonomies (e.g., semantic category, colour palette, material, pattern, etc). 
  3. Relating a region to regions in other images (e.g., this couch in this living room scene is similar to that couch in the other living room scene).
  4. Record labels from taxonomies that have no spatial associations in the image 
  5. Images need to be co-located as raw bytes with the metadata, not linked, to protect against relocated or modified images.

Conclusion

One thing that can be learned from Pinterest is that their data management system significantly simplifies common deep learning tasks by reducing turnaround time for the process of data collection, model training, debug, and more data collection. This is important for any developers who plan on building platforms with computer vision tasks as the vision tasks get more complicated over the duration of product deployment.

This article draws information from Pinterest Engineering blog.  

Share
Picture of Ram Sagar

Ram Sagar

I have a master's degree in Robotics and I write about machine learning advancements.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.