Netflix, the popular online entertainment platform, is constantly transforming itself to provide an enriching viewer experience. With the number of Netflix users growing every day, viewership of TV shows has scaled up too, and it is introducing a collection of tools and algorithms to make content more relevant and enticing for the audience. The video-on-demand streaming company is not only looking forward to these tools to help choose the right title picture for a TV show, but also attract more viewership.
AVA, as these set of tools and algorithms are called, is helping Netflix users to watch shows of their choice and taste, available on the platform. It analyses large volumes of images obtained from video frames of a particular TV show to set a title image for that show. The title image makes it more visually appealing and easing the task of merchandising (user reach to the content) among the shows’ curators and creators.
What is AVA?
The presence of images on the internet space is ever growing. Thanks to advanced technology, the number of electronic gadgets related to photography and video recording are also growing large in number and are becoming more inexpensive. Therefore, the concern of handling/ organising image data– be it storage, processing or classification, is challenging. To address this concern, a research team from University of Barcelona, Spain in collaboration with Xerox corporation has developed a method called Aesthetic Visual Analysis (AVA) as a research project.
The project contains a vast database of over 2.5 lakh images combined with metadata such as aesthetic scores for images, semantic labels for more than 60 classifications of images and many other characteristics. They primarily use statistical concepts such as the standard deviation, mean score and variance to rate the images. Based on the distributions computed from these statistics, they assess the semantic challenges and choose the right images for the database. The methodology presented in the paper also discusses the training of appropriate images that fits the statistical criteria followed for the project, such as goodness-of-fit with root-mean-square errors (RMSE) and so on. The classification are also done according to various styles of customisation.
With this they primarily alleviate the problem of extensive benchmarking and also train more images. They also show how computer applications could be made visually richer with large datasets and better aesthetic appeal. Altogether, the project consists of algorithms along with statistical analysis. With AVA, computing performance can be significantly optimised and have lesser impact on the hardware.
How AVA works
In the usual scenario, content editors have to go through a vast assortment of video frames for a show, to select a good title image. The number of frames span in millions depending on the number of episodes in a show. This task of manually screening the frames is almost impossible and are often rendered ineffective. This is where AVA comes into play, where it uses its image classification algorithms for sorting the right image at the right time.
AVA follows a sequential method by analysing images obtained through the process of frame annotation. The variables required for the algorithm are annotated in the video frames, which is achieved by using their own framework called Archer. The framework is closely based on FFmpeg platform for video editing. Archer splits the video into very tiny bits to aid parallel video processing. This will lead to more useful algorithms being generated on AVA.
After the frames are obtained, they are subjected to a series of image recognition algorithms to build metadata which are the classified as visual, contextual and composition metadata. Some of the important details captured in the metadata are given below.
- Visual Metadata: For brightness, sharpness and color
- Contextual Metadata: For facial expressions, camera motion, camera angle and object detection
- Composition Metadata: For intricate image details such as depth of field and symmetry.
The annotations are made based on the above metadata. The ‘right’ picture is selected by now feeding all the relevant, appropriate images on an automating image processing framework.
And, finally it chooses the right picture
The ‘best’ image is chosen from considering three important aspects– the lead actors, visual range and image filters. The emphasis is given first on lead actors of the show since they form an aesthetic appeal and make a visual impact.
The next thing, is the diversity of the images present in the video frames such as the camera positions, image details such as brightness, color, contrast to name a few. With these in mind, image frames are easy to group based on similarities. This will help develop image support vectors. The vectors primarily will assist in designing an image diversity index where all the relevant images collected for an episode or even a movie can be scored based on visual appeal.
Apart from these factors, other sensitive factors such as violence, nudity and advertisements are also filtered, and are allotted low priority in the image vectors. This way they are screened out completely in the process.
This article provides a brief outlook towards the implementation of AVA in Netflix. However, there might be other contributing factors in computer vision to sort images and videos. The underlying software and applications is just a tip of the iceberg. In the coming days, Netflix will surely come up with much more beautiful and richer interface to attract more viewers.
Try deep learning using MATLAB