Even with all the knowledge of the real world, it is tricky for humans to identify how far or close the objects in the videos really are. It gets even messier when an algorithm is tasked to scan videos. Moreover, the real-world video streams can have objects or people moving along with the camera and other such possibilities.
Applying supervised learning to understand each individual frame in a video is expensive since per-frame labels in videos of the action of interest are needed. The data obtained from videos, when read frame by frame, gives rise to labelling issues. One cannot expect a well-defined label for every action or object in the data. A supervised solution is expensive. So, the researchers introduced a self-supervised learning method called Temporal Cycle-Consistency Learning (TCC). This technique was developed to identify the similarities in videos when the labelled data is almost non-existent.
What Is TCC?
The idea behind Temporal Cycle-Consistency Learning (TCC), to find correspondences across time in multiple videos. These correspondences can be used for matching frames in multiple videos based on the similarity of the action performed and align them. This is done using the nearest-neighbours in the learned embedding space.
In short, TCC is a technique to make the machine learning model gain more insights about the video. Feed the model with a video and it skims through all the frames, and learns all the embeddings that can be used for classification, transfer learning and many more.
As can be seen in the above picture, the procedure is as follows:
- The first step is to learn a frame encoder for image processing.
- All the frames of the videos are fed to the encoder and corresponding embeddings are produced.
- A reference video, say video 1 and video 2 fed where a reference frame is chosen from video 1 and its nearest neighbour frame (NN2) from video 2 is found in the embedding space (not pixel space).
- If the representations are cycle-consistent, then the findings of the nearest neighbour frame in video 1 (NN1) will be referred back to the starting reference frame.
To help future researchers to make most of the information in the videos, the team behind this innovation have released a codebase. This codebase contains implementations of many state-of-the-art self-supervised learning methods, including TCC.
Applications Of TCC
The team behind TCC list the following interesting applications:
- Improved Unsupervised Learning : TCC can classify the phases of different actions with as few as a single labelled video. When a few labelled videos are available for training, the few-shot scenario, TCC can be handy.
- Transfer learning between videos: TCC can be used to transfer metadata associated with any frame in one video to its matching frame in another video. This metadata can be sound or text. So, sound in one video can be transferred to a mute video which contains similar action.
- Per-frame Retrieval: The embeddings are powerful enough to differentiate between frames that look quite similar, such as frames just before or after the event has occurred.
If we consider the case of object detection in self-driving cars or a robot at the assembly line, in both the cases, the actions can be tailored to meet the needs if there is enough data to learn. Here most of the data can be a recorded video. For example, training the model from a video where a car makes an anomalous lane change before the accident. This kind of training can help in designing warning systems for safer self-driving systems.
So, to make the most out of videos, the machine learning models should be smart enough to classify something in the video as based on the actions performed.