The ability of an ML model to deal with noisy training data depends in great part on the loss function used in the training process. For classification tasks, the standard loss function used for training is the logistic loss.
The logistic loss, also known as the softmax loss, has been the standard choice in training deep neural networks for classification. The loss involves the application of the softmax function on the activations of the last layer to form the class probabilities followed by the Kullback-Leibler divergence between the true labels and the predicted probabilities. The logistic loss is known to be a convex function of the activations (and consequently, the weights) of the last layer.
Where Does Logistic Loss Fall Short
The logistic loss function value grows without bound as the mislabelled examples (outliers) are far away from the decision boundary. So, if there is an unwanted entity as an outlier, then the training process gets penalised and as a result, the decision boundary gets extended to compensate for the penalties. This results in leaving out other potential good examples from training.
Consequently, the generalisation performance of the network will immediately deteriorate, even with a low level of label noise.
To handle the noisy data during training better, researchers at Google Brain, introduce a temperature into the exponential function and replace the softmax output layer of neural nets by a high-temperature generalisation. This generalisation of the logistic loss endowed with two tunable parameters.
Temperatures in the context are used to characterize boundedness and the rate of decline of the transfer function. For example, “temperatures”—t1, characterizes boundedness, and t2 is used for tail-heaviness.
So, setting t1 lower than 1.0 increases the boundedness and setting t2 greater than 1.0 makes for a heavier-tailed transfer function. More about the application of bi-tempered in the next section.
Tackling Drawbacks Of Logistic Loss
When replacing the last layer of the neural nets by bi-temperature generalisation of logistic loss, the authors find that the training becomes more robust to noise.
In this approach, the authors tackle shortcomings of the logistic loss, pertaining to its convexity as well as its tail-lightness, by replacing the logarithm and exponential functions with their corresponding “tempered” versions.
The above picture is an illustration of Logistic vs. bi-tempered logistic loss where the temperature values (t1, t2) for the tempered loss are shown. The results show that for each situation, the decision boundary recovered by training with the bi-tempered logistic loss function is better than before.
When both temperatures are the same, then a construction based on the notion of “matching loss” leads to loss functions that are convex in the last layer.
This construction of bounded tempered loss functions that can handle large-margin outliers and introduce heavy-tailedness in this new tempered softmax function that seems to handle small-margin mislabeled examples.
Heavy tail means that there is a larger probability of getting very large values. So heavy tail distributions typically represent wild as opposed to mild randomness.
The experiments to demonstrate the practical utility of the bi-tempered logistic loss function were done on a wide variety of image classification tasks. For moderate size experiments, MNIST dataset of handwritten digits was used ImageNet-2012 for large scale image classification, that has 1000 classes.
- In the presence of mislabeled training examples near the classification boundary, the short tail of the softmax probabilities enforces the classifier to closely follow the noisy training examples.
- A bounded, tempered loss function is constructed that can handle large-margin outlier.
- A temperature is introduced into the exponential function and replace the softmax output layer of neural nets by a high-temperature generalisation.
- By tuning the two temperatures, researchers create loss functions that are non-convex already in the single-layer case.
Know more about training with noisy data here.