Neural networks are powerhouses of predictions. The last couple of years has witnessed a sporadic growth ML approach for prediction. A neural network is one such approach which applies arithmetic operations over hundreds of layers until an appropriate solution is obtained.
To keep a check on how accurate the solution is, loss functions are used. These functions are a handful of mathematical expressions whose results depicts by how much the algorithm has missed the target. An example would be that of a self driving car whose on board camera, if, misidentifies a cyclist as lane marks then that would be a bad day for the cyclist. Loss functions help avoid these kind of misses by mitigating the errors.
Most commonly used loss functions are:
- Mean-Squared error
- Cross-entropy loss
- Hinge loss
- Huber Loss
Mean Squared Error(MSE) is used to measure the accuracy of an estimator. The lesser the value of MSE, the better are the predictions.
This expression can be defined as the mean value of the squared deviations of the predicted values from that of true values. Here ‘n’ denotes the total number of samples in the data.
Cross entropy is a widely popular concept of information theory. It is the measure of number of bits that are needed to encode certain information based on an initial hypotheses.
In machine learning, this function is used in classification where a network has to give two distinct outputs. The right side of the expression has an addition to account for the false positives in the results and penalise it. It is also written as:
As the predicted probability moves away from the target or a label entity in the data, the loss increases. Log loss penalizes both types of errors, but especially those predictions that are confident and wrong. This function is also used for estimation of an anomaly..
Hinge loss function is popular with Support Vector Machines(SVMs). These are used for training the classifiers.
where ‘t’ is the intended output and ‘y’ is the classifier score.
Hinge loss is convex function but is not differentiable which reduces its options for minimising with few methods.
Knowing When To Use A Loss Function
For a classification problem, hinge loss and logistic loss are almost equal for a given convergence rate and are better than square loss rate.
Every loss function has its own set of advantages and disadvantages. More than disadvantages they can be called unsuitable for few problems.
Squared loss function which operates statistical assumptions of mean, is more prone to outliers. It penalises the outliers intensely. This results in slower convergence rates when compared to hinge loss or cross entropy functions.
When it comes to hinge loss function, it penalises the data points lying on the wrong side of the hyperplane in a linear way. Hinge loss is not differentiable and cannot be used with methods which are differentiable like stochastic gradient descent(SGD). In this case Cross entropy(log loss) can be used. This function is convex like Hinge loss and can be minimised used SGD.
A data scientist needs to look into data for what it is- empirical reality. Mathematically aesthetic models can look good on paper but the reality has other plans. This intuition along with other metrics should be primary motivators of selecting a loss function. For example, consider a glass frame which has to be tested for the load it can tolerate. Glass being a highly brittle material, will not succumb to the load until it does. In other words, sometimes the failure comes without any warning. The load few moments before failure can be seen as safe when it should really by the warning load. Data analysis must be done by considering the chance of asymmetries that might have occurred during data collection.
Going by the data alone can fool us into making improper decisions and that is the last thing a decision maker like data scientist would want to.
Know more about loss functions here