Calibration Techniques and it’s importance in Machine Learning
Introduction
Calibration in classification means turning transform classifier scores into class membership probabilities.
Instead of predicting class values directly for a classification problem, it can be convenient to predict the probability of an observation belonging to each possible class.
Predicting probabilities allows some flexibility including deciding how to interpret the probabilities, presenting predictions with uncertainty, and providing more nuanced ways to evaluate the skill of the model.
Predicted probabilities that match the expected distribution of probabilities for each class are referred to as Calibrated. A well-calibrated model has a calibration curve that hugs the straight line y=x
.
Calibration is comparison of the actual output and the expected output given by a model i.e. we try to improve our model such that the distribution and behaviour of the probability predicted is similar to the distribution and behaviour of probability observed in training data.
Many a times, we face problems (data sets) whose evaluation metric is LogLoss. We calibrate our model when the probability estimate of a data point belonging to a class is very important. I discovered two powerful methods to improve accuracy of predicted probabilities. We will discuss about them in detail later in the blog..
Calibration Plots
Calibration plots are often line plots. Once I choose the number of bins and throw predictions into the bin, each bin is then converted to a dot on the plot. For each bin, the y-value is the proportion of true outcomes, and x-value is the mean predicted probability. Therefore, a well-calibrated model has a calibration curve that hugs the straight line y=x
.
Follow the below steps to draw a Calibration Plot:
- Create a data set with two columns i.e. Actual class label(y) and its Predicted probability(y^)given by the model.
- Sort this data set in ascending order of the Probability predicted(y^) by the model.
- Now divide the data set in bins of some fixed size . If the data set is large then keep bin size large and vice versa.
- Now in each chunk / bins , calculate Average(yi’s) and the Average(yi^’s) by the model.
- Plot a graph with Average(yi’s) on y-axis and Average(yi^’s) on x-axis. i.e. D={Average(yi^’s),Average(yi’s)}
Types of Calibration Techniques:
There are two popular approaches to calibrating probabilities; they are the Platt Scaling and Isotonic Regression.
- Sigmoid Scaling/ Platt’s Scaling: In this technique we use a slight variation of sigmoid function to fit out distribution of predicted probabilities to the distribution of probability observed in training data. We actually perform logistic regression on the output of the model with respect to the actual label. The mathematical function used is shown below:
i.e., a logistic transformation of the classifier scores f(x), where A and B are two scalar parameters that are learned by the algorithm.
The parameters A and B are estimated using a maximum likelihood method that optimizes on the same training set as that for the original classifier f.
Note:
Platt scaling has been shown to be effective for SVMs as well as other types of classification models, including boosted models and even naive Bayes classifiers, which produce distorted probability distributions. It is particularly effective for max-margin methods such as SVMs and boosted trees, which show sigmoidal distortions in their predicted probabilities, but has less of an effect with well-calibrated models such as logistic regression, multilayer perceptrons, and random forests.
An alternative approach to probability calibration is to fit an isotonic regression model to an ill-calibrated probability model. This has been shown to work better than Platt scaling, in particular when enough training data is available.
2. Isotonic Regression: In statistics, isotonic regression or monotonic regression is the technique of fitting a free-form line to a sequence of observations under the following constraints: the fitted free-form line has to be non-decreasing (or non-increasing) everywhere, and it has to lie as close to the observations as possible.
A benefit of isotonic regression is that it is not constrained by any functional form, such as the linearity imposed by linear regression, as long as the function is monotonic increasing.
Thus, the isotonic regression problem corresponds to the following quadratic program (QP):
where m is an isotonic function. Now given the training set {fi, yi} we try to find the isotonic function m^by minimizing the following equation.
The algorithm used to solve this problem is called Pairs Adjacent Violators algorithm. The pseudo code of the algorithm is given below.
Code examples for better understanding and visualization:
- First, we will generate some random classification data using sklearn’s make classification; we will also create 80% train and 20% validation dataset.
- Next , let’s train 3 models which will help us gain inferences:
- Logistic Regression,
- Naive Bayes and
- Support Vector Classifier
- Next, we will plot the probability distribution of test probabilities to see if there are any significant differences in trends.
Observations:
- Logistic Regression has an even distribution .There is a slight accumulation of probabilities around 0 and 1, however, it isn’t very significant.
- Naive Bayes, has slightly higher concentration of probabilities around 0 and 1, but still has some values in the mid range.
- SVC, has almost all the values highly concentrated around 0 and 1, with negligible values in mid probability range.
Let’s plot Reliability Plots without using Calibration :
- Here, we will use ‘clf.decision_function’ attribute rather than ‘clf.predict_proba’ to predict probabilities(i.e.y^).
Observation: (Metric used : Brier Scores)
Note : The smaller the Brier score, the better, hence the naming with “loss”. Across all items in a set N predictions, the Brier score measures the mean squared difference between the predicted probability(y^) assigned to the possible outcomes for item i, and the actual outcome(y). Therefore, the lower the Brier score is for a set of predictions, the better the predictions are calibrated.
- Logistic Regression : 0.189
- Naive Bayes : 0.152
- Support Vector Machines : 0.105
- We can see that the calibration plot on cross validation data is distorted from ideal straight line.
Let’s plot Reliability/Calibration Plots using Calibration (Sigmoid/Isotonic):
- We need to look at the how confident the model is based on predicted probability and actual values. Let’s calibrate the models which will allow us to deploy informed post-processing techniques to the probabilities so that the final output results in better score (AUC, logloss etc).
- We can apply calibration just by using the CalibratedClassifierCV class available in sklearn library in python. For sigmoid calibration just pass ‘sigmoid’ while creating the object of this class and for isotonic just pass ‘isotonic’.
Observation: (Metric used :Brier Scores)
- Logistic Regression : 0.149
- Naive Bayes : 0.139
- Support Vector Machines : 0.031
- We can clearly see that Platt scaling has improved our probability scores.The calibration curve is almost closer to the straight line.
Observation: (Metric used :Brier Scores)
- Logistic Regression : 0.146
- Naive Bayes : 0.131
- Support Vector Machines : 0.031
- We can clearly see that Isotonic Regression has improved our probability scores even more.The calibration curve is approx. straight line. So, the predicted probabilities(y^) are almost same as actual probabilities(y).
Pretty Table to compare all the results:
CONCLUSION:
- Notice the different ways in which the actual predictions try to hug the perfect-calibration line. For example, Logistic Regression exhibits the logistic curve (S-shaped) . Logistic Regression produced improved results when calibrated with Isotonic Regression.
- Naive Bayes scores improved after Calibration and produced better results with Isotonic Regression than Platt/Sigmoid Scaling.
- SVC on the other hand is very closely hugging the benchmark calibration line, which means the predicted probabilities are very close to the average per bin, which definitely implies some stability in terms of probabilities and higher reliability. We can also observe that the F1 Score of SVC+Sigmoid is better than that of SVC+Isotonic.
- Platt Scaling is most effective when the distortion in the predicted probabilities is sigmoid shaped.
- Isotonic Regression is a more powerful calibration method that can correct any monotonic distortion. But, Isotonic Regression is prone to over-fitting.
- In general we should use Isotonic regression when we have large cross-validation data otherwise platt scaling should be done.
Where can you find my code?
Github link : https://github.com/SubhamIO/Calibration-Techniques-in-Machine-Learning
Thanks for reading !
- If you find my blog useful , please do clap , share and follow me .
References:
- https://machinelearningmastery.com/calibrated-classification-model-in-scikit-learn/
- https://en.wikipedia.org/wiki/Isotonic_regression
- https://en.wikipedia.org/wiki/Platt_scaling
- https://scikit-learn.org/stable/modules/calibration.html
- https://www.youtube.com/watch?v=RXMu96RJj_s&list=LLHBO8bM9cCDfnrqugwtjd-g&index=13&t=0s
- https://www.appliedaicourse.com/lecture/11/applied-machine-learning-online-course/3119/calibration-of-modelsneed-for-calibration/5/module-5-feature-engineering-productionization-and-deployment-of-ml-models
- https://drive.google.com/file/d/133odBinMOIVb_rh_GQxxsyMRyW-Zts7a/view