### An intro to probability scoring strategies in Python

Predicting probabilities rather than class labels for a classification problem can furnish extra nuance and uncertainty for the predictions.

The additional nuance enables more advanced metrics to be leveraged to interpret and assess the forecasted probabilities. Generally, strategies for the assessment of the precision of forecasted probabilities are referenced to as scoring rules or scoring functions.

In this guide, you will find out a trio of scoring strategies that you can leverage to assess the forecasted probabilities are referenced to as scoring rules or scoring functions.

After going through this guide, you will be aware of:

- The log loss score that heavily penalizes forecasted probabilities far away from their predicted value.
- The Brier score that is gentler than log loss but still penalizes proportional to the distance from the predicted value.
- The region under ROC curve that summarizes the likelihood of the model forecasting a higher probability for real positive cases than true negative cases.

**Tutorial Summarization**

This guide is subdivided into four portions, which are:

- Log Loss Score
- Brier Score
- ROC AUC Score
- Tuning Predicted Probabilities

**Log Loss Score**

Log Loss, also referred to as “logistic loss”, “logarithmic loss”, or “cross entropy” can be leveraged as a measure for assessing forecasted probabilities.

Every predicted probability is contrasted to the actual class output value (0 or 1) and a score is calculated that penalizes the probability on the basis of the distance from the expected value. The penalty is logarithmic, providing a small score for minimal variations (0.1 or 0.2) and enormous score for a large difference (0.9 or 1.0)

A model with perfect skill has a log loss score of 0.0.

In order to summarize the skill of a model leveraging log loss, the log loss is calculated for every forecasted probability, and the average loss is reported.

The log loss can be implemented in Python leveraging the log_loss() function in scikit-learn.

For instance:

[Control]

1 2 3 4 5 6 7 8 9 10 | from sklearn.metrics import log_loss … model = … testX, testy = … # predict probabilities probs = model.predict_proba(testX) # keep the predictions for class 1 only probs = probs[:, 1] # calculate log loss loss = log_loss(testy, probs) |

In the binary classification scenario, the function takes a listing of true outcome values and a listing of probabilities as arguments and calculates the average log loss for the predictions.

We can make a singular log loss score concrete with an instance.

Provided a particular known outcome of 0, we can forecast values of 0.0 to 1.0 in 0.01 increments (101 predictions) and calculate the log loss for each. The outcome is a curve displaying how much every forecast is penalized as the probability gets further away from the expected value. We can repeat this for a known result of 1 and observe the same curve in reverse.

The full instance is detailed below:

[Control]

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | # plot impact of logloss for single forecasts from sklearn.metrics import log_loss from matplotlib import pyplot from numpy import array # predictions as 0 to 1 in 0.01 increments yhat = [x*0.01 for x in range(0, 101)] # evaluate predictions for a 0 true value losses_0 = [log_loss([0], [x], labels=[0,1]) for x in yhat] # evaluate predictions for a 1 true value losses_1 = [log_loss([1], [x], labels=[0,1]) for x in yhat] # plot input to loss pyplot.plot(yhat, losses_0, label=’true=0′) pyplot.plot(yhat, losses_1, label=’true=1′) pyplot.legend() pyplot.show() |

Running the instance develops a line plot displaying the loss scores for probability predictions from 0.0 to 1.0 for the cases where the true label is 0 and 1.

This assists to develop an intuition for the effect that the loss score has when assessing predictions.

Model skill is reported as the average log loss throughout the predictions in a test dataset.

As an average, we can predict that the score will be appropriate with a balanced dataset and misleading when there is a major imbalance between the two classes in the test set. This due to forecasting 0 or small probabilities will have the outcome in a minimal loss.

We can illustrate this by contrasting the distribution of loss values when forecasting differing constant probabilities for a balanced and an imbalanced dataset.

First, the instance below forecasts values from 0.0 to 1.0 in 0.1 increments for a balanced dataset of 50 instances of class 0 and 1.

[Control]

1 2 3 4 5 6 7 8 9 10 11 12 | # plot impact of logloss with balanced datasets from sklearn.metrics import log_loss from matplotlib import pyplot from numpy import array # define a balanced dataset testy = [0 for x in range(50)] + [1 for x in range(50)] # loss for predicting different fixed probability values predictions = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] losses = [log_loss(testy, [y for x in range(len(testy))]) for y in predictions] # plot predictions vs loss pyplot.plot(predictions, losses) pyplot.show() |

Running the instance, we can observe that a model is better-off forecasting probabilities values that are not sharp (near to the edge) and are back towards the middle of the distribution.

The penalty of being incorrect with a sharp probability is very big.

We can repeat this experiment with an imbalanced dataset with a 10:1 ratio of class 0 to class 1.

[Control]

1 2 3 4 5 6 7 8 9 10 11 12 | # plot impact of logloss with imbalanced datasets from sklearn.metrics import log_loss from matplotlib import pyplot from numpy import array # define an imbalanced dataset testy = [0 for x in range(100)] + [1 for x in range(10)] # loss for predicting different fixed probability values predictions = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] losses = [log_loss(testy, [y for x in range(len(testy))]) for y in predictions] # plot predictions vs loss pyplot.plot(predictions, losses) pyplot.show() |

Here, we can observe that a model that is skewed towards forecasting very minimal probabilities will feature good performance, optimistically so.

The naïve model that forecasts a constant probability of 0.1 will be the baseline model to defeat.

The outcome indicates that model skill assessed with log loss should be interpreted meticulously in the scenario of an imbalanced dataset, probably adjusted relative to the base rate for class 1 in the dataset.

**Brier Score**

The Brier Score, named in honor of Glenn Brier, calculates the mean squared error between forecasted probabilities and the predicted values.

The score summarizes the magnitude of the error in the probability forecasts.

The error score is typically between 0.0 and 1.0, where a model with ideal skill has a score of 0.0.

Predictions that are further away from the expected probability are penalized, but less severely as in the scenario of log loss.

The skill of a model can be summed as the average Brier score across all probabilities forecasted for a test dataset.

The Brier score can be calculated in Python leveraging the brier_score_loss() function in scikit-learn. It takes the true class values (0,1) and the forecasted probabilities for all instances in a test dataset as arguments and returns the average Brier score.

For instance:

[Control]

1 2 3 4 5 6 7 8 9 10 | from sklearn.metrics import brier_score_loss … model = … testX, testy = … # predict probabilities probs = model.predict_proba(testX) # keep the predictions for class 1 only probs = probs[:, 1] # calculate bier score loss = brier_score_loss(testy, probs) |

We can assess the impact of forecasting errors by contrasting the Brier score for single probability forecasts in escalating error from 0.0 to 1.0.

The complete instance is detailed below:

[Control]

1 2 3 4 5 6 7 8 9 10 11 | # plot impact of brier for single forecasts from sklearn.metrics import brier_score_loss from matplotlib import pyplot from numpy import array # predictions as 0 to 1 in 0.01 increments yhat = [x*0.01 for x in range(0, 101)] # evaluate predictions for a 1 true value losses = [brier_score_loss([1], [x], pos_label=[1]) for x in yhat] # plot input to loss pyplot.plot(yhat, losses) pyplot.show() |

Running the instance develops a plot of the probability prediction error in absolute terms (x-axis) to the calculated Brier score (y axis).

We can observe a familiar quadratic curve, escalating from 0 to 1 with the squared error.

Model skill is reported as the average Brier throughout the forecasts in a test dataset.

As with log loss, we can expect that the score will be appropriate with a balanced dataset and misleading when there is a major imbalance between the two categories in the test set.

We can illustrate this by contrasting the distribution of loss values when forecasting differing constant probabilities for a balanced and an imbalanced dataset.

To start with, the instance below forecasts values from 0.0 to 1.0 in 0.1 increments for a balanced dataset of 50 instances of class 0 and 1.

[Control]

1 2 3 4 5 6 7 8 9 10 11 12 | # plot impact of brier score with balanced datasets from sklearn.metrics import brier_score_loss from matplotlib import pyplot from numpy import array # define a balanced dataset testy = [0 for x in range(50)] + [1 for x in range(50)] # brier score for predicting different fixed probability values predictions = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] losses = [brier_score_loss(testy, [y for x in range(len(testy))]) for y in predictions] # plot predictions vs loss pyplot.plot(predictions, losses) pyplot.show() |

Running the instance, we can observe that a model is better-off forecasting middle-of-the-road probabilities values like 0.5.

Not like log loss that is quite flat, for close probabilities, the parabolic shape displays the clear quadratic increase in the score penalty as the error is increased.

We can repeat this experiment with an imbalanced dataset with a 10:1 ratio of class 0 to class 1.

[Control]

1 2 3 4 5 6 7 8 9 10 11 12 | # plot impact of brier score with imbalanced datasets from sklearn.metrics import brier_score_loss from matplotlib import pyplot from numpy import array # define an imbalanced dataset testy = [0 for x in range(100)] + [1 for x in range(10)] # brier score for predicting different fixed probability values predictions = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] losses = [brier_score_loss(testy, [y for x in range(len(testy))]) for y in predictions] # plot predictions vs loss pyplot.plot(predictions, losses) pyplot.show() |

Running the instance, we observe a very different picture for the imbalanced dataset.

A lot like the average log loss, the average Brier score will put forth optimistic scores on an imbalanced dataset, rewarding small forecasting values that minimize error on the majority class.

In these scenarios, Brier score should be contrasted comparative to the naïve prediction (for example the base rate of the minority class or 0.1 in the above instance) or normalized through the naïve score.

This latter instance is typical and is referred to as the Brier skill score (BSS).

BSS = 1 – (BS / BS_ref)

Where BS is the Brier skill of model, and BS_ref is the Brier skill of the naïve prediction.

The Brier Skill Score reports the comparative skill of the probability prediction over the naïve forecast.

A good adequate to the scikit-learn API would be to include a parameter to the brier_score_loss() to support the calculation of the Brier Skill Score.

**ROC AUC Score**

A forecasted probability for a binary (two-class) classification problem can be interpreted with a threshold.

The threshold defines the point at which the probability is mapped to class 0 vs. class 1, where the default threshold is 0.5. Alternative threshold values enable the model to be tuned for higher or lower false positives and false negatives.

Tuning the threshold by the operator is specifically critical on problems where one variant of error is more or less critical than another or when a model makes disproportionately more or less of a particular variant of error.

The Receiver Operating Characteristic, or ROC curve is a plot of the true positive rate versus the false positive rate for the forecasts of a model for several thresholds between 0.0 and 1.0.

Predictions that possess no skill for a provided threshold are drawn on the diagonal of the plot from the bottom left to the top right. This line indicates no-skill predictions for every threshold.

Models that possess skill have a curve above this diagonal line that bows towards the top left corner.

Listed here is an instance of fitting a logistic regression model on a binary classification problem and calculating and plotting the ROC curve for the forecasted probabilities on an evaluation set of 500 new data examples.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | # roc curve from sklearn.datasets import make_classification from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import roc_curve from matplotlib import pyplot # generate 2 class dataset X, y = make_classification(n_samples=1000, n_classes=2, random_state=1) # split into train/test sets trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2) # fit a model model = LogisticRegression() model.fit(trainX, trainy) # predict probabilities probs = model.predict_proba(testX) # keep probabilities for the positive outcome only probs = probs[:, 1] # calculate roc curve fpr, tpr, thresholds = roc_curve(testy, probs) # plot no skill pyplot.plot([0, 1], [0, 1], linestyle=’–‘) # plot the roc curve for the model pyplot.plot(fpr, tpr) # show the plot pyplot.show() |

Running the instance develops an instance of a ROC curve that can be contrasted to the no skill line on the primary diagonal.

The integrated region under the ROC curve, referred to as AUC or ROC AUC, furnishes a measure of the skill of the model throughout across all assessed thresholds.

An AUC score of 0.5 indicates no skill, for example, a curve along the diagonal while an AUC of 1.0 indicates ideal skill, across all points along the left y-axis and top x-axis toward the top left corner. An AUC of 0.0 indicates perfectly wrong predictions.

Predictions by models that have a larger area have improved skill throughout the thresholds, even though the particular shape of the curves amongst models will demonstrate variance, possibly providing opportunity to optimize models by a pre-selected threshold. Usually, the threshold is selected by the operator after the model has been prepped.

The AUC can be calculated in Python leveraging the roc_auc_score() function in scikit-learn.

This function takes a listing of true output values and forecasted probabilities as arguments and returns the ROC AUC.

For instance:

1 2 3 4 5 6 7 8 9 10 | from sklearn.metrics import roc_auc_score … model = … testX, testy = … # predict probabilities probs = model.predict_proba(testX) # keep the predictions for class 1 only probs = probs[:, 1] # calculate log loss loss = roc_auc_score(testy, probs) |

An AUC scoring is a measure of the likelihood that the model that generated the forecasts will rank an arbitrarily selected positive instance above an arbitrarily selected negative instance. Particularly, that the probability will be higher for a real event (class=1) than a real non-event (class=0).

This is an instructive definition that provides two critical functions.

- Naïve Prediction: A naïve prediction under ROC AUC is any constant probability. If the same probability is forecasted for every instance, there is no discrimination amongst positive and negative scenarios, thus the model has no skill (AUC=0.5)
- Insensitivity to Class Imbalance: ROC AUC is a summarization of the model’s capability to rightly discriminate a singular instance across differing thresholds. As such, it is not concerned with the base likelihood of every class.

Below, the instance illustrating the ROC curve is updated to calculate and display the AUC.

[Control]

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | # roc auc from sklearn.datasets import make_classification from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import roc_auc_score from matplotlib import pyplot # generate 2 class dataset X, y = make_classification(n_samples=1000, n_classes=2, random_state=1) # split into train/test sets trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2) # fit a model model = LogisticRegression() model.fit(trainX, trainy) # predict probabilities probs = model.predict_proba(testX) # keep probabilities for the positive outcome only probs = probs[:, 1] # calculate roc auc auc = roc_auc_score(testy, probs) print(auc) |

Running the instance calculates and prints the ROC AUC for the logistic regression model assessed on 500 new instances.

0.9028044871794871

A critical consideration in selecting the ROC AUC is that it does not summarize the particular discriminative power of the model, instead the general discriminative power throughout all thresholds.

It might be an improved utility for model selection instead of quantifying the practical skill of a model’s forecasted probabilities.

**Tuning Predicted Probabilities**

Predicted probabilities can be tuned to enhance or even game a performance measure.

For instance, the log loss and Brier scores quantify the average amount of error in the probabilities. As such, forecasted probabilities can be tuned to enhance these scores in a few ways:

- Making the probabilities less sharp (less confident): This implies modifying the forecasted probabilities away from the hard 0 and 1 bounds to restrict the impact of penalties of being completely incorrect.
- Shift the distribution to the naïve prediction (base rate): This implies shifting the mean of the forecasted probabilities to the probability of the base rate, like 0.5 for a balanced forecasting problem.

Typically, it might be useful to review the calibration of the odds leveraging tools like a reliability diagram. This can be accomplished leveraging the calibration_curve() function in scikit-learn.

Some algorithms, like SVM and neural networks, might not forecast calibrated probabilities natively. In these scenarios, the probabilities can be calibrated and in turn may enhance the selected metric. Classifiers can be calibrated in scikit-learn leveraging the CalibratedClassifierCV class.

**Further Reading**

This section furnishes additional resources on the subject if you are seeking to delve deeper.

**API**

- sklearn.metrics.log_loss API
- sklearn.metrics.brier_score_loss API
- sklearn.metrics.roc_curve API
- sklearn.metrics.roc_auc_score API
- sklearn.calibration.calibration_curve API
- sklearn.calibration.CalibratedClassifierCV API

**Articles**

- Scoring Rule, Wikipedia
- Cross entropy, Wikipedia
- Log Loss, fast.ai
- Brier Score, Wikipedia
- Receiver operating characteristic, Wikipedia

**Conclusion**

In this guide, you found out about three metrics that you can leverage to assess the forecasted probabilities on your classification predictive modelling problem.

Particularly, you learned:

- The log loss score that heavily penalizes forecasted probabilities far away from their predicted/expected value.
- The Brier score that is gentler than log loss but still penalizes proportional to the distance from the expected value
- The region under ROC curve that summarizes the probability of the model forecasting a higher probability for true positive cases than true negative cases.