Logistic regression (Part II – evaluation)

(get code directly)

This post is a continuation of recent post where we implemented gradient descent method for solving logistic regression problem. Today we will try to evaluate our algorithm to find out how it works in the wild.

Last time the logistic regression problem was stated strictly as optimization problem which suggests our evaluation should consider the goal function itself – which is to some extend correct, indeed it makes sense to check how J changes while algorithm runs. But it is not the best idea one can have. Please recall our goal. We aimed for parametrized labelling machine producing labels for new upcoming obveravations/objects based on pre-defined features they are characterized by. The word new is crucial here. We want our model to fit to new data – to predict labels correctly for unseen data.

Therefore we need some external metrics to measure how our model performs on a data it does not know yet – or in other words we need to find out how well it generalizes. Generalization is the ability of machine learning model to fit unseen data. The opposite phenomenon is called “overfitting” – it happens when model predicts training data pretty well but it fails on new upcoming examples – it learns patterns that are unique for training data ommiting important general patterns that are common accross whole universe of possible objects of given kind.

Our next steps are then directed towards introducing external classification evaluation metrics, then building logistic regression model for chosen dataset (based on the code we already have) and applying these metrics onto it. In the meantime we will also take a look at how J behaves over time (iterations)

Lets begin then and present some basic terms around evaluation metrics for binary classificators and list their pros/cons. We will also provide Julia implementation fitting the code we already have.

Confusion matrix

Confusion matrix is a matrix (table) expressing classification results (class memebership) with respect to true labels. Its columns indices represent true classes while rows indices represent classes assigned by classification model. Entry of confussion matrix reflects number of objects falling into true-label/predicted-label category. In our case then confussion matrix is a square matrix of size 2.

One simple example, lets say we deal with the problem of binary classification of text email messages into classes called “spam” and “not spam”. The following confussion matrix is then produced:

\[
\begin{pmatrix}
\color{green}{\text{[actual spam predicted as spam]}} & \color{red}{\text{[not spam predicted as spam]}} \\
\color{red}{\text{[actual spam predicted as not spam]}} & \color{green}{\text{[not spam predicted as not spam]}}
\end{pmatrix}
\]

Depending on where they land observations falling into entries of the confussion matrix have the following names

\[
\begin{pmatrix}
\color{green}{\text{[true positive]}} & \color{red}{\text{[false positive]}} \\
\color{red}{\text{[false negative]}} & \color{green}{\text{[true negative]}}
\end{pmatrix} =
\begin{pmatrix}
\color{green}{\text{[TP]}} & \color{red}{\text{[FP]}} \\
\color{red}{\text{[FN]}} & \color{green}{\text{[TN]}}
\end{pmatrix}
\]

where positive part indicates email being a spam (one can imagine spam test with positive result if that helps) while negative inditaces email not being a spam.

Lets quickly implement confussion matrix calculation in Julia

function confussionmatrix(predictions, labels, d)
   c = zeros(d,d)
   for i in 1:length(labels)
       c[labels[i] + 1 ,predictions[i] + 1] += 1 
   end 
   return c
end

and play with it

labels = [ x > 0.5 ? 1 : 0 for x in randn(1,100)] 
predictions = [ rand(1)[1] > 0.1 ? labels[i] : (1 - labels[i]) for i in 1:100 ] # lets miss the true target with prob 0.1 
confussionmatrix(predictions,labels, 2)

resulting in something like (up to randomness..)
\begin{pmatrix}
\color{green}{\text{55}} & \color{red}{\text{9}} \\
\color{red}{\text{6}} & \color{green}{\text{33}}
\end{pmatrix}

There is quite a lot of evaluation metrics that one can derive out of these 4 values (TP FP TN FN). Lets list, implement and comment the most important:

Accuracy

\[
accuracy = \frac{TP + TN}{TP + TN + FP + FN} = \frac{\text{number of correct predictions}}{\text{number of observations}}
\]

As described it is quite naive ratio of correct predictions. The main problem with such metric is high dependency on class imbalance. For example if we deal with 100 observations with 90 positive labels and our “classifier” classifies all new observations as “positive” we land with accuracy = 0.9 which may indicate (at first glance) strong classification ability while our classifier is rather a toy.

Example Julia implementation below

function accuracy(labels, predictions) 
   sum(predictions .== labels) / length(labels)
end 

Sensitivity

Also known as true positive rate or (recall) is defined as

\[
sensitivity = \frac{TP}{TP + FN}
\]

which is a ratio of correct predictions of positive labels and all true positive labels. In case of our spam/not spam example that would measure how well we can predict spam (meaning, among all spam messages how many of them we actually predict as spam)

Specificity

Also known as true negative rate is defined as

\[
specificity = \frac{TN}{TN + FP}
\]

which is a ratio of correct predictions of negative labels and all true negative labels. Again, in case of our email example, its a way to measure how well we can predict “healthy” email (TN + FP = number of all emails that are not spam)

Precision

defined as

\[
precision = \frac{TP}{TP + FP}
\]

Lets think about that in the context of our email toy example. That is ratio of number of emails correctly predicted as spam to the number of emails predicted as spam – including incorrect predictions.

Once again, quickly, for the training purposes implement these metrics in Julia:

function sensitivity(predictions, labels) 
    tp = 0 
    fptp = 0. 
    for i in 1:length(labels)
      if labels[i] == 1 
          tp += predictions[i]
          fptp += 1.  
      end    
    end 
    return tp / fptp 
end

function specificity(predictions, labels) 
    tn = 0 
    tnfp = 0. 
    for i in 1:length(labels)
      if labels[i] == 0 
          tn += 1 - predictions[i]
          tnfp += 1.  
      end    
    end 
    return tn / tnfp 
end

function precision(predictions, labels) 
    tp = 0 
    fp = 0. 
    for i in 1:length(labels)
      if labels[i] == 1 
          tp += predictions[i]
      else 
          fp += predictions[i]
      end    
    end 
    return tp / (tp + fp)
end

Having these pieces in place it is probably high time to mention another “problem” that will motivate last evaluation metric mentioned in this post. It is about problem of choosing proper threshold when dealing with models returning real values (like in logistic regression case) instead of “labels” directly (knn could be example of such).

Lets stick to Logistic regression. Choosing proper threshold is about what exactly should be considered positive or negative prediction. Please recall that logistic regression model crunches (tranformed) input through logistic function that returns values from range \( (0,1) \). In other words it does not return suggested binary label. Then one must choose threshold value that will serve as minimum value for the model output to be interpreted as positive prediction. Choosing one particular threshold results in a single classifier. Each choice implies specificity and sensivity values. Now if you consider all possible choices of threshold you land with set of 2 dimensional points in sensitivity-specificity space. These points draw a curve. The curve is called ROC curve.

ROC curve

Different ROC curves exist, here we will talk about one particular, the one in specificity/sensivity space. ROC curve is then defined as set of points in \( \mathcal{R}^{2} \) where each point is defined as

\[
( 1 – specificity(\mathcal{M(t)}, X_{test}) , sensitivity(\mathcal{M(t)}, X_{test}) )
\]

where \( \mathcal{M(t)} \) is a model under consideration for chosen threshold \( t \) and then sensitivity and specificity are calculated for given model and test set \( X_{test} \). First of all as range of possible values for sensitivity and specificity is \( (0,1) \) ROC curve exists in a unit square. Lets take a look at the example of ROC curve for logistic regression model on iris dataset with target = “versicolor” (on training dataset).

roc

and comment border points

  • the upper right point has sensivity = 1 and specificity = 0. It is the point with threshold = 0 implying classifier that always returns positive prediction. Therefore true positive rate (sensitivity) is 1 as all predictions are positive and true negative rate (specificity) is 0 for the same reason.
  • the bottom left point has sensivity = 0 and specificity = 1. It is the point with threshold = 1 representing a classifier that always returns negative prediction. Analogically true positive rate (sensitivity) is 0 as all positive cases are predicted incorrectly and true negative rate (specificity) is 1 as all predictions are negative predictions.

Points in between represent classifiers resulting from choosing thresholds from \((0, 1)\).

How will the ROC curve look for potentially ideal classifier, the one for whom exists threshold that perfectly separates observations ? For that particular perfect threshold it will generate specificity and sensitivity equal to 1 with lower values for other thresholds. The final ROC curve will look like this:

Rplot2

To wrap ROC curve information into single metric people use something called AUC – Area Under the Curve which is very common practical metric for classifier performance. AUC around 0.5 implies purely random classifier while AUC equal to 1 implies perfect one.

if you want to get deeper insight into presented topics take a look at Elements of Statistical Learning where PDF version is available.

Lets try to implement AUC calculation in Julia. Our required input is vector of labels (binary) and vector of predictions (real). The main trick here is to sum up areas of consecutive triangles made of: point \((1,0)\), \(i\)-th ROC curve point and its \((i+1)\)-th point. We will then need to identify all possible thresholds, iterate through these thresholds, compute current triangle area, sum these areas up to produce final result.

Lets try to write some code

function triangle_area(A,B,C) 
  0.5 * abs((A[1] - C[1]) * (B[2] - A[2]) - (A[1] - B[1]) * ( C[2] - A[2]))
end

function auc(labels, predictions) 
   n = length(labels)
   ROC_curve_points = ones(n+1,2)
   sorted_labels = labels[sortperm(predictions)];
   prev_point = [1., 1.];
   total_auc = 0.
   predicted_labels = ones(n)
   for i in 1:n
     predicted_labels[i] = 0 
     A = [1 - specificity(predicted_labels,sorted_labels), sensitivity(predicted_labels,sorted_labels) ]     
     ROC_curve_points[i,:] = A
	 B = prev_point
     C = [1 0] 
     total_auc += triangle_area(A,B,C)     
     prev_point = copy(A)
   end    
   return ROC_curve_points, total_auc
end

The triangle_area function computes triangle area using Shoelace formula, auc function first sort labels with repsect to predictions, then it sets all predictions to 1, at each iteration current permuted label is set to 0 and for such prediction point A is computed, then finally area is updated by current triangle area given by last point, current point and point [1 0]. Moreover ROC curve points are returned for later visualization.

The code is not so optimal – just note that specificity and sensitivity are computed separately at each iteration making the whole calculation of inefficient quadratic time. Optimal code should involve calculating all sensitivity and specificity values in one function cleverly relying on result from previous iterations. I will try to add this to the code on github – as soon as it happens you won’t see this remark.

Cross validation

Ok lets gather all pieces together. We know how to build logistic regression model, we know it tries to minimize certain goal function, we know how to compute AUC to evaluate the model, what we still miss is a way of dataset division into training and test set – to evaluate generalization ability. Common way is to divide dataset into \(K\) chunks – build model \(K\) times each time using one chunk as a test set for evaluation. This technique is called k-fold cross validation.

We will now propose simple implementation of k-fold cross validation and comment it right after.

function cross_validation(k, learn_method, dataset, labels, evaluation_metric)
 n = size(dataset,1)
  if n < k
     error("nr of folds is greater than number of observations")
  end
  # shuffle dataset
  shuffled_indices = shuffle([1:n])
  # divide dataset into k chunks
  fold_cardinality = div(n,k)
  fold_indices = [[(fold_cardinality *(i-1) + 1):(fold_cardinality *i)] for i in 1:k]
  # add mising observations to the last chunk
  fold_indices[end] = fold_indices[end][end] < n? vcat(fold_indices[end],(fold_indices[end][end]+1):n) : fold_indices[end]

  f = learn_method[:learning_method] 
  predict = learn_method[:inferencer]
  params = learn_method[:params]
  
  predictions = zeros(n)
  goal_function_values = []
  for fold = fold_indices
      test_indices = shuffled_indices[fold]
      training_indices = filter(a -> !(a in test_indices), shuffled_indices)
      training_dataset = dataset[training_indices,:]
      training_labels = labels[training_indices]
      
      model_params, goal_function_values = f(training_dataset, training_labels, params)
	  
      test_dataset = dataset[test_indices ,:]    	  
      predictions[test_indices] = predict(test_dataset, model_params)[:]	  
	  
  end
  
  return predictions, evaluation_metric(labels, predictions), goal_function_values
end

the main idea here is to

  • iterate through subsets of shuffled indices (data matrix rows)
  • use these subsets as test data and the remaining elements as training data
  • each time to produce the model using selected learning method (in our case logistic_regression_learn function) and learning parameters
  • each time store the predictions for given indices subset
  • having all predictions, to evaluate them using chosen evaluation metric (in our case auc function)

Real world example

Well, it is not really real world example because dataset we are going to examine is quite ‘easy’ to separate. It does come from real world though. We are going to work with dataset called “breast cancer”. It consists of 699 cases described by 10 attributes each. You can read detailed description of the dataset here.

Ok, first of all, you can get the whole code that was created around logistic regression posts from github repository here.

include("LogisticRegression.jl");
include("Evaluation.jl");

# Pkg.add("RDatasets") - install RDatasets if you haven't already 
# Phg.add("Gadfly") - install gadfly for visualization purposes 
using RDatasets
using Gadfly

biopsy  = dataset("MASS","biopsy");
median_of_missing_V6 = int(median(biopsy[:V6][!isna(biopsy[:V6])]))
biopsy[:V6][isna(biopsy[:V6])] = median_of_missing_V6
data = convert(Array{Float64,2}, biopsy[2:10]);
labels = convert(Array{Float64,1}, biopsy[:11] .== "benign");

params = Dict([:alpha, :eps, :max_iter], [0.001, 0.0005, 10000.]);
learn_method = Dict([:learning_method, :params, :inferencer], [logistic_regression_learn, params, predict]);
predictions, eval_results = cross_validation(10, learn_method, data, labels, auc);

plot(x = eval_results[1][:,1], y= eval_results[1][:,2], Geom.line, Guide.xlabel("1 - Specificity"), Guide.ylabel("Sensivitiy"))

Lets see. First we include our previously written code. Then we prepare data input and labels, propose learning method with proper params, set evaluation metric and run cross validation (choosing 10 folds). Finally we plot ROC curve that results in the following plot:

plot

We can almost ideally separate points for certain threshold. AUC is around 1. Lets now pick one certain threshold, compute sensitivity and specificity and comment on that. Lets choose threshold = 0.5

sensitivity(labels, predictions .> 0.5)
0.9696312364425163

specificity(labels, predictions .> 0.5)
0.9537815126050421

As you can see for that choice of threshold we get very good performance of our classifier. Lets think of that for a moment. What are the consequences of threshold choice taking into account domain (potential breast cancer diagnosis) our classifier operates on. Of course, the loss of incorrect prediction in case of breast cancer is very high. It may results in death of a patient. It differs from the opposite situation when we diagnose cancer for healthy patient. Of course it is extremally stressful for patient, too. Losses of these decisions can be encoded in a structure called Loss matrix – often provided by domain expert. Then threshold choice minimizes given loss.

There are still some unsolved misteries around the code we just wrote. How does one know alpha prameter? What should be the value of epsilon? When people don’t know but still want to look smart – they usually say it is rather an art then science. I’m gonna use that here. It is more an art then science, my dear friend. I honestly don’t know. But we can still empirically examine alpha choice. This time we can directly look how the goal function changes over time. Lets then experiment with alpha a bit and see how J changes over time (iterations)

params = Dict([:alpha, :eps, :max_iter], [0.00001, 0.005, 10000.]);
results1 = logistic_regression_learn(data,labels,params);

params = Dict([:alpha, :eps, :max_iter], [0.0001, 0.005, 10000.]);
results2 = logistic_regression_learn(data,labels,params);

plot(
     layer(x = 1:length(results1[2]), y = results1[2], Geom.line, Theme(default_color=color("red"))),
     layer(x = 1:length(results2[2]), y = results2[2], Geom.line, Theme(default_color=color("blue"))),
     Guide.XLabel("Iterations"), Guide.YLabel("J"),  Scale.x_continuous(minvalue = 0, maxvalue = 5000)
)

it produces:
plot

Choosing smaller alpha results in slower convergance. On the other hand choosing alpha that is too large will not produce good result, either. If the step size is too large algorithm will not have flexibility to find optimal solution – it will operate in resolution that won’t let it find the optimal solution – it will finish iterating after max iterations – failing. Final conclusion is, you have to try out few alphas, see how J changes – if it gets smaller without jumping around – it is probably ok. Another option is to experiment with adaptive gradient descent, where you don’t have to specify that parameter.

Summary

To summarize. I found implementing of gradient descent method for logistic regression problem quite convenient. It was fun. The main pain though, I still experience is lack of good IDE. I also never established any mature workflow because of that. Right now I still use REPL (console) and external file in text editor which I think is very basic approach

Somewhere in the future, I will try to investigate what IDEs are out there and how to work in Julia in a correct and convenient way. See you soon!

---

  • Ehsan Totoni

    Hi, you might find the HPAT package interesting for scaling these algorithms to big data. Data frames isn’t supported yet, however.
    https://github.com/IntelLabs/HPAT.jl