Skip to content
Home » Duke University » AI Product Management Specialization » Machine Learning Foundations for Product Managers » Week 3: Evaluating & Interpreting Models

Week 3: Evaluating & Interpreting Models

In this module we will learn how to define appropriate outcome and output metrics for AI projects. We will then discuss key metrics for evaluating regression and classification models and how to select one for use. We will wrap up with a discussion of common sources of error in machine learning projects and how to troubleshoot poor performance.

Learning Objectives

  • Differentiate between outcome and output metrics
  • Apply metrics to evaluate the performance of regression models
  • Apply metrics to evaluate the performance of classification models

Metrics in ML Projects


Video: Introduction and Objectives

Why Model Evaluation Matters:

  • Evaluating a machine learning model goes beyond just looking at its accuracy.
  • We need to understand the types of errors a model makes and their implications for its real-world use.

Key Concepts

  • Outcome vs. Output Metrics:
    • Outcome Metrics: Measure success in terms of the real-world problem the model is trying to solve (e.g., increased sales, improved patient diagnosis).
    • Output Metrics: Assess the technical performance of the model itself (e.g., accuracy, precision, recall). These help refine the model but must be carefully connected to real-world outcomes.

Evaluating Regression Models:

  • Regression models predict numerical values.
  • Common metrics:
    • Mean Squared Error (MSE)
    • Root Mean Squared Error (RMSE)
    • R-squared

Evaluating Classification Models:

  • Classification models predict categories.
  • Common metrics:
    • Accuracy
    • Precision
    • Recall
    • F1-score
    • Confusion matrix (visualizes different types of errors)

Key Takeaways

  • Choosing the right metrics depends on the specific problem you’re solving and the trade-offs you’re willing to make between different types of errors.
  • Metrics help us understand the strengths and weaknesses of our models, leading to further refinement and better decision-making.

In the last module, we talked
about the process of building and training machine learning models. We discussed strategies to select and optimize models using validation sets and
cross validation. In this module, we’re going to talk
in depth about how to evaluate and interpret models and their outputs. At the end of this module, you should be
able to differentiate between the two main types of metrics that we use
outcome metrics and output metrics. You should also be able to understand and
apply metrics to evaluate the performance of the two main types of supervised
machine learning models that we create. Regression models and
classification models.

Reading: Download Module Slides

Video: Outcomes vs Outputs

Key Concepts

  • Outcome Metrics: Measure the desired business impact of the model or product (e.g., reduced costs, increased revenue, improved safety). They are often expressed in dollars or time.
  • Output Metrics: Evaluate the technical performance of the model itself (e.g., classification accuracy, regression error). They focus on how well the model predicts the target outcome.

The Relationship

  1. Start with Business Understanding: Define the problem and what success looks like (outcome metrics).
  2. Outcome Metrics Guide Output Metrics: Choose output metrics that align with and help achieve the desired outcomes.

Case Studies

  • Turbulence Prediction for Airlines
    • Outcome Metric: Reduced safety incidents or claims.
    • Output Metric: Classification metric (accuracy in predicting turbulence).
  • Power Demand Forecasting for Utilities
    • Outcome Metric: Lower cost or emissions per megawatt hour.
    • Output Metric: Regression metric (accuracy in predicting power demand).

Key Takeaways

  • Choosing the right metrics starts with a clear understanding of the business problem you are solving.
  • Outcome metrics communicate value to stakeholders, while output metrics guide model development and refinement.

Which of the following are examples of potential "outcome" metrics for a machine learning project (select all that apply)?
  • Additional sales revenue (in $/day)
  • Time saved (in minutes/day)

Evaluation and interpretation of models
is the primary focus of step five of the christian process. However, defining metrics in order to
evaluate models actually begins right at the beginning of the process and
step one business understanding. When we define the problem that we’re
trying to solve through a model, a key part of that problem definition
is defining what success looks like. And identifying the metric that we’re
going to use to evaluate success. Our choice of metrics in the business
understanding phase feeds directly into our evaluation of models when
we reach that five of the process. Machine learning,
we generally use two different types of metrics to evaluate our model performance. The first type is called outcome metrics. Outcome metrics refer to the desired
business impact of the model or the broader product that we’re
trying to create either for our own organization or for our customers. Typically the business impact is
stated in terms of dollars, so it might be dollars of costs saved,
might be dollars of revenue generated. Sometimes it can be a time as well,
but typically it’s referring to some sort of an impact on a customer or
our own business operations. Outcome metrics do not
contain technical performance metrics about the model
that we’ve created. Help put metrics, on the other hand, refer
to the desired output from our model. These are typically stated in terms of
one of our model performance metrics that we’re going to learn about
later in this lesson. Typically the output metrics for a model are not communicated to
the customer except in rare cases. What our customer really cares about is
the outcome that we’re delivering to them. Not so
much the output from the model itself. Output metrics are also generally
set after we’ve defined the desired outcome and we allow the choice of
outcome metric to then dictate our selection of output metrics that
we use to evaluate our model. To illustrate the difference
between outcome and output metrics,
let’s consider a couple of case studies. The first case study is focused on a tool
to predict turbulence for airlines. Our objective here is to use atmospheric
conditions to predict turbulence in advance of flights taking off. By predicting turbulence, we’re able to optimize flight
routes to ensure safe flights. An outcome metric that we might use to
evaluate the performance of this tool that we’re building might
look something like this. A lower number of safety incidents per
year for an airline customer of this tool. Or perhaps a lower dollar value of safety
related claims made against that airline. This would be a direct result of
the tools, ability to successfully predict turbulence and therefore,
to ensure that that airline is planning safe flight routes and
minimizing potential safety incidents. The output metric that we might use to
evaluate the quality of the model that we build to support this tool would
typically be a classification barometric. And we’ll talk about some different
options for those later on in this module. Let’s now consider a second case study. We’re building a tool for electric utilities to be able to
forecast power demand on their network. Forecasting demand is
critically important for electric utilities to assist in
planning their power generation. Electric utilities are able to do a good
job forecasting demand they’re able to optimize the mix of energy that they
generate, minimizing their cost and the emissions associated
with that energy generation. When they do a poor job forecasting
demand utilities are often forced to use what’s called Peaker plants
to meet the extra demand. Problem with using Peaker plants is
that they are often very expensive and result in higher emissions relative to the
standard energy generation for utilities. Outcome metrics that we might select
to evaluate our tool might be something like this, a lower cost per
megawatt hour of power produced for our electric utility customers. Or a lower emissions rate for
megawatt hour of power produced. We would choose to evaluate the output
of the model behind this product using a regression barometric,
which we’ll talk about a little bit later.

Video: Model Output Metrics

What are Output Metrics?

  • Output metrics directly measure the technical performance of a machine learning model (e.g., classification accuracy, regression error).

Key Stages where Output Metrics are Critical

  1. Model Comparison: During the model building process, output metrics calculated on a validation set (or through cross-validation) help you compare and select the best-performing model.
  2. Final Evaluation: Before deploying your chosen model, you evaluate its performance on a held-out test set using output metrics. This gives you an unbiased estimate of real-world performance.
  3. Ongoing Monitoring: After deployment, you continuously track output metrics to monitor your model’s performance in production. This helps detect performance degradation over time.

Key Takeaway: Output metrics play a vital role throughout the machine learning process, guiding model selection, ensuring pre-deployment quality, and signaling when a model might need retraining or adjustment in a live setting.

We use output model metrics at several points in the
modeling process. Firstly, we use output metrics
in order to compare and evaluate different
models that we might create and make our selection
of the model to use. Once we’ve selected
our final model, we use output metrics to
evaluate the performance of that model on our test set of
data prior to deploying it, and providing it
into our customers. Finally, we’ll use output metrics on a
continual basis to evaluate the ongoing
performance of our model as it’s being
used out in the real world. When we compare models, we calculate these output
metrics using the validation set or cross-validation strategy as we discussed in
the last module. Once we’ve selected our
final model to use, we then apply our test
set and we calculate an output metric using the test set to
generate predictions.

Regression and Classification Metrics


Video: Regression Error Metrics

Common Regression Metrics

  • Mean Squared Error (MSE): Calculates the average squared difference between actual and predicted values.
    • Pros: Emphasizes large errors (outliers).
    • Cons: Sensitive to scale, hard to compare between datasets.
  • Root Mean Squared Error (RMSE): The square root of MSE, making it more comparable to the original data scale.
  • Mean Absolute Error (MAE): Calculates the average absolute difference between actual and predicted values.
    • Pros: Less sensitive to outliers than MSE, easier to interpret.
    • Cons: Still scale-dependent.
  • Mean Absolute Percent Error (MAPE): Expresses error as a percentage of the actual values.
    • Pros: Easy for non-technical audiences to understand.
    • Cons: Can be skewed by small actual values leading to large percentage errors.

Which Metric to Choose

  • If large outliers are particularly bad, MSE/RMSE might be preferable.
  • If consistent small errors are a bigger concern, MAE could be a better choice.

Understanding R-Squared

  • R-Squared (Coefficient of Determination): Measures the proportion of variance in the target variable that’s explained by the model.
  • Range: 0 to 1
    • 1 = Perfect model
    • 0 = Model explains none of the variance
  • Calculation: 1 – (SSE / SST), where SSE is the sum of squared errors and SST is the total sum of squares.

We are working on a regression modeling project for a client who is very concerned about minimizing any particularly large errors of the model on outlier datapoints. To compare multiple versions of our model and select a final version for the project, which metric would we mostly likely use?

Mean squared error

Correct. MSE penalizes large errors more heavily than MAE, and so will be a better metric for comparing models if we desire to minimize infrequent large errors occuring

We would like to compare two models using R-squared as a metric. Model A has a R-squared value of 0.85, and Model B has a R-squared value of 0.2. Which model does a better job explaining the variability in our target variable?

Model A

Correct. Model A has a much higher R-squared value, meaning it does a better job explaining the variability in the target variable

Let’s begin the discussion of output
metrics by talking about regression metrics. For regression modeling problems, we will
typically use one of three common metrics, mean mean squared error, mean absolute
error, or mean absolute percent error. Start with the most popular which
is called mean squared error. When we calculate the mean squared
error is by summing up the differences between the actual target value and
a predicted value squared and then dividing by the number
of observations that we have. One of the challenges of mean squared
error is that’s heavily influenced by outliers. When we have a particular instance of
a large error because of the square term in the formula,
we have a heavy penalty that’s applied and as a result, we get a very high MSC. Mean squared error is also
influenced by the scale of our data. Therefore, it’s impossible to compare
a mean squared error on one problem to a mean squared error on another problem
because we’re working with completely different data sets
using different scales. Sometimes we’ll use what’s called our MSC,
or root mean squared error,
rather than mean squared error. Our MSC is simply the square
root of mean squared error. A second common regression output
metric is mean absolute error. In MAE, our mean absolute error,
we’re summing up the absolute value of the difference between the target and
the prediction across all of the predictions that we make and dividing
by the total number of predictions. MAE is also influenced by the scale of
the problem is therefore impossible to compare an MAE value on one
problem to another problem. However, as compared to mean squared
error, MAE is more robust to outliers or very large errors,
tends to penalize large errors, much less than MSC does because it doesn’t
contain that square term in the formula. MAE can also be a little bit
easier to interpret in the context of a problem because we don’t have
that square term in the formula. And therefore, MAE is tends to be on
a similar scale to the value that they were trying to predict. So it’s a little bit more logical for
us to understand when we see an MAE value relative to an MSC value in the context
of the predictions we’re trying to make. We also sometimes use mean absolute
percent error rather than mean absolute error. Mean absolute percent error, or MAPE, is calculated as the absolute value of the
difference between the actual value and the predictions regenerating
divided by the actual values. We sum that up and divide by the total number of
productions to get our MAPE value. MAPE converts the error to a percentage
rather than an absolute number. MAPE is typically very popular, particularly among
non-technical audiences. Because it’s easily understood,
it’s a common metrics that’s used to present to customers, again, because
it’s easy to understand and interpret. One of the challenges with MAPE is that
it’s skewed by high percentage errors for low values of y. So consider a case when we have
a very low value of a target, we may have a very small error, but
relative to the low value of a target. When we convert that small
area to a percentage, it ends up being a very high percent. So to understand the difference
between mean absolute error and root mean squared error,
let’s take a look at an example. On the side, we have two situations. In each situation, we have five output
data points of a model that we’ve built. In case one, we have a small variance and
errors for each of those five points. The error values are either one unit or
two units for each of the five points. In case two, we have four points where
we’ve made a perfect prediction and we have zero error in our prediction. And for the fifth point,
we have a large error of seven units. In each case, the total error across
the five points is equal to seven. But in the first case, if we calculate our
mean absolute error, we come up with 1.4. Likewise for case two,
we can calculate MAE and also come to 1.4. However, when we calculate mean
squared error, for case one, it turns out to be 2.2, and for
case two, it turns out to be 9.8. Why is that again? Because mean squared error severely
penalizes severe errors or large error values even
if it’s a single error. In the case of case two, we had one
single large error of seven units. And because of the square term in
the formula for mean squared error, it’s severely penalized
that single error value. Sometimes this can be a good thing and
sometimes it’s not. In some problems that we’re trying to
model, being off by a large amount, or having a very large error,
one single time can be a really bad thing. In other cases, we don’t really care
all that much if we’re off once or twice by large values for error. But what’s really bad is if we’re
consistently off by small values. If we really care about minimizing
the chances of any large, severe outlier errors, mean squared error may be a better metric to help
us get a realistic picture of that. If we care more about whether we’re
consistently on or off by small amounts every time we make a prediction, we might
want to look at mean absolute error. So let’s now entry this some terminology
that’s used to calculate r squared. The total deviation from the mean, which
we referred to as a sum of square total, or SST, is equal to the sum of the actual
y values minus the mean y value squared. SST is the result of the addition of two
terms, the sum of squared regression, or SSR, and
the total squared error, or SSE. Some squared regression, or SSR,
is calculated as the difference between the predicted y values and
the mean y value. Or put another way, the amount of the
variance that’s explained by our model. The total squared error, or SSE,
is the sum of the actual y values minus predicted y values, or the unexplained
variance, or error in our model. Our square then is equal to the sum of
squared regression over the sum squared total, or the amount of variability
explained by our model, divided by the total
variability in our y values. Or put another way, our square can be
calculated as one minus the sum of squared error, divided by the sum
of squared total. The r squared value for
a model is typically between zero and one, where in r squared of one would indicate
a perfect model that’s able to completely explain all of the variants found in
the Y values or the target values. R squared of zero would mean that the
model is explaining none of the variants found in our y or target values.

Video: Classification Error Metrics: Confusion Matrix

Metrics for Classification Problems

  • Accuracy: The proportion of correct predictions out of total predictions. Simple, but can be misleading in these situations:
    • Class Imbalance: Where one category (e.g., “no heart disease”) is significantly more frequent than others. A model predicting the majority class all the time can appear very accurate but is useless.
  • Confusion Matrix: A table visually laying out True Positives, True Negatives, False Positives, and False Negatives. This helps with more nuanced metrics:
  • True Positive Rate (TPR), Recall, or Sensitivity: Measures how many actual positives were correctly identified. (True Positives / (True Positives + False Negatives) )
  • False Positive Rate (FPR): How many negatives were wrongly classified as positive. (False Positives / (False Positives + True Negatives) )
  • Precision: Out of values predicted as positive, how many were truly positive? (True Positives / (True Positives + False Positives) )

When to Use Which

  • Class Imbalance: Don’t rely on accuracy alone, use Precision, Recall, and FPR. Consider the real-world consequences of false negatives vs. false positives.
  • Multi-class Problems: Use a multi-class confusion matrix to see where your model confuses specific categories. Calculate metrics for each individual class or use macro-averages for an overall picture.

Suppose we are building a model to predict which days of the year a normal, healthy person will have a common cold. Would accuracy be the best choice of metric to evaluate our model?

No

Correct. Since we have a high class imbalance in this case because most days of the year a healthy person does not have a cold, accuracy is probably not the best choice of metric

We are building a classification model to identify manufacturing defects (a "positive" in our model) from among parts coming off a manufacturing line. In our test set we have 1000 images of parts. 50 of the 1000 contain defects ("positives") and the remaining images do not. Our model successfully identifies 40 true positives. What is the recall of our model?

80%

Correct. Our model succesfully identified 40 out of the 50 positives (parts with defects), and so our recall is 40/50 = 80%

As in the previous question, we are building a model to identify defects ("positives" in this case) within products coming off a manufacturing line. We test our model on a test set of 1000 images of products coming off the line. Our model predicted that 50 of the images were positives (had defects) and the remaining 950 had no defects. We compare our model's predictions to the actual labels and determine that our model had 45 true positives. What was the precision of our model on the test set?

90%

Correct. Our model predicted 50 positives and of those 45 were true positives, so our recall was 45/50 = 90%

In the last lesson,
we talked about common output metrics that are used for regression problems. We’ll now talk about
the classification scenario and cover some of the popular metrics used for classification types of tasks. By far, the most common classification
metric is accuracy. Accuracy is very popular, is easy to understand, and you’ll find accuracy
values all over the place. Accuracy simply refers to the number of
predictions that we’ve gotten correct divided by the total number of predictions
that we’ve generated. The challenge with accuracy
is that it can sometimes be deceiving in situations where we have what’s called
the class imbalance, meaning that in
our given problem we have a very large number of one class and a relatively
much smaller number of values in our other class. To illustrate this, let’s
consider a situation. I’m building a model
to predict whether patients will have
heart disease or not. I’m using data for
this model from a medical study that included
thousands of patients and several features
for each patient along with a label which
is either a one or a zero, indicating whether they were diagnosed with heart
disease or not. I use this dataset, I create a classifier model, and I evaluate the output
of my model using accuracy, and finding that I achieved
99.4 percent accuracy. Sounds great. I’ve got an excellent model.
What’s the problem? The problem is, if we look into our dataset a
little bit deeper, we find that we had very high class imbalance in our dataset. The vast majority of patients in the study did not
have heart disease. So the model that we
created actually just predicted a zero or
no heart disease for every single patient
and the model was right 99.4 percent of the time but the model was
actually pretty useless. A better method of
evaluating the output of a classification model is using what’s called the
confusion matrix. A confusion matrix
is a matrix that illustrates on one axis
the true values of our y. In the case of a
binary classification, or a zero, or one
classification, we would split it into negative and positive
zero and one values, and on the other axis
of our matrix we would highlight the
predicted class or y hat, again, separating
into one or zero. Using our confusion matrix
we can then start to calculate classification
error metrics. In the top left quadrant of
our matrix in the case where the true y value was a one and the predicted
value was a one, we call these true positives. In the opposite corner
where our true value was a zero or a negative class and we successfully
predicted a zero, we call these true negatives. In the case where the
actual true value was a zero but we predicted
a one or positive, we call this a false positive. Likewise, when the
true value was a one but we predicted
a zero or negative, we call this a false negative. One of the error metrics
that we’ll use based on this confusion matrix is what’s called the true
positive rate or also called the recall or
sensitivity of a model. True positive rate or
recall refers to out of all the positives how many did we correctly identify
as being positive. We calculate this as the
number of true positives divided by the sum of the true positives plus
the false negatives. We can also identify the false positive rate
or FPR for a model. FPR refers to out of all
the negatives how many did the model incorrectly
classify as being positives. To calculate the FPR, we take our false
positives divided by the sum of the
false positives plus our true negatives. The precision value of our model refers to something
a little bit different, out of the values that we predicted as being the positive
class or as being ones, how many of those were
actually positives. We calculate this using
the true positives divided by the true positives
plus the false positives. In the previous examples, we looked at a binary
classification setting and the resulting
confusion matrix. We can also apply a
confusion matrix to problems where we
have multiple classes we’re trying to predict. We generate the confusion
matrix the same way, except now rather than
a single one or zero on each axis we have
multiple classes. We use this multi-class confusion
matrix to show us where the model is struggling to differentiate between
certain classes. We can also calculate metrics using the multi-class
confusion matrix, just as we did in
the binary setting. However, now we can calculate these metrics for each class, for example, the recall and precision of every
class in our problem. We can also calculate
average metrics across all the classes, which we call the macro averaged recall or macro
averaged precision.

Video: Classification Error Metrics: ROC and PR Curves

ROC (Receiver Operating Characteristic) Curves

  • What they plot: True Positive Rate (TPR) vs. False Positive Rate (FPR) across different classification thresholds.
  • Thresholds: The probability cutoff for deciding if a prediction is positive or negative (e.g., above 0.5 probability = positive class).
  • AUROC: Area Under the ROC Curve. Summarizes model performance:
    • Perfect classifier = 1.0
    • Random guessing = 0.5
    • Higher AUROC = generally better model

Precision-Recall (PR) Curves

  • What they plot: Precision vs. Recall (same as True Positive Rate) across different thresholds.
  • Why they’re useful: Especially good for scenarios with class imbalance (e.g., few positives, many negatives). This is because PR curves focus on correctly identifying positives and don’t get inflated by correctly identifying easy negatives.

Key Takeaways

  • Both ROC and PR curves help visualize model performance by plotting how it changes with different thresholds.
  • AUROC is a common metric, but PR curves are better when dealing with class imbalance.
  • Model evaluation is about choosing the right tool for the context of your problem!

What most likely happens to the recall / true positive rate of our model if we decrease the threshold value from the default of 0.5 to a value of 0.3?

It goes up

Correct. Our model would classify more points as positives, likely increasing the TPR / recall

What most likely happens to the false positive rate of our model if we decrease the threshold from the default of 0.5 to a value of 0.3?

It goes up

Correct. Our model will classify more points as positives, which will likely increase the false positive rate

We are working on a binary classification modeling project and have developed two different models. The first model (Model A) has an Area under the ROC (AUROC) of 0.73, and the second model (Model B) has an Area under ROC of 0.43. Which model should we select if we are using AUROC as our performance evaluation criteria?

Model A

Correct. Model A has the higher AUROC score of 0.73

One of the common ways of evaluating
classification models is using what’s called rock or
receiver operating characteristic curves. ROC Curve plots the true positive rate
versus the false positive rate for a variety of different threshold values. So what is the threshold value? Most classification models returned to
us rather than a discrete prediction, a one or a zero. They provide the probability
of predicting each class. So the probability that a certain
data point would be a one positive or zero negative. In order to convert these probabilities
into discrete predictions of a zero or one for example,
we have to set some threshold. And we say that if our probability
is higher than the threshold, we generate the prediction of that class. So for example, the default
threshold is typically set at 0.5. If we generate a prediction which
happens to be 0.7 which is greater than our threshold of 0.5 will predict
a one the positive class. And if we generate a probability of
.3 less than our threshold of .5, our prediction is zero or
the negative class. So they convert these probabilistic
model outputs into discrete outputs. We began by setting a threshold. An example on the slide let’s start
with the threshold value of 0.3. For our first data point,
our model outputs 0.85 clearly higher than our threshold a .3, so
we generate a one as our prediction. Our second model output
is lower than the .3, so we generate a zero than a one,
another one and finally one more one. We then take these predictions we
calculate the true positive rate and false positive rate by comparing our
predictions to the actual targets. We then change our threshold to
0.5 recalculate the predictions by comparing the model outputs to
our new threshold value of .5, calculate the TPR FPR and repeat. Once we’ve done this several times,
we can then plot these points on a graph of TPR versus FPR and
connect the points to form a curve. This Curve is then called the ROC Curve. Common error metric for classification
models associated with the ROC Curve is what’s called AUROC or
Area Under the ROC curve. As the name suggests, the way we calculate
our ROC is simply taking the area under the ROC Curve that we’ve plotted. In the case of a perfect
classification model. No matter what threshold value we select,
we’re always going to have a true positive rate of one and
a false positive rate of zero. So are perfect classifier point on
the AUROC Curve would be at a value of TPR of one and FPR of zero. If we calculate the area under
that Curve it would simply be one. On the other hand, if we were to
take a model which simply randomly guessed between zero or one, we would
expect that for every threshold value are true positive rate would be
equal to our false positive rate. And when we plot that Curve on
the ROC Curve we would see a straight line ranging from 00- 11. When we calculate the area under that
straight line it would be equal to roughly 0.5. Therefore for real world models that we
generate, we would expect the area under the ROC Curve to fall between 0.5,
randomly guessing between zero and one and on the upper bound would be one
indicating a perfect model. Higher AUROC Values generally
indicate better quality models. Another evaluation
technique that we can use, it’s what’s called
the Precision-Recall Curve or PR Curve. A PR Curve is a plot of the precision
versus the recall value, for a model as we change the threshold value. Pure curves are especially
useful relative to ROC Curves. When we have situations
with high class imbalance. For example we have a lot of zeros and
only a few ones. When you think back to the situation we
discussed earlier when we’re creating a model to predict patients
with heart disease, that was a clear case of
a high class imbalance. Very, very large number of patients
without heart disease relative to the few number of patients with our
disease precision recall curves, unlike ROC Curves do not
factor in true negatives. Therefore for situations like
the model that we discussed, they’re not biased by the fact that
our model was successfully able to predict that a lot of patients
do not have heart disease. So when we have a clear situation of
a class imbalance will often choose to use a Precision-Recall Curve rather than
the ROC Curve to evaluate our model.

Review


Video: Troubleshooting Model Performance

Why Models Underperform: 5 Key Reasons

  1. Problem Framing and Metrics:
    • Crucial: Have you defined the problem correctly, and do your evaluation metrics accurately reflect what success looks like?
    • Example: Predicting outage severity across an entire region proved more useful than predicting outages per town.
  2. Data Quality and Quantity
    • Garbage In, Garbage Out: Insufficient, unclean, or outlier-filled data will inherently limit model performance, no matter how advanced your model is.
  3. Feature Engineering
    • Key Features: Did you miss essential features that strongly influence the outcome? Consult domain experts to ensure you’re including all relevant aspects.
  4. Model Fit
    • Experiment and Tune: Have you tested different algorithms and optimized their hyperparameters? Avoid underfitting (too simple) or overfitting (too complex).
  5. Inherent Error
    • Realistic Expectations: Real-world problems are complex. Even the best models cannot achieve 100% accuracy due to natural variation and noise.

Debugging Process

If your model isn’t performing well, investigate in this order:

  1. Problem framing and metrics
  2. Data quality and quantity
  3. Feature engineering
  4. Model fit
  5. Understand the limits of inherent error

No matter how good we are at
building machine learning models, it’s inevitable that will run into
situations where model we’ve created is not performing as well as we’d like it to. There are many reasons why models
don’t perform as well as we like. And debugging poorly performing
models can be a real challenge. In this lesson we’ll dive into some of
the key reasons why models don’t perform as well as expected. I’ve listed the five primary
reasons here in order of severity. The first and probably most important
source of error to look at is are you properly framing and understanding
the problem that you’re trying to solve? And have you chosen the right
metrics to solve that problem? Secondly, you have to look at the data
that you have, the quantity and the quality of data. You also have to consider the features. Have you properly defined features and have you included enough features of your
data to explain the output of your model. Model fit can sometimes be
an issue when we haven’t done a proper job training,
a model to the data that we have. And finally,
in every problem in the real world. There’s always some amount of inherent
error that no matter how good of a job collecting data and modelling we do, we’re just not able to fully model the
phenomenon that we’re trying to describe. If your model is not performing
as well as you’d like it to. The first place to look is that
whether you’ve correctly framed the problem that you’re
trying to model and you’ve chosen the right metric to evaluate
success in modeling that problem. Let’s look at an example from a project
that was formerly involved in to build a tool for electric
utilities to predict the severity of power outages in advance
of severe weather events. When we first started on this project,
we frame the problem in a way that we focused on building a regression model
to be able to predict the number of power outages occurring within every
town Across the utilities territory. We really struggled with getting a model
that was accurate enough to do this. And as we engage with more and
more utility customers, we learned that it really wasn’t
all that important to be able to predict the exact number of
challenges occurring within every town. Instead, what you told is really cared
about was that we were able to predict the expected severity of an event across
the utilities entire aggregate territory. On a scale from 1 to 5, which was a scale that they
commonly used in their operations. With 1 being a low impact event
with very few outages and high and 5 being a very high impact event with
widespread outages across the territory. So rather than building a regression
model to predict the number of challenges in each individual town, we pivoted and
focused on a classification approach. Where the objective of the model
was to try to classify for each upcoming severe weather event. Whether it was going to be an event that
fell within the range of a 1,2,3,4 or 5 on the utility scale. And by correctly classifying where that
event was going to fall on the scale, it provided incredibly valuable
information to the utility. That they could then use to prepare for
the upcoming storm. A second, very common source
of error in modeling is having sufficient data to build
a model in the first place. If you don’t have a sufficient quantity of
data and if the quality of the data is not good, for example, if you have a lot of
missing data, your data is not clean. If you have a large number of outliers
within your data, it’s kind of really limit the performance that we can expect
to achieve from a model that we build. No matter how much work we do
on the modeling part of itself. If the data is not a sufficient quality, our model outputs are also
not going to be very good. Well with considering the quantity and
the quality of our data. We also have to consider whether we’ve
defined the right features of the data to include in our model. Defining features can
really be a challenge. And we often try to engage
domain experts in this process. To make sure that we properly defined
all the characteristics or features of our data that we need to explain
the output that we’re trying to predict. If we build a model but failed to
include a couple of key features. The really important in
predicting the output, the quality of our model is going to
end up not being very good. If we’ve done a good job
defining the problem. If we have sufficient quantity and
quality of data, if we’ve included all the features that we need to describe
the output, we’re trying to predict. The next thing to look at
is the model fit itself. Have we tried multiple
types of algorithms? Have we adjusted the hyper parameters for
each of those algorithms? In order to find a model
that has an optimal fit and the right balance of simplicity and
complexity. So a model is neither under fitting nor overfitting to the data
that we have available. And finally, if we’ve gone
through all of those things and we’re still struggling with
the performance of our model. It’s also important to consider that every
problem in the real world is complex and noisy and
has a certain level of inherent error. It’s likely we’ll never build
a model that can achieve 100% or even 99% accuracy on
rail world phenomenon. The level of inherent error and noise that’s found in nature
is just too high to do that. So the inherent error
can also set an upper bound on the performance of the model
that we’re trying to build. And this is important to keep in mind
that this upper boundary can exist.

Video: Module Wrap-up

The Importance of Metrics

  • Defining Success: The right metrics are essential for defining what a successful machine learning project looks like, both from a business value perspective (outcome metrics) and a technical performance perspective (output metrics).
  • No Universal Solution: There’s no single metric that works for every project. Your choices should be tailored to the specific problem you’re solving.

Considerations for Metric Selection

  • Regression Example: Consider whether it’s worse to have a few very large errors or many smaller errors. This will influence your choice between metrics like mean squared error or mean absolute error.
  • Classification Example: Are false positives or false negatives more harmful? This will guide your choice between metrics like precision and recall.

Key Takeaways

  • Choosing the right metrics is crucial for a successful machine learning project.
  • There’s no one-size-fits-all answer – metrics must be chosen based on the specific problem and its context.
  • Understanding the trade-offs between different metrics allows for informed decisions.

In this module, we
talked about how to evaluate the performance
of machine learning models and we covered some of
the popular metrics to use to evaluate regression
and classification models. Selecting the proper outcome
and output metrics for a machine learning
project is really one of the keys to making a
successful project. If we don’t do a
good job defining the business impact that
we’re expecting to achieve for our customers or our own organization and how we can measure that using
outcome metrics. Then also defining the
technical performance metrics to evaluate our machine
learning model, which we call the
output metrics. It’s really impossible to have a successful machine
learning project, no matter how much work
we do on the data, on the modeling itself. If our metrics are wrong, we’re not going to
be able to achieve the outcomes that
we’ve expected. Our choices for the metrics to use should reflect the nature of the problem and also
consequences of being wrong with the model on the problem that
we’re trying to solve. There is no one size fits all
when it comes to metrics, and it really depends on
the particular problem. For example, in
regression problems, is it worse to be very
wrong a couple of times or a little bit
wrong a lot of times? That determination will
impact whether we use a metrics such as mean squared error or mean absolute error. For a classification
type of a problem, are false positives
a lot worse than false negatives, or vice versa? Depending on their answer there, we may choose different metrics between precision or recall.

Quiz: Module 3 Quiz

Suppose we are building a recommendation system for an online retailer of home improvement products. Which of the below options would be the best outcome metric to use to evaluate the success of our project?

Outcome metrics are commonly stated in terms of (select all that apply):

When do we typically use model output metrics (select all that apply)?
1 point


To compare models and perform model selection

To evaluate the final model prior to deployment

To state the success criteria for solving the user problem we are trying to address

To monitor the ongoing performance of deployed models

What is one key difference between Mean Squared Error (MSE) and Mean Absolute Error (MAE)?

Why is Mean Absolute Percentage Error (MAPE) popular among non-technical audiences relative to MSE and MAE for quanitfying the error of regression models?

Suppose I am building a classification model for a university to try to identify students likely to have Covid-19 (positives) vs. no Covid-19 (negatives) based on daily questionaires about symptoms, and my main goal is to make sure I identify everyone with Covid-19 correctly in order to keep them from entering campus. I calculate the recall (or true postive rate) of my model as 92%. What does this number mean?

Which of the following is an incorrect statement regarding Receiver Operating Characteristic (ROC) curves?

I am building a model for an insurance company to predict which insured drivers will likely have a car accident within the next year. I calculate the accuracy of my model as 97.8%, and proclaim success. Why might my declaration of success be misleading? Keep in mind that only a small fraction of insured drivers have a car accident within a given year.

I am building a binary classification (positive/negative) model and use the typical threshold value of 0.5, meaning that if the output probability is above 0.5 the model predicts a positive, and if it below 0.5 the model predicts a negative. What would happen to the recall of my model if I reduced my threshold to 0.3?

What is the value of displaying the confusion matrix for a classification model?