Week 3: Evaluating & Interpreting Models

In this module we will learn how to define appropriate outcome and output metrics for AI projects. We will then discuss key metrics for evaluating regression and classification models and how to select one for use. We will wrap up with a discussion of common sources of error in machine learning projects and how to troubleshoot poor performance.

Learning Objectives

  • Differentiate between outcome and output metrics
  • Apply metrics to evaluate the performance of regression models
  • Apply metrics to evaluate the performance of classification models

Metrics in ML Projects

Video: Introduction and Objectives

Why Model Evaluation Matters:

  • Evaluating a machine learning model goes beyond just looking at its accuracy.
  • We need to understand the types of errors a model makes and their implications for its real-world use.

Key Concepts

  • Outcome vs. Output Metrics:
    • Outcome Metrics: Measure success in terms of the real-world problem the model is trying to solve (e.g., increased sales, improved patient diagnosis).
    • Output Metrics: Assess the technical performance of the model itself (e.g., accuracy, precision, recall). These help refine the model but must be carefully connected to real-world outcomes.

Evaluating Regression Models:

  • Regression models predict numerical values.
  • Common metrics:
    • Mean Squared Error (MSE)
    • Root Mean Squared Error (RMSE)
    • R-squared

Evaluating Classification Models:

  • Classification models predict categories.
  • Common metrics:
    • Accuracy
    • Precision
    • Recall
    • F1-score
    • Confusion matrix (visualizes different types of errors)

Key Takeaways

  • Choosing the right metrics depends on the specific problem you’re solving and the trade-offs you’re willing to make between different types of errors.
  • Metrics help us understand the strengths and weaknesses of our models, leading to further refinement and better decision-making.

Video: Outcomes vs Outputs

Key Concepts

  • Outcome Metrics: Measure the desired business impact of the model or product (e.g., reduced costs, increased revenue, improved safety). They are often expressed in dollars or time.
  • Output Metrics: Evaluate the technical performance of the model itself (e.g., classification accuracy, regression error). They focus on how well the model predicts the target outcome.

The Relationship

  1. Start with Business Understanding: Define the problem and what success looks like (outcome metrics).
  2. Outcome Metrics Guide Output Metrics: Choose output metrics that align with and help achieve the desired outcomes.

Case Studies

  • Turbulence Prediction for Airlines
    • Outcome Metric: Reduced safety incidents or claims.
    • Output Metric: Classification metric (accuracy in predicting turbulence).
  • Power Demand Forecasting for Utilities
    • Outcome Metric: Lower cost or emissions per megawatt hour.
    • Output Metric: Regression metric (accuracy in predicting power demand).

Key Takeaways

  • Choosing the right metrics starts with a clear understanding of the business problem you are solving.
  • Outcome metrics communicate value to stakeholders, while output metrics guide model development and refinement.

Which of the following are examples of potential "outcome" metrics for a machine learning project (select all that apply)?
  • Additional sales revenue (in $/day)
  • Time saved (in minutes/day)

Video: Model Output Metrics

What are Output Metrics?

  • Output metrics directly measure the technical performance of a machine learning model (e.g., classification accuracy, regression error).

Key Stages where Output Metrics are Critical

  1. Model Comparison: During the model building process, output metrics calculated on a validation set (or through cross-validation) help you compare and select the best-performing model.
  2. Final Evaluation: Before deploying your chosen model, you evaluate its performance on a held-out test set using output metrics. This gives you an unbiased estimate of real-world performance.
  3. Ongoing Monitoring: After deployment, you continuously track output metrics to monitor your model’s performance in production. This helps detect performance degradation over time.

Key Takeaway: Output metrics play a vital role throughout the machine learning process, guiding model selection, ensuring pre-deployment quality, and signaling when a model might need retraining or adjustment in a live setting.

Regression and Classification Metrics

Video: Regression Error Metrics

Common Regression Metrics

  • Mean Squared Error (MSE): Calculates the average squared difference between actual and predicted values.
    • Pros: Emphasizes large errors (outliers).
    • Cons: Sensitive to scale, hard to compare between datasets.
  • Root Mean Squared Error (RMSE): The square root of MSE, making it more comparable to the original data scale.
  • Mean Absolute Error (MAE): Calculates the average absolute difference between actual and predicted values.
    • Pros: Less sensitive to outliers than MSE, easier to interpret.
    • Cons: Still scale-dependent.
  • Mean Absolute Percent Error (MAPE): Expresses error as a percentage of the actual values.
    • Pros: Easy for non-technical audiences to understand.
    • Cons: Can be skewed by small actual values leading to large percentage errors.

Which Metric to Choose

  • If large outliers are particularly bad, MSE/RMSE might be preferable.
  • If consistent small errors are a bigger concern, MAE could be a better choice.

Understanding R-Squared

  • R-Squared (Coefficient of Determination): Measures the proportion of variance in the target variable that’s explained by the model.
  • Range: 0 to 1
    • 1 = Perfect model
    • 0 = Model explains none of the variance
  • Calculation: 1 – (SSE / SST), where SSE is the sum of squared errors and SST is the total sum of squares.

We are working on a regression modeling project for a client who is very concerned about minimizing any particularly large errors of the model on outlier datapoints. To compare multiple versions of our model and select a final version for the project, which metric would we mostly likely use?

Mean squared error

Correct. MSE penalizes large errors more heavily than MAE, and so will be a better metric for comparing models if we desire to minimize infrequent large errors occuring

We would like to compare two models using R-squared as a metric. Model A has a R-squared value of 0.85, and Model B has a R-squared value of 0.2. Which model does a better job explaining the variability in our target variable?

Model A

Correct. Model A has a much higher R-squared value, meaning it does a better job explaining the variability in the target variable

Video: Classification Error Metrics: Confusion Matrix

Metrics for Classification Problems

  • Accuracy: The proportion of correct predictions out of total predictions. Simple, but can be misleading in these situations:
    • Class Imbalance: Where one category (e.g., “no heart disease”) is significantly more frequent than others. A model predicting the majority class all the time can appear very accurate but is useless.
  • Confusion Matrix: A table visually laying out True Positives, True Negatives, False Positives, and False Negatives. This helps with more nuanced metrics:
  • True Positive Rate (TPR), Recall, or Sensitivity: Measures how many actual positives were correctly identified. (True Positives / (True Positives + False Negatives) )
  • False Positive Rate (FPR): How many negatives were wrongly classified as positive. (False Positives / (False Positives + True Negatives) )
  • Precision: Out of values predicted as positive, how many were truly positive? (True Positives / (True Positives + False Positives) )

When to Use Which

  • Class Imbalance: Don’t rely on accuracy alone, use Precision, Recall, and FPR. Consider the real-world consequences of false negatives vs. false positives.
  • Multi-class Problems: Use a multi-class confusion matrix to see where your model confuses specific categories. Calculate metrics for each individual class or use macro-averages for an overall picture.

Suppose we are building a model to predict which days of the year a normal, healthy person will have a common cold. Would accuracy be the best choice of metric to evaluate our model?


Correct. Since we have a high class imbalance in this case because most days of the year a healthy person does not have a cold, accuracy is probably not the best choice of metric

We are building a classification model to identify manufacturing defects (a "positive" in our model) from among parts coming off a manufacturing line. In our test set we have 1000 images of parts. 50 of the 1000 contain defects ("positives") and the remaining images do not. Our model successfully identifies 40 true positives. What is the recall of our model?

Correct. Our model succesfully identified 40 out of the 50 positives (parts with defects), and so our recall is 40/50 = 80%


Correct. Our model succesfully identified 40 out of the 50 positives (parts with defects), and so our recall is 40/50 = 80%

As in the previous question, we are building a model to identify defects ("positives" in this case) within products coming off a manufacturing line. We test our model on a test set of 1000 images of products coming off the line. Our model predicted that 50 of the images were positives (had defects) and the remaining 950 had no defects. We compare our model's predictions to the actual labels and determine that our model had 45 true positives. What was the precision of our model on the test set?

Correct. Our model predicted 50 positives and of those 45 were true positives, so our recall was 45/50 = 90%


Correct. Our model predicted 50 positives and of those 45 were true positives, so our recall was 45/50 = 90%

Video: Classification Error Metrics: ROC and PR Curves

ROC (Receiver Operating Characteristic) Curves

  • What they plot: True Positive Rate (TPR) vs. False Positive Rate (FPR) across different classification thresholds.
  • Thresholds: The probability cutoff for deciding if a prediction is positive or negative (e.g., above 0.5 probability = positive class).
  • AUROC: Area Under the ROC Curve. Summarizes model performance:
    • Perfect classifier = 1.0
    • Random guessing = 0.5
    • Higher AUROC = generally better model

Precision-Recall (PR) Curves

  • What they plot: Precision vs. Recall (same as True Positive Rate) across different thresholds.
  • Why they’re useful: Especially good for scenarios with class imbalance (e.g., few positives, many negatives). This is because PR curves focus on correctly identifying positives and don’t get inflated by correctly identifying easy negatives.

Key Takeaways

  • Both ROC and PR curves help visualize model performance by plotting how it changes with different thresholds.
  • AUROC is a common metric, but PR curves are better when dealing with class imbalance.
  • Model evaluation is about choosing the right tool for the context of your problem!

What most likely happens to the recall / true positive rate of our model if we decrease the threshold value from the default of 0.5 to a value of 0.3?

It goes up

Correct. Our model would classify more points as positives, likely increasing the TPR / recall

What most likely happens to the false positive rate of our model if we decrease the threshold from the default of 0.5 to a value of 0.3?

It goes up

Correct. Our model will classify more points as positives, which will likely increase the false positive rate

We are working on a binary classification modeling project and have developed two different models. The first model (Model A) has an Area under the ROC (AUROC) of 0.73, and the second model (Model B) has an Area under ROC of 0.43. Which model should we select if we are using AUROC as our performance evaluation criteria?

Model A

Correct. Model A has the higher AUROC score of 0.73

Model A

Correct. Model A has the higher AUROC score of 0.73

Video: Troubleshooting Model Performance

Why Models Underperform: 5 Key Reasons

  1. Problem Framing and Metrics:
    • Crucial: Have you defined the problem correctly, and do your evaluation metrics accurately reflect what success looks like?
    • Example: Predicting outage severity across an entire region proved more useful than predicting outages per town.
  2. Data Quality and Quantity
    • Garbage In, Garbage Out: Insufficient, unclean, or outlier-filled data will inherently limit model performance, no matter how advanced your model is.
  3. Feature Engineering
    • Key Features: Did you miss essential features that strongly influence the outcome? Consult domain experts to ensure you’re including all relevant aspects.
  4. Model Fit
    • Experiment and Tune: Have you tested different algorithms and optimized their hyperparameters? Avoid underfitting (too simple) or overfitting (too complex).
  5. Inherent Error
    • Realistic Expectations: Real-world problems are complex. Even the best models cannot achieve 100% accuracy due to natural variation and noise.

Debugging Process

If your model isn’t performing well, investigate in this order:

  1. Problem framing and metrics
  2. Data quality and quantity
  3. Feature engineering
  4. Model fit
  5. Understand the limits of inherent error

Video: Module Wrap-up

The Importance of Metrics

  • Defining Success: The right metrics are essential for defining what a successful machine learning project looks like, both from a business value perspective (outcome metrics) and a technical performance perspective (output metrics).
  • No Universal Solution: There’s no single metric that works for every project. Your choices should be tailored to the specific problem you’re solving.

Considerations for Metric Selection

  • Regression Example: Consider whether it’s worse to have a few very large errors or many smaller errors. This will influence your choice between metrics like mean squared error or mean absolute error.
  • Classification Example: Are false positives or false negatives more harmful? This will guide your choice between metrics like precision and recall.

Key Takeaways

  • Choosing the right metrics is crucial for a successful machine learning project.
  • There’s no one-size-fits-all answer – metrics must be chosen based on the specific problem and its context.
  • Understanding the trade-offs between different metrics allows for informed decisions.

