Skip to content
Home » Duke University » AI Product Management Specialization » Machine Learning Foundations for Product Managers » Week 2: The Modeling Process

Week 2: The Modeling Process

In this module we will discuss the key steps in the process of building machine learning models. We will learn about the sources of model complexity and how complexity impacts a model’s performance. We will wrap up with a discussion of strategies for comparing different models to select the optimal model for production.

Learning Objectives

  • Describe the steps to develop a machine learning model
  • Explain the bias-variance tradeoff
  • Identify possible sources of data leakage and strategies to prevent it

Building a Model


Video: Introduction and Objectives

Module 2 Overview

This module dives into the process of building and training machine learning models, along with these key challenges and considerations:

  • Steps in Model Development: The module will outline the iterative process of creating a machine learning model, from data preparation to deployment.
  • The Bias-Variance Tradeoff: This fundamental concept explains the balance between a model being too simple (underfit) or too complex (overfit). Finding the right balance is crucial.
  • Data Leakage: This refers to accidentally using information during training that won’t be available in real-world use, leading to misleadingly good results. You’ll learn how to spot and prevent this.

Learning Objectives

By the end of the module, you should be able to:

  • Describe the steps involved in developing a machine learning model.
  • Understand and explain the bias-variance tradeoff.
  • Identify sources of data leakage and devise strategies to avoid it.

Welcome to Module

  1. In this module, we’re going to talk
    about the process of developing and training
    machine learning models. We’ll also talk a
    little bit more about some of the things
    that make that so challenging and we’ll
    discuss strategies for model selection
    and evaluating models. At the conclusion
    of this module, you should be able to
    describe the steps to develop a machine
    learning model, you should understand what
    the bias-variance tradeoff is and why that makes machine
    learning so difficult. Finally, you should
    be able to identify possible sources of data
    leakage in developing models, understand what that means, and identify possible strategies to prevent it in modeling.

Reading: Module Slides

Video: Building a Model

Understanding the Problem:

  • Model building is part of a larger problem-solving process (like the CRISP-DM framework).
  • Start with a crystal clear understanding of the business problem and what success looks like. This drives all later steps.

The Modeling Process

  1. Data Gathering: Collect past observations (e.g., houses sold) with their features (bedrooms, size) and target values (sale price).
  2. Feature Selection: Choose the most important house characteristics that influence the sale price.
  3. Algorithm Choice: Select a machine learning algorithm as a template for your model (this defines how your model relates inputs to outputs).
  4. Hyperparameter Tuning: Adjust the ‘dials’ within the algorithm to fine-tune performance.
  5. Loss Function: Use a loss function to measure your model’s accuracy during training.

Key Points

  • Not Linear: This is an iterative process. You might revisit feature selection, try different algorithms, etc., as you improve your model.
  • Data-Driven: Early work in understanding the problem and gathering the right data is vital to building a good model.

What provides the overall form or template for our machine learning model?

The algorithm we select to use

Correct. The choice of algorithm determines the overall form of the model we build

In this lesson, we’re going to talk
about the process of building a model. I’d first like to set that though, in the broader context of solving
problems using machine learning. We’re going to initially focus on
what’s called the CRISP-DM process, which is across industry standard
process for applying data science and machine learning to solve problems. The process starts with step one
that’s called Business Understanding, which focuses on having
a good solid understanding of the problem that you’re
trying to solve and what defines success in solving the
problem and how we might measure success. Step two focuses on collecting and
organizing and identifying the data that you
need to solve that problem. And step three,
we then prepare the data for modeling. Step four, we build our model and
then we continue to evaluate our model and finally deploy our final model. The entire process is really an editor
of process in the sense that we may move forward a step or two, we may discover or
learn things as we go and then we may move back to previous
steps and iterate over and over again. Although we’re going to talk
today primarily about modeling, which is contained in
step four of the process. The efforts to build models
really start with step one. If we don’t understand the problem
that we’re trying to solve and have a concrete grasp of what to find
success in solving that problem, it’s impossible to build a good modeling,
no matter how much data you collect or how much effort you put
into developing your model. So the modeling process actually starts
with that step one business understanding, making sure we really have
a firm grasp of our problem and then continues on all the steps
up to step for modeling. Making sure we’re collecting enough data,
making sure we’re collecting the right data and organizing it in the right set of
features, making sure we’ve cleaned and prepared our data and then finally
building the model itself in step four. So now let’s dive in a little bit
deeper on the step forward and talk about how we create a model. We start by collecting a set of past
observations and the associated targets. So using the example that
we have discussed before, our past observations would be a set
of houses that have been for sale and associated feature values of each of those
houses, such as the number of bedrooms or the square footage of the house. The targets would be the price
that that home had sold for. We feed all of that data into our model
in a process called model training. So what are we actually doing
when we train the model? The model is represented by an equation or
a set of equations which relates the input the past observations to
the output, the target or the sale price. When we train a model, we’re learning or
identifying the optimal coefficient values in that equation
are set of equations which defines the optimal relationship between
the input features and the output target. Once we define that relationship as
expressed through an equation or set of equations, we can use that model
and we can feed new data in this case new houses and the associate feature values
of those new houses into the model. And as an output will be able to make a
prediction of the expected sale price for each of those new houses. In the previous lesson, we talked
about the four components of a model. So let’s now bring them back and
see where they come in. The first component of the model
was the selection of features. So this is done in the earlier phases of
gathering and preparing our data through a process called feature selection and
feature engineering. And the objective of that process is to
identify which features of our data in this case, which characteristics of
a house have the most impact or the most value in terms of being able to predict
the target value or the sale price. As we start to build our model, we then have to make decisions about
which algorithm we may want to use. There are many different
machine learning algorithms and we’ll talk about some of those
algorithms and lay their lessons. You can think of an algorithm as
a template for the equation or the relationship that the model follows
to relate the input to the output. Once we’ve chosen the template, we don’t
have a set of individual parameters associated with that template and
we can tweak or tune these parameters. So think of these as dials in our
algorithm that we can tune or a just a little bit to make the model
perform a little bit better or a little bit worse depending on which
direction we turn these hyper brown. There’s in terms of predicting the output. And finally the fourth component
was defining a loss or sometimes called a cost function. And we use our loss function as a way to
evaluate the performance of our model as we’re building the model itself. So as we choose the algorithm as we select
the features that we want to use for a model as we tune or tweet those
values of our hyper parameters. We use the cost function to tell us
whether we’re moving things in a positive direction or
we’re actually making things worse. So now let’s recap
the entire modeling process. And the process starts with gathering
the data that we think we need to solve our problem and build a model. And then we have to select the features of
the data that we’re going to use as part of our model. We then choose an algorithm
which acts as a template for defining the relationship between
the input, the observations, each of which has a set of features we’ve
identified and the output target value. Once we’ve chosen our template or the algorithm that we’re going to use,
we then set values for hyper parameters. We train our model using the past
observations we’ve collected and the target values. And we evaluate
the performance of our model. And the important thing to note is this
process is not a simple linear process. This process is actually highly iterative
in the sense that will go through these steps, will evaluate the performance
of our model well back up and we’ll make changes. For example, we may adjust the set
of features we decided to use. We may try a new algorithm. We may adjust the hyper parameters
that we’re using retrain and then evaluate again until we’re
happy with the end result and satisfied with the performance of
the model that we’ve developed.

Video: Feature Selection

What is Feature Selection?

  • The most important step in model building.
  • It involves choosing the characteristics of your data (features) that best help the model predict the desired output (target).
  • Example: For predicting house prices, features could include bedrooms, bathrooms, location, year built, etc.

Defining Features

  • Features are found where two things intersect:
    1. Factors that might influence the problem you want to solve.
    2. The data you have available or could collect.
  • Example: For power outages, factors include wind gusts, precipitation, season, etc. Some data might be harder to get than others.

Feature Selection Methods

  1. Expert Interviews: Talk to people who know the problem domain well. They can guide your choices (e.g., utility experts for the power outage case).
  2. Visualization: Plot individual features against the output to see if there are clear relationships.
  3. Statistical Tests: Measure the correlation between features and the output to find the strongest links.
  4. Model-Based Analysis: During model training, see which features the model finds most predictive. Eliminate features that have little impact.

Key Advice

  • Start Broad: It’s easier to remove features than to discover ones you missed. Collect as much data and consider as many features as possible initially.
  • Prioritize Expertise: Nothing beats the knowledge of those deeply familiar with the problem you’re trying to solve.

Suppose we are building a supervised machine learning model to predict students' grades on the final exam for a math course based on their activity throughout the semester. Which of the following might we use as features for the model (select all that apply)?
  • Students’ scores on weekly quizzes and homework assignments
  • Students’ attendance at the weekly lectures

We’re now going to
dive in a little bit deeper to each of the
steps in building a model. We’ll start with
feature selection. In my experience, feature
selection is actually the most important step
of building a model. Feature selection focuses
on identifying the set or subset of features or
characteristics of the data that we’ll use to
build and train our model. First let’s recap,
what are features? Features are characteristics
of our data. In example we’ve been
using of houses for sale, features would
include things like the number of bedrooms
or bathrooms, the neighborhood the
house is located in, or the school district
that may be in, the year the house was built. There are many possible
features that we could use in training a model. Our task in the feature
selection phase is to identify
which features have the most value in terms of
training the model to be able to accurately predict the
output target given the inputs. How do we define
features of our data? We define features really as the intersection
of two things. Number 1 is, what
are the factors that might influence the
problem we’re trying to solve? Again, in terms of
predicting sale prices, factors that might
influence the problem would be characteristics
of the house itself. Would be time, things like the year the house was built or the year the house was sold. Might be characteristics or factors about the neighborhood, or the broader area, or the city that the
house is located in. We have a large set of
possible factors that might influence our problem of
predicting a house price. The second thing we have to
consider is what data do we have available or what data
might we be able to collect? In the case of houses for sale, it’s generally pretty easy
to collect a lot of data about the houses themselves
or the local area. But sometimes when we
work with problems, it may be much harder to try
to collect certain pieces of information about factors that
can influence the problem. Generally, we’re
trying to define features as the intersection of what might influence our problem and what we
might be able to collect. We’ll now walk through a simple
case study of a problem I was working on in a team that
I used to lead in industry. Our objective with
this new product that we were building was
to be able to predict the severity and location
of power outages for electric utilities in advance of storms which might impact
in electric network. Power outages
primarily are caused by trees falling down on wires, but there’s many things that can cause trees to fall down. We spent a lot of effort
and time working on feature selection to
identify the relevant set of data and features of
that data that we could use to build an effective model for predicting power outages. We went through interviews
of industry experts, both in the utility industry of meteorologists
and we ended up with a set of factors including
things such as wind gas, precipitation amounts,
even things that you might not initially suspect,
such as seasonality. For example, trees that still have leaves on them are
much more likely to fall down on top of power lines
in strong winds and cause outages relative to trees and whether time that
have no leaves and therefore are much less
likely to fall down. Sometimes features are very
obvious characteristics of your data or your problem. Other times, features
are much less obvious and involve a
heavy amount of research, which is best done by talking to industry experts who have a particular expertise about the problem you’re
trying to solve. There are several methods
of feature selection. The first and in my experience, the best method of
feature selection is to actually talk to experts about the problem that you’re
trying to solve. In the case of a power outage
tool that we were building, we talked to many individuals
from electric utilities. We talked to meteorologists
to better understand what factors most impact the
severity of power outages. That helped us then
narrow down on a set of features of our data to use
in developing the model. We can also use the data itself to better
understand relationships between possible input features and the output that
we’re trying to predict. We can do this through
visualization. For example, collect data
involving many features. It creates simple plots
of each feature of our data relative to the
output we’re predicting. In the example of the
power outage predictor, we might choose to plot
sustained wind speed relative to power outage or temperature
relative to power outage. That can help us
through visualization, identify possible strong
relationships between each of these potential
features and the output. Likewise, we can also apply statistical tests to
evaluate the strength of correlation or the relationships between possible
features and the output. We can do this by
collecting a large set of past data and then
performing tests using that pass data to
identify relationships. Finally, there’s a set
of techniques we can employ when we’re building
the model itself. As we’re building and
training the model, we can look into which features
the model is relying most on to be able to effectively predict the output that
were trying to predict. We can then narrow
our feature selection down to focus only on those features that
the model is using most in terms of
predicting the output, and we can eliminate or
get rid of features that the model is really not using at all to be able to
make its predictions. One tip I’d like to share on feature selection
is that including too few features in
your model is usually much worse than including
too many features. When in doubt, try to collect as much data as you can and as many possible features
or factors as you can about the problem that
you’re trying to solve. Build your model using all of those features initially
and see how it performs. When you then identify
features which are irrelevant or maybe duplicative, you can start to then
reduce your feature set. But starting with
a very small group of features can
be very dangerous because it may often occur that you’re accidentally leaving an important feature
or factor out, and as a result, it will be impossible to
achieve good model performance.

Video: Algorithm Selection

Algorithm Selection: Key Considerations:

  • No One-Size-Fits-All: The “No Free Lunch Theorem” states there’s no single algorithm that’s best for every task. The optimal choice depends on the specific problem and dataset.
  • Typical Approach: Experiment with multiple algorithms and compare performance to find the best fit.
  • Three Main Criteria:
    1. Performance/Accuracy: How well the model predicts the desired output.
    2. Interpretability: How easy it is to understand the model’s decision-making process (important for explaining results).
    3. Computational Efficiency: How much processing power and time are needed for training and using the model.

Types of Algorithms

  • Parametric: Defined by mathematical equations (e.g., linear regression).
  • Non-parametric: No single defining equation (e.g., decision trees).

Real-World Example: Netflix Prize

  • Goal: Improve movie rating predictions by 10%.
  • Winner: Used a complex ensemble of models, demonstrating the power of combining algorithms.
  • Trade-off: Netflix ultimately didn’t implement the winning model. The gain in performance didn’t justify the increased computational cost and engineering complexity.

Key Takeaway: Choosing the right algorithm involves balancing accuracy, interpretability, and efficiency based on your application’s specific needs.

Let’s now talk a bit about algorithm selection. So, the machine learning algorithm,
you can think of as a template that defines the relationship between
the input and the output for a model. There are many many types of
algorithms focused on different tasks, such as regression or classification. Algorithms can also be classified
into two primary types, parametric algorithms and
non parametric algorithms. Parametric algorithms can be defined
by mathematical equations that relate the input and the output. So, linear regression, which many of you will know is a common
example of a parametric relationships. And then parametric algorithms,
our objective is to define the coefficients of that equation or
the set of equations, which governs the relationship
between the input and the output. Non parametric algorithms, on the other
hand, do not have a single equation or set of equations that
define the relationship. A common example of a non parametric
algorithm would be a decision tree. Which we’ll talk about in a later module. One important thing to note
with algorithm selection, is what’s called the no
free lunch theorem. And this theorem says that there is
no one single algorithm that performs optimal across the range of all
possible machine learning tasks that you might be working on. The choice of algorithm which is optimal,
is really dependent on the particular task that you’re working on,
the problem you’re trying to solve, and the particular set of data that
you have available to use. And so, the the typical approach
the algorithm selection, is to try multiple algorithms and
train models using multiple algorithms and compare, and see which algorithm gives us
the best result in terms of performance. When we’re selecting the algorithm, well
commonly consider three primary criteria. The most obvious one, is the performance
of the model or the accuracy of the model. And be able to generate
predictions of the output. But perhaps less obvious ones are,
interpret ability and computational efficiency. So, interpretability refers to how easy or
hard it is to understand what the model is actually doing when
it’s generating the predictions. And how it arrives to certain predictions. So, some algorithms, such as a linear
regression, for example, or decision tree. Give us a high level of interpret ability,
meaning it’s very easy to look into how the model is performing and
why it generates certain predictions. So that, if we have to explain
the prediction to a customer or user of our product. It’s very easy for us to understand
how we got to that prediction and in turn be able to explain it to someone. Other algorithms, such as neural networks, use very large sets of complex equations
to generate output predictions. And it can be very difficult to
understand how the neural network model is reaching the prediction. And so,
if we had to explain that to a customer or user, it would be very challenging for
us to do so. The third criteria we want to consider,
is the computational efficiency. Simpler models,
such as linear regressions, generally are highly efficient,
can run very quickly, can both train and generate predictions very quickly
with low computational power. Other algorithms, such as again,
their own networks, can take a high degree of computational power to train, and
also to use in generating predictions. They could take many hours,
days, or even weeks to train. And they take a lot of
computational horsepower to do so. So, when we’re making our
decision about algorithms. We want to make sure to not only consider
the performance of the accuracy, but really to consider the balance
of all three of these factors. And for the particular problem
that we’re working on, or product that we’re building. Which of these factors
is most important to us? One example I’d like to share here,
of considering all three of these factors in the appropriate balance,
is an example from Netflix. So, in the early 2000’s, Netflix ran
a competition called the Netflix Prize. And the objective of that competition was
for a team to be able to develop a model, which is highly effective at
predicting the ratings that a user would give to certain
movies that they watched. To create this model, Netflix made
available large historical data set of its users who
had watched movies and the ratings that they had given to
each of the movies they had watched. And to win the Netflix prize,
a team had to be able to improve upon Netflix’s own internal algorithm,
by at least 10% in terms of generating accurate predictions
of users ratings of movies. So, after the competition running for
multiple years, a team was finally able to reach that
10% performance improvement threshold. And was declared the winners
of the Netflix prize. As a teammate, the information about
their model publicly available. We quickly learned that the model
actually was a complex ensemble or collection of multiple models. Each of those models using different
algorithms as the template for them. As the Netflix engineers after
the competition evaluated the prospect of using the winning teams model
in place of their own algorithms. They quickly realized that the idea
of implementing this complex model using their own very large
internal training data sets. It was actually really not worth
the engineering effort required to do so. In the competition, the teams were given millions of movie
reviews to work with to train their model. Netflix internally, manages on
the order of billions of reviews. And so, in order to use this
model that the team had built on the billions of reviews that
Netflix has available to it. It was a tremendous engineering
effort to scale it up. And as the engineers evaluated,
the effort and the computational power
required to run this model, relative to the performance improvement
that it gave over the existing model. It came to the conclusion that,
wow, interesting, they decided to stick with
their original model.

Model Selection


Video: Bias-Variance Tradeoff

Model Complexity

  • Factors: Model complexity depends on the number of features, the choice of algorithm (simple like linear regression vs. complex like neural networks), and hyperparameter values.
  • Impact on Error: Model complexity directly impacts the error in these ways:
    • Bias: Simple models have higher bias. They cannot fully capture complex patterns, leading to consistent errors in prediction.
    • Variance: Complex models have higher variance. They over-adapt to the training data, including noise, leading to less accuracy with new data.

The Bias-Variance Tradeoff

  • Simpler models tend to have higher bias and lower variance.
  • Complex models have lower bias and higher variance.
  • Total Error: Bias squared + variance + irreducible error
  • Goal: Finding the optimal model complexity means minimizing the total error.

Underfitting and Overfitting

  • Underfitting: Model is too simple, resulting in high bias and high total error.
  • Overfitting: Model is too complex, resulting in high variance and high total error.

Visual Example

The provided graphs illustrate underfitting (simple straight line), optimal fit, and overfitting (complex curve tightly following training data).

Key Takeaway: The challenge in machine learning is finding the right balance between model complexity, bias, and variance to minimize error and achieve better predictions.

If we determine that our model is overfitting on the data, which of the following aspects of our model might we adjust to reduce the complexity (select all that apply)?
  • The choice of algorithm
  • The values of the hyperparameters used in our model
  • The number of features used in the model

One of the things that
makes machine learning modeling challenging is finding the right degree
of complexity for a model that you’re building
for a given problem. Complexity of a model is the result of three
primary sources. One is the number of
features that you choose to include in your model. Obviously, the more
features you include, the more complex
the model becomes. The second source of complexity
is in the algorithm or the template that you’re using for the form of the
model you’re creating. They’re algorithms such as linear regression that
are much simpler, and algorithms such
as neural networks that are much more complex. The third source of complexity
in modeling comes from the values of what we call the hyper-parameters,
which are again, these knobs that you tune, which are specific to
your choice of algorithm that you selected
for your problem. Put together, these
three things result in some amount of
model complexity, which we can then vary
and make simpler or more complex through our choice of these three things:
number of features, selection of algorithm, and selection of values for
our hyper-parameters. The complexity of a
model that you create has a direct impact on
the error of that model. The average model can
be broken down into two primary terms;
bias and variance. Bias refers to the error
that’s introduced by modeling a complex real life problem
using a simpler model, where that simple model
is unable to fully capture underlying
patterns within your data. Models with high bias would be simple models that are
consistently a little bit off target in generating their predictions because
they’re just not able to fully capture those fine
patterns within the data. Variance refers to
the sensitivity of a model to small
fluctuations in the data. Models with high variance
are very tightly fit on the training data and as
a result of that tight fit, they’ve interpreted noise
in the training data as actual patterns and they’ve attempted to model
those patterns, which in reality are just noise. As a result, when models
with high variance receive new sets of data to
generate predictions on, the predictions can be somewhat scattershot with high variance. There’s a natural
trade off between the bias and the
variance of a model. Models that are
simpler often have higher bias because
they’re unable to fully capture the real
underlying patterns in your data and they also
have low variance. On the other hand,
more complex models don’t have much lower bias, they’re much better at capturing those underlying patterns, but they often have
higher variance or the scattershot effect by very tightly fitting to
the training data. The total error of a model is the sum of the bias
and the various terms. In technical terms, we say that the total error is equal to the squared bias
plus the variance, plus an additional error
term which refers to the inherent error
in any data set or the random noise that is inherent with any
particular problem that you’re trying to model. In practice we often use
the terms underfitting and overfitting to
refer to situations where we’ve created models
that are either too simple or too complex for the problem we’re
trying to model. Underfitting refers to the
situation where we’ve chosen a very simple model
and as a result, we have a model that has very high bias and low variance, but put together, the bias and variance equal to an error that’s
higher than optimal. In underfitting, our
model is too simple and it’s unable to really capture those underlying
patterns in our data. Overfitting is the
opposite problem, where we’ve selected or
developed a model that is too complex for the data and the problem that
we’re trying to model. In overfitting,
we have low bias, but we have very high
variance and as a result, our total error, the sum of the bias and the variance
is higher than our optimal. The optimal model complexity
occurs when the sum of the bias and the variance
is at a minimal point. To show you an example of what overfitting and
underfitting look like, let’s consider this example. Our true function
in these graphs is represented by the orange curve. Our model function is
represented by the blue line. We can see on the graph
on the left side, we’ve modeled this function with a simple linear regression
and our model in fact is too simple to really capture
that underlying pattern in our data and this would be a
clear case of under fitting. In the middle diagram, we’ve done a pretty
good job matching our model to the true
underlying function in our data so that our model is fitting quite well yet still
has some degree of error, which refers to that
inherent error or noise that’s found
in every data set. The picture on the right is a good depiction of over
fitting where we’ve chosen the model
function that has very tightly fit the
data and as a result, has a very complex form. This may work on
our training data, but when we present this
function that we’ve created with new data to
generate predictions on, our model ends up doing
a pretty poor job generating accurate predictions, because it has fit itself to the noise it’s found in
the training data set and that same noise may
not be present in new data that we presented
to generate predictions.

Video: Test and Validation Sets

Why Evaluate Model Performance?

  • During Development: To compare different algorithms, hyperparameter settings, and make informed choices about the best model configuration.
  • On Finalized Model: To assess how well the model generalizes to new, unseen data, which is the ultimate goal of predictive modeling.

The Importance of Data Splitting

  • Training Set: Used to build and train the model.
  • Test Set: Held back to simulate new data and provide an unbiased estimate of the model’s performance on real-world cases.
  • Typical Split: 80-90% training, 10-20% test.

Data Leakage: A Common Pitfall

  • What it is: Inadvertently using test data during the model building process.
  • Consequence: Overly optimistic performance estimates that don’t reflect true generalization ability.

Best Practice: Three-Way Split

  1. Training Set: For initial model building.
  2. Validation Set: To compare model variations (algorithms, hyperparameters) and select the best one.
  3. Test Set: Used only after model finalization to get an unbiased performance estimate.
  • Typical Split: Training 60-80%, Validation 10-20%, Test 10-20%

Key Takeaway: Rigorous evaluation with carefully separated data sets is crucial to ensure that machine learning models perform well in the real world and avoid the pitfall of overfitting.

Why do we split our data into separate sets for model training and model testing?

Once we have created our final model using the training set, we can then use the test set to evaluate the model performance so that it better reflects the model’s true ability to generate good predictions on new data it has not seen before

Correct. We use the test set only after finalizing our model to evaluate its performance

In this lesson, we’re
going to talk about evaluating the
performance of the model. We evaluate model performance
in two different ways. The first way is as we’re
building and training a model, as we’re evaluating different
algorithms for our model, or tweaking those
hyperparameters, tuning those dials of the model, we want to evaluate performance
so we know whether we’re making things better or
making things worse. The second way we
evaluate performance is after we finalized our model, we then also want to evaluate performance using
new unseen data. Why do we do this twice? Well, the goal of predictive
modeling is to create a model that is
highly effective in generating predictions
on new data that the model has
never seen before, so data that was not used in building or training the
model in the first place. We can’t estimate
the performance on data that we don’t have, which is new data. Instead, what we do is take the data that we have collected, our training set, and divide it up into two different subsets. The first subset we use
for model training, so we use this to
build and then train the model and evaluate and compare different
types of models. The second subset is
called the test set. We use the test set of data in evaluating the performance of the final model
that we’ve built. We do this in order to be representative of the
model’s ability to generate accurate predictions on new
data that the model has never seen before and that was not used in originally
training the model. Typically, when we split our data into a training
and a test set, we’ll use roughly
80 – 90 percent of the available data that we’ve collected in training
and selecting the model. Then we’ll reserve roughly
10-20 percent to use as our test set in evaluating
our final models performance. Common problem in building machine learning models is
what’s called data leakage. Data leakage occurs when some of the data that we’ve
set aside to use in our test set is accidentally used in building or training
the model in some way. This can happen in
many different ways, some of which are really
not that obvious. For example, if we used our entire set of data,
including the training, but also the test set
to do selection of features for a model or to
compare different algorithms, we’ve actually already
used the test set data as part of one of the steps
of building our model. Therefore, a test set
data is no longer representative of the
model’s ability to generate predictions
on new data. Since we’ve already
used it as part of the model building
process itself, it really has now become
training data instead. What happens when we
have data leakage and we accidentally use our
test data as part of the model building
process is that it invalidates the estimated
performance of the model. Generally we’ll
find that it causes the performance estimation
to be overoptimistic. The performance on
the test set is actually better than
what we might expect the performance to
be on generating predictions using new data that the model has
never seen before. Often during model-building
and training, we want to compare
different models. Models which may be based
on different algorithms, or maybe using
different values for these hyperparameters
or these dials that we can tune on the model. If we were to use
our test set to compare the performance
of different models, our test set is no longer a non-bias indicator of the performance of
our final model. Instead, we generally
split our training data set further and divide it
into two further subsets. One being a training set and the second being what’s
called a validation set. We can then build and train the model on our training
subset and we use this new validation set to compare different models and
perform model selection. Once we’ve selected
our final model, we retrain our final model using the training and the
validation data together, and then we can evaluate
the performance of the model using the test set. In this way, we ensure that
the test set is always left on the side and
only used once we’ve made a final selection
of our model so that the test set can remain
an unbiased indicator of the model’s ability
to generalize and generate accurate predictions on new data it’s never seen before. Typically, when we
do this and break our training set further into a training and
a validation set, we’ll use roughly 60-
80 percent of our set for training purposes and then roughly 10- 20
percent for validation.

Video: Cross Validation

Why Cross-Validation?

  • Better Evaluation: Provides a more robust way to estimate how well a model generalizes to new data, compared to using a single, fixed validation set.
  • Maximizes Training Data: Since the validation set changes in every iteration, all available data gets used for training the model at some point. This is especially useful for smaller datasets.

How K-Folds Cross-Validation Works

  1. Divide Data: Split your training data into K subsets (folds), typically 5 or 10.
  2. Iterations: Run K iterations, using a different fold as the validation set in each iteration. The remaining folds become the training set.
  3. Calculate Error: For each iteration, calculate the model’s error on the validation fold.
  4. Average Error: Get the overall error by averaging the error across all iterations.

Benefits of Cross-Validation

  • Reduces Bias: Rotating validation sets prevents accidental bias introduced by a single, fixed validation set.
  • Better Generalization Assessment: Provides a more reliable picture of how well the model performs on unseen data.

Key Takeaway: Cross-validation is a widely used and generally preferred method for evaluating model performance due to its advantages over using a single validation set.

We are building a model and decide to use K-folds cross validation to compare different versions of our model. We set k = 10. How many iterations would we run in our cross validation process, and during each iteration, what percentage of our data would be used for validation (and not training)?

10 iterations, 10% for validation each iteration

Correct. Since k=10 we run 10 iterations, and each iteration we use 1 of the 10 folds (10%) for validation and the remaining 9/10 for training

In the last lesson, we discussed the
importance of setting aside a subset of your
data for testing purposes, to serve as an
unbiased indicator of the model’s performance. We also talked about we can use a validation set to assist us in evaluating the performance of different models as retraining
and building those models. Another common strategy
for evaluating and comparing multiple models is what’s called
cross-validation. In cross-validation,
rather than using a single fixed subset of data as our validation set to compare
the performance of models, we will actually run
multiple iterations, and for each iteration, we will choose a
different subset of data to serve as
the validation set, and the rest of the data will be available to us for
model training. Common method of
cross-validation is what’s called K-folds
cross-validation. In K-folds cross-validation,
we divide our data up into a number of subsets
or folds, as we call them. Typically, we use either
five folds or 10 folds. If we were to perform
five-fold cross-validation, we would divide our training
data up into five folds. We then run five iterations, and for each iteration, we’d use one fifth of our
data as the validation set, and the remaining four of these would serve as the training set. The validation set, each fold would rotate. On the first iteration, we use the first fifth of
our data for validation, the last four fifths
for training. The second iteration, we’d use the second fifth of our
data for validation, and the rest for training. After we’ve run through
all five iterations, we can then calculate
the error as the average of the error on each validation fold
at each iteration. We calculate the error on the validation fold for each
of the five iterations, we’d sum the errors up, and we divide by five to get the average error across
the K-fold validation. Cross-validation is very
commonly used in the industry to evaluate the performance
of models as retraining and comparing
multiple models. In fact, it’s generally
considered to be the preferred
approach rather than using a single fixed
validation set. There’s a couple of
reasons for that. The first is that
cross-validation maximizes data available
for training the model. Using a single fixed
validation subset, we remove that subset from
use in training our model. Whereas in cross-validation, because the validation
set rotates each time, we’re able to use all of
the data available to us at some point during one of the iterations
for training our model. If we have a very large dataset, this is really
less of a concern. But if we’re doing
a smaller datasets, this can quickly become much
more of a concern for us. Secondly, cross-validation
generally provides a better evaluation
of how well the model can generalize to
be able to generate accurate predictions on new
data it’s never seen before. One of the risks with using a single fixed validation
set is that we may accidentally bias the model’s performance
on that set, through the choice
of data points to include in that single
fixed validation set. In cross-validation, because our validation subset
is rotating each iteration, we use every data
point available to us onetime for validation. Thus, we can compare
and evaluate the models performance over a much broader range
of data points, and we reduce the chances of biasing or model performance through our choice of data to
use in the validation set.

Review


Video: Module Wrap-up

Model Building: A Key Step in Data Science

  • Model building is central to the CRISP-DM data science process, starting with properly framing the problem and defining success metrics in the business understanding phase.

Model Complexity

  • Complexity arises from:
    • Features used in the model.
    • The chosen algorithm.
    • Hyperparameter settings for that algorithm.

Balancing Complexity and Performance

  • Underfitting: Model is too simple and can’t capture real-world patterns in the data.
  • Overfitting: Model is too complex, fits itself to noise in the training data, and doesn’t generalize well to new data.

Model Evaluation

  • Validation Sets and Cross-Validation: Techniques for comparing models and choosing the best one.
  • Test Set: A final, unbiased assessment of the selected model’s performance on new data. This is crucial before deployment.

Key Takeaways

  • Choosing the right algorithm, features, and complexity level is essential for building effective models.
  • Careful evaluation using validation techniques and a held-out test set ensures models will perform well on unseen data in real-world situations.

In this module, we talked about the key steps in the
model building process, and we began to
discuss how to compare models against each other and evaluate the models performance. Although the modeling
process is just one step in the overall CRISP-DM
data science process, it really begins
right at step one of the CRISP-DM process or in the business
understanding phase. It’s critical to properly
frame the problem and to define metrics for success
so that as we build models, we can compare them against
our metrics and evaluate our performance
and how well we’re doing in actually
solving the problem. We talked about model complexity
and where it comes from specifically from the features that we’ve defined
to use in the model. From our choice of
algorithm or template to use to govern the relationship between our input features and our output target
and the choices of hyperparameters or values for those knobs that we can turn on the algorithm
that we’re using. We discussed a little
bit about how to define features and
where they come from. We talked about how
to choose algorithms based on our three
criteria of performance, interpretability, and
computational cost. We also discussed
what happens when our complexity is either
too high or too low, given the amount of data that we have and the problem
we’re trying to solve. We discussed an
important terminology that’s used in the industry, specifically underfitting
which is when we’ve built a model that’s too
simple to capture the inherent
real-world variability in the problems that
we’re trying to model. We also discussed
the opposite problem of when we built a model that’s too complex
and as a result, it’s fitting itself to noise that’s found in
a training data and it’s not then able to generalize well and generate
predictions on new data. This problem is
called overfitting. We talked about ways to compare
models against each other using validation sets and
cross-validation strategies, and also discussed the
importance of setting aside a test set of data so that after we’ve
compared our models, made our final model selection, trained our model, we’re able to generate
predictions on our test set, evaluate
their performance, and use that as an
unbiased indicator of our model’s ability to generate predictions on new
data that it receives.

Module 2 Quiz

What are the three primary considerations when selecting an algorithm to use in modeling (select three)?

What does the term “variance” in modeling refer to?

“Underfitting” refers to the situation in modeling when:

Why do we split our data into a training set and test set, and then hold back the test set while training our model?

When we split our data to create a test set, how much of our data do we generally use for training and how much for testing?

Why would we use cross-validation instead of a fixed validation set (select all that apply)?

If we use standard K-folds cross-validation with k=5, how many times do we use each observation in our data as part of a validation fold?

What does the term “data leakage” mean?

Why is data leakage a dangerous thing in the modeling process?

If we are using cross-validation to compare multiple models and select the best one, how do we compare the models?