Skip to content
Home » Duke University » AI Product Management Specialization » Machine Learning Foundations for Product Managers » Module 5: Tress, Ensemble Models and Clustering

Module 5: Tress, Ensemble Models and Clustering

We will begin this model with a discussion of tree models and their value in modeling compex non-linear problems. We will then introduce the method of creating ensemble models and their benefits. We will wrap this module up by switching gears to unsupervised learning and discussing clustering and the popular K-Means clustering approach.

Learning Objectives

  • Describe how tree-based models differ from linear models
  • Identify the advantages of ensemble models and how they are assembled
  • Explain what clustering is and how K-Means clustering works

Tree and Ensemble Models


Video: Introduction and Objectives

Module Overview

This module covers non-parametric models, which are more complex and flexible than linear models. The topics covered include:

  1. Decision Trees: A type of non-parametric model that uses a series of questions to classify data.
  2. Ensemble Models: Combining multiple models to improve predictions, including Random Forest.
  3. Unsupervised Learning: Focusing on clustering, specifically K-Means Clustering.

Learning Objectives

By the end of this module, you should understand:

  1. How tree-based models differ from linear models.
  2. Why ensemble models are used instead of individual models, and how to create them.
  3. What clustering is, how K-Means Clustering works, and when to apply it.

This module aims to provide a comprehensive understanding of non-parametric models, ensemble models, and unsupervised learning techniques.

In the last module, we cover the family
of algorithms called linear models. Then your models are simple and
easy to interpret. But one of the challenges they have is
that because of their simple nature, they often under fit on
complex rural world problems. Another family of models that we’re
going to cover in this module, it’s called the non parametric models. We’re going to talk about a couple of
specific types of non parametric models. Starting with the decision tree, and then we’ll cover ensemble models and
specifically the random forest. Wrap up this module by talking about
unsupervised learning and specifically clustering and the most popular
clustering algorithm K means clustering. At the end of this module, you should understand how tree based
models differ from linear models. You should understand why we use ensemble
models versus individual models and how we can create ensemble models. And you should understand what clustering
is and how K means clustering works and the types of situations in
which we might apply it.

Module Slides

Video: Tree Models

Here’s a summary of the article on decision trees:

What is a Decision Tree?

A decision tree is a machine learning algorithm that asks a series of questions to narrow down a prediction for a given data point. It’s a classification model that uses a series of splits to separate data into different classes.

How Does a Decision Tree Work?

  1. Start with a question (e.g., “Does the animal have horns?”).
  2. Split the data based on the answer (e.g., moose have horns, others don’t).
  3. Ask another question (e.g., “How many legs does the animal have?”).
  4. Split the data again based on the answer (e.g., birds have two legs, others have four).
  5. Continue asking questions and splitting the data until a prediction is made.

Choosing the Best Splits

The goal is to choose splits that maximize information gain, which is the decrease in impurity (mix of classes) in the data. The algorithm looks at every possible way to split the data and chooses the combination that results in the maximum information gain.

Predicting with a Decision Tree

Once the tree is created, predictions are made by traversing the tree and calculating the majority vote (for classification) or average value (for regression) at each leaf node.

Tree Depth and Complexity

The depth of the tree determines the complexity of the model. A shallow tree may underfit the data, while a deep tree may overfit the data. The optimal depth depends on the problem and data.

Advantages and Challenges

Advantages:

  • Highly interpretable
  • Trained quickly
  • Can handle non-linear relationships
  • No need for scaling or encoding categorical variables

Challenges:

  • Sensitive to tree depth
  • May overfit or underfit the data
  • Individual trees can be unstable and prone to overfitting

Overall, decision trees are a powerful tool for classification and regression problems, but require careful consideration of tree depth and complexity to avoid overfitting or underfitting.

Which of the following are correct statements about decision trees (select all that apply)?
  • Trees can be used for both regression and classification tasks
  • The splitting criteria for each node in the tree is selected as the one which maximizes the information gain, or decrease in impurity, of the observations at the node

The decision tree is a machine learning
algorithm that asks a series of questions in order to narrow in on
a prediction for a given data point. Easiest way to understand the decision
tree is to look at an example. So let’s say we wanted to create
a classification model to classify four types of animals dogs,
lizards, birds, and moose. And we want to do this by asking a series
of questions about each of these animals to determine which animal it is. We might start by asking,
does the animal have horns? Of the four possible animal
classes that we have, we know that there’s only one class
that has horns, which is a moose. So if the answer to our question is yes,
we can predict that the animal is a moose. However, if the answer to our question is
no, we’re not yet sure what animal it is. It could be a dog, a lizard, or a bird. So we then ask a second question,
how many legs does the animal have? Does it have two legs or four legs? If the animal has two eggs,
we can predict it’s a bird. If the animal has four legs,
we’re still not sure. It could be a dog, but
it could also be a lizard. So let’s ask one more question. What color is the animal? If it’s green, we can predict
that the animal is a lizard, and if it’s brown,
we can guess that the animal is a dog. By asking the series of questions,
we’ve now formed the decision tree. And so if we were to take a new animal, we
could map it through the tree and we could compute predicted class as the output
based on where it falls within the tree. So how do we choose the splits
that form a decision tree. Our goal is to build the most efficient
tree or the one that uses the minimum number of splits to effectively split
the data into our target classes. To choose the splits, we define an objective function to help
us select which split is the best one. And the objective function
that we use is maximizing the information gain at the split. The information gain is equal to
the decrease in impurity by splitting our data. So impurity means how well mixed our
data is at any point within our tree. If I had a certain node in our tree, our data is highly mixed between two
classes, let’ say class A and class B. Our data has a high degree of impurity. If we create a split that effectively
splits our data between A and B, such that one leaf, we have labels
that are entirely class A, and that the other leaf we have
data that’s entirely class B. We’ve reduced our impurity
all the way to zero. So we’ve created a pretty
significant decrease in impurity. Or put another way, we’ve successfully increased the
information gain coming from that split. So the idea of creating a decision
tree is to find questions or splits that can reduce
the mixture of the data or effectively separated out
into the individual classes. When we’re creating a split, we look at
every possible way that we could split our data at that point in the tree. So we look at each of the features
on which we could split. And for each of those features we look at
different possible values that we could split on. And we choose the combination
of the feature and the value to split on the results
in the maximum information gain, or the biggest decrease in impurity by
splitting on that feature and value. Once we’ve created the tree, how do we actually generate
predictions out of the tree? The bottom nodes in a tree
are called the leaves of the tree. To calculate the actual prediction or
value for all the points which occur
at each leaf of the tree, we generally take an average if we’re
working with a regression model, or a majority vote if we’re working
with a classification model. So let’s say we have some data that’s
mixed at a certain node between class A and class B. We make a split and
we split that down into two leaves. One leaf has a majority of class A,
the other leaf has a majority class B. The prediction for that leaf is
whichever class has a majority. So leaf 1, the prediction for all the data
points at that leaf would be Class A. And for leaf 2, because it’s majority
Class B, we make a prediction for every point which arrives at that leaf
would be predicted to be Class B. One of the key things
that we need to determine when we’re creating a tree is
the optimal depth of our tree. And this can really make a huge
difference in terms of the prediction ability of your tree. So the depth of a tree is the maximum
number of splits that occur within that tree. And it’s actually a number
that we can choose. We can decide to create a very
shallow tree by limiting ourselves to at most one or two splits in our
tree before we create the leaves. Or we can have an unlimited
number of splits in our tree, such that every leaf contains
only one single point. Very shallow trees with a few number
of splits tend to underfit the data. They’re just too simple to really
capture the patterns within the data and effectively split your data. On the other hand, trees that are very
deep tend to overfit the data, because every single example or
observation can end up at its own leaf. This might fit your training
data set very well, but when you try to use this to generalize on
new data, you’ll find it’s overfitting and it’s not performing very well. Let’s take an example to illustrate
the impact that tree depth has on the complexity of the model and the
resulting outputs it’s able to generate. On the left side of this slide, we have a
set of data organized along two features, x1 shown on the horizontal axis,
and x2 shown on the vertical axis. Our data is labelled in the four classes, which are denoted by the color
shading of the data points. Let’s now try to fit a simple
tree model to classify our data. We’ll start by using a tree depth of one,
meaning that we only have a single node or split in our model. We could see the result on
the right side of the slide. Our single node tree model
is using a single split on the value of x2 to split their data. Because it’s using only a single split, it
can split the data into only two classes. In reality we have four
classes in our data set. And so are our simple model using only
a single split is underfitting or data by predicting only two classes
relative to the actual four classes that we have in our problem. If we now start to increase
the depth of our tree, we can draw more complex
decision boundaries, splitting our data along x1 and x2 as
denoted by horizontal and vertical lines. And as a result,
we can differentiate our data points and split them into more classes. As we increase our depth to two and
then three, you can see that now we’re starting to be
able to better capture the variability and the split of the data into
each of the four classes. As we continue to increase the complexity
of our model and add more and more layers, we can see that we’re now slicing and dicing our decisions space
into many more partitions. Well, this may improve
the accuracy on a training data. What happens is that when we
now move to a test data set, or we use a more complex model to
generate predictions on new data. We fitted our model so tightly to the
training set that we’ve created partitions based on noise that’s
found in the training set. The same noise is not always found
in our test set or a new data. And as a result,
our model is fairly inflexible and often does not generate great
performance on predicting new data. We can also use trees for
regression types of problems. In a regression problem, rather than taking a majority vote of the
different samples which fall at a leaf, we take the mean of the target values
of each of the samples at that leaf. So let’s say we have a particular
node that results in two leaves. Leaf 1 has 4 samples that fall
at that leaf of 5, 9, 8, and 6. And leaf 2 has 3 samples 4, 2, and 3. To generate the prediction for the samples
that fall at leaf 1, we add up the target values of the four samples, divide by
the number of target values are four. And we calculate a prediction of seven,
which is the predicted value for every sample which falls
to this leaf in the tree. Likewise for leaf 2, we can
calculate an average value of three. And so our prediction for
every sample that falls to this leaf in the tree based on
the splits in the tree is three. One of the key benefits of decision
tree models is that they’re highly interpretable. Because of this series of questions or
splits, it’s very easy to follow
the order of the questions and to trace back how we got to a certain
prediction given an input value. They also trained very quickly, and
because they’re a nonparametric model, meaning they’re not constrained to
any specific template function, they can handle non-linear
relationships very well. They also don’t require
scaling of our data or extra work encoding categorical variables
before we feed them into our model. One of the challenges of individual
decision tree models is that they’re highly sensitive to the depth
that we choose to grow our tree. If we choose a small depth,
we end up with a very simple model. It really doesn’t do a very good job
predicting on either or training or a test set data. One of the bigger problems is
choosing a depth that’s too deep, such that our model performs very well
on the data on which it’s been trained. But our model is actually overfit
itself to that training data. And so when we try to use it to generalize
and create predictions on new data, really doesn’t do a very good job.

Video: Ensemble Models

What is Overfitting and How to Overcome it?

Overfitting occurs when a machine learning model is too closely fit to the training data, resulting in poor performance on new, unseen data. One popular strategy to overcome overfitting is to create ensemble models, which combine multiple models to improve generalization and reduce overfitting.

How Do Ensemble Models Work?

  1. Create multiple datasets from the original data, either by replicating the data or taking smaller slices of it.
  2. Train a model on each dataset, using the same or different algorithms and hyperparameters.
  3. Use each model to generate predictions.
  4. Combine the predictions using an aggregation function, such as majority vote (for classification) or simple/weighted average (for regression).

Real-World Applications of Ensemble Models

Ensemble models are commonly used in industries like weather forecasting and electric utilities. For example, weather companies use ensemble models to combine forecasts from various government agencies to generate more accurate predictions. Electric utilities use ensemble models to predict demand and load on their networks, incorporating different weather scenarios and input conditions.

Challenges of Ensemble Models

While ensemble models can reduce overfitting and improve predictions, they also come with challenges:

  1. Training multiple models requires significant time and resources.
  2. Computational costs are higher due to running multiple models in parallel.
  3. Ensemble models can be less interpretable than single models, making it harder to understand how the prediction was made.

Overall, ensemble models can be a powerful tool in machine learning, but require careful consideration of the trade-offs involved.

What is the primary reason we use ensemble models instead of single models?

By combining multiple models we reduce the variance of the predictions and improve the model’s ability to generalize to predict well on new data

Correct. Ensemble models decrease the variance of the predictions and thus the total error, improving the model’s ability to generate good predictions on new data

One of the common challenges we
face in building machine learning models is overfitting our
models to the training data. A popular strategy for
overcoming the challenge of overfitting is creating what’s called ensemble models. The goal of ensembling is to
combine multiple models together into a meta model that is better able to
generalize in predicting on new data. By averaging our models and the output
predictions of each model together. We’re less likely to over fit on
their training data and as a result, we’re more flexible and better able to
generalize in predicting on new data. The reason for this is that averaging
the output predictions of models, assuming that each of those models are independent
of each other, are close to independent, can lower the variance as compared to
the variance of an individual model. By lowering the variants,
we improve the performance in predicting. So, how does ensembling work? We start the process of ensembling
by creating multiple sets of data from our original data set. Each of these new sets of
data can either be a full replicated version of
our original data set or can be some smaller slice
of the original data. We can then train a model on each of
these new data sets that we have. Our models can all have the same algorithm
and trained in different ways using different hyper parameters
on different sets of data or they can take on different algorithms. We may combine linear models,
for example, with tree models. Once we’ve trained these multiple models, we can use each model to
generate predictions. We then need an aggregation
function to combine the predictions to generate a single output
prediction from our ensemble model. And here again we have a decision to make
in terms of the form of our aggregation function. If we’re working with the classification
model, we might choose to use the majority vote amongst the individual
member models of our ensemble. Or if we’re working with regression
problem, we may use a simple average of the predictions of each
individual member model, or some sort of a weighted average
where we assign different weightings to the output predictions
of each member model. Once we’ve chosen our
aggregation function, we can combine the predictions of our
member models into a single prediction, which is the output of our ensemble model. Ensemble models are nothing new, in fact, they’re commonly used in
a number of industries. One of those industries
that very commonly uses ensemble models is the weather
forecasting industry, private weather companies like
the business that I used to manage. We use ensembles of a large number
of individual weather forecasts, usually coming from various government
agencies from countries around the world. They’ll combine these individual member
models in intelligent ways, usually using some sort of weighted averages
that are dynamically changing over time. And they’ll be able to generate
an ensemble model which has a better prediction ability than each of
the individual member models. Likewise, the electric utility
industry commonly uses ensemble models in predicting demand or
load for the network. In this case, utilities will use
ensemble models because some of the inputs to the load
forecasting models are uncertain. One of the primary inputs to
this type of model is weather. So, the purpose of the ensemble model
might be to use as input different weather scenarios, for example, looking at different possible weather
conditions for the following day. And for each of those different input
conditions, creating an individual member model that predicts an output in terms of
the load or the demand on the network. The ensemble model then combines
those outputs of the member models, which are each provided with
different input conditions into a single output prediction. Which is then used by
the electric utility for planning purposes to schedule their
power production for the following day. Although ensemble models have great
potential to reduce the variance in your model predictions,
reduce the problem of overfitting and generate better predictions on new data,
they also come with challenges. One of those primary
challenges is the time and the resources it takes to train multiple
individual member models of your ensemble. Likewise, you have to consider
the computational costs of running these multiple
models in parallel. Each time you want to
generate a prediction, you have to now run not just one,
but multiple models and then combine each model’s predictions. And finally,
when you use an ensemble model, you lose interpret ability
relative to a single model. With individual models, it tends to be
much easier to understand how the model was able to reach its prediction. When your ensembling, it becomes a more
challenging task because now you have to dive in to each individual model and
then how those model outputs were combined to understand where
the ultimate and prediction came from.

Video: Random Forest

Bagging:

  • Bagging (Bootstrap Aggregating) is a method for building ensemble models by training multiple models on bootstrap samples of the data.
  • Bootstrap sampling means sampling with replacement, where each observation can be selected multiple times.
  • Bagging subsets can be defined as a percentage of the original data or a fixed number of rows.
  • Combining the models reduces variance and the likelihood of overfitting.

Random Forests:

  • Random forests are a type of bagging model that combines multiple decision trees.
  • Each tree is grown on a different subset of the data, ensuring independence between trees.
  • Random forests are great for complex real-world problems with non-linear relationships between inputs and outputs.
  • However, they can lose interpretability compared to individual decision trees.

Challenges and Decisions:

  • Decisions to make when applying random forest ensemble models include:
    • Number of trees to grow
    • Sampling strategy (bagging sample size and maximum number of features)
    • Depth of each tree (maximum depth or minimum number of samples per leaf)
  • These decisions can affect the performance and interpretability of the model.

Overall, bagging and random forests are powerful tools for building ensemble models that can reduce variance and overfitting. However, careful consideration of the decisions involved is necessary to achieve the best results.

When we create a Random Forest model, what is different about each individual model within the Random Forest?

Each member model may use a different subset of both features and observations to train

One specific method that’s commonly
used for building ensemble models is what’s called bagging, which is short for
bootstrap aggregating. In bagging, we use bootstrap samples
to train multiple models that we put together in an ensemble. What does bootstrapped mean? It means sampling with replacement. So let’s say we have a large number of
samples or observations in our data. We randomly take a certain number of those
observations out to use to train a model. And each time we pull an observation out
to use, we replace it in the original set. Meaning that we would actually be able
to take the same observation multiple times because we continue to replace it. So bootstrapping means
sampling with replacement each time we sample a row to use. We select the size of the bagging
subset that we choose, and we can define that either as a percentage
of the original number of rows in our data, or simply as a number
of rows that we choose to use. When we create bagging models, because each model is trained
on a different subset of data, the output predictions from each model
can be considered close to independent. Therefore, we get the benefits of
ensemble models when we combine them. Typically, we combine them
using either an average, either simple average or weighted average. And by combining these, which are each
created on a separate bagging subset, we reduce the variance in
the overall output predictions and we reduce the likelihood
of overfitting to our data. The most common type of bagging model
is what’s called a random forest. When we think back to our
discussion on decision tree models, one of the challenges with decision
trees is that they tend to overfit data. To overcome this challenge, rather than
growing a single decision tree for a problem that we’re trying to model,
we can grow multiple decision trees and take a majority vote between the trees. In order to ensure that each tree is as
close as possible to being independent of the other trees, we grow trees
using a bagging subset of our data. So for each tree we want to grow,
we apply bootstrap aggregating or bagging to create a subset of data. And we sample from the rows or
observations of our original data, but also from the columns or
the features of data. So that each subset that we train a model
on consists of some number of rows and some number of features
out of our original data. Thus again, the tree models that we grow
can be considered close to independent of each other, because they’re trained
on different subsets of data. We combined these tree models together and we take a majority vote in
the case of classification. Or if we’re applying our random
forest to aggression problem, we take a simple average of
the predictions each one generates. By doing this, we reduce the variance
of the output predictions and we reduce the likelihood of overfitting our
ensemble random forest model to the data. Random forests are great for working with
complex real world types of problems where we have highly nonlinear
relationships between inputs and outputs. Although we lose some of
the interpretability of individual decision trees,
where we can look inside the tree to understand how the predictions
are being generated. The advantage we gain is that
we reduce the variance and reduce the likelihood of
overfitting on the training data. One of the challenges of applying random
forest ensemble models is that there’s a number of decisions we have to make and
how they construct the random forest. These decisions fall into
three main categories. The first is the number of trees that
we’re going to grow in our random forest model, or the number of individual tree
models that we’re going to combine in our ensemble model. The second choice we need to make is our
sampling strategy for applying bagging. So how are we going to choose a subset
of data from the original data to use to grow each tree model? Our sampling strategy includes two parts. Number one is the bagging sample size as a
percentage of the total original dataset, in terms of the number of rows or
observations we have. Second part of our sampling strategy is
the maximum number of features that we want represented in each bagging sample. So when we apply bagging, we don’t
have to sample all of the features for every row that we choose
to use in our subset. We can instead choose to use
a certain percentage of features. This again helps to ensure that the tree
models that we grow based on the different subsamples of data are as close as
possible to being independent of each other. The third choice we have to make is the
depth of each tree in our random forest. And again, the depth of the tree controls
that trade off between underfitting and overfitting our data. We can set this in a couple of ways. One is we can just specify
the maximum depth, or the maximum number of splits,
or nodes in each of our trees. And the second way we can specify this is
by setting a minimum number of samples per leaf in our tree. So if we don’t specify this, we can
grow a very large tree such that each leaf in the tree ends up with only
one single observation at the leaf. This would be a very complex decision
tree model that tightly fits itself to a training data. When we do this, we can also run the risk
of overfitting to the training data. So instead of allowing it to grow so large
that each leaf contains only one sample, we can specify a minimum
number of samples per leaf. So that the tree stops growing itself
when it hits that minimum number of samples at each of its leaves. This results in a shallower tree, which is less likely to
overfit on the training data.

Clustering


Video: Clustering

Clustering

  • Clustering is a technique in unsupervised learning that organizes data into logical groups without using explicit group labels.
  • Unlike supervised learning, clustering does not have access to output labels or target values.
  • The goal of clustering is to group similar input data points together and separate those that are very different from each other.

Examples of Clustering

  • Genetics: inferring population structures using similarities in genetic data.
  • Marketing: partitioning potential customers into target segments based on characteristics such as geographic location, demographic data, or purchase history.
  • Text documents: clustering news articles into topics based on similarities in text characteristics.

Key Decision in Clustering

  • The key decision in clustering is determining the basis for evaluating similarity or difference between data points.
  • The choice of basis for similarity can greatly impact the clustering results.
  • Example: two images of a glass of apple juice and beer can be clustered together or separated based on different criteria such as color, type of object, or ingredients.
Suppose we are creating a clustering model to segment the target market for our product. We want to determine the basis to use for calculating similarity/dissimilarity and grouping the potential customers into logical clusters for the purpose of creating targeted ad campaigns for each group. Which of the following might we use as our feature(s) for clustering (select the best answer)?

Any or all of the above

Correct. There are many ways we can calculate “similarity” when clustering. Choosing which features to use to determine similarity is the most important decision we have to make when applying clustering

We’ll now turn our attention
to unsupervised learning. Specifically, an unsupervised
learning technique called clustering.
What is clustering? Clustering is a technique
to organize data into logical groups without using
explicit group labels. The key difference between
unsupervised learning and clustering and the
supervised learning that we were studying earlier is that
we now do not have access to output labels or target values to use
in training a model. When we studied
classification and regression and our supervised
learning techniques, we were able to use the input values and
the output values from past observations to train a model to relate the
input and the output. In unsupervised learning, we don’t have access
to any output values. Our focus is on organizing the inputs
into logical groups or clusters in such a way that similar input data points are grouped in the same cluster, and input data points which are very different from each other, should fall within
different groups or different clusters. There are many examples of clustering found in
the world around us. One example is in
genetics for inferring population structures using similarities
and genetic data. When we organize animals
into different types of animals, reptiles,
amphibians, mammals, etc, there’s no master list in the universe that says this is a mammal and this is a reptile. We organize these
things ourselves by looking at similarities
in genetics and characteristics
between animals to organize them
into logical groups. Another common example of clustering is found
in marketing, where we’re partitioning
potential customers for a product into different target segments
so that we can create and apply different
marketing techniques for each of our different
target segments. We might define our
target segments based on geographic location
or demographic data, or whether they’ve bought
products from us before or not. There’s many different
ways to organize potential customers
for a product based on what
characteristics we choose. Likewise, there’s no golden rule or no master
organization that says, this customer falls in Group 1 and this customer
falls in Group 2. We define this for
ourselves as best we can, based on exploiting
similarities and differences between potential
customers that we have to organize them
into logical groups. Another common
example of clustering is the application of
clustering the text documents. Let’s suppose we’re building
an app that’s organizing daily news articles that we
find in the news each day across a number of
different news sources into a set of the most
important topics for the day so that we can
tell our user what are the most important topics that you should pay
attention to today. Again, here, each news article that we come across doesn’t have some specific
pre-existing label that says This article is about Topic A or this article
is about Topic B. We look at the text
of each article. Then we can group
articles which have similar texts characteristics as being about the same topic. The key decision that
we have to make when we apply clustering
to a problem is, how are we going to
determine whether something is similar to another thing or different
to another thing. What basis are we going to use
for evaluating similarity? Take an example on
the slide here. We have two images. One image shows a
glass of apple juice, the other image shows
a glass of beer. Depending on our
basis for similarity, we could make the argument that these things are
either very similar to each other or very
different from each other. Let’s suppose our basis for establishing
similarity was color. We might look at
these and say, well, they’re both of a golden color. These things are very similar. They should be in
the same cluster. If we were to look at these
things and say, well, they’re both liquids
that somebody drinks and generally serve cold, we might also say, yes, these things are very
similar to each other. They should be in the same
cluster with each other. However, let’s say instead
of color or type of object, that we choose to use ingredients as our
basis for similarity. Obviously, apple
juice and beer are made from very
different ingredients. If that was our basis for
calculating similarity, we might put these in
different clusters, saying that apple juice
contains a list of ingredients that’s
quite different from what beer is made of. Understanding our
basis for calculating similarity or difference
between two things is really the key decision
that we need to make when we’re applying
clustering to a problem.

Video: K-Means Clustering

Clustering Algorithms

  • K-means clustering is a popular algorithm for clustering data, but there are many other types of clustering algorithms to choose from, depending on the specific problem.
  • K-means is a good starting point, but it’s important to understand the strengths and weaknesses of different algorithms.

How K-means Works

  1. Choose a number of clusters (K) and plot the data points using the features as axes.
  2. Randomly select the locations of the cluster centers.
  3. Assign each data point to the nearest cluster center.
  4. Move the cluster center locations to the actual mean location of the points in each cluster.
  5. Repeat steps 3-4 until the cluster centers no longer move.

Advantages and Disadvantages of K-means

Advantages:

  • Easy to implement
  • Converges quickly
  • A good starting point for clustering tasks

Disadvantages:

  • Requires specifying the number of clusters in advance
  • Doesn’t work well with complex data or a large number of features
  • Creates linear boundaries between clusters, which may not fit complex relationships between features.

Choosing the Number of Clusters

  • Try different numbers of clusters and choose the one that gives the lowest error in terms of total distance between data points and cluster center.
  • Consider the logical number of clusters based on the problem being solved.
Which of the following statements about K-Means clustering are correct (select all that apply)?
  • K-Means is the most popular clustering algorithm
  • K-Means forms linear decision boundaries between clusters and so may not work as well on geographically complex data

After we’ve chosen our basis for establishing similarity or
difference between things, we then select an algorithm to apply to create our clusters. We have focused our discussion today on what’s called
K-means clustering, which is by far the most
popular clustering algorithm. But I also want you to be
aware that there are many, many different types of algorithms that apply
for clustering. There’s so many
algorithms because clustering problems are varied, and it’s important
to understand for your specific problem which of these algorithms
may fit the best. But when in doubt, K-means
is a great place to start. How does K-means work? Once we’ve established
the basis for similarity, we now want to group our data points together
in a set of clusters. The first thing we
need to do is choose a number of clusters in K-means. We can then plot our data points so that we can
visualize these groups. Let’s say we have a problem where we have two
features that we’re using as our basis for
similarity X1 and X2. We plot their data points
as shown on the slide, using X1 and X2 as the axis. Our intuition would tell us the data points which are
close to each other should be grouped within the same cluster and data points
which are far from each other it would be logical to group in
different clusters. Let’s say we again have
chosen three clusters. We could locate each of
these three clusters and we could assign each of these data points to the nearest cluster
so that they’re organized into three
distinct clusters. The key question though, is where to locate
each of these clusters and how to assign data
points to the cluster. In the K-means algorithm, our objective function is
to minimize the sum of the distances from every point towards the center of
its assigned cluster. So that each data
point is assigned to the cluster with the center
that’s nearest to that point. Let’s look at how this
works in practice. After we’ve selected
a number of clusters, Step 1 is to randomly select the locations of the center
for each of those clusters. Again, we’ve chosen to form three clusters using our data. We’ll pick random locations to put the centers of each
of those clusters, 1, 2, and 3, as we’ve
shown on the slide. Step 2, is then to assign all of our data points to its
nearest cluster center. The blue points that we see
on the slide are assigned to Cluster 1 since they’re closest to the
center of Cluster 1, the orange points are
assigned to Cluster 2, and the purple points are
closest to the center of Cluster 3 so they’re
assigned to Cluster 3. As we can see that once
we’ve assigned our points, the cluster center locations
we’ve chosen are not actually the centers of the clusters of points
we’ve assigned them to. Let’s now move those centers to the actual center location or the mean location of the points in each
of those clusters. We’ll move our cluster
center locations from the original randomly
chosen locations to the actual centers of the data points assigned
to that cluster. We’ll now repeat that
process and we’ll again assign points to the
nearest cluster center, which might stay the same or
more than likely would now change since we’ve moved the
cluster center locations. The point in assignment
will change because some points have
now moved closer to another cluster center. We’ll again assign the points to the nearest cluster
center and we’ll again move the cluster centers to the actual mean location of the points within
that cluster. We’ll continue to repeat
this process over and over again until the cluster
centers are no longer moving, meaning they are at the
actual mean locations of the points within
that cluster. Once we’ve found
those mean locations, we’ve completed our
cluster assignment, and we’ve established
our three clusters and the data points within each. One of the advantages of K-means is that it’s very
easy to implement, it converges quickly so it’s
generally quick to run. As a result, it’s generally a very good starting
point when you’re working with clustering tasks. One of the key downsides
of K-means is that it requires the user to specify a number of clusters in advance. For some problems where
we’re applying clustering, we may have in mind a logical
number of clusters to use. In other problems,
we really don’t know how many clusters to use. Generally what we’ll do is we’ll choose different
numbers of clusters, we’ll run K-means and we’ll
look at the one that gives us the lowest error in terms of the total distance between data points and cluster center. But it also fits with
our logical intuition of how many logical
clusters we might expect given the problem
that we’re trying to solve. Another challenge of K-means
is it doesn’t work well for data that’s
geographically very complex, that’s a large
number of features. K-means creates linear
boundaries between the clusters and so in some cases where we’re dealing with very complex data, with complex relationships
between the features, we may want to choose
another type of clustering that can give us a better
fit for our problem.

Review


Video: Module Wrap-up

The text discusses the importance of considering three key criteria when evaluating and choosing machine learning algorithms: performance, interpretability, and computational cost.

  • Performance refers to how well the algorithm performs on a specific task.
  • Interpretability refers to how easily the algorithm’s predictions can be understood and explained. Linear models and decision trees are highly interpretable, while ensemble models are less so.
  • Computational cost refers to the resources required to train and run the model.

The text also touches on unsupervised learning, specifically clustering, and highlights the importance of defining similarity between data points. In clustering, the key decision is how to define similarity, and this can be done in various ways, such as by size, year built, location, or other factors.

The main points are:

  • When choosing a machine learning algorithm, consider performance, interpretability, and computational cost.
  • Interpretability is important when it’s necessary to explain how the model is generating predictions.
  • Clustering requires defining similarity between data points, and this can be done in various ways depending on the problem and data.

In the last two modules, we introduced some of the most common machine
learning algorithms. We started with a discussion on a set of algorithms
called linear models, which includes
linear regression, which is typically used for regression types of
machine learning tasks, and logistic
regression, which is used for classification tasks. We then talked about a very
different set of models, which are called
non-parametric models. Specifically decision
trees and ensemble models, including random
forests, which are composed of many decision trees. we talked about some
of the advantages of this type of machine
learning algorithm, and also some of
their disadvantages relative to simpler
linear models. It’s important to note,
as you evaluate and choose the algorithms to use for machine
learning problems, there are three key
criteria that we consider. The most obvious one
is the performance. One algorithm may
naturally give us a better or worse
performance than another algorithm that we
choose for our problem. The second criteria that we want to consider is
interpretability. One of the great things about linear models is that they’re
highly interpretable. It’s very easy to understand how a linear model is
generating its prediction. If we’re creating
a model where it’s important that we’re
able to explain to our user of our model how
we achieve the prediction, we might consider something like a simple linear regression. Likewise, decision trees are also very simple to understand how a prediction is generated by following the path
through a tree. But as we get to
more complex things like ensemble models, we lose interpretability, and it becomes much more
difficult for us to actually explain how a model is
generating its predictions. In some cases, that
might be okay, where it’s more acceptable
to treat a model as a black box
generating predictions. But when we’re working
with things that have very significant consequences
for individuals, financial consequences,
or otherwise, we need to make sure
that we’re building models that have some
level of interpretability, so that when we have
to go look into why the model is generating the outputs that
it’s generating, that we’re able to do so. The final criteria is
computational cost. Each of these
different algorithms comes with a different level of computational cost and resources to train the model
in the first place, and then to run inference, or generate predictions using the model on a go-forward
basis in a product. As we compare algorithms, and we build models by
selecting an algorithm to use, we need to keep in mind each of these three criteria to
drive our decisions. We also briefly touched in this past module on
unsupervised learning, and specifically, clustering. We looked at k-means clustering. The key thing to remember
for clustering is that the most important decision
we have to make is how we’re defining
similarity between things. To go back to an earlier
example we used where we were building a model to
predict house prices. Let’s say that we’re
going to apply an unsupervised technique
to this type of problem. Rather than generate predictions of sale prices for those houses, we’re going to
attempt to classify houses into different
logical groups. Again, the key thing here to make this type
of a model work is deciding how do we define whether houses are
similar or different. There are many ways to do this. We might define similarity
based on something like size, where we use square footage, or number of bedrooms,
or bathrooms. We might define similarity
based on year built, where older houses are more
similar to each other, regardless of their size
relative to new houses. Or we might choose
something like location, neighborhood that
the house is in, or school district where
houses, they’re in. The same neighborhood,
a school district, are more similar to each other
regardless of size or age. Thinking through how to define similarity between things
is really the key to success in clustering approaches regardless of the specific
clustering algorithm that you choose to apply.

Quiz: Module 5 Quiz

What impact does the depth of a decision tree model have on its complexity and likelihood to underfit or overfit?

Shallow decision trees are simpler models and may underfit data, while very deep decision trees are complex models and may overfit data.

Regardless of depth, all decision tree models are simple and very likely to underfit

The depth of a decision tree does not influence the model’s complexity or fit

Deep decision trees are simpler models that commonly underfit data, while shallow decision trees are more complex and can result in overfitting

When used for a regression task such as predicting demand for a product, how does a decision tree make its predictions?

For each new observation, we look at the training data and find the most similar point, and then use the corresponding target label as the prediction for the new observation

We trace the route through the tree for each new observation until we reach a leaf. We then use the majority vote of the class of each point at the leaf as the prediction

For each new observation, it traces the route through the tree until it reaches a leaf. It then uses the mean target value of the training points at that leaf as the prediction.

For each new observation, we re-train the tree with the data including the observation, and then use the mean target value of the points at the leaf as the prediction.

Which of the following are correct statements about decision trees (select all that apply)?
Single decision trees are highly interpretable
Decision trees do not handle non-linear relationships well
Decision trees are prone to overfitting on the training data
The fit of a decision tree is highly influenced by the hyperparameters, specifically those which control the tree’s depth

What is the goal of using ensemble models rather than individual models?

Ensemble models are more interpretable

Ensemble models reduce the computational requirements to train and run the model

Combining multiple models in an ensemble makes the aggregate model less likely to overfit and better at generalizing to new data

Ensemble models usually have better prediction performance on the training dataset

When we create an ensemble model, what decisions do we need to make (select all that apply)?

Whether we use the same algorithm or different algorithms for each member model

Whether each individual member model is trained on the full dataset or different slices of the data

How many individual member models we would like to create in the ensemble

How we would like to aggregate the predictions of each individual member model to form the prediction for the ensemble (for example, using majority voting, simple average, or weighted average)

When we use tree models, why do we often choose to use a Random Forest ensemble model rather than a single decision tree?
Random Forests always generate better predictions than single decision trees on new data
Random Forest models require less computational power to train
Using a Random Forest reduces the risk of overfitting relative to a single decision tree
We can use Random Forest models for both regression and classification, while we can only use decision tree models for classification

What is a key benefit of tree-based models (decision trees or Random Forests) relative to linear models (linear or logistic regression)?
Tree-based models are less likely to overfit than linear models
Decision tree and Random Forest models are easier to interpret than linear models
They can easily model complex, non-linear relationships between the input features and targets with no need for additional feature creation / feature engineering work
They always yield better performance than linear models do in generalizing to make predictions on new data

For which of the below applications would we likely use a clustering approach (select all that apply)?
Modeling and predicting flight delays using historical flight data
Segmenting the potential customer base for a new product into groups for the purpose of developing targeted advertising campaigns for each
Organizing food items into groups based on nutritional content
Training a model for use in a car to autonomously parallel park

What is the most important (and typically first) decision to make when applying clustering to a problem?

Determining what basis we will use for measuring similarity / dissimilarity between datapoints (how we will measure which points are similar and which are not)

Determing which clustering algorithm we should apply

Which metric we will use to evaluate the quality of the clusters formed

How we will split our data between training and test sets

Which of the below are true about K-Means clustering (select all that apply)?
It is easy to implement and generally converges quickly
It is the most popular clustering technique
It does not require the user to provide a specific number of clusters to use in the algorithm
It forms linear decision boundaries between the data when separating into clusters