Week 4: Linear Models

In this module we will explore the use of linear models for regression and classification. We will begin with introducing linear regression and continue with a discussion on how to make linear regression work better through regularization. We will then switch to classification and introduce the logistic regression model for both binary and multi-class classification problems.

Learning Objectives

Explain how linear regression works
Describe the differences between linear and logistic regression
Discuss the benefits and types of regularization

Table Of Contents

Linear Regression & Regularization
Logistic Regression
Review

Linear Regression & Regularization

Video: Introduction and Objectives

Notes

Transcript

Linear Models vs. Non-Parametric Models

Linear Models:
- Use a predefined form (or template) to model input-output relationships.
- Train quickly and can work on smaller datasets.
- Can be overly simplistic for complex problems, leading to underfitting.
Non-Parametric Models:
- Do not assume a predefined form for input-output relationships.
- Flexible and can handle complex data well.
- Often achieve better prediction performance.
- Require more training data and are susceptible to overfitting.

Supervised Learning Algorithms

Both linear (parametric) and non-parametric algorithms exist for supervised learning.
Complexity can range from simple to advanced.

Focus of the Discussion

Beginning with linear models, specifically:
- Linear Regression
- Ridge and Lasso Regression (which use regularization to prevent overfitting)
- Logistic Regression (a classification model, despite the name)

Key Learning Goals

Understand how linear regression works.
Differentiate between linear regression (for numerical outputs) and logistic regression (for classification tasks).
Explain the concept of regularization and its benefits in preventing overfitting.

We’re going to begin our discussion on
the algorithms behind machine learning, by focusing on a class of model
that we call linear models. Linear models are a parametric algorithm
meaning, that they take on a known form, which is used as a sort of template to
model the input to output relationship. This template is fixed using
a predetermined set of coefficients or parameters. And when we build and train a model, our job is to learn the values of
those coefficients or parameters. Parametric models can train quickly,
they work well even on small data. The downside of using parametric
algorithms is that they are constrained to a specific form or this template that we’ve chosen to
model the input output relationships. Because of that,
they can sometimes be too simple for real world complex problems
that were trying to model. And so they’re prone to under fitting. The other class of models are called
non parametric algorithms. And these do not make any particular
assumption about the form or template of the input to output
relationship in advance of building and training the model. Instead, these models
are highly flexible and they can adapt well to complex
nonlinear data and relationships. As a result, they can often result in higher
performance on generating prediction. However, they require more data to train
and they are often prone to overfitting a model on the training
data that we’ve provided. If we look at the various algorithms
available to us for supervised learning, we have both parametric and
non parametric algorithms that we can use. Some of them are very simple and
some are more complex. We’re going to start the discussion on
algorithms by focusing on the linear models of linear regression ridge and
lasso regression and logistic regression. Which even though it’s called regression
is actually a classification model. At the conclusion of this module,
you should be able to explain and understand how linear regression works. You should be able to describe
the differences between linear regression regression model and
logistic regression classification model. And you should also be able to understand
and explain the benefits of what we call regularization, which is a technique for
reducing overfitting of linear models.

Reading: Download Module Slides

Reading

Linear-models Download

Video: Linear Regression

Notes

Quiz

Transcript

What is Linear Regression?

A statistical model assuming a linear relationship between input features and the target output.
Simple yet versatile, forming the basis for more complex models like neural networks.
Useful for benchmarks and understanding input-output relationships.

How it Works

Equation: y = W_0 + W_1X (+ W_2X_2 + … for multiple features)
- y = predicted output
- W_0 = bias term (y-intercept)
- W_1, W_2, etc. = coefficients (weights) of features X_1, X_2, etc.
Training: Finding optimal coefficients to minimize prediction error.
Error Calculation:
- Error = actual value – predicted value
- Often measured as Sum of Squared Errors (SSE) for convenience

Key Concepts

Cost/Loss Function: SSE is the cost function linear regression aims to minimize.
Closed Form Solution: Linear regression often has a direct way to compute optimal coefficients.
Non-linear Relationships: Features can be transformed (e.g., squared, cubed, log) to model non-linear patterns (polynomial regression).

Example: Predicting Fuel Efficiency

Simple linear regression on horsepower works decently, but there’s room for improvement.
Applying a non-linear transformation (horsepower cubed) greatly improves model accuracy, demonstrating the power of polynomial regression.

Which of the following are good reasons to use linear regression (select all that apply)?

It is highly interpretable and so is a good choice of model in situations where interpretability is critical
It helps us understand the relationships between the input features and the output target
It is useful at least as a first model to apply to get a benchmark of what performance we can reasonably expect on our problem

We are building a regression model and find that the relationship of one of our key input features with the target variable appears to be non-linear. Which of the following strategies should we follow?

We should create a new feature from it by applying a non-linear transformation, and then use our newly created feature in the linear regression model

Correct. We can improve the fit of our linear regression model by applying a non-linear transformation to this feature

We’ll start the discussion of linear models with the
simple linear regression. Many of you are probably
familiar with linear regression. Shows up in almost
every imaginable field, everything from economics to many branches of the sciences. A linear regression
model assumes a linear relationship between
the input and output. The inputs being the
Data features that we’ve defined and the output being the target we’re
trying to predict. This relationship is
defined by a set of coefficients which
are multipliers of each of the input features. If linear regression is
so simple and common, why are we spending
time talking about it? There’s actually a couple
answers to that question. The first is that even
though it is simple, linear regression forms
the basis of many of the more complex machine-learning
models that we use. In particular, neural networks, which we’ll talk about
in a later lesson, really are founded on the basis of the simple
linear regression. The regressions can also be surprisingly effective models on certain situations if
they’re properly used. They also make a
great first model to apply to get a
benchmark or sense of expected performance
that you might hope to achieve on a particular
machine learning task. By the way, I always
recommend when you’re working on
a modeling task, to start with a simple model
like a linear regression. Apply that as a first step and see what performance
that gives you. Then once you move on to
more complex algorithms, you can compare them back to your original benchmark and see whether you’re really making
an improvement or not. Finally, one of the
really nice things about linear regression
is that it’s highly interpretable and it’s very
easy for us to understand the relationships
between the inputs and the outputs in the
model that we’re building. How does a simple linear
regression model work? Let’s take the example
we were working on before of predicting
sale prices for homes. If we were building a
simple linear regression involving a single variable, which is the number of bedrooms, we might provide that variable, the bedrooms into a model and as an output from our model, we would be predicting
the home price. Our model might look
something like this. y=W_0 + W_1X. W_0 is what
we call the bias term. Or you can think of this
as the y-intercept. If all of the features, or in this case the single
feature we have was zero, what would the y value be. We call this again the bias. W_1 would be the coefficient, or sometimes also called the
weight of the variable x, which represents the number
of bedrooms in our house. This would be the
multiplier of that feature to calculate the total value
for our target sale price. Let’s now move from the simple
linear regression model to the multiple linear
regression model. In this case, we have
more than one feature. In fact, we have as many
features as we would like to put into our model. We might add additional
features in this case, such as square
footage of our home, the school district, or the
neighborhood our home is in. Again, we represent this with
an equation which contains that bias term W_0. But now we have
multiple coefficients, one for each of our
input features. We may have a coefficient W_1, representing the weight
of the number of bedrooms in calculating the
final target sale price. W_2 might be the coefficient of the square
footage of our home. W_3 might represent
the coefficient of the school district that
we’re in, and so on. We add all these up. The coefficients are
weights times the values of the features to calculate our y value or our
target sale price. When we train a linear
regression model, really what we’re
doing is learning the optimal values of
these coefficients or weights that can
effectively model the relationship between
the input features and the output target. The first step in identifying the optimal values of those
coefficients or weights, is to calculate the total
error of the model. We will then alter the coefficients in a way
that hopefully reduces that total error to the point where we’ve now minimized
our total error. How do we calculate total
error of our model? Well, the error for
any given point, in this case, any
given house for sale, is the actual price that, that homes sold for minus
the predicted sale price. Or in mathematical notation, we call it y minus
our prediction, which we call y-hat. Alternatively, for
computational convenience, we often define error
in terms of what we call Sum of Squared
Errors or SSE. Sum of Squared Error,
SSE is calculated as the sum of all the
predictions minus the actuals squared
or y-hat minus y squared and summed up over all the data
points that we have. When we build our model, what we’re really
trying to do is seek the coefficients
that can minimize that total value of the
Sum of Squared Error. SSE, in modeling terminology is called our cost function, or also called a loss function. In this case, our cost function, or SSE is the sum of the y-hat minus y squared
for every data point. Again, when we’re training
our linear regression model, we’re seeking to find the values for those coefficients or weights that minimize the
total of our cost function. To do this, we use
the training data, the inputs x’s and outputs
y’s available to us and we solve for the weights
or coefficients that result in the minimum
of the cost function. In the case of
linear regression, we can usually do this using
a closed form solution. In other types of models, we apply the same strategy, but often there is no
closed form solution, so we use some more
complex methods for calculating the values that result in the minimum
of the cost function. Many people will think that linear regression really
only works when there’s a linear relationship between
the inputs and the outputs. In reality, you can also model nonlinear relationships
between inputs and outputs. In order to do that, what we do is transform
an input feature by some nonlinear
transformation function and create a new feature which we then use as an input to a model. For example, we may take
an input feature x to a certain power x squared
or x cubed for example. Or we may take the log of x and we’ll create that
as a new input feature. We’ll feed that into our model and now we’re able to
better capture some of those nonlinearities
of the relationship between the inputs
and our outputs. When we do this, it’s called
polynomial regression. There’s actually an
unlimited number of transformations
that we can apply. Let’s look an example of
when this comes in handy. In this case, our objective of our modeling
task is to predict the fuel efficiency of cars given the horsepower
of the engine. You could see here on
this slide that I fitted a simple linear
regression to horsepower. Looks like it does an okay job at capturing the variability in the pattern that we see in
the output miles per gallon. But there’s certainly
room for improvement. On this screen I’ve displayed,
the Mean Squared Error, as we can see for the training
set and the test set. Now let’s take a look at
what happens when we use a nonlinear transformation and apply polynomial regression
to the same task. In this case, I’ve taken horsepower cubed and using
that as the input to my model. I’m now predicting miles
per gallon based on a single input horsepower cubed. As we can see, our model is doing a much better
job at capturing that non-linear
relationship between horsepower and miles per
gallon and as a result, the Mean Squared Error on both our training set and our test set has
significantly improved.

Video: Regularization

Notes

Quiz

Transcript

Regularization for Better Regression Models

The problem: Linear regression models can overfit the training data, leading to poor predictions on new data.
What is regularization? It’s a technique that adds a penalty term to the cost function to discourage overly complex models (those with many features or high weights).
How it works: The penalty term makes the model seek a balance between fitting the training data well and keeping the model simple. This improves its ability to generalize to new data.

Types of Regularization

Lasso Regression (L1):
- Penalty term: Sum of absolute values of coefficients.
- Can force coefficients to zero, effectively removing irrelevant features.
- Good for feature selection.
Ridge Regression (L2):
- Penalty term: Sum of squared coefficients.
- Reduces coefficients of irrelevant features towards zero, but not completely.
- Useful for complex datasets and when many features are somewhat relevant.

Choosing between Lasso and Ridge

Lasso: Choose if you want a simpler model with fewer features for better interpretability.
Ridge: Choose if you suspect many features have some influence, even if small, and there’s correlation between features.
Experimentation: Often the best way to decide is to try both and see which performs better on your specific dataset.

What is the primary reason we apply regularization when building models?

It reduces complexity in the model, decreasing the probability of overfitting and helping the model better generalize to predict on new data

Correct. Regularization simplifies the model and helps it better generalize to predict on new data

Why can LASSO regression be considered as a method for feature selection while Ridge regression cannot?

LASSO reduces the coefficients of irrelevant features to 0, which Ridge reduces them close to 0 but does not completely remove them

Correct. LASSO removes features which do not add value to the model by reducing their coefficients to 0

In the last lesson, we discussed the training method
for linear regression, how we calculate the
sum of squared error, and we seek to find the
values for the weights or coefficients that minimize
the sum of squared error. One of the challenges of this training method is it tends to reward overfitting on
the training data, because we’re calculating
and trying to minimize SSE on the
training data alone, we can sometimes end up
with a model that is fit very tightly to
the training data, and so we try to generate predictions on new data with it, and we find that doesn’t generalize particularly
well to new data. How do we build a
linear regression model in a way that is a little bit more balanced
between fitting tightly on the training data, but also still being
flexible enough to generate accurate
predictions on new data, as we can test
with the test set. One method for doing this is
to add a penalty factor into our cost function that
penalizes complexity. In this case, complexity being
in the form of features, the number of features
that we have included, and the values of the weights
for all of those features. The cost function for normal linear regression that we described earlier looks
something like this. Sum of squared error is equal
to the actual values for y minus the predicted
value for y squared, and summed up over all the
data points that we have. When we apply regularization, we add a penalty term. That’s a function of the sum of the coefficients or weights in our linear
regression equation. Now, when we have more coefficients or higher
values of those coefficients, it tends to increase
the cost function. But another way, as we
reduce the number of coefficients or reduce the
weights of our coefficients, our cost function
tends to decrease. Minimizing this
new cost function, including this
regularization penalty, helps us find the
optimal balance between fit on our
training data and simplicity of a
model in terms of the number of features and
weights of those features. The cost function
with regularization applied is now equal to
the sum of square there, y minus y hat squared and
summed over data points, plus the sum of a value Lambda
times our penalty factor. The value Lambda is a fixed
value that we set and it controls the strength of the penalty that
we want to apply. As we increase Lambda, we can apply a higher
penalty and we can decrease Lambda to
apply a lower penalty. For the penalty factor, there’s two primary choices
that we may select from. One’s called lasso regression, the other’s called
ridge regression. In lasso regression, we calculate the penalty
factor as the sum of the absolute value of the coefficients multiplied
by our Lambda value. Lasso regression actually
has the effect of forcing coefficients
all the way to zero, if the variables of those coefficients are really not relevant in
predicting the output. If we have a large number of features in the model that
we’re trying to build. But several of those features
really are not adding value and our ability
to predict the output. If we apply a lasso regression with a sufficient
penalty factor, it actually forces
those coefficients to zero and thereby it removes those features from
the equation altogether. Lasso regression can
also be considered a form of feature
selection because it’s generally reducing the
number of features that are present in our final
model equation. On the other hand,
ridge regression does not have the effect of forcing our coefficients all
the way to zero. In ridge regression,
our penalty term is the sum of the weight squared across all of the weights or coefficients
in our equation. Ridge regression forces
the coefficients of irrelevant factors
towards zero, but generally not
all the way to zero. Ridge regression
can be an effective modeling strategy to reduce over-fitting and improve
that balance between simplicity and fit
on training data. But it’s not a feature
selection method like lasso regression is, in the sense that
it’s not actually eliminating features
that are irrelevant. It’s just reducing
the coefficients of those features to something
that’s very close to zero. Regularization can
be highly effective strategy when working with
regression models that often can give us a better model than a standard linear
regression model alone. Particularly when
we’re dealing with complex data with many features. As far as the choice between
lasso and ridge regression, you might have a reason to
prefer one or the other, or you might try both
and see which one does a better job
predicting the output. If you desire a
simple model with a smaller number of features,
that’s more interpretable. Lasso can be an
effective strategy because it’s reducing the
number of features by eliminating features
from our model that really are not particularly valuable in predicting
the output. On the other hand, if we know
ahead of time that we have a very complex relationship of the output target to many
of our input features, and we have what’s
called collinearity or correlation between some
of our input features, ridge regression can
be a better strategy. When in doubt. It’s
advisable to try both approaches and see which approach does a
better job in modeling.

Logistic Regression

Video: Logistic Regression

Notes

Quiz

Transcript

From Linear Regression to Logistic Regression

The Problem: Linear regression isn’t ideal for classification tasks where outputs are discrete classes (like 0 or 1). It can produce predictions outside the desired range.
The Solution: Instead of predicting the class directly, logistic regression predicts the probability of an output being in a certain class.
The Sigmoid Function: The sigmoid function is used to bound the model’s outputs between 0 and 1, making them interpretable as probabilities.
Model Structure:
- A linear regression model is used to calculate an intermediate value (‘z’).
- This value is then fed into the sigmoid function to get the probability.

Finding the Best Model (Gradient Descent)

Cost Function: A cost function is defined specifically for logistic regression.
No Closed-Form Solution: Unlike linear regression, there’s no simple formula to calculate the optimal coefficients (weights) for the model.
Gradient Descent: An iterative algorithm used to find the coefficients that minimize the cost function. Here’s how it works:
1. Start with random coefficients.
2. Calculate the gradient (direction of steepest increase) of the cost function.
3. Take a small step in the opposite direction of the gradient (using the learning rate to control step size).
4. Repeat until a minimum is reached.

Why do we use the sigmoid function in logistic regression, and not in linear regression?

In logistic regression we are predicting the probability of the positive class (y=1) and so we need to output values between 0 and 1, which is what the sigmoid function accomplishes. In linear regression we are predicting a numerical output which is not capped in the range of 0 to 1.

Correct. The sigmoid converts our scores into probabilities ranging from 0 to 1

When we perform gradient descent to calculate the optimal coefficients/weights in a logistic regression model, during each iteration of gradient descent we update the values of the weights by a small amount to move closer to the weights which minimize our loss/cost function. What information do we need at each iteration in order to calculate the new weight values (select all that apply)?

The previous weights from the prior iteration
The value of the learning rate
The gradient of the cost/loss function with respect to the weights

Correct. We start with the previous weights and then we subtract the learning rate multipled by the gradient of the cost function

In the past couple of lessons, we focused our discussion of linear models on
regression tasks. Let’s now try to tackle
a classification problem using what we’ve learned
about linear models so far. Suppose we now have a simple
problem where we again have a single input variable called x and we’re trying to predict
an output variable y. But now, because it’s
a classification task, our output is a class. Let’s make it simple and
use a binary task where our output is either a 0 or a 1. We could again apply a linear regression to
create a model to do this. Our linear regression might
look something like this. Again, it takes the
form y hat is equal to our bias w_0 plus r coefficient w_1 times
our single input x_1. However, we now have
a couple of problems. One is that as we can
see on the diagram, the linear regression
model is almost always predicting
the wrong value. Again, our output values
are either 0 or 1, and in almost every
case that we can see, our prediction is
neither 0 nor a 1. How do we interpret
these predictions that fall between 0 and 1? Additionally, what about values that our model is
predicting that are greater than 1 or values it’s predicting that
they’re less than 0. One solution to some of
these problems would be, rather than trying to predict
the actual y value 0 or 1, what if if we predict
the probability that y is equal to a 1? In this case, those
values that fell between 0 and 1 now
makes sense because the probability
that y is equal to 1 is somewhere between 0 and 1. One of the problems we
still have however, is what to do about those
values that are higher than 1 or lower than 0. In order to solve this, let’s now apply a
function that predicts the outputs which only fall
within that range of 0-1. One option here would be
to use what’s called the logistic or the
sigmoid function. Just take a sigmoid function
is a function that predicts values which fall between 0 on the lower side and 1
on the upper side. By now applying this function, we could generate predictions
which makes sense. Predicting the probability
that y is equal to 1, falling somewhere between 0. Meaning y is equal
to 0 and 1 meaning it’s 100 percent certain
that y is equal to 1. Our desired model output is the probability that
y is equal to 1. Again, we’ve decided to use the sigmoid function
in order to create boundaries on the
outputs of our model so that they fall
in between 0 and 1. As an input to this
sigmoid function, we provide the output of our linear regression
model are biased w_0 plus our coefficient w_1
times the input feature x_1. In general, we can create a model that takes
our input features X times the coefficients and we combine them in the form
of a linear regression, taking each feature times
its respective coefficient. We can call this value z, and then we then
provide this value z into the sigmoid function. Coming out of the sigmoid, we have a value that
falls between 0 and 1, which we then interpret as the probability that
y is equal to 1. Our challenge now is to find the optimal values for the coefficients of
our linear model. We can approach this in
a similar way to what we did in linear regression. We first define
our cost function. We then seek to find
the optimal values of the weights or coefficients that minimize the cost function. Again, the way we
can do this is very similar to what we did
in linear regression. If we have a function
that we want to minimize, like our cost function. To minimize that function, we calculate the derivative
of the function, which is also called the
gradient of a function, and we set the
derivative equal to 0. We can then solve
for the values of the coefficients that
make this equation true. In linear regression, there was a simple closed form
solution to this so that we can easily calculate the values of our coefficients. In logistic regression,
where we’ve now introduced the
sigmoid function, we no longer have a simple
closed form solution. We resort to an iterative
solving method that we call gradient descent to
solve for the values of our coefficients that
minimize the cost function. How does gradient descent work? Suppose we want to minimize a function y is
equal to x squared. We start at some
point on the curve, y is equal to x squared, and we can move
iteratively towards the minimum and then stop once
we’ve reached our minimum. How do we know where to move? Well, the first question is, which direction should we move? The answer to that is that
we move in the direction opposite the value of the
derivative or the gradient. If we think back to calculus, the gradient of a
function points in the direction of steepest
ascent of that function. If we’re trying to find
the minimum of a function, we want to move opposite
that direction of steepest ascent or the
direction of steepest descent. We move in that direction
opposite the value of the gradient at that starting
point that we’ve selected. The second question is, how far should we move
in that direction? The answer to that is that we
move by some small amount, which is equal to a parameter
called the learning rate, multiplied by the value of
the gradient at that point. The learning rate is an
important parameter in gradient descent and it’s also critically important
in neural networks. We’ll talk about learning
rates much more once we get into the neural
networks lesson. But for now, the answer to our question of
how to minimize that function is that we pick some random starting
point on the curve. We calculate the gradient. We then move in that direction
opposite to the gradient, and the amount we move in
that direction is equal to the learning rate times the value of the
gradient at that point. We can then do this
all over again and move one more step in
the direction opposite the gradient and we
continue to move until we’ve reached
the stable point where we no longer
move any further. Once we reach the stable point, we know that we’ve
reached the minimum value for a function. How do we apply gradient
descent in the context of estimating the
optimal weights or coefficients for our
logistic regression model? Again, we first define our cost function for
logistic regression. We then seek to find the
values of the weights or coefficients that minimize
this cost function. To do this, we apply
gradient descent. We first pick a random
set of weights. We calculate the
cost function using that random set of weights and the training data that
we have available. We then calculate
the gradient of that cost function and we use our gradient descent
rule to iteratively update the weights based
on gradient descent. We calculate a new set of
weights which is equal to the previous weights
that we had used, minus the learning rate times the magnitude of the gradient. We can then repeat
this each time moving one small step in
the direction of our minimum cost until we’ve reached a minimum or after
a certain number of moves, we can terminate the function.

Video: Softmax Regression

Notes

Quiz

Transcript

From Binary to Multi-Class Classification

Logistic Regression: Used for binary problems (e.g., is it a 1 or a 0). The sigmoid function calculates the probability of the positive class.
Multi-Class Needs Softmax: When you have more than two possible classes, you need the softmax function.

How Softmax Works

Similar Structure: Like logistic regression, you take inputs and multiply them by weights to get a ‘z’ value.
Key Difference:
- Instead of one set of weights, you have a set for each potential class.
- Each ‘z’ value is fed into the softmax function, giving you a probability for each class.
Normalization: Softmax ensures the probabilities of all the classes add up to 1.
Picking the Prediction: The class with the highest probability is your model’s prediction.

Example: Animal Classifier

Input: An image of an animal.
Features: The pixel values of the image.
For Each Class (dog, cat, etc):
- Multiply pixel values by class-specific weights.
- Calculate ‘z’.
- Softmax gives the probability of the image belonging to that class.
Output: Example probabilities: Dog 0.8, Cat 0.05… The model predicts the class with the highest probability (here, dog).

When we are building a classification model to predict more than two classes, why do we use the softmax function?

It normalizes the output probabilities of each class so that they sum to 1, and we can then take the class with the highest probability value as our prediction

Correct. The softmax function normalizes the output probabilities

In the last lesson, we talked about the
situation where we’re predicting a class
that’s binary. It’s either a one or a zero. In this case, we could use
the logistic regression model with that sigmoid function to give us the probability of the positive class or the probability that
y was equal to 1. But what if we have a problem, where we have more
than two classes? It’s not as simple as
predicting y equals one. We actually need to
predict which of many classes,
something might be. In the binary setting where we applied logistic regression, we took our input axis, we multiplied them by a set of weights and
calculate the z, which looked very similar
to linear regression. We then took the output of
that z and we fed it into our sigmoid function and out came the probability
that y was equal to one, or the probability
that the input belonged to the
positive class one. In the multi-class situation, instead of using the
sigmoid function, we use what’s called a
softmax function in order to give us the probability of
belonging to each class. In this case, we have to calculate separately
for each class, the probability that the
input belongs to that class. To do that, we again take our
input axis, but this time, rather than multiplying them
by a single set of weights, we multiply them by a set
of weights for each class. We calculate a z for each class, feed the z into our
softmax function, and calculate the
probability that the input belongs to each class. You can think of the
softmax function as a normalized sigmoid function. It acts very much
the same in that, it bounds our output between
zero and one for each class. But now since we have
multiple classes, we’d like each of those
outputs to sum to 1, so that the probabilities
of each class summed up over all the possible
classes are equal to 1. We can then identify
the class which has the highest probability and use that as the predicted
class for the input. Let’s take an example
of how this works. Suppose we’re creating a
classification model to classify four types of animals based on pictures that
we provide as input. The animals we’re
classifying are dogs, cats, rabbits, and bears. As an input to our model, we would supply a picture. The features that we
would use are actually the values of each of the
pixels within our input image. If we have an image that
we provide as input, that’s eight pixels
by eight pixels, we have a total of 64 features. When we’re using
softmax regression to predict multiple classes, we would provide each of
those 64 input features, and for each of
our four classes, dog, cat, rabbit, and bear, we would multiply the values of those input pixels times
the weights for that class. We would calculate
the z for that class, feed it into our softmax
function and generate the prediction that
that input picture belong to each class. We would do this for all
four of our classes, and our output might look
something like this. Dog, 0.8, cat, 0.05, rabbit, 0.05, bear, 0.1. To now generate a discrete
prediction from our model, we would look for the class that corresponds to the
highest probability. In this case, dog corresponds to an 80
percent probability, and so the output from
our model would be dog.

Review

Video: Module Wrap-up

Notes

Transcript

Linear Models: Foundation and Limitations

Types: This module covered linear regression (for predicting continuous values) and logistic regression (for classifying data).
Regularization: This technique was discussed as a way to prevent overfitting and improve how linear models work with unseen data.
Role: Linear models are important because:
- They provide a mathematical basis for understanding neural networks.
- They serve as excellent starting points for modeling tasks, setting a benchmark for performance.
Limitations: Their simple form can limit their effectiveness with complex, non-linear datasets, which are common in real-world problems.

What’s Next

The next module will introduce non-parametric algorithms, which are designed to handle the complex, non-linear data that linear models sometimes struggle with.

In this module, we discussed
a set of algorithms. They’re called linear models. We first talked about
linear regression, a simple but commonly used algorithm for
regression tasks. We then discussed
logistic regression, which is despite its name,
a classification algorithm. We finally talked about regularization and how we
can apply regularization to penalize complexity and improve the ability
of linear models to generalize and
perform well on predicting new data the
model has never seen before. The reason we spent
time on linear models is because the mathematical
intuition behind linear models is the same
foundation as neural networks, which we’ll cover in
the later module. Linear model is also a
great starting point for modeling efforts. I highly recommend starting with something like a linear or logistic regression
as a first model to use when starting
a modeling task in order to get a
benchmark value. You can then move
on to other types of algorithms and compare them back to your performance with
the simple benchmark model. One of the downsides
we discussed of linear models is that
they’re simple form, can somewhat constrain their capability
and in particular, their ability to perform
well with highly complex, non-linear sets of data. Which is often the
case when we’re dealing with
real-world problems. In the next module,
we’ll talk about some other types of
non-parametric algorithms and how those can excel at exactly this type of
complex non-linear data.

Module 4 Quiz

Which of the following are true about linear regression (select all that apply)?

Linear regression forms the basis of more complex machine learning models
Linear regression is a great first model to apply to a problem to get a baseline of expected performance
Linear regression models are easy to interpret, relative to other types of models

gpt-4o-2024-05-13

The correct statements about linear regression are:

Linear regression forms the basis of more complex machine learning models.

True. Linear regression is foundational in machine learning and statistical modeling. Many complex models, such as polynomial regression, logistic regression, and neural networks, build on the principles of linear regression.

Linear regression is a great first model to apply to a problem to get a baseline of expected performance.

True. Linear regression is often used as a first step in modeling due to its simplicity and ease of implementation. It provides a baseline against which more complex models can be compared.

Linear regression models are easy to interpret, relative to other types of models.

True. Linear regression models are relatively easy to interpret because the relationship between the input variables and the output is straightforward. The coefficients directly represent the impact of each feature on the target variable.

The statement about overfitting is not necessarily true for linear regression:

Linear regression models commonly overfit.

False. Linear regression models do not typically overfit if the number of features is significantly lower than the number of observations. Overfitting can occur if the model is too complex relative to the amount of data (e.g., polynomial regression with a high degree), but standard linear regression with fewer features is less prone to overfitting compared to more complex models like decision trees or neural networks.

claude-3-opus-20240229

Here are my thoughts on each option:

Linear regression forms the basis of more complex machine learning models

True. Many more advanced models like neural networks, support vector machines, etc. build upon the fundamental concepts of linear regression like finding optimal weights/coefficients to minimize a loss function. Linear regression provides a foundation.

Linear regression is a great first model to apply to a problem to get a baseline of expected performance

True. Linear regression is simple, fast to train, and provides a good baseline to compare more complex models against. If linear regression performs well, you may not need anything fancier. It’s a great starting point.

Linear regression models commonly overfit

False. Linear regression models are actually less prone to overfitting compared to more complex models with many parameters. Their simplicity makes them more likely to underfit than overfit. Regularization can further reduce overfitting risk.

Linear regression models are easy to interpret, relative to other types of models

True. The coefficients in linear regression directly describe the effect each input variable has on the output. You can easily understand the relationships the model has learned. This is in contrast to black box models like deep neural networks.

How do we determine the optimal coefficient values in linear regression?

We find the coefficient values which result in the minimum value of our cost function, the Sum of Squared Error (SSE)

Gemini

The answer is We find the coefficient values which result in the minimum value of our cost function, the Sum of Squared Error (SSE). Here’s why:

Linear Regression and SSE: Linear regression aims to find the best-fitting straight line through a set of data points. The Sum of Squared Errors (SSE) measures the total squared difference between the predicted values (from the line) and the actual data points.
Minimizing Error: Our goal is to find the line (defined by its coefficients) that minimizes this error, meaning the line gets as close as possible to all the data points.
Other Options:
- Maximum SSE: Maximizing SSE would mean finding the line that fits the data worst.
- Zero Error: Achieving zero error is often impossible, especially with real-world data. There will always be some variation that a straight line can’t perfectly capture.
- Cross Entropy: Cross entropy is a cost function used for classification problems (e.g., logistic regression), not linear regression.

In essence, the optimal coefficients in linear regression are the ones that produce the smallest SSE, indicating the best possible linear fit to the data.

If we have a situation where we are seeking to model a target variable that has a nonlinear relationship with one of our input features, which of the following is true?

We can use linear regression and improve the fit by transforming the input feature using a nonlinear function into a new variable which we then add into the linear regression model as a feature

llama-3-70b-instruct

The correct answer is:

We can use linear regression and improve the fit by transforming the input feature using a nonlinear function into a new variable which we then add into the linear regression model as a feature.

When there is a nonlinear relationship between an input feature and the target variable, linear regression may not be able to capture this relationship accurately. However, we can still use linear regression by transforming the input feature using a nonlinear function, such as logarithmic, exponential, or polynomial transformations, to create a new feature that has a more linear relationship with the target variable. This is known as feature engineering or feature transformation. By adding this new feature to the linear regression model, we can improve the fit and capture the nonlinear relationship.

The other options are incorrect:

While neural networks can be useful for modeling nonlinear relationships, they are not the only option. Linear regression with feature transformation can also be effective.
We can still use linear regression, but we may need to transform the input feature to capture the nonlinear relationship.
Logistic regression is typically used for binary classification problems, not for modeling nonlinear relationships between continuous variables.

gpt-4o-2024-05-13

The correct statement is:

We can use linear regression and improve the fit by transforming the input feature using a nonlinear function into a new variable which we then add into the linear regression model as a feature.

This means that when dealing with nonlinear relationships between input features and the output target, one effective approach is to apply a nonlinear transformation to the input features. This transformation can help to linearize the relationship, making it more suitable for a linear regression model. For example, if there is a quadratic relationship, you might include a squared term of the input feature as an additional feature in the regression model.

Why do we often use regularization in modeling?

It helps to reduce overfitting by penalizing complexity of the model

gpt-3.5-turbo-0125

Why do we often use regularization in modeling?

1 point

It reduces the computational complexity of training our model

It helps to reduce underfitting by encouraging the development of more complex models

It helps to reduce overfitting by penalizing complexity of the model

It usually improves the performance of the model on the training dataset

snowflake-arctic-instruct

Regularization is a technique used in modeling to prevent overfitting, improve the model’s performance, and reduce computational complexity. It accomplishes this by adding a penalty term to the loss function that the model seeks to minimize. The purpose of this penalty term is to discourage the model from fitting the training data too closely, which can lead to poor generalization performance on new, unseen data.

There are different types of regularization techniques, such as L1, L2, and dropout, each with its own approach to penalizing complexity in the model. By adding regularization, we can help the model to better generalize from the training data to new data, which ultimately improves the model’s overall performance. In summary, regularization is an essential tool in modeling that helps to prevent overfitting, improve model performance, and reduce computational complexity.

gpt-4o-2024-05-13

Regularization is often used in modeling because:

It helps to reduce overfitting by penalizing complexity of the model.

Regularization techniques, such as L1 and L2 regularization, add a penalty to the loss function based on the size of the coefficients. This discourages the model from becoming too complex and fitting the noise in the training data, thereby improving its generalization to new, unseen data.

What is the key difference between LASSO and Ridge regression?

LASSO regression forces coefficients of irrelevant features to 0 while Ridge regression forces coefficients to be small but not to 0. Therefore LASSO can be considered a feature selection technique but not Ridge

llama-3-8b-instruct

The correct answer is:

Ridge regression forces coefficients of irrelevant features to be small but not to 0. Therefore Ridge can be considered a feature selection technique but not LASSO

In Ridge regression, the coefficients of irrelevant features are shrunk towards zero, but they are not forced to exactly zero. This means that Ridge regression can still capture some information from these features, but with less emphasis.

In contrast, LASSO regression (Least Absolute Shrinkage and Selection Operator) sets the coefficients of irrelevant features to exactly zero. This means that LASSO regression is a feature selection technique, as it can eliminate features that are not relevant to the model.

So, the key difference between LASSO and Ridge regression is that LASSO sets the coefficients of irrelevant features to zero, while Ridge shrinks them towards zero.

gpt-4o-2024-05-13

Which of the following are correct about linear regression and logistic regression (select all that apply)?

Both linear regression and logistic regression are linear models
Linear regression attempts to predict the output value, while logisitic regression predicts the probability that the output is the positive class
Linear regression is used for modeling regression tasks while logistic regression is used for classification tasks

yi-large-preview

To address the statements correctly:

Both linear regression and logistic regression are linear models – This statement is correct. Both models involve linear combinations of the features to predict the output. In linear regression, this linear combination directly estimates the output value. In logistic regression, this linear combination is input to the logistic function (also known as the sigmoid function) to estimate probabilities, but the model itself is linear in terms of its parameters.
Both linear regression and logistic regression are used as algorithms for modeling regression tasks – This statement is not entirely correct. While linear regression is indeed used for regression tasks (predicting continuous values), logistic regression is typically used for classification tasks (predicting discrete classes, often binary classification). So, logistic regression is not used for regression tasks in the conventional sense of predicting a continuous output.
Linear regression attempts to predict the output value, while logistic regression predicts the probability that the output is the positive class – This statement is correct. Linear regression models the relationship between features and a continuous target variable to predict its value directly. Logistic regression, on the other hand, models the probability that an observation belongs to a particular class (often a binary class, where it predicts the probability of the positive class).
Linear regression is used for modeling regression tasks while logistic regression is used for classification tasks – This statement is correct. As mentioned, linear regression is employed for predicting continuous values (regression tasks), whereas logistic regression is used for predicting class membership probabilities (classification tasks).

So, the correct statements are:

Both linear regression and logistic regression are linear models.
Linear regression attempts to predict the output value, while logistic regression predicts the probability that the output is the positive class.
Linear regression is used for modeling regression tasks while logistic regression is used for classification tasks.

mixtral-8x7b-instruct-v0.1

The correct statements are:

Both linear regression and logistic regression are linear models.
Linear regression is used for modeling regression tasks, while logistic regression is used for classification tasks.
Linear regression attempts to predict the output value, while logistic regression predicts the probability that the output is the positive class.

The statement “Both linear regression and logistic regression are used as algorithms for modeling regression tasks” is incorrect. While linear regression is used for modeling regression tasks, logistic regression is used for classification tasks.

Why is the logistic (or sigmoid) function used in logistic regression?

We use the logistic function to convert the numerical output into a probability of the postive class, in the range of 0 to 1

reka-core-20240501

The logistic (or sigmoid) function is used in logistic regression primarily because it converts the linear combination of input variables into a value between 0 and 1, which can be interpreted as a probability. This is crucial for logistic regression as it models the probability that an instance belongs to a particular class.

So, the correct answer among the options provided is:

We use the logistic function to convert the numerical output into a probability of the positive class, in the range of 0 to 1.

Explanation:

The logistic function is indeed nonlinear, but its use in logistic regression is more specifically about transforming the model’s output into a probabilistic interpretation.
Regularization is a separate concept used to prevent overfitting, not directly related to the choice of the logistic function.
The logistic function is used to convert the output into a probability of the positive class, not specifically the negative class. The probability of the negative class can be derived as 1 minus the probability of the positive class.

gpt-4o-2024-05-13

The logistic function is used in logistic regression to convert the numerical output into a probability of the positive class, in the range of 0 to 1.

So, the correct answer is:

We use the logistic function to convert the numerical output into a probability of the positive class, in the range of 0 to 1.

Which of the following are correct regarding the use of gradient descent in logistic regression (select all that apply)?

Gradient descent is an iterative solving method that helps us find the coefficients/weights which minimize the cost function
During each iteration of gradient descent, we adjust the coefficient/weight values in the direction opposite the gradient of the cost function with respect to the weights
Before we begin to perform gradient descent, we initialize the coefficients/weights of our model equation to small random values

im-also-a-good-gpt2-chatbot

Let’s analyze the statements one by one:

Gradient descent is an iterative solving method that helps us find the coefficients/weights which minimize the cost function:

Correct: Gradient Descent is indeed an iterative optimization method that helps minimize a cost function by adjusting the weights/coefficients. In logistic regression, the cost function often used is the logistic loss (or cross-entropy loss).

We only use gradient descent to find the coefficients/weights in neural networks, but not for logistic regression:

Incorrect: Gradient Descent can be used for logistic regression as well as for neural networks. In logistic regression, we can use gradient descent to find the optimal coefficients/weights.

During each iteration of gradient descent, we adjust the coefficient/weight values in the direction opposite the gradient of the cost function with respect to the weights:

Correct: Gradient Descent adjusts the weights in the direction opposite to the gradient, as this is the direction that will decrease the cost function.

Before we begin to perform gradient descent, we initialize the coefficients/weights of our model equation to small random values:

Correct: It’s common practice to initialize the weights with small random values before starting the gradient descent algorithm. Alternatively, weights can also be initialized to zeros.

Therefore, the correct statements are:

1: Gradient descent is an iterative solving method that helps us find the coefficients/weights which minimize the cost function.
3: During each iteration of gradient descent, we adjust the coefficient/weight values in the direction opposite the gradient of the cost function with respect to the weights.
4: Before we begin to perform gradient descent, we initialize the coefficients/weights of our model equation to small random values.

gpt-4o-2024-05-13

The correct statements regarding the use of gradient descent in logistic regression are:

Gradient descent is an iterative solving method that helps us find the coefficients/weights which minimize the cost function.
During each iteration of gradient descent, we adjust the coefficient/weight values in the direction opposite the gradient of the cost function with respect to the weights.
Before we begin to perform gradient descent, we initialize the coefficients/weights of our model equation to small random values.

To elaborate:

Statement 1 is correct because gradient descent is indeed an iterative method used to minimize the cost function by adjusting the weights.
Statement 2 is incorrect because gradient descent is used in logistic regression as well as in training neural networks.
Statement 3 is correct because during each iteration, the weights are adjusted in the direction opposite to the gradient to minimize the cost function.
Statement 4 is correct because initializing weights to small random values is a common practice before starting the gradient descent process to avoid symmetry and ensure that the weights are not all zero, which could lead to no learning.

We are building an app that uses pictures which our users upload from their phones to identify species of birds in the photos. We want to be able to identify 20 different types of birds with our model. Which of the below algorithms might be the best choice to use for our model?

Softmax regression

llama-3-70b-instruct

The correct answer is:

Softmax regression

Here’s why:

You want to classify images into one of 20 different bird species, which is a multi-class classification problem.
Softmax regression is a type of logistic regression that is suitable for multi-class classification problems, where the output is a probability distribution over all classes.
In contrast, standard logistic regression (non-softmax) is typically used for binary classification problems, where the output is a probability of belonging to one of two classes.
Linear regression is not suitable for classification problems, as it is used for predicting continuous values rather than categorical labels.

Softmax regression is a popular choice for image classification tasks, especially when there are multiple classes to predict. It will allow your model to output a probability distribution over the 20 bird species, making it well-suited for your task.

command-r-plus

Softmax regression would be the best choice for your application.

Softmax regression, also known as multinomial logistic regression, is a generalization of logistic regression to the case where there are more than two possible outcomes. It is particularly useful when the task is to choose among multiple possible alternatives, which is exactly the problem you are trying to solve – identifying 20 different bird species.

With softmax regression, the output layer of the neural network used for prediction will have 20 nodes, each corresponding to the probability of a specific bird species being present in the image. The node with the highest probability will be the predicted species. This aligns well with your requirement of identifying one bird species out of the 20 possible options.

Additionally, softmax regression is a well-established technique and there are many resources and tools available to implement and fine-tune such a model.

Therefore, softmax regression is the most suitable choice among the given options for building an app to identify bird species from user-uploaded photos.

Which of the following are true about softmax regression?

It is used for classification tasks where we have more than 2 possible output classes we are trying to predict
The sum of the outputs from a softmax regression model across all classes is 1

im-a-good-gpt2-chatbot

The statements about softmax regression that are true are:

It is used for classification tasks where we have more than 2 possible output classes we are trying to predict. Softmax regression (or multinomial logistic regression) is a generalization of logistic regression to the case where we want to handle multiple classes. It is commonly used in scenarios where the output can take more than two possible classes. This makes it suitable for multi-class classification problems.
The sum of the outputs from a softmax regression model across all classes is 1. The outputs from a softmax regression model are probabilities that sum to 1 across all possible classes. Each output value is in the range [0, 1] and they add up to 1, making them a proper probabilistic distribution.

The other statements are not true for softmax regression:

It is not used primarily for binary classification tasks. While it can be applied to binary classification (where it simplifies to logistic regression), it is more commonly used and referred to in the context of multi-class classification.
We do not take the class with the lowest corresponding output value from the softmax regression model as the predicted class. Instead, we predict the class with the highest probability (or corresponding output value). The class with the maximum output from the softmax function is the predicted class.

yi-large-preview

True – Softmax regression is indeed used for classification tasks where there are more than 2 possible output classes. It extends logistic regression, which is used for binary classification, to cases where there are multiple classes.
False – This statement contradicts the first one. Softmax regression is not primarily used for binary classification tasks; it is used when there are more than two classes. Binary classification is typically handled by logistic regression.
True – The outputs of a softmax regression model are designed to represent probabilities for each class. The sum of these probabilities across all classes is 1. This is a key property of the softmax function, ensuring that the outputs can be interpreted as probabilities.
False – In softmax regression, we take the class with the highest corresponding output value as the predicted class, not the lowest. The output value is proportional to the probability that the input belongs to that class, so the class with the highest output value is the most probable class.

So, the correct statements are:

It is used for classification tasks where we have more than 2 possible output classes we are trying to predict.
The sum of the outputs from a softmax regression model across all classes is 1.

The incorrect statements are:

It is used primarily for binary classification tasks.
We take the class with the lowest corresponding output value from the softmax regression model as the predicted class.

Home » Duke University » AI Product Management Specialization » Machine Learning Foundations for Product Managers » Week 4: Linear Models

Week 4: Linear Models

Linear Regression & Regularization

Video: Introduction and Objectives

Reading: Download Module Slides

Video: Linear Regression

Video: Regularization

Logistic Regression

Video: Logistic Regression

Video: Softmax Regression

Review

Video: Module Wrap-up

Module 4 Quiz

Share this:

Like this: