Skip to content

In this module we will explore the use of linear models for regression and classification. We will begin with introducing linear regression and continue with a discussion on how to make linear regression work better through regularization. We will then switch to classification and introduce the logistic regression model for both binary and multi-class classification problems.

Learning Objectives

  • Explain how linear regression works
  • Describe the differences between linear and logistic regression
  • Discuss the benefits and types of regularization

Linear Regression & Regularization


Video: Introduction and Objectives

Linear Models vs. Non-Parametric Models

  • Linear Models:
    • Use a predefined form (or template) to model input-output relationships.
    • Train quickly and can work on smaller datasets.
    • Can be overly simplistic for complex problems, leading to underfitting.
  • Non-Parametric Models:
    • Do not assume a predefined form for input-output relationships.
    • Flexible and can handle complex data well.
    • Often achieve better prediction performance.
    • Require more training data and are susceptible to overfitting.

Supervised Learning Algorithms

  • Both linear (parametric) and non-parametric algorithms exist for supervised learning.
  • Complexity can range from simple to advanced.

Focus of the Discussion

  • Beginning with linear models, specifically:
    • Linear Regression
    • Ridge and Lasso Regression (which use regularization to prevent overfitting)
    • Logistic Regression (a classification model, despite the name)

Key Learning Goals

  • Understand how linear regression works.
  • Differentiate between linear regression (for numerical outputs) and logistic regression (for classification tasks).
  • Explain the concept of regularization and its benefits in preventing overfitting.

We’re going to begin our discussion on
the algorithms behind machine learning, by focusing on a class of model
that we call linear models. Linear models are a parametric algorithm
meaning, that they take on a known form, which is used as a sort of template to
model the input to output relationship. This template is fixed using
a predetermined set of coefficients or parameters. And when we build and train a model, our job is to learn the values of
those coefficients or parameters. Parametric models can train quickly,
they work well even on small data. The downside of using parametric
algorithms is that they are constrained to a specific form or this template that we’ve chosen to
model the input output relationships. Because of that,
they can sometimes be too simple for real world complex problems
that were trying to model. And so they’re prone to under fitting. The other class of models are called
non parametric algorithms. And these do not make any particular
assumption about the form or template of the input to output
relationship in advance of building and training the model. Instead, these models
are highly flexible and they can adapt well to complex
nonlinear data and relationships. As a result, they can often result in higher
performance on generating prediction. However, they require more data to train
and they are often prone to overfitting a model on the training
data that we’ve provided. If we look at the various algorithms
available to us for supervised learning, we have both parametric and
non parametric algorithms that we can use. Some of them are very simple and
some are more complex. We’re going to start the discussion on
algorithms by focusing on the linear models of linear regression ridge and
lasso regression and logistic regression. Which even though it’s called regression
is actually a classification model. At the conclusion of this module,
you should be able to explain and understand how linear regression works. You should be able to describe
the differences between linear regression regression model and
logistic regression classification model. And you should also be able to understand
and explain the benefits of what we call regularization, which is a technique for
reducing overfitting of linear models.

Reading: Download Module Slides

Video: Linear Regression

What is Linear Regression?

  • A statistical model assuming a linear relationship between input features and the target output.
  • Simple yet versatile, forming the basis for more complex models like neural networks.
  • Useful for benchmarks and understanding input-output relationships.

How it Works

  • Equation: y = W_0 + W_1X (+ W_2X_2 + … for multiple features)
    • y = predicted output
    • W_0 = bias term (y-intercept)
    • W_1, W_2, etc. = coefficients (weights) of features X_1, X_2, etc.
  • Training: Finding optimal coefficients to minimize prediction error.
  • Error Calculation:
    • Error = actual value – predicted value
    • Often measured as Sum of Squared Errors (SSE) for convenience

Key Concepts

  • Cost/Loss Function: SSE is the cost function linear regression aims to minimize.
  • Closed Form Solution: Linear regression often has a direct way to compute optimal coefficients.
  • Non-linear Relationships: Features can be transformed (e.g., squared, cubed, log) to model non-linear patterns (polynomial regression).

Example: Predicting Fuel Efficiency

  • Simple linear regression on horsepower works decently, but there’s room for improvement.
  • Applying a non-linear transformation (horsepower cubed) greatly improves model accuracy, demonstrating the power of polynomial regression.

Which of the following are good reasons to use linear regression (select all that apply)?
  • It is highly interpretable and so is a good choice of model in situations where interpretability is critical
  • It helps us understand the relationships between the input features and the output target
  • It is useful at least as a first model to apply to get a benchmark of what performance we can reasonably expect on our problem
We are building a regression model and find that the relationship of one of our key input features with the target variable appears to be non-linear. Which of the following strategies should we follow?

We should create a new feature from it by applying a non-linear transformation, and then use our newly created feature in the linear regression model

Correct. We can improve the fit of our linear regression model by applying a non-linear transformation to this feature

We’ll start the discussion of linear models with the
simple linear regression. Many of you are probably
familiar with linear regression. Shows up in almost
every imaginable field, everything from economics to many branches of the sciences. A linear regression
model assumes a linear relationship between
the input and output. The inputs being the
Data features that we’ve defined and the output being the target we’re
trying to predict. This relationship is
defined by a set of coefficients which
are multipliers of each of the input features. If linear regression is
so simple and common, why are we spending
time talking about it? There’s actually a couple
answers to that question. The first is that even
though it is simple, linear regression forms
the basis of many of the more complex machine-learning
models that we use. In particular, neural networks, which we’ll talk about
in a later lesson, really are founded on the basis of the simple
linear regression. The regressions can also be surprisingly effective models on certain situations if
they’re properly used. They also make a
great first model to apply to get a
benchmark or sense of expected performance
that you might hope to achieve on a particular
machine learning task. By the way, I always
recommend when you’re working on
a modeling task, to start with a simple model
like a linear regression. Apply that as a first step and see what performance
that gives you. Then once you move on to
more complex algorithms, you can compare them back to your original benchmark and see whether you’re really making
an improvement or not. Finally, one of the
really nice things about linear regression
is that it’s highly interpretable and it’s very
easy for us to understand the relationships
between the inputs and the outputs in the
model that we’re building. How does a simple linear
regression model work? Let’s take the example
we were working on before of predicting
sale prices for homes. If we were building a
simple linear regression involving a single variable, which is the number of bedrooms, we might provide that variable, the bedrooms into a model and as an output from our model, we would be predicting
the home price. Our model might look
something like this. y=W_0 + W_1X. W_0 is what
we call the bias term. Or you can think of this
as the y-intercept. If all of the features, or in this case the single
feature we have was zero, what would the y value be. We call this again the bias. W_1 would be the coefficient, or sometimes also called the
weight of the variable x, which represents the number
of bedrooms in our house. This would be the
multiplier of that feature to calculate the total value
for our target sale price. Let’s now move from the simple
linear regression model to the multiple linear
regression model. In this case, we have
more than one feature. In fact, we have as many
features as we would like to put into our model. We might add additional
features in this case, such as square
footage of our home, the school district, or the
neighborhood our home is in. Again, we represent this with
an equation which contains that bias term W_0. But now we have
multiple coefficients, one for each of our
input features. We may have a coefficient W_1, representing the weight
of the number of bedrooms in calculating the
final target sale price. W_2 might be the coefficient of the square
footage of our home. W_3 might represent
the coefficient of the school district that
we’re in, and so on. We add all these up. The coefficients are
weights times the values of the features to calculate our y value or our
target sale price. When we train a linear
regression model, really what we’re
doing is learning the optimal values of
these coefficients or weights that can
effectively model the relationship between
the input features and the output target. The first step in identifying the optimal values of those
coefficients or weights, is to calculate the total
error of the model. We will then alter the coefficients in a way
that hopefully reduces that total error to the point where we’ve now minimized
our total error. How do we calculate total
error of our model? Well, the error for
any given point, in this case, any
given house for sale, is the actual price that, that homes sold for minus
the predicted sale price. Or in mathematical notation, we call it y minus
our prediction, which we call y-hat. Alternatively, for
computational convenience, we often define error
in terms of what we call Sum of Squared
Errors or SSE. Sum of Squared Error,
SSE is calculated as the sum of all the
predictions minus the actuals squared
or y-hat minus y squared and summed up over all the data
points that we have. When we build our model, what we’re really
trying to do is seek the coefficients
that can minimize that total value of the
Sum of Squared Error. SSE, in modeling terminology is called our cost function, or also called a loss function. In this case, our cost function, or SSE is the sum of the y-hat minus y squared
for every data point. Again, when we’re training
our linear regression model, we’re seeking to find the values for those coefficients or weights that minimize the
total of our cost function. To do this, we use
the training data, the inputs x’s and outputs
y’s available to us and we solve for the weights
or coefficients that result in the minimum
of the cost function. In the case of
linear regression, we can usually do this using
a closed form solution. In other types of models, we apply the same strategy, but often there is no
closed form solution, so we use some more
complex methods for calculating the values that result in the minimum
of the cost function. Many people will think that linear regression really
only works when there’s a linear relationship between
the inputs and the outputs. In reality, you can also model nonlinear relationships
between inputs and outputs. In order to do that, what we do is transform
an input feature by some nonlinear
transformation function and create a new feature which we then use as an input to a model. For example, we may take
an input feature x to a certain power x squared
or x cubed for example. Or we may take the log of x and we’ll create that
as a new input feature. We’ll feed that into our model and now we’re able to
better capture some of those nonlinearities
of the relationship between the inputs
and our outputs. When we do this, it’s called
polynomial regression. There’s actually an
unlimited number of transformations
that we can apply. Let’s look an example of
when this comes in handy. In this case, our objective of our modeling
task is to predict the fuel efficiency of cars given the horsepower
of the engine. You could see here on
this slide that I fitted a simple linear
regression to horsepower. Looks like it does an okay job at capturing the variability in the pattern that we see in
the output miles per gallon. But there’s certainly
room for improvement. On this screen I’ve displayed,
the Mean Squared Error, as we can see for the training
set and the test set. Now let’s take a look at
what happens when we use a nonlinear transformation and apply polynomial regression
to the same task. In this case, I’ve taken horsepower cubed and using
that as the input to my model. I’m now predicting miles
per gallon based on a single input horsepower cubed. As we can see, our model is doing a much better
job at capturing that non-linear
relationship between horsepower and miles per
gallon and as a result, the Mean Squared Error on both our training set and our test set has
significantly improved.

Video: Regularization

Regularization for Better Regression Models

  • The problem: Linear regression models can overfit the training data, leading to poor predictions on new data.
  • What is regularization? It’s a technique that adds a penalty term to the cost function to discourage overly complex models (those with many features or high weights).
  • How it works: The penalty term makes the model seek a balance between fitting the training data well and keeping the model simple. This improves its ability to generalize to new data.

Types of Regularization

  • Lasso Regression (L1):
    • Penalty term: Sum of absolute values of coefficients.
    • Can force coefficients to zero, effectively removing irrelevant features.
    • Good for feature selection.
  • Ridge Regression (L2):
    • Penalty term: Sum of squared coefficients.
    • Reduces coefficients of irrelevant features towards zero, but not completely.
    • Useful for complex datasets and when many features are somewhat relevant.

Choosing between Lasso and Ridge

  • Lasso: Choose if you want a simpler model with fewer features for better interpretability.
  • Ridge: Choose if you suspect many features have some influence, even if small, and there’s correlation between features.
  • Experimentation: Often the best way to decide is to try both and see which performs better on your specific dataset.

What is the primary reason we apply regularization when building models?

It reduces complexity in the model, decreasing the probability of overfitting and helping the model better generalize to predict on new data

Correct. Regularization simplifies the model and helps it better generalize to predict on new data

Why can LASSO regression be considered as a method for feature selection while Ridge regression cannot?

LASSO reduces the coefficients of irrelevant features to 0, which Ridge reduces them close to 0 but does not completely remove them

Correct. LASSO removes features which do not add value to the model by reducing their coefficients to 0

In the last lesson, we discussed the training method
for linear regression, how we calculate the
sum of squared error, and we seek to find the
values for the weights or coefficients that minimize
the sum of squared error. One of the challenges of this training method is it tends to reward overfitting on
the training data, because we’re calculating
and trying to minimize SSE on the
training data alone, we can sometimes end up
with a model that is fit very tightly to
the training data, and so we try to generate predictions on new data with it, and we find that doesn’t generalize particularly
well to new data. How do we build a
linear regression model in a way that is a little bit more balanced
between fitting tightly on the training data, but also still being
flexible enough to generate accurate
predictions on new data, as we can test
with the test set. One method for doing this is
to add a penalty factor into our cost function that
penalizes complexity. In this case, complexity being
in the form of features, the number of features
that we have included, and the values of the weights
for all of those features. The cost function for normal linear regression that we described earlier looks
something like this. Sum of squared error is equal
to the actual values for y minus the predicted
value for y squared, and summed up over all the
data points that we have. When we apply regularization, we add a penalty term. That’s a function of the sum of the coefficients or weights in our linear
regression equation. Now, when we have more coefficients or higher
values of those coefficients, it tends to increase
the cost function. But another way, as we
reduce the number of coefficients or reduce the
weights of our coefficients, our cost function
tends to decrease. Minimizing this
new cost function, including this
regularization penalty, helps us find the
optimal balance between fit on our
training data and simplicity of a
model in terms of the number of features and
weights of those features. The cost function
with regularization applied is now equal to
the sum of square there, y minus y hat squared and
summed over data points, plus the sum of a value Lambda
times our penalty factor. The value Lambda is a fixed
value that we set and it controls the strength of the penalty that
we want to apply. As we increase Lambda, we can apply a higher
penalty and we can decrease Lambda to
apply a lower penalty. For the penalty factor, there’s two primary choices
that we may select from. One’s called lasso regression, the other’s called
ridge regression. In lasso regression, we calculate the penalty
factor as the sum of the absolute value of the coefficients multiplied
by our Lambda value. Lasso regression actually
has the effect of forcing coefficients
all the way to zero, if the variables of those coefficients are really not relevant in
predicting the output. If we have a large number of features in the model that
we’re trying to build. But several of those features
really are not adding value and our ability
to predict the output. If we apply a lasso regression with a sufficient
penalty factor, it actually forces
those coefficients to zero and thereby it removes those features from
the equation altogether. Lasso regression can
also be considered a form of feature
selection because it’s generally reducing the
number of features that are present in our final
model equation. On the other hand,
ridge regression does not have the effect of forcing our coefficients all
the way to zero. In ridge regression,
our penalty term is the sum of the weight squared across all of the weights or coefficients
in our equation. Ridge regression forces
the coefficients of irrelevant factors
towards zero, but generally not
all the way to zero. Ridge regression
can be an effective modeling strategy to reduce over-fitting and improve
that balance between simplicity and fit
on training data. But it’s not a feature
selection method like lasso regression is, in the sense that
it’s not actually eliminating features
that are irrelevant. It’s just reducing
the coefficients of those features to something
that’s very close to zero. Regularization can
be highly effective strategy when working with
regression models that often can give us a better model than a standard linear
regression model alone. Particularly when
we’re dealing with complex data with many features. As far as the choice between
lasso and ridge regression, you might have a reason to
prefer one or the other, or you might try both
and see which one does a better job
predicting the output. If you desire a
simple model with a smaller number of features,
that’s more interpretable. Lasso can be an
effective strategy because it’s reducing the
number of features by eliminating features
from our model that really are not particularly valuable in predicting
the output. On the other hand, if we know
ahead of time that we have a very complex relationship of the output target to many
of our input features, and we have what’s
called collinearity or correlation between some
of our input features, ridge regression can
be a better strategy. When in doubt. It’s
advisable to try both approaches and see which approach does a
better job in modeling.

Logistic Regression


Video: Logistic Regression

From Linear Regression to Logistic Regression

  • The Problem: Linear regression isn’t ideal for classification tasks where outputs are discrete classes (like 0 or 1). It can produce predictions outside the desired range.
  • The Solution: Instead of predicting the class directly, logistic regression predicts the probability of an output being in a certain class.
  • The Sigmoid Function: The sigmoid function is used to bound the model’s outputs between 0 and 1, making them interpretable as probabilities.
  • Model Structure:
    • A linear regression model is used to calculate an intermediate value (‘z’).
    • This value is then fed into the sigmoid function to get the probability.

Finding the Best Model (Gradient Descent)

  • Cost Function: A cost function is defined specifically for logistic regression.
  • No Closed-Form Solution: Unlike linear regression, there’s no simple formula to calculate the optimal coefficients (weights) for the model.
  • Gradient Descent: An iterative algorithm used to find the coefficients that minimize the cost function. Here’s how it works:
    1. Start with random coefficients.
    2. Calculate the gradient (direction of steepest increase) of the cost function.
    3. Take a small step in the opposite direction of the gradient (using the learning rate to control step size).
    4. Repeat until a minimum is reached.

Why do we use the sigmoid function in logistic regression, and not in linear regression?

In logistic regression we are predicting the probability of the positive class (y=1) and so we need to output values between 0 and 1, which is what the sigmoid function accomplishes. In linear regression we are predicting a numerical output which is not capped in the range of 0 to 1.

Correct. The sigmoid converts our scores into probabilities ranging from 0 to 1

When we perform gradient descent to calculate the optimal coefficients/weights in a logistic regression model, during each iteration of gradient descent we update the values of the weights by a small amount to move closer to the weights which minimize our loss/cost function. What information do we need at each iteration in order to calculate the new weight values (select all that apply)?
  • The previous weights from the prior iteration
  • The value of the learning rate
  • The gradient of the cost/loss function with respect to the weights

Correct. We start with the previous weights and then we subtract the learning rate multipled by the gradient of the cost function

In the past couple of lessons, we focused our discussion of linear models on
regression tasks. Let’s now try to tackle
a classification problem using what we’ve learned
about linear models so far. Suppose we now have a simple
problem where we again have a single input variable called x and we’re trying to predict
an output variable y. But now, because it’s
a classification task, our output is a class. Let’s make it simple and
use a binary task where our output is either a 0 or a 1. We could again apply a linear regression to
create a model to do this. Our linear regression might
look something like this. Again, it takes the
form y hat is equal to our bias w_0 plus r coefficient w_1 times
our single input x_1. However, we now have
a couple of problems. One is that as we can
see on the diagram, the linear regression
model is almost always predicting
the wrong value. Again, our output values
are either 0 or 1, and in almost every
case that we can see, our prediction is
neither 0 nor a 1. How do we interpret
these predictions that fall between 0 and 1? Additionally, what about values that our model is
predicting that are greater than 1 or values it’s predicting that
they’re less than 0. One solution to some of
these problems would be, rather than trying to predict
the actual y value 0 or 1, what if if we predict
the probability that y is equal to a 1? In this case, those
values that fell between 0 and 1 now
makes sense because the probability
that y is equal to 1 is somewhere between 0 and 1. One of the problems we
still have however, is what to do about those
values that are higher than 1 or lower than 0. In order to solve this, let’s now apply a
function that predicts the outputs which only fall
within that range of 0-1. One option here would be
to use what’s called the logistic or the
sigmoid function. Just take a sigmoid function
is a function that predicts values which fall between 0 on the lower side and 1
on the upper side. By now applying this function, we could generate predictions
which makes sense. Predicting the probability
that y is equal to 1, falling somewhere between 0. Meaning y is equal
to 0 and 1 meaning it’s 100 percent certain
that y is equal to 1. Our desired model output is the probability that
y is equal to 1. Again, we’ve decided to use the sigmoid function
in order to create boundaries on the
outputs of our model so that they fall
in between 0 and 1. As an input to this
sigmoid function, we provide the output of our linear regression
model are biased w_0 plus our coefficient w_1
times the input feature x_1. In general, we can create a model that takes
our input features X times the coefficients and we combine them in the form
of a linear regression, taking each feature times
its respective coefficient. We can call this value z, and then we then
provide this value z into the sigmoid function. Coming out of the sigmoid, we have a value that
falls between 0 and 1, which we then interpret as the probability that
y is equal to 1. Our challenge now is to find the optimal values for the coefficients of
our linear model. We can approach this in
a similar way to what we did in linear regression. We first define
our cost function. We then seek to find
the optimal values of the weights or coefficients that minimize the cost function. Again, the way we
can do this is very similar to what we did
in linear regression. If we have a function
that we want to minimize, like our cost function. To minimize that function, we calculate the derivative
of the function, which is also called the
gradient of a function, and we set the
derivative equal to 0. We can then solve
for the values of the coefficients that
make this equation true. In linear regression, there was a simple closed form
solution to this so that we can easily calculate the values of our coefficients. In logistic regression,
where we’ve now introduced the
sigmoid function, we no longer have a simple
closed form solution. We resort to an iterative
solving method that we call gradient descent to
solve for the values of our coefficients that
minimize the cost function. How does gradient descent work? Suppose we want to minimize a function y is
equal to x squared. We start at some
point on the curve, y is equal to x squared, and we can move
iteratively towards the minimum and then stop once
we’ve reached our minimum. How do we know where to move? Well, the first question is, which direction should we move? The answer to that is that
we move in the direction opposite the value of the
derivative or the gradient. If we think back to calculus, the gradient of a
function points in the direction of steepest
ascent of that function. If we’re trying to find
the minimum of a function, we want to move opposite
that direction of steepest ascent or the
direction of steepest descent. We move in that direction
opposite the value of the gradient at that starting
point that we’ve selected. The second question is, how far should we move
in that direction? The answer to that is that we
move by some small amount, which is equal to a parameter
called the learning rate, multiplied by the value of
the gradient at that point. The learning rate is an
important parameter in gradient descent and it’s also critically important
in neural networks. We’ll talk about learning
rates much more once we get into the neural
networks lesson. But for now, the answer to our question of
how to minimize that function is that we pick some random starting
point on the curve. We calculate the gradient. We then move in that direction
opposite to the gradient, and the amount we move in
that direction is equal to the learning rate times the value of the
gradient at that point. We can then do this
all over again and move one more step in
the direction opposite the gradient and we
continue to move until we’ve reached
the stable point where we no longer
move any further. Once we reach the stable point, we know that we’ve
reached the minimum value for a function. How do we apply gradient
descent in the context of estimating the
optimal weights or coefficients for our
logistic regression model? Again, we first define our cost function for
logistic regression. We then seek to find the
values of the weights or coefficients that minimize
this cost function. To do this, we apply
gradient descent. We first pick a random
set of weights. We calculate the
cost function using that random set of weights and the training data that
we have available. We then calculate
the gradient of that cost function and we use our gradient descent
rule to iteratively update the weights based
on gradient descent. We calculate a new set of
weights which is equal to the previous weights
that we had used, minus the learning rate times the magnitude of the gradient. We can then repeat
this each time moving one small step in
the direction of our minimum cost until we’ve reached a minimum or after
a certain number of moves, we can terminate the function.

Video: Softmax Regression

From Binary to Multi-Class Classification

  • Logistic Regression: Used for binary problems (e.g., is it a 1 or a 0). The sigmoid function calculates the probability of the positive class.
  • Multi-Class Needs Softmax: When you have more than two possible classes, you need the softmax function.

How Softmax Works

  • Similar Structure: Like logistic regression, you take inputs and multiply them by weights to get a ‘z’ value.
  • Key Difference:
    • Instead of one set of weights, you have a set for each potential class.
    • Each ‘z’ value is fed into the softmax function, giving you a probability for each class.
  • Normalization: Softmax ensures the probabilities of all the classes add up to 1.
  • Picking the Prediction: The class with the highest probability is your model’s prediction.

Example: Animal Classifier

  1. Input: An image of an animal.
  2. Features: The pixel values of the image.
  3. For Each Class (dog, cat, etc):
    • Multiply pixel values by class-specific weights.
    • Calculate ‘z’.
    • Softmax gives the probability of the image belonging to that class.
  4. Output: Example probabilities: Dog 0.8, Cat 0.05… The model predicts the class with the highest probability (here, dog).
When we are building a classification model to predict more than two classes, why do we use the softmax function?

It normalizes the output probabilities of each class so that they sum to 1, and we can then take the class with the highest probability value as our prediction

Correct. The softmax function normalizes the output probabilities

In the last lesson, we talked about the
situation where we’re predicting a class
that’s binary. It’s either a one or a zero. In this case, we could use
the logistic regression model with that sigmoid function to give us the probability of the positive class or the probability that
y was equal to 1. But what if we have a problem, where we have more
than two classes? It’s not as simple as
predicting y equals one. We actually need to
predict which of many classes,
something might be. In the binary setting where we applied logistic regression, we took our input axis, we multiplied them by a set of weights and
calculate the z, which looked very similar
to linear regression. We then took the output of
that z and we fed it into our sigmoid function and out came the probability
that y was equal to one, or the probability
that the input belonged to the
positive class one. In the multi-class situation, instead of using the
sigmoid function, we use what’s called a
softmax function in order to give us the probability of
belonging to each class. In this case, we have to calculate separately
for each class, the probability that the
input belongs to that class. To do that, we again take our
input axis, but this time, rather than multiplying them
by a single set of weights, we multiply them by a set
of weights for each class. We calculate a z for each class, feed the z into our
softmax function, and calculate the
probability that the input belongs to each class. You can think of the
softmax function as a normalized sigmoid function. It acts very much
the same in that, it bounds our output between
zero and one for each class. But now since we have
multiple classes, we’d like each of those
outputs to sum to 1, so that the probabilities
of each class summed up over all the possible
classes are equal to 1. We can then identify
the class which has the highest probability and use that as the predicted
class for the input. Let’s take an example
of how this works. Suppose we’re creating a
classification model to classify four types of animals based on pictures that
we provide as input. The animals we’re
classifying are dogs, cats, rabbits, and bears. As an input to our model, we would supply a picture. The features that we
would use are actually the values of each of the
pixels within our input image. If we have an image that
we provide as input, that’s eight pixels
by eight pixels, we have a total of 64 features. When we’re using
softmax regression to predict multiple classes, we would provide each of
those 64 input features, and for each of
our four classes, dog, cat, rabbit, and bear, we would multiply the values of those input pixels times
the weights for that class. We would calculate
the z for that class, feed it into our softmax
function and generate the prediction that
that input picture belong to each class. We would do this for all
four of our classes, and our output might look
something like this. Dog, 0.8, cat, 0.05, rabbit, 0.05, bear, 0.1. To now generate a discrete
prediction from our model, we would look for the class that corresponds to the
highest probability. In this case, dog corresponds to an 80
percent probability, and so the output from
our model would be dog.

Review


Video: Module Wrap-up

Linear Models: Foundation and Limitations

  • Types: This module covered linear regression (for predicting continuous values) and logistic regression (for classifying data).
  • Regularization: This technique was discussed as a way to prevent overfitting and improve how linear models work with unseen data.
  • Role: Linear models are important because:
    • They provide a mathematical basis for understanding neural networks.
    • They serve as excellent starting points for modeling tasks, setting a benchmark for performance.
  • Limitations: Their simple form can limit their effectiveness with complex, non-linear datasets, which are common in real-world problems.

What’s Next

The next module will introduce non-parametric algorithms, which are designed to handle the complex, non-linear data that linear models sometimes struggle with.

In this module, we discussed
a set of algorithms. They’re called linear models. We first talked about
linear regression, a simple but commonly used algorithm for
regression tasks. We then discussed
logistic regression, which is despite its name,
a classification algorithm. We finally talked about regularization and how we
can apply regularization to penalize complexity and improve the ability
of linear models to generalize and
perform well on predicting new data the
model has never seen before. The reason we spent
time on linear models is because the mathematical
intuition behind linear models is the same
foundation as neural networks, which we’ll cover in
the later module. Linear model is also a
great starting point for modeling efforts. I highly recommend starting with something like a linear or logistic regression
as a first model to use when starting
a modeling task in order to get a
benchmark value. You can then move
on to other types of algorithms and compare them back to your performance with
the simple benchmark model. One of the downsides
we discussed of linear models is that
they’re simple form, can somewhat constrain their capability
and in particular, their ability to perform
well with highly complex, non-linear sets of data. Which is often the
case when we’re dealing with
real-world problems. In the next module,
we’ll talk about some other types of
non-parametric algorithms and how those can excel at exactly this type of
complex non-linear data.

Module 4 Quiz

Which of the following are true about linear regression (select all that apply)?

How do we determine the optimal coefficient values in linear regression?

If we have a situation where we are seeking to model a target variable that has a nonlinear relationship with one of our input features, which of the following is true?

Why do we often use regularization in modeling?

What is the key difference between LASSO and Ridge regression?

Which of the following are correct about linear regression and logistic regression (select all that apply)?

Why is the logistic (or sigmoid) function used in logistic regression?

Which of the following are correct regarding the use of gradient descent in logistic regression (select all that apply)?

We are building an app that uses pictures which our users upload from their phones to identify species of birds in the photos. We want to be able to identify 20 different types of birds with our model. Which of the below algorithms might be the best choice to use for our model?

Which of the following are true about softmax regression?