Skip to content
Home » Duke University » AI Product Management Specialization » Machine Learning Foundations for Product Managers » Deep Learning & Course Project

Deep Learning & Course Project

Our final module in this course will focus on a hot area of machine learning called deep learning, or the use of multi-layer neural networks. We will develop an understanding of the intuition and key mathematical principles behind how neural networks work. We will then discuss common applications of deep learning in computer vision and natural language processing. We will wrap up the course with our course project, where you will have an opportunity to apply the modeling process and best practices you have learned to create your own machine learning model.

Learning Objectives

  • Describe the intuition and mathematical principles behind deep learning
  • Identify common applications of deep learning for computer vision and NLP
  • Explain the strengths and challenges of deep learning relative to other forms of ML
  • Gain hands-on experience in the process of building and evaluating machine learning models

Neural Networks


Video: Introduction and Objectives

The final module of the course will focus on deep learning, a sub-field of machine learning that involves the use of neural network models. The module will aim to provide an intuition behind how neural network models work and understand the mathematical principles behind them. Additionally, the module will discuss the common applications of deep learning models in computer vision and natural language processing, two fields that are largely dominated by neural network models. Finally, the module will conclude with a discussion on the strengths and weaknesses of deep learning models compared to other forms of machine learning.

Welcome to the final
module of our course. In this module, we’re going
to shift our attention to a sub-field of machine
learning called deep learning. Deep learning involves the
use of neural network models. Although deep learning
models are very complex, we’re going to develop an
intuition behind what’s actually going on behind the scenes when we’re building
a neural network model; and understand some basic
mathematical principles behind how these complex neural network models actually work. Also in this module, we’re
going to discuss some of the common applications of deep learning models for computer vision and natural
language processing. These two specific fields in the modern era
of machine learning are largely dominated
by the use of neural network models
and for good reason. We’ll wrap up this module with a discussion on some
of the strengths or weaknesses of deep learning
models relative to the other forms of
machine learning that we studied in the past
couple of modules.

Video: Introduction to Deep Learning

Origins of Neural Networks

  • The term “neural network” comes from the complex network of neurons in the brain, which work together to perform complex calculations.
  • In 1943, Warren McCulloch and Walter Pitts introduced a computational model for how individual neurons in the brain work.
  • They proposed that complex networks of artificial neurons could achieve complex calculations and approximate complex functions.

History of Neural Networks

  • After the original proposal, researchers made progress in the 1940s and 1950s, including the first model of an artificial neuron and the first successfully functioning neural network.
  • However, research was held back due to the lack of data and computing power.
  • Advances in the 1980s, such as the technique of backpropagation, led to breakthroughs in training complex neural networks.
  • It wasn’t until the early 2000s that computational power and data availability caught up, leading to a boom in deep learning.

Key Enablers of Deep Learning

  • Exponential growth in data availability, which is necessary for training large complex neural networks.
  • Advances in computational power and algorithm design, allowing for deeper and more complex neural network architectures.
  • Researchers have also made great advances in terms of the algorithms themselves, overcoming limitations of neural network architectures.

Applications of Deep Learning

  • Image classification and recognition, such as Facebook’s automatic tagging of friends in photos.
  • Neural machine translation, allowing for easy translation between languages.
  • Healthcare applications, such as predicting the onset of sepsis in ICU patients.
  • Quality control in industries, such as a pizza chain using computer vision to inspect pizzas.

Common Themes in Deep Learning

  • Vast amounts of training data are necessary for successful training of deep learning models.
  • Deep learning excels in applications with a large number of features, such as unstructured data like text, imagery, or video.
  • Deep learning models can learn complex relationships between input features and targets.
  • Applications of deep learning generally have a low concern for explainability, as they are often considered “black boxes”.

Challenges and Limitations

  • One of the challenges of using neural networks is that they are often difficult to interpret and explain, making them less suitable for applications where interpretability is key.
I am building a model for use in an app which allows users to take a photo of a tree with their phone, and the app then identies the species of tree for them. The model will be trained on a dataset of tree images and corresponding labels. For which of the reasons below might this be a good application for a neural network model?

Since I am working with images I am using each pixel in the image as a feature, and therefore have a large number of features in my model

Correct. Neural networks excel in using large feature sets

There are complex relationships between my input feature set (pixels) and the target label

Correct. Neural networks can handle complex highly non-linear relationships between inputs and outputs well

I am not particularly concerned in this case about interpretability/explainability of my model

Correct. A challenge of neural networks is that they are difficult to interpret, but this should be less of a concern in this situation

To understand deep learning, let’s start with understand the origins
of the term neural network. Neural network is a complex
network of neurons in the brain, which work together to
perform complex calculations. In 1943,
a neuropsychologist Warren McCulloch, and a mathematician Walter Pitts work together
to introduce a computational model for how these individual neurons
in the brain actually work. Multiple signals arrive at
the dendrites of a neuron. When the signals come in
through the dendrites, are added together in the cell body. And if the accumulated signal exceeds
some threshold, the neuron fires, meaning it’s activated, and
it passes on an output signal. It’s important to understand that
an individual neuron in the brain really can’t do a lot. But when connected to the thousands
of other neurons in the brain, these neural networks can achieve
very complex calculations. Likewise, it was proposed by McCulloch,
Pitts in the early 1940s. The complex networks of artificial
neurons could achieve very complex calculations, and
approximate very complex functions. A neural network that contains many
layers is referred to as a deep neural network, which is the origin
of the term deep learning. So, let’s look at the history
of neural networks. After the original proposal by McCulloch,
Pitts in the 1940s that really launched an era of early research
into the field of neural networks. Researchers, and scientists made great
progress throughout the 40s, and the 1950s when the first model of an artificial
neuron was proposed by Rosenblatt. And shortly after,
the Stanford researchers, Woodrow and Hoff proposed the first successfully
functioning neural network. One of the challenges that held back
research during this time period, was that researchers really didn’t
have a great way to successfully train large complex networks of neurons. Advances in the 1980s lead
to breakthroughs such as the technique of what’s
called back propagation. Which is a method for training complex neural networks
that contain multiple layers. In the 1980s, however,
progress has held back due to lack of both data availability
as well as computing power. And it really wasn’t until the early
2000s that computational power caught up. And that we had sufficiently
vast amounts of data to use, to train deeper and
deeper neural networks. As neural network models became deeper and
deeper with more neurons, organizing into more layers the power
of these networks in terms of achieving very complex calculations,
continued to grow. And as a result in recent years, we’ve
seen a boom in the field of deep learning. Where large powerful neural
networks are now being used to achieve a wide variety
of very complex tasks. There’s a number of key enablers of this
recent boom in the use of deep learning. The first and probably the most important,
is that the amount of data that is now available for training of large complex
neural networks has grown exponentially. One of the important things to
remember about neural network, is they require very, very large
amounts of data to successfully train. And it’s really only been
in the last decade or two that we have enough data
that’s made available to us, through a variety of
sources such as computers, connected devices, pervasive sensors. But we’ve also taken the time, and
put in the effort by scientists, and researchers, and engineers to organize,
and label all of this data. In a way that they can be consumed for
training neural networks. Computational power has also caught
up to state that they are in terms of algorithm design. And that’s allowed for
much deeper, and much more complex neural network architectures than
we previously could achieve. Researchers have also made great advances
in terms of the algorithms themselves. There’s some inherent limitations of
the architecture of a neural network. And recent advances have largely
overcome a number of these limitations. As a result, today, neural networks
can be found really all around us in the world in a wide variety
of different applications. Let’s look at a couple of representative
applications of of neural networks. The first would be for image
classification, and image recognition. For example, when you take
a picture of someone on your phone, and uploaded it to Facebook. And Facebook automatically tags
that picture with a friend’s name. In order to do that, it’s using a neural
network model that has been built-in, and trained to recognize pictures of
your friends to enable this automation. Another application of deep learning is
found in what’s called neural machine translation. Automatic translation websites, and
applications are able to use complex, and specific types of neural network
models to translate very easily back and forth between a very large
number of languages. The applications of deep learning in the
healthcare space are really at the very early days. There’s tremendous application potential
from their own networks within healthcare. Driven by the large amount of
healthcare data that’s collected by our health care system about patients. One of the early breakthroughs in the use
of deep learning models in the healthcare space was the use of neural
networks as a predictive model, to predict the onset of
sepsis within ICU patients. They only use automated models
to predict sepsis onset based on physiological signals coming from sensors. Enables ICU doctors, and
nurses to better manage, and take proactive care of patients
at high risk of sepsis onset. Let’s look at one final example of an
innovative application of deep learning. One of the major pizza chains here in
the US, is using a computer vision deep learning model to perform quality control
on the pizzas coming out of their ovens, in the restaurants. Rather than human employees having to
perform quality control on the pizzas. They use a camera connected with the deep
learning model to do the quality control. Looking for things like the ratio
of cheese to sauce on the pizza. Whether the pizza has
the proper ingredients, that the customer has actually ordered. Whether the number of pepperoni on
the pizza is up to the standard number, that they’re supposed to put on the pizza. In this way they’re able to automate
that quality control process. And use their human employees for more
sophisticated tasks within the restaurant. If we think about the examples I just
presented, we could see a couple of common themes in terms of where
deep learning really excels. One of these things, is that we
need vast amounts of training data, to successfully train deep learning
models to perform challenging tasks. The second theme we observe,
is that deep learning really excels in applications where we have
a very large number of features. For example, in unstructured data
when we’re dealing with text, or we’re dealing with imagery, or video. We have a very, very large number of features in
the case of image classification. For example, we may have pictures, and each individual pixel within that
picture represents a separate feature. So if we have for example,
5×12 pixels by 5×12 pixels, we’re dealing with thousands
of potential features. And this is where deep learning
applications really can shine. Number three, is the deep learning
applications are able to do very well when we have complex relationships between
the input features, and the target. Where again,
we have many input features, and we have complex nonlinear relationships
between the features, and the targets. Deep learning networks are able,
given enough data, to learn those complex relationships. Finally, it’s important to note that
applications of deep learning generally have a low concern for explainability. One of the challenges that we’ll discuss
later about the use of neural networks, is that they are often
considered black boxes. Because they’re so
complex with so many equations. That it’s really difficult to understand how in their own network is
reaching an output prediction. As a result, we generally focus
their use on specific applications. Where we don’t necessarily need to
present users with a sophisticated explanation for
how the machine got to its prediction. So for things like, tagging of
images with your friends names, or counting the number of
pepperoni cheese on the pizza. This is generally really not a concern,
because interpretability, and explainability is really not a key
driver in these types of applications. If we’re thinking about
building models for example to determine whether applicants
to a graduate school are accepted. Or whether somebody is approved for a loan
to which they have applied, for example. These are applications with
high stakes for users, and also as a result a high need for
interpretability, and explainability. And so these types of applications, we need to be really careful
about the use of neural networks.

Video: Artificial Neurons

Perceptron: A simple artificial neuron model that takes inputs, multiplies them by weights, sums them, and passes the result through a threshold function to produce an output (0 or 1).

Logistic Regression: A type of artificial neuron that adds an activation function (sigmoid) to the perceptron model. The output is a probability that the input belongs to a certain class.

Training a Neuron: The goal is to find the weights that minimize the cost function. Since the cost function is non-linear, iterative methods like gradient descent are used.

Stochastic Gradient Descent: An iterative method that uses one data point at a time to update the weights. It’s suitable for large datasets and online learning.

Batch Gradient Descent: Uses the entire dataset to update the weights at each iteration. It’s more efficient than stochastic gradient descent but can be computationally expensive for large datasets.

Mini-Batch Gradient Descent: A compromise between stochastic and batch gradient descent. It divides the dataset into smaller batches and updates the weights using each batch. It’s commonly used in training neural networks.

Key Parameters: The learning rate is a crucial parameter that controls how big of a step is taken in each gradient descent iteration. If it’s too small, convergence is slow; if it’s too large, the algorithm may diverge.

Which of the below are differences between a perceptron and a logistic regression model (select all that apply)?
  • Logistic regression uses a sigmoid function to calculate the output, while the perceptron does not
  • The output of a logistic regression model is a probability ranging from 0 to 1 (which can then be converted to a discrete 0/1 prediction). The output from the perceptron is a discrete 0/1 prediction, and we do not get the probabilisitic values for each class
Suppose we are building a model on a very large dataset which will not fit in memory. We would also like to rapidly update the model training as new data is received from our users over time. Which type of gradient descent should we use in our model training?

Stochastic gradient descent

Correct. Stochastic gradient descent is commonly used when we have very large datasets and cannot train in batch mode, and also when we wish to update the model rapidly as new data comes in

In order to understand the intuition
behind how neural networks work and how we train them, let’s start by understanding
how an individual artificial neuron works. There are different types of artificial
neurons but let’s start with the first and the most basic which is
called the perceptron. So the perception is
a simple model where we take a set of inputs x multiplied by
a set of weights or coefficients w. We sum the results together and
we pass them through what’s called a threshold function where we can
pair the output of that some z to 0. If the output is higher than 0,
we output a 1. If the output is lower than 0,
we output a -1. Therefore the perceptron is
a model that would be used for a binary classification type of a task. As you look at this model, you might recognize much of this from
our discussion on linear models. And, in fact, the perceptron really is
a very simple linear model starts with a linear combination of our input features
x times are coefficients or weights, sum of those together and compares against the threshold function
to generate an output prediction. Another type of artificial
neuron is logistic regression, which we covered in an earlier module. Logistic regression is, in fact,
very similar to the perceptron but we now add in one more component to our
model which is the activation function. Or in this case of logistic regression, we use a sigmoid function
as our activation function. So in logistic regression,
we start with our input x, we multiply each of our features in
x times of weight or coefficient and we sum them together z score. We then pass our z through the activation
function or the sigmoid function for logistic regression. And as an output of that, we get
the probability that y is equal to 1 or the probability that that data point
belongs to the positive class. We then pass this probability that y is
equal to 1 through our threshold function. And if the probability is higher than 0.5,
we say that our prediction y hat is 1. If the probability is lower than 0.5,
our prediction y hat is 0. We can also use that in their immediate
value, the probability y equals 1 that came out of our activation function
in order to calculate our cost or a loss. So our objective in logistic regression
as well as the perceptron and all of our other models is to find the values of the
weights that minimize this cost function. So let’s now walk through the process
of training in artificial neuron. Again, our goal in training and neuron is to find the values of the
weights that minimize our cost function. As we remember, to minimize a function, we can take the derivative of that
function and set it equal to 0. When we cover linear regression models, we could simply take the derivative of
our cost function set equal to 0 and calculate the weight values
that made that equation 0. When we introduce nonlinear
activation functions such as the sigmoid function that’s
used in logistic regression, there is no longer an easy way to find
a closed form solution to solve for the weight values that make
the derivative equal to 0. Therefore we use an iterative solving
methods such as gradient descent. We start with some random initial values
of our weights, we calculate the cost and then we slowly move in a direction
opposite the gradient or derivative of the cost function. Towards the point where we reach
a minimum cost level and we solve for the weights that achieve that
minimum value for our cost function. Let’s take a look at how we do this using
a process called stochastic gradient descent. In stochastic gradient descent,
we use one data point at a time. We perform gradient descent,
we update our weights, then we take another data point,
do the same thing and we continue on through our entire
dataset until we’ve used every point. The step one in training a neuron using
stochastic gradient descent is called forward propagation. In this step,
we take our first data point and we forward propagate through the model. Meaning, we take our data point, we multiply our input features
times the coefficients or weights. Calculate our z, pass our z
through our activation function, which is the sigmoid function for logistic regression, and we calculate
our y hat or prediction as an output. Once we’ve calculated our y hat
prediction, we then able to calculate the cost using that prediction
comparing it to the actual y value and also calculate the gradient
of that cost function. We calculate the gradient of the cost
function with respect to each of the weights or
coefficients that are in our model. Once we’ve calculated the gradient of the
cost function with respect to each weight, we’re now able to update the values of
each of those weights using our gradient descent process. So our new value for a weight is equal to the previous value
of that weight minus our learning rate times the derivative of our cost
function with respect to that weight. We can go through each of our weights and
update them using this update rule. We then repeat the process by taking
the next data point in our data set, passing it through our model, calculating
our y hat, calculating our cost and the derivative of a cost and
then updating the weights one more time. And we continue this process until we’ve
looped through our entire data set. Eventually our gradient descent
process should converge to values of weights that
result in the minimum cost and these are the weights that we
then use in our final model. One of the key parameters that we
need to set to enable this process is the learning rate that we saw
in the previous update equation. The learning rate controls how big of
a step we take each time we perform that gradient descent step. As we cover later in our
section on neural networks, the learning rate can have a big impact on
your ability to train on neural network. If you set the learning rate too small,
each time you perform that gradient descent step, you’re taking a very,
very, very small step. And as a result, your algorithm may
take a very long time to converge. On the other hand, if you set too large
of a learning rate, you end up taking large steps at each time and you may
bounce around on your cost function and never find that minimum value. So setting the learning rate to a point
where it’s neither too small and takes too long or too large and runs the
risk of diverging is one of the key things that you need to focus on as your
training neural network models. In the previous example, we trained an artificial neuron using what
we called stochastic gradient descent. Or taking one observation or data point
at a time to iteratively calculate the gradient and update the weights,
then moving on to the next and looping through our data
one point at the time. This approach works very well for
large data sets and it’s also the primary approach used
in what we call online learning. Which is the case where we have
a production model that we’re receiving data points from a user,
for example, one of the time and each time we receive a data point,
we’re retraining and updating our model. One of the downside of stochastic gradient
descent is that we have to use a loop using a single observation at the time and therefore we can take advantage of more
efficient vectorized or matrix operations. The alternative approach is
what we call batch descent. In batch gradient descent, we’re using
the entire data set for each update. So we’re calculating the gradient and
we’re updating the weights based on all of the observations in our training
data set at each iteration. The primary advantage of batch gradient
descent is that we can now take advantage of vectorized or matrix operations and
performing this and we can perform these operations
much more efficiently. One of the challenges of batch gradient
descent is that if you have a very large data sets, sometimes can be impossible due
to the compute power required to perform batch gradient descent at each iteration. So as a compromise, it’s very
common that we’ll use an approach in training neural networks called
mini-batch this gradient descent. In mini-batch gradient descent, we divide our training data up into
smaller subsets or smaller batches. For example, a batch of eight observations
at the time or 32 observations at a time. We can then perform batch
gradient descent using all of our observations within
this mini-batch each time. Therefore, we’re able to take advantage
of the vectorized operations that we can accomplish using
batch gradient descent. But we’re not using nearly as
much computational horsepowers as if we were to try to use our
entire data set within each batch. Mini-batch gradient descent is very common
in training neural networks because it works very well for large data sets, while also still allowing us to achieve
efficient computational operations. One of the challenges that you’ll find
with mini-batch descent is it’s not as good as stochastic gradient descent for
online learning when we often have a single observation coming
in at the time for a user and desire to retrain as each
single observation comes in

Video: From Neurons to Neural Networks

Limitations of Artificial Neurons

  • Artificial neurons can only handle problems with linear decision boundaries
  • Adding more neurons to form a network allows for more complex calculations with non-linear decision boundaries

Stacking Perceptrons

  • Perceptrons can be stacked side-by-side or outputs can be fed into another perceptron to generate a final output
  • Stacking perceptrons enables more complex calculations than using a single artificial neuron

Example: Binary Classification

  • Two perceptrons can be trained to create linear decision boundaries, and their outputs can be fed into a third perceptron to create a non-linear decision boundary
  • This simple model can approximate the input function being approximated

Multi-Class Classification

  • For more than two possible output classes, multiple units can be used in the output layer
  • Each output unit represents a score for each class, and the class with the highest score is assigned to the input data point

Neural Network Architecture

  • A neural network consists of an input layer, one or more hidden layers, and an output layer
  • Each node in the network consists of a linear combination of inputs, weights, and an activation function (e.g. sigmoid, hyperbolic tangent, ReLu)
  • The use of non-linear activation functions enables better modeling of non-linear relationships

Typical Neural Network Architecture

  • Input layer: features are multiplied by weights and fed into nodes in the hidden layer
  • Hidden layer: linear combination of inputs and weights, passed through an activation function
  • Output layer: outputs from the hidden layer are multiplied by weights, passed through an activation function, and generate a prediction (y hat)
We are building a neural network model that contains 10 layers in total. How many hidden layers does the model contain?

Eight

Correct. We have one input layer, one output layer, and eight hidden layers

In the last lesson, we discussed artificial neurons. We talked about how
they’re structured and how we can train them using gradient descent to
iteratively change the weights until we find the optimal weights
that minimize the cost. Artificial neurons are powerful, but their power is limited
because they can only handle problems with linear
decision boundaries. Researchers knew at the time in the ’50s that adding
more neurons to form a network will
allow us to perform more complex calculations that had non-linear
decision boundaries. However, it wasn’t until the 1980s when the
backpropagation method was popularized that we
really had a good way to train these neural network
models with multiple layers. Let’s look at what
happens when we stack multiple
perceptrons together. We could stack them
in a couple of ways. We could take a
couple of perceptrons and put them side-by-side, but we could also
take the outputs of those two perceptrons
and feed it into yet another perceptron
to go through its calculations and
generate a final output. Turns out then when we stack these perceptrons
together like this, we could perform much
more complex calculations than we could using only a
single artificial neuron. Let’s look at an example
to illustrate this. Let’s say we have a problem
where we’re trying to generate a model to
predict some output, which is a binary
classification output, either a plus 1 or minus 1, and as an input we have
two features X_1 and X_2. A decision boundary,
looks like this, as we can see on
the slide between the plus 1 class and
the minus 1 class. To approach this problem, we could start by taking
two individual perceptrons. We could train each perceptron so that they were capable of creating linear decision
boundaries between the minus 1 and the
positive 1 class like this. We can then take
the output of each of those individual perceptrons, feed it into a third perceptron. Now our third perceptron
would be capable of combining the outputs of the first and
second perceptron and creating a non-linear
decision boundary. In this way, our simple model consisting of three perceptrons organized in two layers can now approximate the input function that we we’re trying
to approximate. The exercise we
just looked at was a simple example using a
binary classification task. But what happens if we have more than two possible
output classes? Let’s say we’re
classifying animals or flowers with many
different classes. Rather than using
only a single output, a single unit in
the output layer, we can use multiple units
in the output layer. We again combine our perceptrons into layers where we have an input layer consisting
of our input features. Our input features
feed into a set of perceptrons in what we
call a hidden layer, and we take the outputs of those perceptrons in our
hidden layer and feed it into another layer of multiple perceptrons
in our output layer. We then have an
output from each of these perceptrons in
their our output layer, and the output from each
of these represents a score for each of the
classes in our problem. We then look at which class has the highest
associated score, and we assign that class label
to the input data point. Additionally, when we combine perceptrons or
artificial neurons together, rather than using a perceptron as the node in our network, which has a very simple
threshold function, we can choose to use a unit that includes an
activation function, such as a sigmoid function, as we saw in
logistic regression. But we can also use
other functions such as the hyperbolic tangent
or the ReLu function, which is now very commonly used as an activation function. Each node in our network
will now consist of a linear combination of our inputs times our
weights to calculate the Z, and it passing our Z through a non-linear activation
function and providing the output on to the
next layer of the network. The use of these non-linear activation functions
in each layer, rather than a simple threshold
like the perceptron, enables us to better model
nonlinear relationships. Let’s take a look at a typical neural network architecture and
they’re on their work begins with an input
layer which consists of each of the features
in our input data. The features are then
multiply by a weight and fed into each of the
nodes within our first layer, which we call our hidden layer. Again, we take a linear
combination of each of the input features times
each of the weights, calculate our Z and pass
our Z through Phi of Z, which is our
activation function. Again, we get to choose
our activation function. This could be sigmoid or it
could be a ReLu function. We take the output of that activation function and we feed it into the next layer. In this simple example, we have a three-layer
neural network. We have a set of inputs
that we’re providing, multiplying those buyer weights, combining them in
our hidden layer, passing them through our
activation function, and feeding that into
our output layer. Those are then combined
in our output layer, multiplying those outputs from the previous hidden
layer times the weights, and then again pass through an activation function
over output layer to calculate our y hat or our prediction from our
simple neural network.

Video: Training Neural Networks

Training a Neural Network

  • The process of training a neural network is an extension of training a single artificial neuron
  • We work backwards from the output layer to the input layer, calculating the cost and distributing it among the layers
  • We calculate the gradient of the cost function with respect to each weight, using the chain rule
  • We update the weights using gradient descent
  • This process is called backpropagation and is the primary method for training neural networks

Forward Propagation

  • We feed a single data point through the network, calculating the output at each layer
  • We calculate the cost and derivative of the cost using the prediction and actual value
  • We apply the chain rule to calculate the gradient of the cost function with respect to each weight

Updating Weights

  • We update the weights using the gradient and learning rate
  • We repeat the process for each layer and each weight
  • We continue until the weight values converge and we have a final neural network model

Challenges in Neural Networks

  • There are many decisions to make in designing a neural network architecture
  • We need to decide on the number of layers, units per layer, activation functions, regularization, and learning rate
  • We often use one of two approaches:
  1. The “stretch pants” approach, where we use a large network and apply techniques to reduce overfitting
  2. Transfer learning, where we use a pre-trained model and fine-tune it for our specific task

Transfer Learning

  • We use a pre-trained model that someone else has trained on a large dataset
  • We take a generic model and use most of it, but add new final layers that we train on our specific task
  • We fine-tune the model on our specific dataset, benefiting from the pre-training done by others.
Why do we apply backpropagation in training a neural network?

Backpropagation is how we calculate the gradient of the cost function with respect to the weights in each layer of the neural network, which we need in order to update the weights at each iteration of the model training

Correct. We use backpropagation to get the gradient of the cost with respect to each weight in the network. We update the weights at each training iteration by subtracting the learning rate multiplied by the gradient from the previous weight values

which of the following statements about transfer learning are correct (select all that apply)?
  • Transfer learning can significantly reduce our training time by using a model that has already been pre-trained on a generic task
  • Transfer learning is a very common approach in building computer vision and natural language processing models

In a previous lesson, we walked
through the process of training and individual artificial neuron. It’s now extrapolate that intuition to
training an entire in their own network. In neuaral network we now have multiple
layers of weights that we need to update. In order to do this we work backwards,
we start from the end or the right side and
work backwards towards the left side or the beginning layer of
on their own network. We can calculate our total cost and we distribute that cost among
the different layers in our network according to the contribution of
each layer towards the total cost. We can then calculate the gradient
of each of the cost for each layer with respect to
the weights feeding into that layer. Once we calculate that gradient were
able to perform gradient descent and update the weights feeding
into each of the layers. This process was popularized in the 1980s
and it’s called backpropagation. And still today is the primary method
that we use to train neural networks. So let’s now walk through the process step
by step for training in neural network. In a similar way to what we saw when
we trained an artificial neuron, the first step is our
forward propagation step. So if we’re using stochastic
gradient descent, where we’re training with
a single data point at a time, we’ll take our first data point and
we’ll feed it through on neural network. So we’ll start with the input layer, we’ll
multiply our features times the weights, we’ll pass it through the activation
function in the hidden layer. We’ll multiply the output of
that hidden layer activation function times another set of
weights to reach our output layer. We’ll pass it through the activation
function of our output layer and we’ll calculator prediction or
our y hat value. Once we’ve calculated y hat we’re
now able to calculate our cost and ingredient or
derivative of our cost using our prediction y hat, and our actual value y. Again we want to calculate the gradient
of our cost function with respect to each weight within our network. Because our gradient is the product
of weights times other weights and activation functions, we have to
apply the chain rule in this case. So it gets a little bit more
complicated than the situation with a single artificial neuron. But applying the chain rule were still
able to calculate our gradient of our cost function with respect to each
weight in neural network. And once we calculate our gradient we’re
able to update our weight values in the same way as we previously did. Where our new weight values are equal
to our previous weight values minus our running rate times the derivative of our
cost function with respect to each weight. We do this for
each layer of our network and for every weight value within
each layer of our network. We then update our weight values and we
repeat the process, taking the next data point, passing it through our
network calculating the gradient, updating our weights and continuing
until our weight values converge and we have a final neural
network model to use. Well, the challenges in working with
neural networks is that there’s many decisions to make in the design
of a neural network architecture. We need to decide how many layers
we want in neural network. How many units we would
like within each layer, we have a choice of activation function
to use for each of those units. We have other choices to make,
such as do we want to add regularization? Do we want to use batch gradient descent
or mini-batch gradient descent or stochastic gradient descent? We have to choose a value for a learning
rate that can influence how well and how quickly our neural
network model can train. Because there’s so many decisions to make in practice we
often use one of two approaches here. The first approach is what I call
the stretch pants approach to this. Meaning that we want to use
a network that it’s too large for the problem that we’re trying to solve. And then we can apply some techniques in
order to reduce the risk of overfitting or stretch it down to fit just right
to the data that we have and the problem that we’re trying to solve. We’ve covered some of those techniques in
earlier lessons, such as regularization. The second approach is to use a neural
network model that somebody has already done the important work of setting up and
training on a very, very large data set. This approach is called transfer learning. So let’s look a little bit
deeper at how this works. In transfer learning we
start by using a pre-built, pre-trained neural network
model that somebody else has trained on a relevant task for us. So if we’re working on a problem where we
need to classify images, let’s say for example, we’re building a model
to classify images of flowers. We might want to take a model
that’s been pretty trained to classify images of different types. Generally, these models
are trained on very, very large data sets of many different
types of images and we’re able to take advantage of all that heavy lifting
and all that pre-training that’s done. And use that and do some then final tuning
on that model for our specific task. So typically will take a generic model
that somebody else has pre-trained on a very large image data set. We’ll use a major portion
of that model but often will cut off the last
couple of layers and then we’ll build a new model using
that pre-trained portion of the model. And by adding a couple of
final layers onto our model, which we then train in a way that it’s
being fine tuned on a specific task. So in the example of wanting to build an
application that classifies flowers, let’s say we can take advantage of a model that
somebody has pre-trained on a large set of images of many different types animals,
and objects, and plants, and so on. We’ll take all but the last couple
of layers of that model and then we’ll add on a new
set of final layers. We’ll then train those new final layers
using a specific data set that we’ve collected, the specific to our task at
hand, which is classifying flowers. So this data set would be, for example, a data set of images of
different types of flowers. Once we’ve trained our final couple of
layers on neural model, we now have a model that is capable of achieving our
specific task of classifying flowers. That is significantly benefited from all
the work that somebody else has done to pre-train a large portion of that model. And we’ve now added on some of our own
fine tuning in order to help that model recognize different types of flowers.

Applications of Deep Learning


Review


Course Warp Up