Skip to content

You’ll learn the basic rules for calculating probability for single events. Next, you’ll discover how data professionals use methods such as Bayes’ theorem to describe more complex events. Finally, you’ll learn how probability distributions such as the binomial, Poisson, and normal distribution can help you better understand the structure of data.

Learning Objectives

  • Use Python to model data with a probability distribution
  • Describe the significance and use of z-scores
  • Define the Empirical Rule
  • Describe the features and uses of continuous probability distributions such as the normal distribution
  • Describe the features and uses of discrete probability distributions such as the binomial and Poisson distributions
  • Explain the difference between discrete and continuous random variables
  • Describe Bayes’ theorem and its applications
  • Define dependent events
  • Describe conditional probability and its applications
  • Define different types of events such as mutually exclusive and independent events
  • Apply basic rules of probability such as the complement, addition, and multiplication rules
  • Describe basic probability in mathematical terms
  • Explain the difference between objective and subjective probability
Table Of Contents
  1. Basic concepts of probability
  2. Conditional probability
  3. Discrete probability distributions
  4. Continuous probability distributions
  5. Probability distributions with Python
  6. Review: Probability

Basic concepts of probability


Video: Welcome to module 2

This text provides an overview of the upcoming lessons on probability, highlighting its applications in data-driven decision making.

Key takeaways:

  • Probability: Measures the likelihood of events, used for data-driven decisions under uncertainty.
  • Types of Probability: Objective (based on data) and Subjective (based on personal belief).
  • Topics Covered:
    • Basic rules (complement, addition, multiplication)
    • Conditional probability and Bayes’ theorem
    • Probability distributions (discrete and continuous)
    • Common distributions (binomial, Poisson, normal)
    • Z-scores and their application
    • Applying probability distributions using Python’s SciPy library

This information sets the stage for further exploration of probability concepts and their practical applications in various fields, especially data analysis.

Hey there. I really enjoyed exploring descriptive
statistics with you and I’m excited for what’s next: probability. Probability
is the branch of mathematics that deals with measuring and
quantifying uncertainty. In other words, probability uses math to describe
the likelihood of something happening. For example, the chance of rain
tomorrow or of winning the lottery. Data professionals use probability to
help business leaders make data driven decisions in situations of uncertainty. No one can know the outcome of future
events with complete certainty. What data professionals can do is use all
the available data to make reasonable predictions based on probability. For instance, imagine you’re working with the
stakeholder at a large aerospace company. They need to decide whether or not to
invest in a new technology to improve the production process for
their jet engines. As a data professional, you can estimate
the probability that the new technology will have a positive impact and
predict what its potential costs and benefits might be. The stakeholder can use this information
to make an informed decision about what’s best for the organization. We’ll start by reviewing the two main
types of probability: objective and subjective. We’ll cover basic
rules of probability, like the complement rule, the addition
rule and the multiplication rule. Then we’ll go over
conditional probability and how to describe the relationship
between dependent events. We’ll check out Bayes’ theorem, a key
formula for conditional probability and the basis for
more advanced Bayesian analysis. You’ll also learn about probability
distributions. Probability distributions describe the likelihood of the possible
outcomes of a random event and can be discrete or continuous. We’ll check out discrete probability
distributions such as the binomial and Poisson and find out how they can help
you model specific kinds of data. Then we will explore continuous
probability distributions and focus on the normal distribution, the most widely
used distribution in all statistics. You’ll discover its main features and how
it applies to many different data sets. Next, we’ll also discuss how z-scores
can help you better understand the relationship between data values
in a standard normal distribution. Finally, you learn how to use
Python’s SciPy stats module to apply probability distribution to your data. When you’re ready to start
learning about probability, join me in the next video.

Video: Objective versus subjective probability

Summary of Probability and its Applications

This video explores the concept of probability and its use in data-driven decision making:

Key Points:

  • Probability: Measures the likelihood of uncertain events, aiding informed decisions. (e.g., wearing appropriate clothing based on weather forecast)
  • Applications: Predicting product sales, investment returns, election outcomes, and medical test accuracy.

Types of Probability:

  1. Objective Probability:
    • Based on data, experiments, and mathematical calculations.
    • Two types:
      • Classical: Applies to events with equally likely outcomes. (e.g., flipping a coin)
        • Calculated by: Favorable outcomes / Total possible outcomes
      • Empirical: Based on historical data or experiments. (e.g., taste test for ice cream preference)
        • Calculated by: Number of times event occurs / Total number of events
  2. Subjective Probability:
    • Based on personal feelings, experience, or judgment. (e.g., predicting a horse race winner)
    • Not based on formal analysis or experiments.
    • Can vary significantly between individuals.

Importance of Distinguishing Probability Types:

  • Objective probability: Crucial for data analysis and making informed decisions.
  • Subjective probability: Can be unreliable and should be used cautiously when evaluating predictions or making decisions.

Example: A CEO’s subjective feeling about a new technology’s success might be inaccurate. Data science based on objective probability can provide a more reliable prediction, enabling a data-driven decision.

Overall:

Understanding probability, both objective and subjective, is crucial for making informed decisions in various fields, especially data analysis. The next video will delve deeper into fundamental probability concepts.

Tutorial: Summary of Probability and its Applications

This tutorial provides an overview of probability, its different types, and its applications in various fields, especially data analysis.

What is Probability?

Probability is a branch of mathematics that deals with measuring the likelihood of events occurring. It allows us to quantify uncertainty and make informed decisions in situations where we cannot know the outcome with complete certainty.

Example: We might use probability to decide what to wear based on the weather forecast. A 70% chance of rain suggests it’s more likely to rain than not, so wearing a raincoat might be a wise choice.

Types of Probability

There are two main types of probability:

  1. Objective Probability:
    • Based on data, experiments, or mathematical calculations.
    • Two main types:
      • Classical Probability: Applies to events with equally likely outcomes.
        • Formula: Favorable outcomes / Total possible outcomes
        • Example: Flipping a coin (heads or tails) has a 50% chance (1/2) of landing on either side.
      • Empirical Probability: Based on historical data or observations.
        • Formula: Number of times event occurs / Total number of events
        • Example: A taste test reveals 80 out of 100 people prefer strawberry ice cream. The probability of someone preferring strawberry is 80/100 = 80%.
  2. Subjective Probability:
    • Based on personal beliefs, feelings, or experiences.
    • Not based on formal calculations or data.
    • Can vary significantly from person to person.
    • Example: You might feel strongly that your favorite team will win the game, but this belief is not based on data or analysis.

Applications of Probability

Probability plays a crucial role in various fields, including:

  • Data Analysis: Predicting product sales, investment returns, election outcomes, and medical test accuracy.
  • Machine Learning: Training algorithms to make predictions based on patterns in data.
  • Finance: Assessing risk and making investment decisions.
  • Insurance: Determining premiums based on the likelihood of claims.
  • Quality Control: Setting standards and monitoring manufacturing processes.

Why is understanding probability important?

Understanding probability allows us to:

  • Make informed decisions in situations with uncertainty.
  • Interpret data and draw meaningful conclusions.
  • Evaluate the reliability of predictions and claims.
  • Reduce risk by understanding the likelihood of potential outcomes.

Conclusion

Probability is a powerful tool that can be applied in various fields to quantify uncertainty and make informed decisions. Recognizing the difference between objective and subjective probability is crucial for evaluating information and making sound judgments. By studying probability, you equip yourself with a valuable skill for navigating the world of uncertainty and making data-driven decisions.

Additional Resources:

  • You can find various online resources and interactive simulations to practice and visualize probability concepts.
  • Consider exploring introductory statistics courses or textbooks for a deeper understanding of probability and its applications.

Probability helps you
measure and quantify uncertainty and make informed decisions about
uncertain outcomes. For example, you might use probability to decide what
to wear on a given day. Today’s weather
forecast says there’s a 70 percent chance of snow. Based on this data, you
decide to wear your hat, gloves, and snow boots. When the snow falls,
you stay warm and dry. Data professionals might use probability to
predict the chances that a company will sell a certain amount of product
in a given time period, a financial investment will
have a positive return, a political candidate
will win an election, or a medical test
will be accurate. In this video, we’ll explore the two main types of probability: objective
and subjective. Objective probability
is based on statistics, experiments, and
mathematical measurements. Subjective probability
is based on personal feelings,
experience, or judgment. Let’s start with
objective probability. Data professionals use
objective probability to analyze and interpret data. There are two types of
objective probability: classical and empirical. Classical probability is
based on formal reasoning about events with
equally likely outcomes. To calculate classical
probability for an event, you divide the number
of desired outcomes by the total number
of possible outcomes. For example, if you flip a coin, the result will be
either heads or tails. Heads and tails are
terms commonly used to refer the two sides of the coin. There are only two
possible outcomes, and both outcomes
are equally likely. The chance that you get heads is one out of two or 50 percent, the same goes for tails. Or take playing cards. There are 52 cards
in a standard deck. Choosing a card gives
you a one-in-52 chance, or 1.9 percent of getting
any card in the deck, whether it’s the ace of hearts, 10 of clubs, or four of spades. But most events are
more complex and do not have equally
likely outcomes. Usually, the weather isn’t a 50 percent chance
of rain or snow, there might be an 80
percent chance of rain tomorrow and a 20 percent
chance of some other outcome. While classical
probability applies to events with equally
likely outcomes, data professionals use
empirical probability to describe more complex events. Empirical probability
is based on experimental or historical data; it represents the likelihood of an event occurring based on the previous results of an
experiment or past events. To calculate empirical
probability, you divide the number
of times a specific event occurs by the
total number of events. For example, say you conduct a taste test with 100 people to find out whether they
prefer strawberry or mint chip-flavored ice cream. You want to know the
probability that a person prefers
strawberry ice cream. Your taste test reveals that 80 people prefer
strawberry ice cream. To calculate probability, you
divide the number of times the event of preferring
strawberry ice cream occurs, 80, by the total
number of events, 100. 80 divided by 100 equals
0.8 or 80 percent. So the probability
that a person prefers strawberry over mint
chip is 80 percent. Earlier, you learned about inferential statistics and
how data professionals use sample data to make inferences or predictions about
larger populations. Inferential statistics
uses probability too. For instance, a retail company might survey a
representative sample of 100 customers to predict the shopping preferences
of all their customers. Data professionals rely on
empirical probability to help them make accurate
predictions based on sample data. For example, in an A/B
test of a website, you test a sample
of users to make a prediction about the future
behavior of all users. Say the sample of users prefer a green add-to-cart
button over a blue one. You may infer from this data that the larger population of future users will probably
share their preference. An A/B test lets you make a reasonable prediction about future users based on
empirical probability. This probability can
help an online business make smarter decisions
and increase sales. In contrast, the results of subjective probability
are based on personal feeling,
experience, or judgment. This type of probability does not involve formal calculations, statistical analysis, or
scientific experiments. For instance, you may have an overwhelming feeling that a certain horse will
win a horse race, or that your favorite team will win the championship game. You may have good
reasons for your belief, but your reasons are
personal or subjective. Your belief is not based on statistical analysis or
scientific experiments. For this reason, the
subjective probability of an event may differ widely
from person to person. It’s important to know
the difference between subjective and objective
probability when you evaluate a prediction
or make a decision. For example, the CEO of an auto company might
feel confident that using a new technology to
manufacture their pickup truck will cut costs and
increase profits. But if their prediction
is only based on personal feeling or
subjective probability, it may not be reliable. Data science based on statistical analysis or
objective probability can help accurately predict the
potential impact of the new technology and help
the CEO make an informed, data-driven decision about
adopting the technology. That’s all for now. Coming up, we’ll check out some fundamental
concepts of probability.

Video: The principles of probability

Summary of Probability Concepts:

  • Probability: A number between 0 and 1 representing the likelihood of an event occurring.
    • 0: No chance (event won’t occur)
    • 1: Certain (event will occur)
    • Values between 0 and 1 indicate varying degrees of likelihood.
  • Examples: Flipping coins, rolling dice, drawing cards (used for historical and educational reasons).
  • Random Experiment: A process with uncertain outcomes.
    • Has multiple possible outcomes.
    • Each outcome can be identified beforehand.
    • Outcome depends on chance (cannot be predicted with certainty).
  • Calculating Probability:
    • Divide the number of desired outcomes by the total number of possible outcomes.
    • Example: Probability of getting heads from a coin toss = 1 (desired outcome) / 2 (total outcomes) = 0.5 (50%)

This summary provides a basic understanding of probability and its calculation for single events. The video mentions exploring more complex scenarios in future lessons.

If the probability of an event equals 1, what is the chance that the event will occur?

100%

If the probability of an event equals 1, there is a 100% chance that the event will occur. Probability is expressed as a number between 0 and 1. If the probability of an event is close to zero, there is a small chance that it will occur. If the probability is close to 1, there is a strong chance that it will occur.

Recently, you learned that probability uses
math to deal with uncertainty or to determine how likely it is that
an event will occur. In this video, you’ll learn some fundamental
concepts of probability. We’ll discuss the
mathematical definition of probability and how to calculate probability for
single random events. First, I want to give
you some context about the types of
examples we’ll be using. In this part of the course, we’re going to
continue to reference examples of events
like flipping coins, rolling dice, and drawing cards. There are a couple
of reasons for this. One is historical. The modern theory of probability originates in the
analysis of games, of chance in the 16th
and 17th centuries. Second, and more importantly, these are events with
clearly defined outcomes that most people
are familiar with. They’re just super
useful examples of basic probability concepts. That’s why they’re used in stats classes around the world. Later on in the course, we will explore probability of for more complex events like the ones
you’ll encounter in your future work as
a data professional. Let’s talk about the fundamental
concepts of probability. First, the probability
that an event will occur is expressed as a
number between 0 and 1. If the probability of
an event equals zero, there’s a zero percent chance
that the event will occur. If the probability of
an event equals one, there’s a 100 percent chance
that the event will occur. There are lots of possibilities
in between 0 and 1. If the probability of
an event equals 0.5, there is a 50
percent chance that the event will
occur or not occur. If the probability of an
event is close to zero, there’s a small chance
that the event will occur. If the probability of an
event is close to one, there’s a strong chance
that the event will occur. For example, the chance of a stock price
going up this year is 0.05 or five percent, then you probably
don’t want to buy it. If it’s 0.95 or 95 percent, then it’s probably
a good investment. Probability measures the
likelihood of random events. The result of a random event cannot be predicted
with certainty. Before flipping a coin
or rolling a die, you do not know the outcome. The coin could turn
up heads or tails, and a die could show any
number one through six. These are examples of what statisticians call a
random experiment, also known as a
statistical experiment. A random experiment is a process whose outcome cannot be
predicted with certainty. All random experiments have
three things in common. The experiment can have more
than one possible outcome, you can represent each
possible outcome in advance, and the outcome of the
experiment depends on chance. Let’s take the example
of flipping a coin. There’s more than one
possible outcome. You can represent
each possible outcome and advance heads or tails, and the outcome
depends on chance. Until you actually
toss the coin, you can’t know whether it
will be heads or tails, or think about rolling
a six-sided die. There’s more than one
possible outcome, and all outcomes can be
represented in advance, 1, 2, 3, 4, 5, and 6. The outcome of any roll
depends on chance. Until you roll the die, you can’t know which
number will turn up. To calculate the probability
of a random experiment, you divide the number
of desired outcomes by the total number
of possible outcomes. You may recall that this is also the formula for
classical probability. The probability of tossing a coin and getting heads
is one chance in two. This is 1 divided 2
equals 0.5 or 50 percent. The probability of rolling
a die and getting two, is one chance out of six, this is 1 divided 6 equals 0.166 repeating or
about 16.7 percent. Now, let’s conduct a
different random experiment. Imagine a jar
contains 10 marbles, two marbles are red, three are green,
and five are blue. You decide to take one
marble from the jar. You want to know the probability that the marble will be green. First, count the number
of possible outcomes. You have an equal
chance of choosing any one of the 10 marbles. Next, figure out how many of these outcomes refer to
what you want to know. The chance of choosing
a green marble. Of the 10 total marbles,
three are green. Therefore, the
probability of choosing a green marble is 3
out of 10, or 0.3. In other words, you have
a 30 percent chance of choosing a green marble. Now you know how to calculate the probability of a
single random event. This knowledge will be useful as a building block for more complex calculations
of probability.

Reading: Fundamental concepts of probability

Reading

Video: The basic rules of probability and events

Key Concepts

  • Probability Notation: Using P(A) to represent the probability of event A. The complement (event not occurring) is denoted as P(A’).
  • Complement Rule: The probability of an event occurring and its complement must add up to 1. In other words, P(A) + P(A’) = 1.
  • Mutually Exclusive Events: Events that cannot happen simultaneously (e.g., rolling a 2 and a 4 on a single die roll).
  • Addition Rule: For mutually exclusive events, the probability of either A or B happening is the sum of their individual probabilities: P(A or B) = P(A) + P(B).
  • Independent Events: Events where the outcome of one doesn’t affect the other (e.g., a coin toss and the weather).
  • Multiplication Rule: For independent events, the probability of both A and B happening is the product of their individual probabilities: P(A and B) = P(A) * P(B).

Important Distinctions

  • Addition Rule: Use for mutually exclusive events (can’t happen at the same time).
  • Multiplication Rule: Use for independent events (one doesn’t influence the other).

Examples

  • Complement Rule: 30% chance of rain means a 70% chance it won’t rain.
  • Addition Rule: Probability of rolling a 2 or 4 on a die: 1/6 + 1/6 = 1/3.
  • Multiplication Rule: Probability of tails, then heads on two coin flips: 1/2 * 1/2 = 1/4.
Fill in the blank: The addition rule states that, if the events A and B are ____, then the probability of A or B happening is the sum of the probabilities of A and B.

mutually exclusive

The addition rule states that, if the events A and B are mutually exclusive, then the probability of A or B happening is the sum of the probabilities of A and B. Two events are mutually exclusive if they cannot occur at the same time.

So far, we’ve been focusing on calculating the probability
of single events. Many situations, both in everyday life and
in data analytics, involve more than one event. As a future data professional, you’ll often deal with
probability for multiple events. In this video, we’ll cover three basic rules
of probability: the complement rule,
the addition rule, and the multiplication rule. These rules help you
better understand the probability of
multiple events. We’ll also discuss two
different types of events: mutually exclusive events
and independent events. Then you’ll learn
how to calculate probability for each of them. First, let’s discuss
probability notation, which is the standard way to symbolize probability concepts. As we go along, I’ll share some useful notations that will help us communicate
more efficiently when it comes to
basic probability. The letter P indicates the
probability of an event. For example, if you’re
dealing with two events, you can label one event A, and the other event B. The notation for the
probability of event A is the letter P followed by the
letter A in parenthesis. For the probability of event B, it’s the letter P, followed by the letter B in parenthesis. If you want to talk
about the probability of event A not occurring, add an apostrophe
after the letter A. You can also say this is
the probability of not A. Now, let’s check out
our first basic rule, the complement rule. In stats, the complement of an event is the
event not occurring. For example, either it rains or it does not rain. Either you win the lottery or you don’t win the lottery. The complement of
rain is no rain. The complement of
winning is not winning. The important thing to note is that the two probabilities, the probability of an
event happening and the probability of
it not happening, must add to one. Recall that a probability of
one is the same as saying there’s 100 percent certainty
of an event occurring. Another way to think
about it is that there is a 100 percent chance of one event or the other
event happening. There may be a 30 percent
chance of rain tomorrow, but there is a 100 percent
chance that it will either rain or not
rain tomorrow. The complement rule says that the probability
that event A does not occur is 1 minus the
probability of event A. For example, if the
weather forecast says there’s a 30 percent
chance of rain tomorrow, there’s a probability of 0.3. You can use the
complement rule to calculate the probability that
it does not rain tomorrow. The probability
of no rain equals 1 minus the probability of rain. This is 1 minus 0.3
equals 0.7 or 70 percent. But the complement rule
and our next rule, the addition rule, applied two events that are
mutually exclusive. Two events are
mutually exclusive if they cannot occur
at the same time. For example, you can’t visit both Argentina and
China at the same time, or turn left and right
at the same time. The addition rule says that if the events A and B are
mutually exclusive, then the probability
of A or B happening is the sum of the
probabilities of A and B. Let’s check out an example
using a six-sided die. Say you want to find out
the probability of rolling either a two or a four on
a single roll of the die. These two events are
mutually exclusive. You can roll a two or a four, but not both at the same time. The addition rule
says that to find the probability of
either event happening, you should sum up
their probabilities. The odds of rolling any single
number on a six-sided die are 1/6. The probability of
rolling a two is 1/6, and the probability of
rolling a four is 1/6. 1/6 plus 1/6 equals 1/3. The probability of rolling
either a two or a four is 1/3 or 33 percent. The addition rule applies to
mutually exclusive events. If you want to calculate probability for
independent events, you can use the
multiplication rule. Two events are independent
if the occurrence of one event does not change the probability of
the other event. This means that
one event does not affect the outcome
of the other event. For example, checking
out a book from your local library does not
affect tomorrow’s weather. Drinking coffee in
the morning does not affect the delivery of your
mail in the afternoon. These events are separate
and independent. The multiplication
rule says that if the events A and B
are independent, then the probability of
both A and B happening is the probability of A multiplied
by the probability of B. For instance, imagine
two consecutive tosses. Say you want to know the
probability of tails on the first toss and heads
on the second toss. First, figure out what
events you’re dealing with and then apply
the appropriate rule. Two coin tosses are
independent events. The first toss does not affect the outcome
of the second toss. For any toss, the probability
of getting either heads or tails always remains
1/2 or 50 percent. You would use the multiplication
rule for this event. The probability of
getting tails and heads is the probability
of getting tails, multiply it by the
probability of getting heads. The probability of each
event is 0.5 or 50 percent. Now, plug in the
numbers, 0.5 times 0.5 equals 0.25 or 25 percent. The probability of getting
tails on the first toss and heads on the second
toss is 25 percent. To recap, let’s compare
the addition and multiplication rules and
list their differences. It will be helpful to keep
these differences in mind, so you know when to
use the two rules. The addition rule sums up
the probabilities of events, and the multiplication rule
multiplies the probabilities. The addition rule
applies to events that are mutually exclusive. The multiplication
rule applies to events that are independent. The basic rules of
probability help you describe events that are mutually
exclusive or independent. In an upcoming video, we’ll check out
conditional probability, which applies to
dependent events.

Reading: The probability of multiple events

Reading

Practice Quiz: Test your knowledge: Basic concepts of probability

Objective probability is based on personal feeling, experience, or judgment.

Fill in the blank: In statistics, a number between _____ is used to express the probability that an event will occur.

The probability of no snow tomorrow equals 1 minus the probability of snow tomorrow. This is an example of what rule of probability?

Conditional probability


Video: Conditional probability

Conditional Probability

  • Definition: The probability of one event happening given that another event has already occurred.
  • Dependence: This applies when the first event influences the probability of the second event.
  • Applications: Used in finance, insurance, science, machine learning, and everyday decision-making.

Examples of Dependence

  • Needing internet to visit a website
  • Needing a passport for international travel
  • Drawing two aces in a row from a deck of cards (the first draw affects the second draw’s probability).

Conditional Probability Formula

  • P(A and B) = P(A) * P(B given A)
    • Where P(B given A) means the probability of event B happening if event A has already occurred.

Examples:

  • Cards: Probability of drawing two aces in a row is very low (about 0.5%).
  • College: Probability of being admitted and getting a scholarship is even lower (about 0.2%) due to the dependence of the events.

Key Takeaway: Conditional probability helps us understand relationships between dependent events, allowing for more accurate predictions.

Fill in the blank: Two events are _____ if the occurrence of one event changes the probability of the other event.

dependent

Two events are dependent if the occurrence of one event changes the probability of the other event.

So far, you’ve learned how
to calculate probability for a single event and for two
or more independent events. Remember, two events are independent if one
event does not affect the outcome
of the other event, like two coin flips. In this video, you’ll
learn how to calculate probability for two or
more dependent events. This type of probability is known as conditional
probability. Conditional
probability refers to the probability of an event occurring given that another
event has already occurred. Conditional probability is
used in many different fields, like finance, insurance,
science, and machine learning. For example, an agency that sells life
insurance might use conditional probability
to decide how risky it is to insure someone who
skydives for a living. Data professionals,
like those who work on machine
learning models use conditional probability to make accurate predictions
about complex data sets. Before we get into calculating
conditional probability, let’s go over the
concept of dependence. Two events are dependent
if the occurrence of one event changes the
probability of the other event. This means that the
first event affects the outcome of the second event. For instance, if you
want to visit a website, you first need Internet access. Visiting a website depends on you having access
to the Internet. If you want to travel
to another country, you first need to
get a passport. Traveling to another country depends on you
having a passport. In each instance, we can
say that the second event is dependent on or conditional
on the first event. Let’s check out an
example of dependence that’s closer to
probability theory. Imagine you have two events. The first event is drawing an ace from a standard
deck of playing cards, and the second event is drawing another ace from the same deck. There are four aces in
a deck of 52 cards. For the first draw, the chance of getting
an ace is four out of 52 or 7.8 percent. But for the second draw, the probability
of getting an ace changes because you’ve
removed a card from the deck. Now, there are three aces
in a deck of 51 cards. For the second draw, the chance of getting an ace is three out of 51 or about 5.8 percent. Getting an ace is
now less likely. These two events are dependent
because getting an ace on the first draw changes the probability of getting
an ace on the second draw. Now you have a better
understanding of dependent events. Let’s return to conditional probability and check
out the formula. You don’t need to memorize
the formula, but personally, I find that reviewing
the formula often helps me understand
the concept better. That’s why I’m
sharing it with you. The formula says that for
two dependent events, the probability of event A
and event B occurring equals the probability of event A times the probability of
event B given A. You may notice that we have a new notation in this formula, the vertical bar between
the letters B and A means that event B depends
on event A happening. We say this as the
probability of B given A. The formula can also
be expressed as the probability of
event B given event A equals the probability
that both A and B occur divided by the
probability of A. These are just two ways of representing the same equation. Depending on the situation or what information you
are given upfront, it may be easier to
use one or the other. We can apply the conditional
probability formula to our example of drawing
an ace playing card. The probability of
event A or getting an ace on the first
draw is four out of 52. The probability of
event B given event A, or of getting an ace
on the second draw, is three out of 51. Let’s enter these numbers
into the formula. The probability of
event A and event B, or of getting two aces in a row, is 4 over 52 multiplied
by three over 51. If you do the math, this
equals one over 221. The probability of
getting two aces in a row equals one over 221, or about 0.5 percent. Let’s check out another example. Imagine you’re
applying for college. The college accepts 10 out of every 100 applicants
or 10 percent. If you’re accepted, you also hope to receive an
academic scholarship. The college awards academic
scholarships to two out of every 100 accepted
students or two percent. You want to calculate the
probability that you get accepted and you
get a scholarship. Getting a scholarship depends
on first getting accepted. So this is a
conditional probability because it deals with
two dependent events. Let’s call getting accepted, event A and getting a
scholarship, event B. You want to calculate
the probability of event A and event B. According to the formula, to find the probability
of event A and event B, you can multiply the
probability of event A by the probability of
event B given event A. The probability of event A, getting accepted
is 10 out of 100. The probability of
event B given event A, or getting a scholarship given that you are
first accepted, is two out of 100. Ten divide by 100
times 2 divided by 100 equals 1 divided by 500. The probability of getting accepted and getting
a scholarship is one out of 500 or 0.2 percent. Conditional probability
helps you better understand the relationship
between dependent events. As a data professional, I often use conditional
probability to predict how an event, like an ad campaign, will
impact sales revenue. Then I share my findings
with stakeholders so they can make more
informed business decisions.

Reading: Calculate conditional probability for dependent events

Reading

Video: Discover Bayes’ theorem

Conditional Probability Recap

  • Conditional probability is the probability of one event happening given another event has already happened (like the chance of getting a second ace after drawing one).

What is Bayes’ Theorem?

  • A formula used to calculate conditional probability.
  • Allows us to update our estimate of a probability based on new information.

Key Terms

  • Prior Probability: The base probability of an event occurring before any new data.
  • Posterior Probability: The revised probability of an event after considering new information.

Bayes’ Theorem Applications

  • Widely Used: Finance, marketing, medical testing, artificial intelligence – all use Bayesian approaches.
  • Example: A medical test’s accuracy can be refined using Bayes’ Theorem based on the patient’s age or other factors.

Example: Planning an Outdoor Event

  1. Prior Probability: Overall, there’s a 10% chance of rain.
  2. New Information: The morning is cloudy.
  3. Bayes’ Theorem: Used to update the chance of rain based on cloud data.
  4. Posterior Probability: Calculation yields a 12.5% chance of rain (it increased slightly)

Key Takeaway

Bayes’ Theorem is a powerful tool to revise our understanding of probabilities as new information becomes available.

What does Bayes’s theorem enable data professionals to calculate?

Posterior probability

Bayes’s theorem enables data professionals to calculate posterior probability, or the updated probability of an event based on new data.

Earlier, you learned that conditional probability
refers to the probability of an event occurring given that another event
has already occurred. For example, when you draw an ace from a deck
of playing cards, this changes the probability of drawing a second ace
from the same deck. In this video,
you’ll learn how to calculate conditional probability
using Bayes’ theorem. Bayes’ theorem, also
known as Bayes’ rule, is a math formula for determining conditional
probability. It’s named after Thomas Bayes, an 18th century mathematician
from London, England. Bayes’ theorem provides a way to update the probability of an event based on new
information about the event. In Bayesian statistics,
prior probability refers to the probability of an event before new data is collected. Posterior probability is
the updated probability of an event based on new data. Posterior means occurring after. Posterior probability
is calculated by updating the prior probability
using Bayes’ theorem. For example, let’s say a medical condition
is related to age. You can use Bayes’ theorem
to more accurately determine the probability
that a person has the condition based on age. The prior probability would be the probability of a person
having the condition. The posterior or
updated probability would be the probability of a person having the condition if they’re in a
certain age group. Bayes’ theorem is the foundation for the field of
Bayesian statistics, also known as
Bayesian inference, which is a powerful
method for analyzing and interpreting data in
modern data analytics. Data professionals
applied Bayes’ theorem in a wide variety of fields from artificial
intelligence to medical testing. For instance,
financial institutions use Bayesian analysis to rate the risk of lending
money to borrowers or to predict the success
of an investment. Online retailers use Bayesian
algorithms to predict whether or not users will like certain products and services. Marketers rely on Bayes’
theorem for identifying positive or negative responses
in customer feedback. Let’s check out the
theorem itself. As always, don’t worry
about memorizing it. Bayes’ theorem is a bit complicated and this is
the basic version of it. Bayes’ theorem says that
for any two events A and B, the probability of A given B equals the probability
of A multiplied by the probability of B given A divided by the
probability of B. In math terms, prior probability is the probability of event A. Posterior probability or what you’re ultimately
trying to figure out is the probability of
event A given event B. The key for Bayes’
theorem is that it includes both the
conditional probability of B given A and the conditional
probability of A given B. If you know one of
these probabilities, Bayes’ theorem can help
you determine the other. Let’s check out an example. Say you’re planning
a big outdoor event like a graduation party. The success of the event
depends on good weather. On the day of the event, you notice that the
morning is cloudy. You want to find out
the chance of rain, given that this day
starts off cloudy. If there’s a high
probability of rain, you may decide to move the event indoors or even cancel it. You know the following
information: at this time of year, the overall chance of
rain is 10 percent. However, cloudy
mornings are common. About 40 percent of all
days start off cloudy and 50 percent of all rainy
days start off cloudy. In this example, your
prior probability is the overall probability
of a rainy day. New data will update
this probability, in this case, the knowledge that the morning is cloudy and
that rain may be coming. What you ultimately
want to find out is the probability that it will
rain given that it’s cloudy. This is your posterior
probability. You can use Bayes’ theorem to update the prior probability that it rains based on the new data that the
morning is cloudy. When you work with
Bayes’ theorem, it’s helpful to first figure out what event A is and
what event B is. This makes it easier
to understand the relationship between
events and use the formula. Let’s use the word rain
to refer to event A, the probability of rain. This is your prior probability. Event B is the probability
that the day will be cloudy. Let’s use the word cloudy
to refer to event B. Now, you can rewrite the
probability of event B given event A as the
probability that it’s cloudy, given that it rains. Finally, the
probability of event A, given event B is the
probability that it rains, given that it’s cloudy. This is your posterior
probability or the updated probability that Bayes’ theorem will
help you calculate. Finally, enter what you
know into the formula. The probability of
rain is 10 percent, the probability that it’s
cloudy is 40 percent, the probability that it’s cloudy given that it rains
is 50 percent. The probability of rain given
that it’s cloudy equals 0.1 times 0.5, divided by 0.4. This equals 0.125
or 12.5 percent. There’s a 12.5 percent
chance of rain today. This is your posterior
probability or the updated probability based on the data that the
morning is cloudy. The odds are still
in your favor. You decide to proceed with your outdoor party.
Hope it’s a fun one.

Video: The expanded version of Bayes’s theorem

When to Use the Expanded Bayes’ Theorem

  • Situations with Unknowns: The basic Bayes’ Theorem requires you to know the probability of event B. The expanded version is useful when you don’t have this information.
  • Evaluating Tests: A common use case is determining the accuracy of tests (medical tests, spam filters, quality control checks) where false positives and false negatives are a concern.

Key Terms

  • False Positive: A test wrongly indicates something is present when it’s not. (e.g., a spam filter misclassifying an email)
  • False Negative: A test wrongly indicates something is absent when it’s present. (e.g., missing a defective part)

Example: Peanut Allergy Test

  1. Prior Probability: 1% of the population has the allergy.
  2. Test Accuracy:
    • 95% chance of true positive (test positive if the allergy is present)
    • 2% chance of false positive (test positive even if the allergy is absent)
  3. Posterior Probability (what we want): Given a positive test result, what’s the chance the allergy is actually present?

Using the Expanded Bayes’ Theorem By plugging in the known probabilities, we calculate that if the test is positive, there’s only a 32.4% chance the person actually has the allergy. This highlights the impact of false positives when the overall condition is rare.

Key Takeaway: The expanded Bayes’ Theorem helps us understand test accuracy more deeply when dealing with uncertainties and potential errors.

You’ve already
learned that Bayes’ theorem tells you how to update the probability of an event based on new data
about the event. But there are several
different versions of Bayes theorem. They’re written in
different ways and used for different
types of problems. In this video, you’ll learn
about an expanded version of Bayes theorem and
how to use it to predict the accuracy of a test. The expanded version of
Bayes theorem is long. If you’re not an
experienced statistician, it may seem quite intimidating. You don’t need to worry about
memorizing this formula. What’s important
to know is that, the expanded version
works better than the basic version in
certain situations. The theorem goes like this. The probability of event A given event B equals the
probability of B given A, multiplied by the
probability of A divided by the
probability of B given A, times the probability of A plus the probability
of B given not A, multiplied by the
probability of not A. Well, that was a lot. You can use the two versions of Bayes’ theorem to deal with
different types of problems. Sometimes for instance, you don’t know the
probability of event B, which is in the denominator of the equation for the
basic Bayes’ theorem. In that case, you can use the expanded version
of Bayes’ theorem, because you don’t need to
know the probability of event B to use the
expanded version. This longer version of Bayes’
theorem is often used to evaluate tests such as
medical diagnostic tests, quality control tests or software test such
as spam filters. When evaluating the
accuracy of a test, Bayes’ theorem can take into account the probability
for testing errors known as false
positives and false negatives. A false positive is a test
result that indicates something is present
when it really is not. For example, a spam filter may incorrectly identify a
legitimate email as spam. False positives, often
referred to medical testing, but they also apply to other
areas like software testing. For instance, antivirus software may indicate that a
computer file is a virus, even though the file is normal. A false negative is
a test result that indicates something is not
present when it really is. For example, a spam filter may incorrectly identify a spam
email as a legitimate. False negatives also apply
to all kinds of tests. In manufacturing for instance, a quality control
test may incorrectly identify a defective part
as an acceptable part. Next, let’s explore
detailed example of how to use the longer Bayes
theorem to evaluate a test. Let’s say you want to
evaluate the accuracy of a diagnostic test that checks for the presence of
a peanut allergy. Suppose that one percent of the population is
allergic to peanuts. Based on historical data, if a person has the allergy, there is a 95 percent chance
that the test is positive. If a person doesn’t
have the allergy, there is still a
two percent chance that the test is positive. This is a false positive, because it’s a
positive result for a person who does not
actually have the allergy. You want to know given that
a person tests positive, what are the chances that they
actually have the allergy? You can also think of
the situation in terms of prior and posterior
probability, which you learned about
earlier in connection with a basic version
of Bayes theorem. You start off with
the prior probability that a person has the
allergy, this is one percent. Then you’ll update
this prior probability with new data based on testing; the probability of getting true positive and false
positive test results. Finally, you want to figure out the posterior probability that the allergy is present given
that the test is positive. There are two main events
in this situation. First, actually
having the allergy. Second, testing positive. Let’s call having the allergy, Event A and testing
positive, Event B. Remember, these two
events are different, because you can test positive
and not have the allergy, which is a false positive. Now, let’s review what you know. First, there is the
probability that a person actually
has the allergy, which is one percent. The probability of Event
A equals one percent. Next, there is a 95
percent chance that a test is positive if the
person has the allergy. This is a conditional
probability for two dependent events. The probability of
a positive test given that the
allergy is present, so the probability of Event B given Event A equals 95 percent. Then there is the
false positive result. The two percent chance
that the test is positive given that the
allergy is not present. This is another
conditional probability; the probability of Event B
given not A equals 2 percent. Finally, if you use
the compliment rule, you can also figure
out one more probability. The probability of not
having the allergy. The complement rule says that the probability that
Event A does not occur, is 1 minus the
probability of Event A. If the probability of Event A, actually having the allergy
is one percent or 0.01, then the probability
of not having the allergy is 1 minus 0.01. This equals 0.99 or 99 percent, so the probability of
not A equals 99 percent. These are the
probabilities you know. What you don’t know is the
probability of Event B, the probability that a person gets a positive test result. This is where you’d have trouble using the basic version
of Bayes theorem, because the probability of Event B is part of the formula. Instead, you can use the
expanded version since you don’t need to know the probability of Event
B for that formula. Now you can enter what you
know into the formula. The probability of A is
one percent or 0.01. The probability of not A
equals 99 percent or 0.99. The probability of B given A
equals 95 percent or 0.95. The probability of B given not a equals two
percent or 0.02. If you do the math, the result
is 0.324 or 32.4 percent. The probability of
Event A given Event B, or the probability that the
allergy is present given that the test is positive
is 32.4 percent. If 32.4 percent
seems low to you, it’s because the allergy
is rare to begin with. It’s not very likely that
a random person will both test positive and
have the allergy. The expanded version
of Bayes theorem gives you a better understanding
of the accuracy of the test by taking into account multiple probabilities.
That’s all for now.

Reading: Calculate conditional probability with Bayes’s theorem

Reading

Practice Quiz: Test your knowledge: Conditional probability

What is conditional probability?

Suppose two events occur: The first event is drawing an ace from a standard deck of playing cards, and the second event is drawing another ace from the same deck. Note that the first ace is not reinserted into the deck after it is drawn. What term is used to describe these two events?

Fill in the blank: _____ probability is the updated probability of an event based on new data.

Discrete probability distributions


Video: Introduction to probability distributions

What are Probability Distributions?

  • They describe how likely different outcomes of a random event are.
  • Used to model data and find patterns within it.
  • Example: Probability of a drug curing a disease, or the results of dice rolls.

Random Variables

  • Represent the possible outcomes of an event.
  • Types:
    • Discrete: Countable values, often whole numbers (e.g., number of times a coin lands heads-up).
    • Continuous: Measured values along a range (e.g., height, time, temperature) – infinite decimal possibilities.

Distributions for Each Type of Variable

  • Discrete Probability Distributions:
    • Describe probabilities for each specific outcome (e.g., the exact probability of rolling a 3 on a die is 1/6).
    • Can be shown as tables or bar graphs (histograms).
  • Continuous Probability Distributions:
    • Describe probability of an outcome falling within a range of values (e.g., probability of a tree being between 15-16 ft).
    • Can’t get the exact probability of a single value (it’s essentially zero).
    • Visualized as curves, the most common being the bell curve (normal distribution).

Key Points

  • To determine if a variable is discrete or continuous: Can you count the outcomes or do you need to measure them?
  • Sample Space: All possible outcomes of an event.

Fill in the blank: A _____ random variable has a countable number of possible values.

discrete

A discrete random variable has a countable number of possible values.

So far we’ve covered a lot of key
concepts in basic probability. What you’ve learned about basic
probability will help you better understand probability distributions or
main topic for this part of the course. In my job as a data professional,
I use probability distributions to model different kinds of data sets and to
identify significant patterns in my data. A probability distribution describes
the likelihood of the possible outcomes of a random event. Probability distributions
can represent the possible outcomes of simple random events. Like tossing a coin or rolling a die. They can also represent
more complex events. Like the probability of a new medication
successfully treating a medical condition. A random variable
represents the values for the possible outcomes of a random event. There are two types of random
variables: discrete and continuous. A discrete random variable has
a countable number of possible values. Often discrete variables are whole
numbers that can be counted. For example, if you roll a die five times you can count
the number of times the die lands on two. If you toss a coin five times you can
count the number of times it lands on heads. A continuous random variable takes all the
possible values in some range of numbers. When it comes to continuous variables, you’re dealing with decimal
values rather than whole numbers. For instance,
all the decimals values between one and two, such as 1.1, 1.12, 1.125 and so on. These values are not countable since there
is no limit to the possible number of decimal values between one and two. Typically these are decimal values
that can be measured such as height, weight, time or temperature. For example, if you measure
the height of a person or object, you can keep on making your
measurement more accurate. The height of a person could
be 70.2 inches, 70.23 inches, 70.237 inches, 70.2375 inches and so on. There is no limit to
the number of possible values. It’s not always immediately obvious if
a variable is discrete or continuous. To help choose between the two, you can
use the following general guidelines. If you can count the number of outcomes
you are working with a discrete random variable. For example, counting the number
of times a coin lands on heads. If you can measure the outcome, you are working with
the continuous random variable. For example, measuring the time it
takes for a person to run a marathon. Now that we’ve explored random variables, let’s return to the topic of probability
distributions which described the probability of each possible
value of the random variable. Discrete distributions represent
discrete random variables and continuous distributions represent
continuous random variables. Once you know the sample
space of a random variable, you can assign probabilities to
each of the possible values. In statistics you can use the term
sample space to describe the set of all possible values for
a random variable. For example, a single coin toss is a
random variable with two possible values. Heads and tails. So the sample space is heads and tails. If you roll a six sided die, you have
a random variable with six possible values or a sample space of one,
two, three, four, five and six. Let’s check out an example of
a discrete probability distribution. Take the familiar random
event of a single die roll. The sample space for a single die roll
is one, two, three, four, five and six. The probability of each
outcome is the same. One out of six or 16.7%. You can display a discreet probability
distribution as a table or a graph. The distribution table summarizes
the probability for each possible outcome. The top row list each
outcome of the die roll and the bottom list
the corresponding probability. The bar graph or histogram shows
the same probability distribution but in a different form. For a discrete probability the random
variable is plotted along the X axis and the corresponding probability
is plotted along the Y axis. In this case the X axis
represents each possible outcome of a single die roll one through six. The Y axis represents
the probability of each outcome. Continuous probability distributions and their graphs work a little differently
from discrete distributions. This is due to the difference between
discrete and continuous random variables. The probability distribution for
a discrete random variable can tell you the exact probability for
each possible value of the variable. For instance,
the probability of rolling a die and getting a three is one out of six or
about 16.7%. The probability distribution for
a continuous random variable can only tell you the probability that the variable
takes on a range of values. Let’s check out an example to learn more. A continuous random variable may have
an infinite number of possible values. Imagine you want to measure the height
of an Oak tree you picked at random from a nearby forest. In this example, the height of the tree
is a continuous random variable. The tree’s height could be say 15 ft or 15.2 ft or 15.2187 ft and so on. You can keep on adding another decimal
place to the measurement without limit. Now say you want to know the probability
that the height of the oak tree is exactly 15.2 ft. Because the height of the tree could
be any decimal value between the range of 15 ft and 16 ft. The probability that the tree is exactly
any single value is essentially zero. In this example you’ll need to use
a continuous probability distribution to tell you the probability that the height
of the oak tree is in a certain range or interval. Such as between 15 ft and 16 ft. The probability of any
specific value is zero, so it only makes sense to talk about
the probabilities of intervals. A convenient way to show
the probabilities of a range or interval of values is with a curve. On a graph,
continuous distributions appear as curves. You may have heard of the bell curve,
which refers to the graph for a continuous distribution
called the normal distribution. On the curve the X axis refers to the
value of the variable you’re measuring, in this case oak tree height. The Y axis refers to something
called probability density. This is a math function that deals
with the values of intervals. You don’t need to focus on
the math part right now, just know that probability density is
not the same thing as probability. There’s a lot more to learn about
probability distributions and how they can help you model
different kinds of data. These topics are complex, so feel free to revisit the video
to go over this part again.

Video: The binomial distribution

What is the Binomial Distribution?

  • Models the probability of events with two outcomes: success or failure.
  • Used in fields like medicine, finance, and machine learning.

Key Requirements for a Binomial Experiment

  1. Fixed Number of Trials: The experiment is repeated a set number of times (e.g., 10 coin flips).
  2. Two Outcomes: Each trial has only ‘success’ or ‘failure’ (the labels are up to you).
  3. Consistent Success Probability: The chance of ‘success’ is the same for every trial.
  4. Independence: One trial’s outcome doesn’t affect the others.

Examples

  • Coin flips
  • Percentage of customers making a return
  • Machine learning image classification (cat or not cat)

Why it Matters

  • If your data fits a binomial experiment, you can use the binomial distribution formula to calculate probabilities.
  • Example: What’s the probability of getting 2 ‘heads’ out of 10 coin flips?

Key Takeaway: Understanding the binomial distribution helps you model and predict outcomes in a wide range of situations.

Fill in the blank: The binomial distribution models the probability of events with _____ possible outcomes.

two

The binomial distribution models the probability of events with two possible outcomes.

Recently, you learned about discrete probability
distributions, which represent
discrete random events like tossing a coin
or rolling a die. Often, the outcomes of
discrete events are expressed as whole numbers
that can be counted. For example, the number of times a coin lands on
heads in 10 tosses. In this video, you’ll
learn about one of the most widely used discrete
probability distributions, the binomial distribution. The binomial distribution is a discrete distribution
that models the probability of events with only
two possible outcomes, success or failure. This definition assumes
that each event is independent or does not affect the probability
of the others, and that the probability of success is the same
for each event. For example, the binomial
distribution applies to an event like tossing the
same coin 10 times in a row. Keep in mind that
success and failure are labels used for convenience. For example, each toss has only two possible
outcomes, heads or tails. You could choose to label
either heads or tails as a successful outcome based on the needs
of your analysis. Whatever label you
apply to the outcomes, it’s important to know that they must be mutually exclusive. As a quick refresher, two outcomes are mutually exclusive if they cannot
occur at the same time. You can’t get both heads and
tails in a single coin toss. It’s either one or the other. Data professionals use the
binomial distribution to model data in different
fields such as medicine, banking, investing,
and machine learning. For example, data professionals use binomial distribution to model the probability that a new medication
generates side effects, a credit card transaction
is fraudulent, or a stock price rises
or falls in value. In machine learning, the
binomial distribution is often used to classify data. For example, a data professional may
train an algorithm to recognize whether
a digital image of an animal is or is not a cat. The binomial distribution
represents a type of random event called a
binomial experiment. A binomial experiment is a
type of random experiment. You may recall that a
random experiment is a process whose outcome cannot be predicted
with certainty. All random experiments have
three things in common. The experiment can have more
than one possible outcome. You can represent each
possible outcome in advance, and the outcome of the
experiment depends on chance. On the other hand, a
binomial experiment has the following attributes. The experiment consists of a
number of repeated trials. Each trial has only
two possible outcomes. The probability of success
is the same for each trial, and each trial is independent. An example of a
binomial experiment is tossing a coin
10 times in a row. This is a binomial experiment because it has the
following features. The experiment consists of 10 repeated trials
or coin tosses. Each trial has only two possible outcomes, heads or tails. The probability of success
is the same for each trial. If you define success as heads, then the probability
of success for each toss is the
same, 50 percent. Each trial is independent. The outcome of one
coin toss does not affect the outcome of
any other coin toss. Let’s check out another example
of a binomial experiment. Suppose you want to
know how many customers return an item to a department
store on a given day. Say 100 customers visit
the store each day. Ten percent of all customers who visit the store make a return. You label a return as a success. This is a binomial
experiment because there are 100 repeated trials
or customer visits. Each trial only has
two possible outcomes, return or not return. If you label return a success, the probability of success for each customer visit is
the same, 10 percent. Each trial is independent. The outcome of one customer
visit does not affect the outcome of any
other customer visit. It’s important to understand the features of a binomial
experiment because the binomial distribution could only model data for
this type of event. If you’re working with data for a different type of event, you need to use a different type of probability distribution, like the Poisson
to model the data. Once you’ve determined that your distribution is binomial, you can apply the binomial
distribution formula to calculate the probability. No need to memorize it. You can use your computer
to make the calculations. If you want to learn more, feel free to check out the
relevant reading. In brief, the binomial distribution formula
helps you determine the probability of getting
a certain number of successful outcomes in a
certain number of trials. For example, getting
a certain number of heads in a certain
number of coin flips. In this formula, k refers
to the number of successes, n refers to the
number of trials, p refers to the probability
of success on a given trial, and n choose k refers
to the number of ways to obtain k
successes in n trials. Let’s explore our
departments, for example, to better understand
how the formula works. This time, suppose 10 percent of all customers who visit
the store make a return. Imagine that three
customers visit the store. You label a return as a success. You can use the formula to determine the probability
of getting 0, 1, 2, and 3 returns among
the three customers. In the calculation, X refers
to the number of returns. I’ll skip the calculations and go directly to the results. If you plug in for
the probability that X equals 0 returns, the result is 0.729. For the probability
that X equals 1 return, the result is 0.243. For the probability that
X equals 2 returns, the result is 0.027. For the probability that
X equals 3 returns, the result is 0.001. You can then use a histogram to visualize this
probability distribution. For a discrete
probability distribution, like the binomial distribution, the random variable is
plotted along the x-axis and the corresponding probability is plotted along the y-axis. In this case, the x-axis
shows the visits per hour: 0, 1, 2, and 3. The y-axis shows the probability
of getting each result. The binomial distribution lets you model the probability of events with only two
possible outcomes, success or failure. Identifying the distribution of your data is a key step in any analysis and helps you make informed predictions
about future outcomes.

Video: The Poisson distribution

The Poisson Distribution

  • Models the probability of a specific number of events happening within a defined time period.
  • Examples:
    • Customer calls per hour at a call center
    • Website visitors per hour
    • Severe storms per month in a city

Prerequisites for a Poisson Experiment

  1. Events can be counted
  2. You know the average number of events in a set time period.
  3. Events are independent (one doesn’t affect another’s chance of happening).

Example: Fast Food Orders

  • A restaurant averages 2 drive-through orders per minute.
  • The Poisson distribution can calculate the probability of getting 0, 1, 2, 3… orders in a given minute.
  • This helps with staff planning.

Formula: (It gets a little mathy, the key is understanding the concept!)

  • Lamda = Average number of events in the time period
  • k = Number of events you’re interested in
  • e = Mathematical constant (~2.71828)
  • ! = Factorial function

Poisson vs. Binomial

  • Poisson: You know the average rate of events per time, and want to find the probability of a specific number of events within that time.
  • Binomial: You know the exact probability of a single event and want to find the probability of it happening a certain number of times in repeated trials.
The Poisson distribution can model the probability that a certain number of events will occur during a specific time period.

True

The Poisson distribution can model the probability that a certain number of events will occur during a specific time period.

As a data professional, knowing about probability
distributions is super useful because different types of distributions help you model different
kinds of data. Every time I work
with a new dataset, I try to understand if there is a pattern present in
the distribution data. Knowing the probability
distribution of my data also helps me choose the machine learning model that works best. This way, I’m able to get a
better result in less time. Data professionals work with many different types of
probability distributions. As you advance in your career
and continue to learn, you can explore
different distributions and discover how they
apply to your work. In this part of the course, we’re focusing on two of the most common discrete
probability distributions, the Binomial and the Poisson. Earlier, you learned that the binomial distribution
represents experiments with repeated trials that each have two possible outcomes,
success or failure. In this video,
you’ll learn about the main features of the
Poisson distribution. The Poisson distribution is a probability
distribution that models the probability that
a certain number of events will occur during
a specific time period. The Poisson distribution
can also be used to represent the number
of events that occur in a specific space, such as a distance,
area, or volume, but in this course
we’ll focus on time. Baron Simeon Denis Poisson, French mathematician, originally derive the Poisson
distribution in 1830. He developed the distribution to describe the number of
times a gambler would win in difficult game of chance in a large
number of tries. Data professionals use the Poisson distribution
to model data, such as the expected number of calls per hour for a customer
service call center, visitors per hour for a website, customers per day
at a restaurant, and severe storms
per month in a city. The Poisson distribution
represents a type of random experiment called
a Poisson experiment. A Poisson experiment has
the following attributes; The number of events in the
experiment can be counted. The mean number of events
that occurred during a specific time period is known, and each event
is independent. Let’s explore an example. Imagine you’re a data
professional working for a large restaurant chain
that serves fast food. You know that the drive-through
service at a restaurant receives an average of
two orders per minute. You want to determine the
probability that a restaurant will receive a certain number of orders in a given minute. This is a Poisson experiment because the number of events in the experiment
can be counted. You can count the
number of orders. The mean number of events that occur during a specific
time period is known. There is an average of
two orders per minute, each outcome is independent. The probability of one person
placing an order does not affect the probability of another person placing an order. Once you know that
you’re working with the Poisson distribution, you can apply the Poisson
distribution formula to calculate the probability. In brief, the formula helps
you determine the probability that a certain number of events occurring during a
specific time period. In this formula, the Greek
letter Lamda refers to the mean number of events that occurred during a
specific time period. k refers to the
number of Events. e is a constant equal to
approximately 2.71828. The exclamation point
stands for factorial, a function that
multiplies a number by every whole number
below it down to one. For example, two factorial
is two multiplied by one. Let’s continue with our
restaurant chain example to better understand
how the formula works. Recall that the
drive-through service at a restaurant receives an average of two orders per minute. You can use the Poisson
formula to determine the probability of the
restaurant receiving 0, 1, 2 or 3 orders
in a given minute. Knowing this
information may help the restaurant organized
staffing for the drive-through. I’ll skip the calculations and go directly to the results. If you plug in for
the probability that X equals 0 orders, the result is 0.1353. For the probability that
X equals one order, the result is 0.2707. For the probability that
X equals two orders, again, the result is 0.2707. For the probability that
X equals three orders, the result is 0.1805. You can then use a histogram to visualize the probability
distribution. The x-axis shows the
number of events, in this case, orders per minute. The y-axis shows the
probability of occurrence. For example, the probability of getting 0 orders
in a minute is about 0.1353 or 13.5 percent. The probability of one order
is 0.2707 or 27.07 percent. The probability of
two orders is also 0.2707 or 27.07 percent. The probability of
three orders is 0.1805 or 18.05 percent. Before we finish up, let’s compare the two discrete
probability distributions you recently learned about, the binomial and the Poisson. Sometimes it can be challenging to figure
out if you should use a binomial distribution
or a Poisson distribution. To help you choose
between the two, you can use the following
general guidelines. Use the Poisson distribution
if you are given the average probability
of an event happening for a
specific time period. You want to find out
the probability of a certain number of events
happening in that time period. For example, if a call center averages 10 customer service
calls per hour, you can use the
Poisson distribution to find the probability of getting 12 calls between
02:00 P.M. and 3PM. Use the binomial
distribution if you are given the exact probability
of an event happening, and you want to find out the
probability of the event happening a certain number of
times in a repeated trial. For example, if the
probability of getting heads for any coin
toss is 50 percent, you can use the binomial
distribution to find the probability of getting
8 heads in 10 coin tosses. That’s all for discrete
probability distributions. In your future career
as a data professional, you’ll use discrete
distributions like the binomial
and the Poisson to better understand
your data and make informed predictions
about future outcomes.

Reading: Discrete probability distributions

Practice Quiz: Test your knowledge: Discrete probability distributions

Which of the following statements describe continuous random variables? Select all that apply.

What probability distribution represents experiments with repeated trials that each have two possible outcomes: success or failure?

Continuous probability distributions


Video: The normal distribution

From Discrete to Continuous

  • Discrete: Outcomes are whole numbers (like rolling 2 or 3 on a die).
  • Continuous: Outcomes can take on any decimal value within a range (like time, height, or temperature).

Introducing the Normal Distribution (a.k.a the Bell Curve)

  • Key Features:
    • Bell-shaped curve
    • Symmetrical around the mean (center)
    • Total area under the curve equals 1 (100% of possible outcomes)
  • Why It Matters: Many real-world datasets follow this pattern (e.g., test scores, heights, salaries), making it essential in statistics and machine learning.

Example: Honeycrisp Apples

  • Assume weights are normally distributed with a mean of 100 grams and a standard deviation of 15 grams.
  • Key Points Illustrated:
    • Mean is at the peak of the curve (most likely weight).
    • Symmetry: 50% of apples are heavier than the mean, 50% are lighter.
    • Farther from the mean = less likely weights.

Standard Deviations and the Empirical Rule

  • Standard Deviation: Measures how spread out the data is around the mean.
  • Empirical Rule: For normal distributions…
    • 68% of data is within 1 standard deviation of the mean.
    • 95% is within 2 standard deviations.
    • 99.7% is within 3 standard deviations.

Applications of the Empirical Rule

  • Estimating Data: Quickly understand how your data is distributed.
  • Outlier Detection: Values beyond 3 standard deviations might be errors.

Example: Plant Heights

  • Knowing the normal distribution and the empirical rule helps you determine the percentage of plants meeting your landscape design criteria.

Key Takeaway: The normal distribution is a powerful tool for understanding and analyzing continuous data throughout various fields.

What shape is the graph of a normal distribution?

Bell-shaped

The normal distribution is a continuous probability distribution that is symmetrical on both sides of the mean and bell-shaped. It is often called the bell curve because its graph has the shape of a bell, with a peak at the center and two downward sloping sides.

So far, we’ve been talking about discrete
probability distributions where the outcomes
of experiments are represented by countable
whole numbers. For example, rolling a die
can result in a two or three, but not a decimal values
such as 2.178 or 3.394. Now, we’ll move from discrete to continuous probability
distributions. Recall that continuous
probability deals with outcomes that can take on all the
values in a range of numbers. Typically, these are decimal values that can be
measured such as: height, weight, time, or temperature. For example, you can keep
on measuring time with more accuracy: 1.1 seconds, 1.12 seconds, 1.1257
seconds, and so on. In this video, we’ll explore the most widely used
probability distribution in statistics, the
normal distribution. The normal distribution is a continuous probability
distribution that is symmetrical on both sides of the
mean and bell-shaped. The normal distribution
is often called the bell curve because its
graph has the shape of a bell with a peak at the center and two
downward-sloping sides. It is also known as the
Gaussian distribution after the German mathematician
Carl Gauss who first described the formula
for the distribution. While we’re on the
subject of formulas, if you want to learn
more about this formula, please check out the
relevant reading where it’s discussed
in more detail. The normal distribution is the most common probability
distribution in statistics because so many
different kinds of data sets display a
bell-shaped curve. For example, if you
randomly sample 100 people you’ll discover a normal distribution curve
for continuous variables, such as: height, weight, blood pressure, IQ scores,
salaries, and more. For example, think of the typical results of
standardized tests. The majority of
people will score close to the average
score or mean. Fewer numbers of people
will score below or above average farther
out from the mean. A very small percentage
of people will score extremely high or extremely low. Very far away from the mean. This distribution of scores
generates a bell curve. Most of the data values are
relatively close to the mean. The farther a value is
away from the mean, the less likely it is to occur. On a normal curve the x-axis refers to the value of
the variable you’re measuring and the y-axis refers to how likely you are
to observe the value. In the case of test
scores, the x-axis is the raw score and the y-axis is the percentage of the population
that gets that score. Data professionals use the
normal distribution to model all kinds of different data sets in
the fields of business, science, government, machine
learning, and others. Understanding the
normal distribution is also important for more advanced statistical
methods such as hypothesis testing and
regression analysis which you’ll learn
more about later. Plus many machine
learning algorithms assume that data is
normally distributed. All normal distributions have the following features:
the shape is a bell curve, the mean is located at
the center of the curve, the curve is symmetrical on
both sides of the center, and the total area under
the curve equals 1. To clarify the features of
the normal distribution, let’s graph the weights of honeycrisp apples. Assume that the weights of honeycrisp apples are approximately normally
distributed with a mean of 100 grams and a standard
deviation of 15 grams. First, find the mean at
the center of the curve. This is also the highest
point or peak of the curve. This data point represents the most probable
outcome in the data set, mean weight of 100 grams. Second, notice that the curve is symmetrical on each
side of the mean. Fifty percent of the
data is above the mean, and 50 percent is
below the mean. Third, the farther a point
is away from the mean, the lower the probability
of those outcomes. The points farthest
from the mean represent the least probable
outcomes in the data set. These are apples that have
more extreme weights, either low or high. Finally, the area under
the curve is equal to 1. This means that the area
under the curve accounts for 100 percent of the possible
outcomes in the distribution. On a normal distribution, the distance of a data
point from the mean is often measured in
standard deviations. As a refresher, the standard
deviation calculates the typical distance of a data point from the
mean of your data set. While the mean refers to
the center of your data, the standard deviation
measures spread. As standard deviations
become larger, data values become more
spread out from the mean. In our apple example,
the mean weight is 100 grams and the standard
deviation is 15 grams, so an apple that is
one standard deviation above the mean will
weigh 115 grams with the mean
weight of 100 grams plus the standard
deviation of 15 grams. An apple that is one
standard deviation below the mean will weigh 85 grams or 100 grams
minus 15 grams. An apple that’s two
standard deviations above the mean will
weigh 130 grams, and an apple that’s two
standard deviations below the mean will
weigh 70 grams. The values on a normal
curve are distributed in a regular pattern based on
their distance from the mean. This is known as
the empirical rule. It states that for a given data set with a normal distribution, 68 percent of values fall within one standard
deviation of the mean, 95 percent of values fall within two standard
deviations of the mean, and the 99.7 percent of values fall within three
standard deviations of the mean. The empirical rule can give you a clear idea of
how the values in your data set are
distributed which helps you save time and better
understand your data. Let’s continue with
our apple example. The empirical rule
tells you that most apples were 68 percent will fall within one
standard deviation of the mean weight of 100 grams. This means that 68 percent
of apples will weigh between 85 grams which is one standard deviation
below the mean, and 115 grams, one standard deviation
above the mean. Ninety-five percent
of apples will weigh between 70 grams and 130 grams or within two standard
deviations from the mean. Almost all apples or
99.7 percent will weigh between 55 grams
and then 145 grams, or within three standard
deviations of the mean. The empirical rule is useful for estimating data, especially for large data sets like height and weight data for an
entire population. You can use the
empirical rule to get an initial estimate of
the distribution of values in your data set such as what percentage of values
will fall within one, two, or three standard
deviations of the mean. This saves time and helps
you better understand your data. Plus, knowing the location of your values on a normal distribution is
useful for detecting outliers. Recall that an outlier
is a value that differs significantly from
the rest of the data. Typically, data professionals
consider values that lie more than three
standard deviations below or above the
mean to be outliers. It’s important to
identify outliers because some extreme values
may be due to errors in data collection
or data processing. These false values may skew
the results of your analysis. Let’s explore another
example of how the empirical rule can help you better
understand your data. Imagine you have a garden, the height of your plans is normally distributed
with a mean of 32.1 inches and a standard
deviation of 2.2 inches. Let’s say you want to find
out what percentage of plants are greater than
29.9 inches tall. You want your plants to be
at least this tall as part of your landscape design
plan for your backyard. First, find out where the value 29.9 is located on
the distribution. Twenty-nine point nine is
located one standard deviation below the mean. The empirical rule tells
you that 68 percent of values fall within one standard
deviation of the mean. Half of these values or 34
percent fall below the mean. Now you know that 34 percent
of values are between 29.9 and the mean because 29.9 is one standard
deviation below the mean plus 50 percent of all values in a normal distribution fall above the mean or
center of the curve. The sum of these two
percentages will tell you the overall percentage of
values greater than 29.9. 34 percent plus 50 percent
equals 84 percent, so 84 percent of your plants are greater than 29.9 inches tall. The empirical rule
helps you quickly understand the overall
distribution of your data values. Now you know that most
of your plants are tall enough for your
landscape design plan. As a future data professional, you use the normal
distribution to identify significant patterns in a
wide variety of data sets.

Reading: Model data with the normal distribution

Categorize: Probability distributions

Reading

Video: Standardize data using z-scores

What is a Z-Score?

  • Definition: A z-score tells you how many standard deviations a specific data point is away from the mean of a normally distributed dataset.
  • Significance: It standardizes data, allowing you to compare values from different datasets that might have different scales or units.

Interpreting Z-Scores

  • Z = 0: The data point is equal to the mean.
  • Z > 0: The data point is above the mean.
  • Z < 0: The data point is below the mean.

Why Z-Scores are Useful

  1. Comparing Datasets: Even with different scales, z-scores let you compare how individual points perform across different datasets.
  2. Anomaly Detection: Z-scores help find unusual data points (outliers) that might signal fraud, errors, etc.

Z-Score Formula

Z = (x - μ) / σ 

Where:

  • x = Data point (raw score)
  • μ = Population mean
  • σ = Population standard deviation

Example

You score 133 on a test with a mean of 100 and a standard deviation of 15. Your z-score is:

(133 – 100) / 15 = 2.2

This means your score is 2.2 standard deviations above average (a great score!)

Key Points

  • Z-scores are typically used with normally distributed data.
  • Data analysts often use programming languages (like Python) to compute z-scores for large datasets.

What is the z-score of a data value equal to the mean?

0

The z-score is 0 if the data value is equal to the mean. A z-score is a measure of how many standard deviations below or above the population mean a data point is.

Recently, you learned about
the normal distribution and how it applies to many
different kinds of data sets. In this video, you’ll learn about z-scores
and how they can help you compare values from different types of
normally distributed data sets. A z-score is a measure of how
many standard deviations below or above the population mean a data point is. A z-score gives you an idea of how
far from the mean a data point is. For example, the z-score is 0,
if the value is equal to the mean. The z-score is positive if
the value is greater than the mean. The z-score is negative if
the value is less than the mean. Z-scores help you standardize your data. In statistics, standardization is the process of putting
different variables on the same scale. There is a formula for
this which will check out a little later. Z-scores are also called standard
scores because they’re based on what’s called the standard
normal distribution. A standard normal distribution is just
a normal distribution with a mean of 0 and a standard deviation of 1. Z-scores typically range from -3 to +3. Standardization is useful because it
lets you compare scores from different data sets that may have different units,
mean values and standard deviations. Data professionals use z-scores to better
understand the relationship between data values within a single dataset and
between different data sets. For example, data professionals often
use z-scores for anomaly detection, which finds outliers in datasets. Applications of anomaly detection include
finding fraud in financial transactions, flaws in manufacturing products,
intrusions in computer networks and more. For example, different customer satisfaction surveys
may have different rating scales. One survey could rate a product or
service from 1 to 20, another from 500 to 1,500,
and a third from 130 to 180. Let’s say the same product got
a score of 9 on the first survey, 850 on the second and 142 on the third. These numbers don’t mean much
by themselves, but if you know, they all have a z-score of 1, or
one standard deviation above the mean, you can meaningfully compare
ratings across surveys. A z-score for an individual value
can be interpreted as follows. A z-score of 1 is one standard
deviation above the mean. A z-score of 1.5 is 1.5 standard
deviations above the mean. A z-score of -2.3 is 2.3 standard
deviations below the mean. You can use the following
formula to calculate a z-score, Z equals x minus mu divided by sigma. In this formula, x refers to
a single data value or raw score. The Greek letter mu refers
to population mean. The Greek letter sigma refers to
the population standard deviation. So we can also say that Z
equals the raw score or data value minus the mean divided
by the standard deviation. For example,
let’s say you take a standardized test, you have a test score of 133. The test has a mean score of 100 and
a standard deviation of 15. Assuming a normal distribution, you can
use the formula to calculate your z-score. Your z-score is your raw score,
133 minus the mean score, 100 divided by the standard deviation 15. This is 133 minus 100 divided by 15
equals 33 divided by 15 equals 2.2. Your Z-score of 2.2 tells you that your
test score is 2.2 standard deviations above the mean or average score. That’s a really good score. Recall that the empirical rule tells
you that 95% of values fall within two standard deviations of the mean. Your score of 2.2 is more than two
standard deviations above the mean. Z-scores are useful because
they give us an idea of how an individual value compares
to the rest of the distribution. Let’s take a different exam
with a different grading scale. Say you score in 85, you want to find out if that’s a good
score relative to the rest of the class. Whether or not it’s a good
score depends on the mean and standard deviation of all exam scores. Suppose the exam scores are normally
distributed with a mean score of 90 and a standard deviation of 4, you can use the formula to calculate
the z-score of a raw score of 85. Your z-score is yours raw score,
85 minus the mean score 90, divided by the standard deviation 4. This is 85 minus 90 divided by 4
equals -5, divided by 4 equals 1.25. Your z-score of -1.25 tells you
that your exam score of 85 is 1.25 standard deviations below the mean or
average exam score. Z-scores give you an idea of how
individual values compared to the mean. As a data professional, you’ll use
z-scores to help you better understand the relationship between
specific values in your data set. You’ll most likely use a programming
language like Python to calculate z-scores on your computer as you”ll
learn in an upcoming video.

Practice Quiz: Test your knowledge: Continuous probability distributions

The normal distribution has which of the following features? Select all that apply.

What does the empirical rule state?

A data value is 2 standard deviations above the mean. What is its z-score?

Probability distributions with Python


Lab: Annotated follow-along guide: Work with probability distributions in Python

Reading

Video: Work with probability distributions in Python

Background

  • The video teaches how to model data with distributions (like the normal distribution) and find outliers using Z-scores.
  • Scenario: You’re analyzing district literacy rates for a Department of Education.

Key Python Libraries

  • NumPy, pandas, matplotlib.pyplot: Essential data analysis tools.
  • SciPy stats: Specifically designed for statistical work.
  • statsmodels: Provides statistical modeling and testing functions.

Analyzing Your Data

  1. Visualize with Histograms: Plotting a histogram helps you see the shape of data, suggesting what type of distribution might fit.
  2. Check the Empirical Rule (Normal Distribution):
    • The data’s histogram was bell-shaped, indicating a normal distribution.
    • The empirical rule states that for a normal distribution:
      • 68% of data falls within 1 standard deviation of the mean.
      • 95% within 2 standard deviations.
      • 99.7% within 3 standard deviations.
    • The Python calculations closely matched these percentages, confirming the normal distribution.
  3. Why this Matters: Many statistical tests assume a normal distribution.
  4. Calculating Z-Scores
    • Z-Score shows how many standard deviations a data point is from the mean.
    • Python’s stats.zscore function makes the calculation easy.
  5. Detecting Outliers
    • Z-scores outside of +/- 3 standard deviations are often considered outliers.
    • In this example, two districts were discovered with unusually low literacy rates, meriting further investigation.

Key Takeaway: Python’s statistical libraries equip you to analyze data distributions effectively, guiding your analytical choices and pinpointing unusual data.

When I deal with a new dataset, I first go through the
process of EDA and compute descriptive stats to get a basic understanding
of my data. After that, I try to
determine if my data fits a certain type of
probability distribution, like the binomial, Poisson, and normal distributions
you recently learned about. Knowing the distribution of
my data helps me decide what statistical test or
machine learning model will work best for my analysis. Python has a great selection of function libraries
for data analysis. Using Python to work
with distributions saves time and improves the overall efficiency
of my analysis. In this video,
you’ll use Python to model your data with the
normal distribution. Then you’ll compute Z-scores to find any outliers
in your data. We’ll continue with our previous scenario
in which you’re a data professional working for
the Department of Education of a large nation. Recall that you’re
analyzing data on the literacy rate
for each district, and you’ve already computed descriptive statistics
to summarize your data. You’ll continue to use the
dataset you worked with before. If you need to access
the data, do so now. Along with pandas, NumPy,
and matplotlib.pyplot, you’ll use two Python packages
that may be new to you: SciPy stats and statsmodels. SciPy is an open source software you can use for
solving mathematical, scientific, engineering,
and technical problems. It allows you to
manipulate and visualize data with a wide range
of Python commands. SciPy stats is a module designed specifically
for statistics. Statsmodels is a Python package that lets you explore data, work with statistical models, and perform statistical tests. It includes an extensive list of stats functions for
different types of data. Now that you know more about the packages you’ll
be working with, let’s open up a Jupyter
Notebook and load them up. To start, import the Python
packages you will use: NumPy, pandas, and
statsmodels.api and the library you will
use matplotlib.pyplot. To save time, rename each package and library
within abbreviation: NP, PD, PLT, and SM. To load the SciPy
stats module right, from SciPy import stats. For the next part
of your analysis, you want to find
out if the data on literacy rate fits a specific type of
probability distribution. The first step in trying
to model your data with a probability distribution
is to plot a histogram. This will help you visualize
the shape of your data and determine if it
resembles the shape of a specific distribution. Use matplotlib’s histogram
function to plot a histogram of the district
literacy rate data. Recall that the overall_li
column contains this data. The x-axis of your
plot refers to the literacy rate
of each district, and the y-axis refers to count or to the
number of districts. The histogram shows that
the distribution of your literacy rate data is bell-shaped and symmetric
about the mean. Recall that the normal
distribution is a continuous probability
distribution that is bell-shaped and symmetrical
on both sides of the mean. The mean literacy rate, which is around 73 percent, is located in the
center of the plot. The shape of your
histogram suggests that the normal distribution might be a good modeling
option for your data. To verify that your data
is normally distributed, you can use Python
to find out if your data follows
the empirical rule. Recall that the empirical rule says that for every
normal distribution, about 68 percent of values fall within one standard
deviation of the mean, 95 percent fall within two standard deviations
of the mean, and 99.7 percent fall within three standard
deviations of the mean. Since the normal
distribution seems like a good fit for the district
literacy rate data, you can expect the
empirical rule to apply relatively well. In other words, you
can expect that about 68 percent of literacy rates will fall within one standard deviation
of the mean, 95 percent will fall within
two standard deviations, and 99.7 percent will fall within three
standard deviations. First, name two new variables
to store the values for the mean and
standard deviation of the district literacy rate. Name your first variable
mean_overall_li and compute the mean. Display the value
of your variable. The mean district literacy
rate is about 73.4 percent. Name your second variable std_overall_li and compute the standard deviation and display the value
of your variable. The standard deviation
is about 10 percent. If your data follows
the empirical rule, you can expect
roughly 68 percent of your values to fall within one standard deviation of the mean district literacy
rate of 73 percent. One standard deviation below
the mean is 63 percent, or 73 minus 1 times 10. One standard deviation above
the mean is 83 percent, or 73 plus 1 times 10. So you can expect roughly
68 percent of your values to fall within this
range of 63-83 percent. Now compute the
actual percentage of district literacy
rates that fall within one standard
deviation of the mean. To do this, first name
two new variables, lower_limit and upper_limit. The lower limit will be one standard deviation
below the mean, or the mean minus 1 times
the standard deviation. The upper limit will be one standard deviation
above the mean, or the mean plus 1 times
the standard deviation. To write the code for
the calculations, use your two previous variables mean_overall_li
and std_overall_li for the mean and
standard deviation. Next, add a line of code that tells the
computer to decide if each value in the
overall literacy column is between the lower
limit and upper limit. In other words, to decide if each value is greater
than or equal to one standard deviation
below the mean and less than or equal to one standard deviation
above the mean. To do this, use the relational operators
greater than or equal to and less than or equal to and the bitwise operator AND. Finally, use the
mean function to divide the number of
values that are within one standard deviation
of the mean by the total number of
values and run the code. The output shows you
that about 0.664 or 66.4 percent of your
district literacy rates fall within one standard
deviation of the mean. This is very close
to the roughly 68 percent that the
empirical rule suggests. You can use the
same code structure to determine how many values of your literacy rate
values fall within two and three standard
deviations of the mean. Just multiply the
standard deviation by two or three instead of one. About 0.954 or 95.4 percent of your values fall within two standard
deviations of the mean, and about 0.996 or 99.6
percent of your values fall within three standard
deviations of the mean. Here, values of 66.4, 95.4, and 99.6 percent are very close to what the
empirical rule suggests, roughly 68, 95,
and 99.7 percent. At this point, it’s safe to say your data follows a
normal distribution. Knowing that your
data is normally distributed is useful
for analysis because many statistical tests and machine learning models
assume a normal distribution. Plus, when your data follows
a normal distribution, you can use Z-scores to measure the relative position
of your values and find outliers in your data. Let’s explore how to calculate
Z-scores in Python now. Recall that a Z-score
is a measure of how many standard
deviations below or above the population
mean a data point is. A Z-score is useful
because it tells you where a value lies
in a distribution. For example, if I tell you a literacy rate is 80 percent, this doesn’t give you
much information about where the value lies
in the distribution. However, if I tell you the literacy rate has
a Z-score of two, then you know that the value is two standard deviations
above the mean. Data professionals often use Z-scores for outlier detection. Typically, they
consider observations with the Z-score smaller than a negative 3 or larger than
positive 3 as outliers. These are values that lie more than three
standard deviations below or above the mean. To find outliers in your data, first create a new
column called Z_SCORE that includes the Z-scores for each district literacy
rate in your dataset. Then compute the Z-scores with
the function stats.zscore. Python takes care of
all the calculations. Now, write some code to identify outliers or districts
with Z-scores that are greater than or less than three standard
deviations from the mean. Use the relational operators
greater than and less than and the bitwise operator OR. Using Z-scores, you identify
two outlying districts: District 461 and District 429. The literacy rates in
these two districts are more than three
standard deviations below the overall mean, which means they have
unusually low rates. Your analysis gives you important
information to share. The government may want to provide more funding
and resources to these two districts in the hopes of significantly
improving literacy. Probability
distributions are useful for modeling your data and help you determine
which statistical test to use for an analysis. In addition to the
normal distribution, Python can help you work with a wide range of
probability distributions.

Lab: Activity: Explore probability distributions

Reading

Lab: Exemplar: Explore probability distributions

Practice Quiz: Test your knowledge: Probability distributions with Python

A data professional is working with a dataset that has a normal distribution. To test out the empirical rule, they want to find out if roughly 68% of the data values fall within 1 standard deviation of the mean. What Python functions will enable them to compute the mean and standard deviation?

What Python function is used to compute z-scores for data?

Review: Probability


Video: Wrap-up

What You’ve Learned

  • Probability in Data Science: It’s the foundation for making predictions, understanding patterns, and making data-driven decisions.
  • Types of Probability: Objective (data-based) is crucial for data analysis, while subjective involves personal judgment.
  • Probability Rules: Complement, addition, multiplication, and conditional probability govern how events relate.
  • Bayes’ Theorem: Helps you update probabilities as you gain new information.
  • Probability Distributions: Models for understanding the likelihoods of different outcomes.
    • Discrete (e.g., binomial, Poisson): For countable outcomes.
    • Continuous (e.g., normal distribution): For measurements on a spectrum.
  • Z-scores: Help standardize data within the normal distribution.
  • SciPy Stats: A powerful Python tool for working with probability distributions.

Why This Matters

Probability is essential for:

  • Deeper statistical analysis (hypothesis testing, regression)
  • Identifying patterns within data
  • Machine learning

Next Steps

  • Prepare for the graded assessment!
  • Review the new terminology introduced.
  • Revisit any concepts that were challenging for a stronger foundation.

You’ve come to the end of your introduction
to probability. Wow, you’ve learned a lot of important concepts, great work. Along the way we’ve explored how data professionals
use probability to make reasonable predictions
about uncertain events and help people and organizations
make data-driven decisions. Basic probability is a foundational part
of data science, and it also informs more
advanced statistical methods, such as hypothesis testing
and regression analysis, which we’ll explore
later in the program. In your future career
as a data professional, you’ll use probability
distributions to discover significant
patterns in your data. Plus a working knowledge of probability distributions is key for machine learning and essential tool in
modern data science. We started off this part
of the course by reviewing the two main types of probability, objective
and subjective. Data professionals use
objective probability to analyze and interpret data. From there, we reviewed the
basic rules of probability, such as the complement, addition, and
multiplication rules. Then you learned about
conditional probability, which helps you
better understand the relationship between
dependent events. We also discussed
Bayes’ theorem, which updates the probability of an event based on new
data about the event. After that, we moved from basic probability to
probability distributions. Probability distributions
describe the likelihood of the possible outcomes of a random event and can be
discrete or continuous. Data professionals use
probability distributions to find meaningful patterns
in complex datasets. Next, we explored discrete
probability distributions, such as the binomial
and Poisson, and discovered how
they can help you model different types of data. Then we moved on to continuous
probability distributions. We focused on the normal
distribution or bell curve, the most widely used
distribution in statistics. We also discussed how Z-scores can help you better understand the relationship
between values and a standard normal distribution. Finally, you learned that
the SciPy stats module is a powerful tool for working with probability
distributions. You use the normal distribution
to model your data and gain useful insights. Coming up, you have a graded
assessment to prepare, check out the reading that lists all the new terms
you’ve learned. Feel free to revisit
videos, readings, and other resources that
covered key concepts, until we meet again. Good luck.

Reading: Glossary terms from module 2

Terms and definitions from Course 4, Module 2

Quiz: Module 2 challenge

An investor believes there is a 90% chance that the price of a certain stock will increase in the next year. The investor’s prediction is based exclusively on intuition. What type of probability are they using?

The probability of an event is close to 1. Which of the following statements best describes the likelihood that the event will occur?

The probability of rain tomorrow is 40%. What is the probability of the complement of this event?

Which of the following events are mutually exclusive? Select all that apply.

What concept refers to the probability of an event occurring given that another event has already occurred?

Which of the following are examples of continuous random variables? Select all that apply.

What probability distribution can model the probability of getting a certain number of defective products in a sample of 15 products?

A data professional working for a smartphone manufacturer is analyzing sample data on the weight of a specific smartphone. The data follows a normal distribution, with a mean weight of 150g and a standard deviation of 10g. What data value lies 3 standard deviations below the mean?

The mean and the standard deviation of a standard normal distribution always equal what values?

A data analytics team at a water utility works with a dataset that contains information about local reservoirs. They determine that the data follows a normal distribution. What Python function can they use to compute z-scores for the data?